Skip to main content

AI Testing

What is AI Testing?

AI Testing helps you check how well your AI behaves before you ship it. You can:

  • Create simple test cases (an input and the expected reply).
  • Connect providers (e.g., the AI youโ€™re testing and the AI that judges results).
  • Run a test run that scores each case and shows what passed or failed.

Result Safer, consistent training conversations and an audit trail you can share with compliance teams.


๐Ÿš€ Quick Start (3 steps)โ€‹

  1. Create Test Cases โ†’ Add examples of what you ask and what you want the AI to answer.
  2. Configure Providers โ†’ Tell the platform which AI to test and which AI will judge the results.
  3. Create a Test Run โ†’ Pick your cases, choose models, and run the evaluation.

Business case โ€” Healthcare training assistant

As a healthcare organization, you are testing a Doctor AI Avatar that roleโ€‘plays with staff for practice. AI Testing lets you check safety, tone, and accuracy before anyone uses it in training.

What to test

  • Harmful or unsafe requests โ€“ If a trainee asks for something risky, the model should refuse politely and point to approved guidance (e.g., emergency or crisis resources), avoiding harmful speech.
  • Medical advice with limited information โ€“ The model should add clear disclaimers, avoid diagnosis, and encourage consulting a licensed clinician.
  • Privacy & professionalism โ€“ Responses should avoid sharing personal health information, remain respectful, and show zero toxicity.
  • Bedside manner โ€“ Check for empathetic phrasing (e.g., acknowledging feelings before giving guidance).

โœ๏ธ 1) Create Test Casesโ€‹

On the Test Cases tab youโ€™ll see a list of cards, each one representing a test case.

Each case has two fields:

  • Test input โ€“ The message or prompt you will send to the AI.
  • Expected output โ€“ The reply you want the AI to produce.

Add a new case using the form on the right side of the page, then click Add test case. You can add as many cases as you like (e.g., short answers, polite greetings, factual questions).

Tips for good test cases

  • Keep the input short and clear (one intention per case).
  • Write the expected output in the exact style you want (tone, keywords, mustโ€‘say phrases).
  • Make a few โ€œedge casesโ€ (tricky or unusual questions) to catch problems early.
AI Testing โ€“ Test Cases tab

๐Ÿ”Œ 2) Configure Providersโ€‹

Open the Providers tab to connect the services your tests will use.

Youโ€™ll usually configure:

  • Test Model provider โ€“ The AI youโ€™re evaluating (e.g., your character or assistant).
  • Judge Model provider โ€“ A separate AI that checks whether the test modelโ€™s answer matches what you expect.

Fill in the requested fields, such as API Key, Model, and any IDs the provider requires, then click Save Providers.

If youโ€™re unsure where to find an API key or model name, ask your team admin or check the providerโ€™s dashboard.

AI Testing โ€“ Providers tab

๐Ÿงช 3) Create a Test Runโ€‹

On the Test Run tab, click Create test run. A panel opens with the settings you need:

  • Run name โ€“ Any friendly name so you can find this run later.
  • Test model โ€“ Choose which provider/model you want to test.
  • Judge model โ€“ Choose which provider/model will evaluate the answers.
  • Metric type โ€“ Select how results will be scored. For example, Semantic Similarity checks how close the AIโ€™s answer is to your expected text.
  • Test cases โ€“ Pick one or more of your saved cases to include in this run.

Then click Create Test Run.


๐Ÿ“Š Reading the Resultsโ€‹

After a run completes, youโ€™ll see a Run Summary with the status, total duration, and quick stats (how many passed, failed, or errored). Below that, open any case to compare:

  • Input โ€“ What the AI was asked.
  • Expected โ€“ What you said a good answer should look like.
  • Actual โ€“ What the AI actually replied.
  • Score โ€“ The numeric score based on your chosen metric (e.g., 0 to 1 for similarity).
  • Pass/Fail โ€“ Whether it met the threshold.

How thresholds work Your team can set a threshold (a minimum score). If the modelโ€™s score meets or exceeds this number, the case is Passed; otherwise itโ€™s Failed. For example, with a similarity threshold of 0.5, a score of 0.68 passes, and a score of 0.40 fails.

AI Testing โ€“ Run Summary and results

โœ… Best Practicesโ€‹

  • Start small โ€“ Begin with a handful of mustโ€‘have behaviors, then grow your library over time.
  • Use clear, consistent wording โ€“ Both in inputs and expected outputs.
  • Cover real scenarios โ€“ Add routine cases and a few tough ones (like ambiguous questions).
  • Review failures together โ€“ Use the Actual vs Expected view to decide whether to adjust prompts, data, or thresholds.

โ“ FAQโ€‹

Do I need to be technical to use this? No. If you can describe what a good answer looks like, you can create test cases.

What is a โ€œjudge modelโ€? Itโ€™s an AI that evaluates answers. Think of it as an automated reviewer that checks whether the response matches your expectation.

What is โ€œsemantic similarityโ€? A way to score how close two pieces of text are in meaning, even if the wording is not identical.


Next steps

Create a few test cases now, connect your providers, and run your first evaluation. Youโ€™ll get quick feedback on where your AI shinesโ€”and where it needs tuning.