AI Testing
AI Testing helps you check how well your AI behaves before you ship it. You can:
- Create simple test cases (an input and the expected reply).
- Connect providers (e.g., the AI youโre testing and the AI that judges results).
- Run a test run that scores each case and shows what passed or failed.
Result Safer, consistent training conversations and an audit trail you can share with compliance teams.
๐ Quick Start (3 steps)โ
- Create Test Cases โ Add examples of what you ask and what you want the AI to answer.
- Configure Providers โ Tell the platform which AI to test and which AI will judge the results.
- Create a Test Run โ Pick your cases, choose models, and run the evaluation.
As a healthcare organization, you are testing a Doctor AI Avatar that roleโplays with staff for practice. AI Testing lets you check safety, tone, and accuracy before anyone uses it in training.
What to test
- Harmful or unsafe requests โ If a trainee asks for something risky, the model should refuse politely and point to approved guidance (e.g., emergency or crisis resources), avoiding harmful speech.
- Medical advice with limited information โ The model should add clear disclaimers, avoid diagnosis, and encourage consulting a licensed clinician.
- Privacy & professionalism โ Responses should avoid sharing personal health information, remain respectful, and show zero toxicity.
- Bedside manner โ Check for empathetic phrasing (e.g., acknowledging feelings before giving guidance).
โ๏ธ 1) Create Test Casesโ
On the Test Cases tab youโll see a list of cards, each one representing a test case.
Each case has two fields:
- Test input โ The message or prompt you will send to the AI.
- Expected output โ The reply you want the AI to produce.
Add a new case using the form on the right side of the page, then click Add test case. You can add as many cases as you like (e.g., short answers, polite greetings, factual questions).
Tips for good test cases
- Keep the input short and clear (one intention per case).
- Write the expected output in the exact style you want (tone, keywords, mustโsay phrases).
- Make a few โedge casesโ (tricky or unusual questions) to catch problems early.
๐ 2) Configure Providersโ
Open the Providers tab to connect the services your tests will use.
Youโll usually configure:
- Test Model provider โ The AI youโre evaluating (e.g., your character or assistant).
- Judge Model provider โ A separate AI that checks whether the test modelโs answer matches what you expect.
Fill in the requested fields, such as API Key, Model, and any IDs the provider requires, then click Save Providers.
If youโre unsure where to find an API key or model name, ask your team admin or check the providerโs dashboard.
๐งช 3) Create a Test Runโ
On the Test Run tab, click Create test run. A panel opens with the settings you need:
- Run name โ Any friendly name so you can find this run later.
- Test model โ Choose which provider/model you want to test.
- Judge model โ Choose which provider/model will evaluate the answers.
- Metric type โ Select how results will be scored. For example, Semantic Similarity checks how close the AIโs answer is to your expected text.
- Test cases โ Pick one or more of your saved cases to include in this run.
Then click Create Test Run.
๐ Reading the Resultsโ
After a run completes, youโll see a Run Summary with the status, total duration, and quick stats (how many passed, failed, or errored). Below that, open any case to compare:
- Input โ What the AI was asked.
- Expected โ What you said a good answer should look like.
- Actual โ What the AI actually replied.
- Score โ The numeric score based on your chosen metric (e.g., 0 to 1 for similarity).
- Pass/Fail โ Whether it met the threshold.
How thresholds work Your team can set a threshold (a minimum score). If the modelโs score meets or exceeds this number, the case is Passed; otherwise itโs Failed. For example, with a similarity threshold of 0.5, a score of 0.68 passes, and a score of 0.40 fails.
โ Best Practicesโ
- Start small โ Begin with a handful of mustโhave behaviors, then grow your library over time.
- Use clear, consistent wording โ Both in inputs and expected outputs.
- Cover real scenarios โ Add routine cases and a few tough ones (like ambiguous questions).
- Review failures together โ Use the Actual vs Expected view to decide whether to adjust prompts, data, or thresholds.
โ FAQโ
Do I need to be technical to use this? No. If you can describe what a good answer looks like, you can create test cases.
What is a โjudge modelโ? Itโs an AI that evaluates answers. Think of it as an automated reviewer that checks whether the response matches your expectation.
What is โsemantic similarityโ? A way to score how close two pieces of text are in meaning, even if the wording is not identical.
Create a few test cases now, connect your providers, and run your first evaluation. Youโll get quick feedback on where your AI shinesโand where it needs tuning.