What You Can Do

Once your model is trained, it’s time to test its performance and capabilities. Facet provides powerful evaluation tools to help you understand how well your model performs on different tasks and datasets.

Run Inference

Test your model with custom prompts and conversations

Batch Evaluation

Evaluate your model on entire datasets with multiple metrics

Performance Metrics

Get detailed performance analysis with accuracy, BLEU, ROUGE, and more

Export Models

Download your models in various formats for deployment

Evaluation Methods

Single Inference

Test your model with individual prompts to see how it responds:
  • Text prompts: Simple question-answer testing
  • Conversations: Multi-turn dialogue testing
  • Vision tasks: Image understanding and description
  • Custom scenarios: Test specific use cases

Batch Evaluation

Run comprehensive evaluations on your test datasets:
  • Multiple metrics: BLEU, ROUGE, BERTScore, accuracy
  • Task-specific metrics: Customized for your use case
  • Statistical analysis: Confidence intervals and significance testing
  • Sample results: Review individual predictions and references

Supported Metrics

Text Quality Metrics

  • BERTScore: Semantic similarity using BERT embeddings
  • ROUGE: Text overlap and summarization quality
  • BLEU: Translation and generation quality
BERTScore is recommended for most LLM evaluations as it captures semantic meaning.

Task-Specific Metrics

  • Question Answering: Exact match, F1 score
  • Summarization: ROUGE scores, BERTScore
  • Translation: BLEU, METEOR
  • Conversation: BERTScore, response relevance
  • General: BERTScore, ROUGE

Model Export Options

Export Formats

Best for: LoRA/QLoRA models, easy to merge later
  • Smaller file size
  • Requires base model
  • Easy to combine with other adapters
  • Good for experimentation

Export Destinations

  • Google Cloud Storage: Download as zip files
  • Hugging Face Hub: Publish to HF Hub for sharing
  • Local Download: Direct download to your machine

Getting Started

1

Prepare Your Model

Ensure your training job has completed successfully.
You’ll need the model source path and base model ID from your training job.
2

Choose Evaluation Method

Decide between single inference testing or batch evaluation on a dataset.
Start with single inference to test basic functionality, then move to batch evaluation for comprehensive assessment.
3

Configure Metrics

Select the metrics that best match your task and objectives.
BERTScore and ROUGE are good defaults for most text generation tasks.
4

Run Evaluation

Launch your evaluation and monitor progress.
Large datasets may take several minutes to evaluate completely.
5

Review Results

Analyze the metrics and sample results to understand your model’s performance.
Look for patterns in both successful and failed predictions to identify areas for improvement.

Best Practices

Next Steps

After evaluating your model:
  1. Analyze Results: Review metrics and identify strengths and weaknesses
  2. Export Model: Download your model in the format you need
  3. Deploy: Set up your model for production use
  4. Monitor: Track performance in production and collect feedback
  5. Iterate: Use insights to improve your model with additional training
Ready to start evaluating? Head to the Inference & Export guide for detailed instructions on running evaluations and exporting your models.