What You Can Do
Once your model is trained, it’s time to test its performance and capabilities. Facet provides powerful evaluation tools to help you understand how well your model performs on different tasks and datasets.Run Inference
Test your model with custom prompts and conversations
Batch Evaluation
Evaluate your model on entire datasets with multiple metrics
Performance Metrics
Get detailed performance analysis with accuracy, BLEU, ROUGE, and more
Export Models
Download your models in various formats for deployment
Evaluation Methods
Single Inference
Test your model with individual prompts to see how it responds:- Text prompts: Simple question-answer testing
- Conversations: Multi-turn dialogue testing
- Vision tasks: Image understanding and description
- Custom scenarios: Test specific use cases
Batch Evaluation
Run comprehensive evaluations on your test datasets:- Multiple metrics: BLEU, ROUGE, BERTScore, accuracy
- Task-specific metrics: Customized for your use case
- Statistical analysis: Confidence intervals and significance testing
- Sample results: Review individual predictions and references
Supported Metrics
Text Quality Metrics
- BERTScore: Semantic similarity using BERT embeddings
- ROUGE: Text overlap and summarization quality
- BLEU: Translation and generation quality
BERTScore is recommended for most LLM evaluations as it captures semantic meaning.
Task-Specific Metrics
- Question Answering: Exact match, F1 score
- Summarization: ROUGE scores, BERTScore
- Translation: BLEU, METEOR
- Conversation: BERTScore, response relevance
- General: BERTScore, ROUGE
Model Export Options
Export Formats
Best for: LoRA/QLoRA models, easy to merge later
- Smaller file size
- Requires base model
- Easy to combine with other adapters
- Good for experimentation
Export Destinations
- Google Cloud Storage: Download as zip files
- Hugging Face Hub: Publish to HF Hub for sharing
- Local Download: Direct download to your machine
Getting Started
1
Prepare Your Model
Ensure your training job has completed successfully.
You’ll need the model source path and base model ID from your training job.
2
Choose Evaluation Method
Decide between single inference testing or batch evaluation on a dataset.
Start with single inference to test basic functionality, then move to batch
evaluation for comprehensive assessment.
3
Configure Metrics
Select the metrics that best match your task and objectives.
BERTScore and ROUGE are good defaults for most text generation tasks.
4
Run Evaluation
Launch your evaluation and monitor progress.
Large datasets may take several minutes to evaluate completely.
5
Review Results
Analyze the metrics and sample results to understand your model’s performance.
Look for patterns in both successful and failed predictions to identify areas for improvement.
Best Practices
Next Steps
After evaluating your model:- Analyze Results: Review metrics and identify strengths and weaknesses
- Export Model: Download your model in the format you need
- Deploy: Set up your model for production use
- Monitor: Track performance in production and collect feedback
- Iterate: Use insights to improve your model with additional training