1
Select a model
Pick a completed training job or a previously exported model.
Have the model source path and base model ID handy.
2
Choose evaluation mode
Start with single‑prompt inference to sanity‑check behavior, then move to batch evaluation on a labeled dataset.
3
Pick metrics
Select task‑appropriate metrics (e.g., BERTScore/ROUGE for generation, EM/F1 for QA, accuracy/F1 for classification).
4
Run and monitor
Launch the job and watch progress; large datasets take longer.
5
Review and compare
Inspect metrics and samples, compare against a baseline, and note failure patterns to guide the next training round.
6
Export for serving
When satisfied, export the model in your preferred format for deployment.