Select a model
Pick a completed training job or a previously exported model.
Have the model source path and base model ID handy.
Choose evaluation mode
Start with single‑prompt inference to sanity‑check behavior, then move to batch evaluation on a labeled dataset.
Pick metrics
Select task‑appropriate metrics (e.g., BERTScore/ROUGE for generation, EM/F1 for QA, accuracy/F1 for classification).
Review and compare
Inspect metrics and samples, compare against a baseline, and note failure patterns to guide the next training round.