This guide covers everything you need to know about testing your fine-tuned models and exporting them in various formats for deployment.

Running Single Inference

Test your model with individual prompts to see how it responds in real-time.

Basic Text Inference

For simple text prompts, use the single inference endpoint:
{
  "hf_token": "your_huggingface_token",
  "model_source": "gs://bucket/trained_adapters/job_123/adapter",
  "model_type": "adapter",
  "base_model_id": "google/gemma-3-2b-pt",
  "prompt": "What is the capital of France?"
}
Response:
{
  "result": "The capital of France is Paris."
}

Conversation Inference

For multi-turn conversations, use the batch inference endpoint with a single conversation:
{
  "hf_token": "your_huggingface_token",
  "model_source": "username/model-name",
  "model_type": "merged",
  "base_model_id": "google/gemma-3-2b-pt",
  "messages": [
    [
      { "role": "system", "content": "You are a helpful assistant." },
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms."
      }
    ]
  ]
}

Vision Inference

For multimodal tasks with images, include base64-encoded images:
{
  "hf_token": "your_huggingface_token",
  "model_source": "username/model-name",
  "model_type": "adapter",
  "base_model_id": "google/gemma-3-2b-pt",
  "messages": [
    [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What do you see in this image?" },
          {
            "type": "image",
            "image": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."
          }
        ]
      }
    ]
  ]
}

Batch Evaluation

Run comprehensive evaluations on your test datasets to get detailed performance metrics. Let the system automatically select appropriate metrics for your task:
{
  "hf_token": "your_huggingface_token",
  "model_source": "username/model-name",
  "model_type": "merged",
  "base_model_id": "google/gemma-3-2b-pt",
  "dataset_id": "processed_dataset_123",
  "task_type": "conversation",
  "num_sample_results": 5
}
Available Task Types:
  • conversation → BERTScore, ROUGE
  • qa → Exact match, BERTScore
  • summarization → ROUGE, BERTScore
  • translation → BLEU, METEOR
  • classification → Accuracy, Precision, Recall, F1
  • general → BERTScore, ROUGE

Using Specific Metrics

For more control, specify exact metrics to compute:
{
  "hf_token": "your_huggingface_token",
  "model_source": "username/model-name",
  "model_type": "adapter",
  "base_model_id": "google/gemma-3-2b-pt",
  "dataset_id": "processed_dataset_123",
  "metrics": ["bertscore", "rouge", "exact_match"],
  "num_sample_results": 10
}
Available Metrics:
  • bertscore: Semantic similarity (⭐ Recommended for LLMs)
  • rouge: Text overlap and summarization quality
  • exact_match: Perfect string matching
  • accuracy: Token-level accuracy
  • precision, recall, f1: Classification metrics
  • bleu, meteor: Translation metrics

Evaluation Response

{
  "metrics": {
    "bertscore_f1": 0.85,
    "rouge_l": 0.82,
    "rouge_1": 0.85,
    "rouge_2": 0.78,
    "exact_match": 0.65
  },
  "samples": [
    {
      "prediction": "The capital of France is Paris.",
      "reference": "Paris is the capital of France.",
      "sample_index": 42
    }
  ],
  "num_samples": 1000,
  "dataset_id": "processed_dataset_123",
  "task_type": "conversation"
}

Model Export

Export your fine-tuned models in various formats for different deployment scenarios.

Export Formats

Best for: LoRA/QLoRA models, experimentation, combining adaptersCharacteristics:
  • Small file size (few MB)
  • Requires base model to run
  • Easy to merge with other adapters
  • Good for A/B testing different fine-tunings
Configuration:
{
  "export_id": "exp_123",
  "job_id": "job_456",
  "type": "adapter",
  "destination": ["gcs", "hf_hub"],
  "hf_repo_id": "username/my-adapter"
}

Export Destinations

Google Cloud Storage

Best for: Downloading models, GCP deployments
  • Download as zip files
  • Direct integration with GCP services
  • Good for private model storage

Hugging Face Hub

Best for: Sharing models, public deployment
  • Publish to HF Hub for sharing
  • Easy integration with HF ecosystem
  • Good for open-source projects

GGUF Quantization Options

Choose the right quantization level for your needs:

Inference Providers

Facet supports multiple inference backends for different use cases:

HuggingFace Transformers

  • Use case: Standard inference, most compatible
  • Supports: All model types and modalities
  • Best for: General use, testing, development

Unsloth

  • Use case: Optimized inference for Unsloth-trained models
  • Supports: Unsloth-optimized models
  • Best for: Models trained with Unsloth framework

vLLM

  • Use case: High-performance production inference
  • Supports: Merged models and adapters (via LoRA)
  • Best for: High-throughput production deployment

Best Practices

Inference Testing

Evaluation Strategy

Export Strategy

Troubleshooting

Common Issues

Next Steps

After testing and exporting your model:
  1. Deploy Your Model: Set up your model for production use
  2. Monitor Performance: Track model performance in production
  3. Collect Feedback: Gather user feedback to improve your model
  4. Iterate: Use insights to refine your model with additional training
Ready to deploy? Head to the Deployment guide to learn how to set up your model for production use.