Skip to main content

Documentation Index

Fetch the complete documentation index at: https://facetai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Once your model is trained and evaluated, it’s time to deploy it for production use. This guide covers multiple deployment options from local development to cloud production.

Deployment Options

Local Deployment

Run models on your local machine for development and testing

Cloud Deployment

Deploy on cloud platforms for production use

Container Deployment

Use Docker containers for consistent deployment

Serverless Deployment

Deploy on serverless platforms for cost efficiency

Local Deployment

GGUF models are optimized for CPU inference and work well for local deployment. First, download your model in GGUF format from the Facet AI platform.

Using llama.cpp

Installation:
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Running your model:
# Basic inference
./llama-cli -m path/to/your/model.gguf -p "Your prompt here"

# With system prompt
./llama-cli -m path/to/your/model.gguf -sys "You are a helpful assistant." -p "What is the capital of France?"

# Interactive mode
./llama-cli -m path/to/your/model.gguf -i
Advanced options:
# Control generation parameters
./llama-cli -m model.gguf -p "Your prompt" --temp 0.7 --top-k 40 --top-p 0.9

# Use GPU acceleration (if available)
./llama-cli -m model.gguf -p "Your prompt" --gpu-layers 20

Using Ollama

Installation:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Create a Modelfile:
FROM path/to/your/model.gguf

SYSTEM """
You are a helpful assistant specialized in answering questions about geography.
"""

PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9
Deploy your model:
# Create the model
ollama create my-geography-assistant -f Modelfile

# Run the model
ollama run my-geography-assistant

# If model is on Hugging Face Hub
ollama run hf.co/username/model-name:q8_0
API usage:
# Start Ollama server
ollama serve

# Make API calls
curl http://localhost:11434/api/generate -d '{
  "model": "my-geography-assistant",
  "prompt": "What is the capital of France?",
  "stream": false
}'

HuggingFace Transformers (Python)

For more control and integration with Python applications:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load your fine-tuned model
model_name = "username/your-fine-tuned-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate responses
def generate_response(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
response = generate_response("What is the capital of France?")
print(response)

Cloud Deployment

Cloud deployment is purely experiment for now, will expand in future.

Google Cloud Run (vLLM)

Deploy your model on Google Cloud Run using vLLM for high-performance inference.

Prerequisites

  • Google Cloud Project with billing enabled
  • Docker installed locally
  • Google Cloud SDK installed

Step 1: Prepare Your Model

Export your model in merged format and upload to Google Cloud Storage:
# Upload your model to GCS
gsutil cp -r your-model-directory gs://your-bucket/models/

Step 2: Create Dockerfile

FROM vllm/vllm-openai:latest

# Copy your model
COPY model/ /app/model/

# Set environment variables
ENV MODEL_NAME=your-model-name
ENV MODEL_PATH=/app/model
ENV HOST=0.0.0.0
ENV PORT=8000

# Expose port
EXPOSE 8000

# Start the server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "/app/model", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Step 3: Build and Deploy

# Build the container
gcloud builds submit --tag gcr.io/your-project/vllm-deployment

# Deploy to Cloud Run
gcloud run deploy vllm-service \
  --image gcr.io/your-project/vllm-deployment \
  --platform managed \
  --region us-central1 \
  --memory 8Gi \
  --cpu 4 \
  --min-instances 0 \
  --max-instances 10 \
  --allow-unauthenticated

Step 4: Test Your Deployment

# Get the service URL
SERVICE_URL=$(gcloud run services describe vllm-service --region=us-central1 --format="value(status.url)")

# Test the API
curl -X POST "$SERVICE_URL/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'
You can also deploy your model on AWS, Azure, or other cloud platforms using vLLM or SGLang.

Container Deployment

Docker Compose

Create a docker-compose.yml for easy local deployment:
version: "3.8"
services:
  model-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=your-model
      - MODEL_PATH=/app/model
    volumes:
      - ./model:/app/model
    deploy:
      resources:
        limits:
          memory: 8G
        reservations:
          memory: 4G

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - model-api

Kubernetes Deployment

Deploy your model on Kubernetes for scalable production use:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-api
  template:
    metadata:
      labels:
        app: model-api
    spec:
      containers:
        - name: model-api
          image: your-registry/model-api:latest
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
          env:
            - name: MODEL_NAME
              value: "your-model"
---
apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-api
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer

Troubleshooting

Common Issues

Symptoms: Model fails to load or returns errorsSolutions:
  • Check model file integrity
  • Verify model format compatibility
  • Ensure sufficient memory allocation
  • Check file permissions
Symptoms: Slow inference, high latency, timeouts Solutions: - Optimize model quantization - Increase compute resources - Implement caching - Use batch processing
Symptoms: Out of memory errors, crashes Solutions: - Use quantized models - Increase memory allocation - Implement model sharding - Use smaller batch sizes
Symptoms: Connection timeouts, API failuresSolutions:
  • Check network connectivity
  • Verify firewall settings
  • Implement retry logic
  • Use load balancing

Next Steps

After deploying your model:
  1. Monitor Performance: Track metrics and user feedback
  2. Scale as Needed: Adjust resources based on usage patterns
  3. Iterate and Improve: Use production data to improve your model
  4. Maintain Security: Keep your deployment secure and updated
Your model is now ready for production use! Monitor its performance and gather feedback to continue improving your AI system.