Model Deployment Guide

Once your model is trained and evaluated, it’s time to deploy it for production use. This guide covers multiple deployment options from local development to cloud production.

Deployment Options

Local Deployment

Run models on your local machine for development and testing

Cloud Deployment

Deploy on cloud platforms for production use

Container Deployment

Use Docker containers for consistent deployment

Serverless Deployment

Deploy on serverless platforms for cost efficiency

Local Deployment

GGUF Format (Recommended for Local)

GGUF models are optimized for CPU inference and work well for local deployment. First, download your model in GGUF format from the Facet AI platform.

Using llama.cpp

Installation:

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Running your model:

# Basic inference
./llama-cli -m path/to/your/model.gguf -p "Your prompt here"

# With system prompt
./llama-cli -m path/to/your/model.gguf -sys "You are a helpful assistant." -p "What is the capital of France?"

# Interactive mode
./llama-cli -m path/to/your/model.gguf -i

Advanced options:

# Control generation parameters
./llama-cli -m model.gguf -p "Your prompt" --temp 0.7 --top-k 40 --top-p 0.9

# Use GPU acceleration (if available)
./llama-cli -m model.gguf -p "Your prompt" --gpu-layers 20

Using Ollama

Installation:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Create a Modelfile:

FROM path/to/your/model.gguf

SYSTEM """
You are a helpful assistant specialized in answering questions about geography.
"""

PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9

Deploy your model:

# Create the model
ollama create my-geography-assistant -f Modelfile

# Run the model
ollama run my-geography-assistant

# If model is on Hugging Face Hub
ollama run hf.co/username/model-name:q8_0

API usage:

# Start Ollama server
ollama serve

# Make API calls
curl http://localhost:11434/api/generate -d '{
  "model": "my-geography-assistant",
  "prompt": "What is the capital of France?",
  "stream": false
}'

HuggingFace Transformers (Python)

For more control and integration with Python applications:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load your fine-tuned model
model_name = "username/your-fine-tuned-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate responses
def generate_response(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
response = generate_response("What is the capital of France?")
print(response)

Cloud Deployment

Cloud deployment is purely experiment for now, will expand in future.

Google Cloud Run (vLLM)

Deploy your model on Google Cloud Run using vLLM for high-performance inference.

Prerequisites

Google Cloud Project with billing enabled
Docker installed locally
Google Cloud SDK installed

Step 1: Prepare Your Model

Export your model in merged format and upload to Google Cloud Storage:

# Upload your model to GCS
gsutil cp -r your-model-directory gs://your-bucket/models/

Step 2: Create Dockerfile

FROM vllm/vllm-openai:latest

# Copy your model
COPY model/ /app/model/

# Set environment variables
ENV MODEL_NAME=your-model-name
ENV MODEL_PATH=/app/model
ENV HOST=0.0.0.0
ENV PORT=8000

# Expose port
EXPOSE 8000

# Start the server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "/app/model", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Step 3: Build and Deploy

# Build the container
gcloud builds submit --tag gcr.io/your-project/vllm-deployment

# Deploy to Cloud Run
gcloud run deploy vllm-service \
  --image gcr.io/your-project/vllm-deployment \
  --platform managed \
  --region us-central1 \
  --memory 8Gi \
  --cpu 4 \
  --min-instances 0 \
  --max-instances 10 \
  --allow-unauthenticated

Step 4: Test Your Deployment

# Get the service URL
SERVICE_URL=$(gcloud run services describe vllm-service --region=us-central1 --format="value(status.url)")

# Test the API
curl -X POST "$SERVICE_URL/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

You can also deploy your model on AWS, Azure, or other cloud platforms using vLLM or SGLang.

Container Deployment

Docker Compose

Create a docker-compose.yml for easy local deployment:

version: "3.8"
services:
  model-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=your-model
      - MODEL_PATH=/app/model
    volumes:
      - ./model:/app/model
    deploy:
      resources:
        limits:
          memory: 8G
        reservations:
          memory: 4G

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - model-api

Kubernetes Deployment

Deploy your model on Kubernetes for scalable production use:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-api
  template:
    metadata:
      labels:
        app: model-api
    spec:
      containers:
        - name: model-api
          image: your-registry/model-api:latest
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
          env:
            - name: MODEL_NAME
              value: "your-model"
---
apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-api
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer

Troubleshooting

Common Issues

Model Loading Errors

Symptoms: Model fails to load or returns errorsSolutions:

Check model file integrity
Verify model format compatibility
Ensure sufficient memory allocation
Check file permissions

Performance Issues

Symptoms: Slow inference, high latency, timeouts Solutions: - Optimize model quantization - Increase compute resources - Implement caching - Use batch processing

Memory Issues

Symptoms: Out of memory errors, crashes Solutions: - Use quantized models - Increase memory allocation - Implement model sharding - Use smaller batch sizes

Network Issues

Symptoms: Connection timeouts, API failuresSolutions:

Check network connectivity
Verify firewall settings
Implement retry logic
Use load balancing

Next Steps

After deploying your model:

Monitor Performance: Track metrics and user feedback
Scale as Needed: Adjust resources based on usage patterns
Iterate and Improve: Use production data to improve your model
Maintain Security: Keep your deployment secure and updated

Your model is now ready for production use! Monitor its performance and gather feedback to continue improving your AI system.

Getting started

Dataset preprocessing

Fine-tuning

Evaluation & Export

Model deployment

Model Deployment Guide

Deployment Options

Local Deployment

Cloud Deployment

Container Deployment

Serverless Deployment

Local Deployment

GGUF Format (Recommended for Local)

Using llama.cpp

Using Ollama

HuggingFace Transformers (Python)

Cloud Deployment

Google Cloud Run (vLLM)

Prerequisites

Step 1: Prepare Your Model

Step 2: Create Dockerfile

Step 3: Build and Deploy

Step 4: Test Your Deployment

Container Deployment

Docker Compose

Kubernetes Deployment

Troubleshooting

Common Issues

Next Steps

Getting started

Dataset preprocessing

Fine-tuning

Evaluation & Export

Model deployment

Documentation Index

​Deployment Options

Local Deployment

Cloud Deployment

Container Deployment

Serverless Deployment

​Local Deployment

​GGUF Format (Recommended for Local)

​Using llama.cpp

​Using Ollama

​HuggingFace Transformers (Python)

​Cloud Deployment

​Google Cloud Run (vLLM)

​Prerequisites

​Step 1: Prepare Your Model

​Step 2: Create Dockerfile

​Step 3: Build and Deploy

​Step 4: Test Your Deployment

​Container Deployment

​Docker Compose

​Kubernetes Deployment

​Troubleshooting

​Common Issues

​Next Steps

Deployment Options

Local Deployment

GGUF Format (Recommended for Local)

Using llama.cpp

Using Ollama

HuggingFace Transformers (Python)

Cloud Deployment

Google Cloud Run (vLLM)

Prerequisites

Step 1: Prepare Your Model

Step 2: Create Dockerfile

Step 3: Build and Deploy

Step 4: Test Your Deployment

Container Deployment

Docker Compose

Kubernetes Deployment

Troubleshooting

Common Issues

Next Steps