Use this file to discover all available pages before exploring further.
Once your model is trained and evaluated, it’s time to deploy it for production use. This guide covers multiple deployment options from local development to cloud production.
# Clone and build llama.cppgit clone https://github.com/ggerganov/llama.cpp.gitcd llama.cppmake
Running your model:
# Basic inference./llama-cli -m path/to/your/model.gguf -p "Your prompt here"# With system prompt./llama-cli -m path/to/your/model.gguf -sys "You are a helpful assistant." -p "What is the capital of France?"# Interactive mode./llama-cli -m path/to/your/model.gguf -i
# Install Ollamacurl -fsSL https://ollama.com/install.sh | sh
Create a Modelfile:
FROM path/to/your/model.ggufSYSTEM """You are a helpful assistant specialized in answering questions about geography."""PARAMETER temperature 0.7PARAMETER top_k 40PARAMETER top_p 0.9
Deploy your model:
# Create the modelollama create my-geography-assistant -f Modelfile# Run the modelollama run my-geography-assistant# If model is on Hugging Face Hubollama run hf.co/username/model-name:q8_0
API usage:
# Start Ollama serverollama serve# Make API callscurl http://localhost:11434/api/generate -d '{ "model": "my-geography-assistant", "prompt": "What is the capital of France?", "stream": false}'
# Get the service URLSERVICE_URL=$(gcloud run services describe vllm-service --region=us-central1 --format="value(status.url)")# Test the APIcurl -X POST "$SERVICE_URL/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "your-model", "messages": [ {"role": "user", "content": "What is the capital of France?"} ] }'
You can also deploy your model on AWS, Azure, or other cloud platforms using vLLM or SGLang.
Symptoms: Model fails to load or returns errorsSolutions:
Check model file integrity
Verify model format compatibility
Ensure sufficient memory allocation
Check file permissions
Performance Issues
Symptoms: Slow inference, high latency, timeouts Solutions: - Optimize
model quantization - Increase compute resources - Implement caching - Use
batch processing
Memory Issues
Symptoms: Out of memory errors, crashes Solutions: - Use quantized
models - Increase memory allocation - Implement model sharding - Use smaller
batch sizes
Network Issues
Symptoms: Connection timeouts, API failuresSolutions: