Uploading Your Dataset
Supported File Formats
Facet supports a wide range of data formats for maximum flexibility:CSV Files
Best for: Structured data, spreadsheets, tabular data
- Comma-separated values
- Headers in first row
- UTF-8 encoding recommended
JSON/JSONL
Best for: Conversational data, complex structures - JSON: Single object
with array of examples - JSONL: One JSON object per line - Nested structures
supported
Parquet
Best for: Large datasets, efficient storage - Columnar format - Fast
loading and processing - Compressed storage
Excel Files
Best for: Business data, multiple sheets
- .xlsx and .xls formats
- Multiple sheets supported
- First sheet used by default
File Size Limits
- Maximum file size: 100MB per upload
- Recommended size: 10-50MB for optimal processing speed
- Large datasets: Consider splitting into multiple files or using Hugging Face Hub
Very large files may take longer to process. Consider using Hugging Face Hub
for datasets over 100MB.
Using Hugging Face Datasets
Public Datasets
Import datasets directly from Hugging Face Hub using their repository name:1
Find a Dataset
Browse Hugging Face Datasets to find one that fits your needs.
Look for datasets with conversation or instruction-following formats for best results.
2
Copy the Repository Name
Use the format
organization/dataset-name
from the dataset page.You don’t need to include the full URL, just the repository identifier.
3
Select Subset (Optional)
If the dataset has multiple subsets, choose the one you want to use.
Common subsets include “train”, “test”, “validation”, or specific task names.
Private Datasets
For private repositories, you’ll need to provide your Hugging Face token:- Go to Hugging Face Settings
- Create a new token with “Read” permissions
- Enter the token when prompted during dataset import
Keep your Hugging Face token secure and never share it publicly.
Configuring Field Mappings
Field mappings tell Facet how to transform your data into the conversational format needed for training.Understanding Field Types
Use for: Static text, instructions, prompts
Template fields use fixed text that doesn’t change between examples.
Processing Modes Explained
Language Modeling Mode
Best for: General conversation, instruction following, text generation This mode creates conversations with three roles:- System: Instructions or context for the model
- User: Questions, prompts, or input from the user
- Assistant: Model’s responses
Prompt-Only Mode
Best for: Reasoning tasks, math problems, structured thinking This mode creates prompts without assistant responses, perfect for GRPO training where the model learns to reason step-by-step.Preference Tuning Mode
Best for: Aligning models with human preferences, RLHF This mode creates data with chosen and rejected responses for preference-based training algorithms.Vision Dataset Configuration
When working with images, Facet automatically detects image fields and enables vision processing.Supported Image Formats
- File paths: Direct references to image files in your dataset
- Base64 encoded: Images embedded as base64 strings
- Hugging Face format: Images with
bytes
field (from HF datasets)
Images are always added to user messages only, and you can include multiple
images in a single message.
Next Steps
Once your dataset is processed and ready:- Start Training: Use your processed dataset to fine-tune a Gemma model
- Review Results: Check the training progress and evaluation metrics
- Export Model: Download your fine-tuned model in various formats
- Deploy: Set up your model for production use