Using Hugging Face Datasets
We recommend using datasets from the Hugging Face Hub for easy integration and management. If your data requirement is not supported by our platform or custom upload, you can preprocess manually and upload to Hugging Face Hub first. Import datasets directly from Hugging Face Hub using their repository name:1
Find a Dataset
Browse Hugging Face Datasets to find one that fits your needs.
Look for datasets with conversation or instruction-following formats for best results.
2
Copy the Repository Name
Use the format
organization/dataset-name from the dataset page.You don’t need to include the full URL, just the repository identifier.
3
Select Subset and Split
If the dataset has multiple subsets, choose the one you want to use.
Common subsets include “train”, “test”, “validation”, or specific task names.
Uploading Your Dataset
We are working on expanding this service to handle unstructured documents with LLM-based synthesis.Supported File Formats
Facet supports a wide range of data formats for maximum flexibility:CSV Files
Best for: Structured data, spreadsheets, tabular data
- Comma-separated values
- Headers in first row
- UTF-8 encoding recommended
JSON/JSONL
Best for: Conversational data, complex structures - JSON: Single object
with array of examples - JSONL: One JSON object per line - Nested structures
supported
Parquet
Best for: Large datasets, efficient storage - Columnar format - Fast
loading and processing - Compressed storage
Excel Files
Best for: Business data, multiple sheets
- .xlsx and .xls formats
- Multiple sheets supported
- First sheet used by default
File Size Limits
- Maximum file size: 100MB per upload
- Recommended size: 10-50MB for optimal processing speed
- Large datasets: Consider splitting into multiple files or using Hugging Face Hub
Very large files may take longer to process. Consider using Hugging Face Hub
for datasets over 100MB.
Configuring Field Mappings
Field mappings tell Facet how to transform your data into the conversational format needed for training.Understanding Field Types
In the dropdown menus for field selection, you will see three types of fields:- Template Fields
- Column Fields
- Image Fields
Use for: Static text, instructions, prompts
Template fields use fixed text that doesn’t change between examples.
Processing Modes Explained
Language Modeling Mode
Best for: General conversation, instruction following, text generation This mode creates conversations with three roles:- System: Instructions or context for the model
- User: Questions, prompts, or input from the user
- Assistant: Model’s responses
Prompt-Only Mode
Best for: Reasoning tasks, math problems, structured thinking This mode creates prompts without assistant responses, perfect for GRPO training where the model learns to reason step-by-step.Preference Tuning Mode
Best for: Aligning models with human preferences, RLHF This mode creates data with chosen and rejected responses for preference-based training algorithms.Data Augmentation
We are working on expanding this!Next Steps
Once your dataset is processed and ready:- Start Training: Use your processed dataset to fine-tune a Gemma model
- Review Results: Check the training progress and evaluation metrics
- Export Model: Download your fine-tuned model in various formats
- Deploy: Set up your model for production use