What You Can Do
Facet’s preprocessing service transforms your raw data into the perfect format for fine-tuning Gemma models. Whether you have text conversations, images, or structured data, we’ll help you prepare it for training.Upload Your Data
Upload CSV, JSON, JSONL, Parquet, or Excel files directly to the platform
Use Hugging Face Datasets
Import datasets directly from Hugging Face Hub with just a repository name
Support Multiple Tasks
Prepare data for language modeling, preference tuning, and reasoning tasks
Handle Vision Data
Process images alongside text for multimodal fine-tuning
Supported Data Formats
Text-Only Datasets
- CSV files: Structured data with columns for prompts, responses, etc.
- JSON/JSONL: Conversational data or structured examples
- Parquet: Efficient columnar format for large datasets
- Excel: Spreadsheet data with multiple sheets
Multimodal Datasets
- Images: PNG, JPEG, or other common formats
- Text + Images: Combined content for vision-language tasks
- Base64 encoded: Images embedded directly in your data
Hugging Face Integration
- Public datasets: Access thousands of pre-existing datasets
- Private repositories: Use your own HF datasets with authentication
- Subset selection: Choose specific splits or subsets of larger datasets
Processing Modes
Facet supports different processing modes depending on your fine-tuning goals:Best for: General conversation, instruction following, text generationConverts your data into conversational format with
system
, user
, and assistant
roles.This is the most common mode for general-purpose fine-tuning tasks.
Getting Started
1
Choose Your Data Source
Upload a file from your computer or import from Hugging Face Hub.
For Hugging Face datasets, you’ll need the repository name (e.g., “microsoft/DialoGPT-medium”).
2
Configure Field Mappings
Map your data columns to the appropriate conversation roles. For example: -
question
→ user
role - answer
→ assistant
role - instruction
→
system
role3
Select Processing Mode
Choose the mode that matches your fine-tuning objective.
The platform will automatically detect if you’re using images and enable
vision processing.
4
Review and Process
Preview your processed data and start the preprocessing job.
You can always reprocess your dataset with different settings if needed.
Next Steps
Once your dataset is processed, you can:- Start Training: Use your processed dataset to fine-tune a Gemma model
- Preview Results: Review sample conversations to ensure quality
- Export Data: Download your processed dataset for external use
- Create Variations: Process the same data with different settings