This guide walks you through everything you need to know about working with datasets in Facet, from uploading your data to configuring field mappings and processing options.

Uploading Your Dataset

Supported File Formats

Facet supports a wide range of data formats for maximum flexibility:

CSV Files

Best for: Structured data, spreadsheets, tabular data
  • Comma-separated values
  • Headers in first row
  • UTF-8 encoding recommended

JSON/JSONL

Best for: Conversational data, complex structures - JSON: Single object with array of examples - JSONL: One JSON object per line - Nested structures supported

Parquet

Best for: Large datasets, efficient storage - Columnar format - Fast loading and processing - Compressed storage

Excel Files

Best for: Business data, multiple sheets
  • .xlsx and .xls formats
  • Multiple sheets supported
  • First sheet used by default

File Size Limits

  • Maximum file size: 100MB per upload
  • Recommended size: 10-50MB for optimal processing speed
  • Large datasets: Consider splitting into multiple files or using Hugging Face Hub
Very large files may take longer to process. Consider using Hugging Face Hub for datasets over 100MB.

Using Hugging Face Datasets

Public Datasets

Import datasets directly from Hugging Face Hub using their repository name:
Examples:
- microsoft/DialoGPT-medium
- HuggingFaceH4/ultrachat_200k
- tatsu-lab/alpaca
1

Find a Dataset

Browse Hugging Face Datasets to find one that fits your needs.
Look for datasets with conversation or instruction-following formats for best results.
2

Copy the Repository Name

Use the format organization/dataset-name from the dataset page.
You don’t need to include the full URL, just the repository identifier.
3

Select Subset (Optional)

If the dataset has multiple subsets, choose the one you want to use.
Common subsets include “train”, “test”, “validation”, or specific task names.

Private Datasets

For private repositories, you’ll need to provide your Hugging Face token:
  1. Go to Hugging Face Settings
  2. Create a new token with “Read” permissions
  3. Enter the token when prompted during dataset import
Keep your Hugging Face token secure and never share it publicly.

Configuring Field Mappings

Field mappings tell Facet how to transform your data into the conversational format needed for training.

Understanding Field Types

Use for: Static text, instructions, prompts
{
  "type": "template",
  "value": "You are a helpful assistant. Answer the following question:"
}
Template fields use fixed text that doesn’t change between examples.

Processing Modes Explained

Language Modeling Mode

Best for: General conversation, instruction following, text generation This mode creates conversations with three roles:
  • System: Instructions or context for the model
  • User: Questions, prompts, or input from the user
  • Assistant: Model’s responses

Prompt-Only Mode

Best for: Reasoning tasks, math problems, structured thinking This mode creates prompts without assistant responses, perfect for GRPO training where the model learns to reason step-by-step.

Preference Tuning Mode

Best for: Aligning models with human preferences, RLHF This mode creates data with chosen and rejected responses for preference-based training algorithms.

Vision Dataset Configuration

When working with images, Facet automatically detects image fields and enables vision processing.

Supported Image Formats

  • File paths: Direct references to image files in your dataset
  • Base64 encoded: Images embedded as base64 strings
  • Hugging Face format: Images with bytes field (from HF datasets)
Images are always added to user messages only, and you can include multiple images in a single message.

Next Steps

Once your dataset is processed and ready:
  1. Start Training: Use your processed dataset to fine-tune a Gemma model
  2. Review Results: Check the training progress and evaluation metrics
  3. Export Model: Download your fine-tuned model in various formats
  4. Deploy: Set up your model for production use
Ready to start training? Head to the Fine-tuning guide to learn how to train your model.