Working with Datasets

This guide walks you through everything you need to know about working with datasets in Facet, from uploading your data to configuring field mappings and processing options.

Uploading Your Dataset

Supported File Formats

Facet supports a wide range of data formats for maximum flexibility:

CSV Files

Best for: Structured data, spreadsheets, tabular data

Comma-separated values
Headers in first row
UTF-8 encoding recommended

JSON/JSONL

Best for: Conversational data, complex structures - JSON: Single object with array of examples - JSONL: One JSON object per line - Nested structures supported

Parquet

Best for: Large datasets, efficient storage - Columnar format - Fast loading and processing - Compressed storage

Excel Files

Best for: Business data, multiple sheets

.xlsx and .xls formats
Multiple sheets supported
First sheet used by default

File Size Limits

Maximum file size: 100MB per upload
Recommended size: 10-50MB for optimal processing speed
Large datasets: Consider splitting into multiple files or using Hugging Face Hub

Very large files may take longer to process. Consider using Hugging Face Hub for datasets over 100MB.

Using Hugging Face Datasets

Public Datasets

Import datasets directly from Hugging Face Hub using their repository name:

Examples:
- microsoft/DialoGPT-medium
- HuggingFaceH4/ultrachat_200k
- tatsu-lab/alpaca

Find a Dataset

Browse Hugging Face Datasets to find one that fits your needs.

Look for datasets with conversation or instruction-following formats for best results.

Copy the Repository Name

Use the format organization/dataset-name from the dataset page.

You don’t need to include the full URL, just the repository identifier.

Select Subset (Optional)

If the dataset has multiple subsets, choose the one you want to use.

Common subsets include “train”, “test”, “validation”, or specific task names.

Private Datasets

For private repositories, you’ll need to provide your Hugging Face token:

Go to Hugging Face Settings
Create a new token with “Read” permissions
Enter the token when prompted during dataset import

Keep your Hugging Face token secure and never share it publicly.

Configuring Field Mappings

Field mappings tell Facet how to transform your data into the conversational format needed for training.

Understanding Field Types

Use for: Static text, instructions, prompts

{
  "type": "template",
  "value": "You are a helpful assistant. Answer the following question:"
}

Template fields use fixed text that doesn’t change between examples.

Processing Modes Explained

Language Modeling Mode

Best for: General conversation, instruction following, text generation This mode creates conversations with three roles:

System: Instructions or context for the model
User: Questions, prompts, or input from the user
Assistant: Model’s responses

Prompt-Only Mode

Best for: Reasoning tasks, math problems, structured thinking This mode creates prompts without assistant responses, perfect for GRPO training where the model learns to reason step-by-step.

Preference Tuning Mode

Best for: Aligning models with human preferences, RLHF This mode creates data with chosen and rejected responses for preference-based training algorithms.

Vision Dataset Configuration

When working with images, Facet automatically detects image fields and enables vision processing.

Supported Image Formats

File paths: Direct references to image files in your dataset
Base64 encoded: Images embedded as base64 strings
Hugging Face format: Images with bytes field (from HF datasets)

Images are always added to user messages only, and you can include multiple images in a single message.

Next Steps

Once your dataset is processed and ready:

Start Training: Use your processed dataset to fine-tune a Gemma model
Review Results: Check the training progress and evaluation metrics
Export Model: Download your fine-tuned model in various formats
Deploy: Set up your model for production use

Ready to start training? Head to the Fine-tuning guide to learn how to train your model.

Getting started

Dataset preprocessing

Fine-tuning

Evaluation

Model deployment

Deploy Facet AI

Working with Datasets

Uploading Your Dataset

Supported File Formats

CSV Files

JSON/JSONL

Parquet

Excel Files

File Size Limits

Using Hugging Face Datasets

Public Datasets

Private Datasets

Configuring Field Mappings

Understanding Field Types

Processing Modes Explained

Language Modeling Mode

Prompt-Only Mode

Preference Tuning Mode

Vision Dataset Configuration

Supported Image Formats

Next Steps

Getting started

Dataset preprocessing

Fine-tuning

Evaluation

Model deployment

Deploy Facet AI

​Uploading Your Dataset

​Supported File Formats

CSV Files

JSON/JSONL

Parquet

Excel Files

​File Size Limits

​Using Hugging Face Datasets

​Public Datasets

​Private Datasets

​Configuring Field Mappings

​Understanding Field Types

​Processing Modes Explained

​Language Modeling Mode

​Prompt-Only Mode

​Preference Tuning Mode

​Vision Dataset Configuration

​Supported Image Formats

​Next Steps

Uploading Your Dataset

Supported File Formats

File Size Limits

Using Hugging Face Datasets

Public Datasets

Private Datasets

Configuring Field Mappings

Understanding Field Types

Processing Modes Explained

Language Modeling Mode

Prompt-Only Mode

Preference Tuning Mode

Vision Dataset Configuration

Supported Image Formats

Next Steps