What You Can Do

Facet’s preprocessing service transforms your raw data into the perfect format for fine-tuning Gemma models. Whether you have text conversations, images, or structured data, we’ll help you prepare it for training.

Upload Your Data

Upload CSV, JSON, JSONL, Parquet, or Excel files directly to the platform

Use Hugging Face Datasets

Import datasets directly from Hugging Face Hub with just a repository name

Support Multiple Tasks

Prepare data for language modeling, preference tuning, and reasoning tasks

Handle Vision Data

Process images alongside text for multimodal fine-tuning

Supported Data Formats

Text-Only Datasets

  • CSV files: Structured data with columns for prompts, responses, etc.
  • JSON/JSONL: Conversational data or structured examples
  • Parquet: Efficient columnar format for large datasets
  • Excel: Spreadsheet data with multiple sheets

Multimodal Datasets

  • Images: PNG, JPEG, or other common formats
  • Text + Images: Combined content for vision-language tasks
  • Base64 encoded: Images embedded directly in your data

Hugging Face Integration

  • Public datasets: Access thousands of pre-existing datasets
  • Private repositories: Use your own HF datasets with authentication
  • Subset selection: Choose specific splits or subsets of larger datasets

Processing Modes

Facet supports different processing modes depending on your fine-tuning goals:
Best for: General conversation, instruction following, text generationConverts your data into conversational format with system, user, and assistant roles.
This is the most common mode for general-purpose fine-tuning tasks.

Getting Started

1

Choose Your Data Source

Upload a file from your computer or import from Hugging Face Hub.
For Hugging Face datasets, you’ll need the repository name (e.g., “microsoft/DialoGPT-medium”).
2

Configure Field Mappings

Map your data columns to the appropriate conversation roles. For example: - questionuser role - answerassistant role - instructionsystem role
3

Select Processing Mode

Choose the mode that matches your fine-tuning objective.
The platform will automatically detect if you’re using images and enable vision processing.
4

Review and Process

Preview your processed data and start the preprocessing job.
You can always reprocess your dataset with different settings if needed.

Next Steps

Once your dataset is processed, you can:
  • Start Training: Use your processed dataset to fine-tune a Gemma model
  • Preview Results: Review sample conversations to ensure quality
  • Export Data: Download your processed dataset for external use
  • Create Variations: Process the same data with different settings
Ready to get started? Head to the Datasets page to begin processing your first dataset.