Dataset Preprocessing

What You Can Do

Facet’s preprocessing service transforms your raw data into the perfect format for fine-tuning Gemma models. Whether you have text conversations, images, or structured data, we’ll help you prepare it for training.

Upload Your Data

Upload CSV, JSON, JSONL, Parquet, or Excel files directly to the platform

Use Hugging Face Datasets

Import datasets directly from Hugging Face Hub with just a repository name

Support Multiple Tasks

Prepare data for language modeling, preference tuning, and reasoning tasks

Handle Vision Data

Process images alongside text for multimodal fine-tuning

Supported Data Formats

Text-Only Datasets

CSV files: Structured data with columns for prompts, responses, etc.
JSON/JSONL: Conversational data or structured examples
Parquet: Efficient columnar format for large datasets
Excel: Spreadsheet data with multiple sheets

Multimodal Datasets

Images: PNG, JPEG, or other common formats
Text + Images: Combined content for vision-language tasks
Base64 encoded: Images embedded directly in your data

Hugging Face Integration

Public datasets: Access thousands of pre-existing datasets
Private repositories: Use your own HF datasets with authentication
Subset selection: Choose specific splits or subsets of larger datasets

Processing Modes

Facet supports different processing modes depending on your fine-tuning goals:

Best for: General conversation, instruction following, text generationConverts your data into conversational format with system, user, and assistant roles.

This is the most common mode for general-purpose fine-tuning tasks.

Getting Started

Choose Your Data Source

Upload a file from your computer or import from Hugging Face Hub.

For Hugging Face datasets, you’ll need the repository name (e.g., “microsoft/DialoGPT-medium”).

Configure Field Mappings

Map your data columns to the appropriate conversation roles. For example: - question → user role - answer → assistant role - instruction → system role

Select Processing Mode

Choose the mode that matches your fine-tuning objective.

The platform will automatically detect if you’re using images and enable vision processing.

Review and Process

Preview your processed data and start the preprocessing job.

You can always reprocess your dataset with different settings if needed.

Next Steps

Once your dataset is processed, you can:

Start Training: Use your processed dataset to fine-tune a Gemma model
Preview Results: Review sample conversations to ensure quality
Export Data: Download your processed dataset for external use
Create Variations: Process the same data with different settings

Ready to get started? Head to the Datasets page to begin processing your first dataset.

Getting started

Dataset preprocessing

Fine-tuning

Evaluation

Model deployment

Deploy Facet AI

Dataset Preprocessing

What You Can Do

Upload Your Data

Use Hugging Face Datasets

Support Multiple Tasks

Handle Vision Data

Supported Data Formats

Text-Only Datasets

Multimodal Datasets

Hugging Face Integration

Processing Modes

Getting Started

Next Steps

Getting started

Dataset preprocessing

Fine-tuning

Evaluation

Model deployment

Deploy Facet AI

​What You Can Do

Upload Your Data

Use Hugging Face Datasets

Support Multiple Tasks

Handle Vision Data

​Supported Data Formats

​Text-Only Datasets

​Multimodal Datasets

​Hugging Face Integration

​Processing Modes

​Getting Started

​Next Steps

What You Can Do

Supported Data Formats

Text-Only Datasets

Multimodal Datasets

Hugging Face Integration

Processing Modes

Getting Started

Next Steps