POST
/
datasets
/
process
Process Dataset
curl --request POST \
  --url https://preprocessing-service-433968519479.us-central1.run.app/datasets/process \
  --header 'Content-Type: application/json' \
  --data '{
  "dataset_name": "<string>",
  "dataset_source": "upload",
  "dataset_id": "<string>",
  "dataset_subset": "default",
  "processing_mode": "language_modeling",
  "config": {
    "field_mappings": {},
    "normalize_whitespace": true,
    "augmentation_config": {
      "augmentation_factor": 1.5,
      "use_eda": true,
      "use_back_translation": true,
      "use_paraphrasing": true,
      "use_synthesis": true,
      "gemini_api_key": "<string>",
      "synthesis_ratio": 123,
      "custom_prompt": "<string>"
    },
    "split_config": {
      "type": "<any>",
      "train_split": "<string>",
      "test_split": "<string>"
    }
  }
}'
{
  "dataset_name": "<string>",
  "dataset_subset": "<string>",
  "dataset_source": "upload",
  "modality": "text",
  "dataset_id": "<string>",
  "processed_dataset_id": "<string>",
  "num_examples": 123,
  "created_at": "<string>",
  "splits": [
    "<string>"
  ],
  "full_splits": []
}

Body

application/json
dataset_name
string
required
dataset_source
enum<string>
required
Available options:
upload,
huggingface
dataset_id
string
required
processing_mode
enum<string>
required

Specifies the preprocessing mode to format the dataset for a specific fine-tuning task.

Available options:
language_modeling,
prompt_only,
preference
config
object
required
dataset_subset
string
default:default

Response

Successful Response

dataset_name
string
required
dataset_subset
string
required
dataset_source
enum<string>
required
Available options:
upload,
huggingface
dataset_id
string
required
processed_dataset_id
string
required
num_examples
integer
required
created_at
string
required
splits
string[]
required
modality
enum<string>
default:text
Available options:
text,
vision
full_splits
Full Splits · object[]