Training Without Labels: Synthetic Data for RAG

July 17, 2025

admin

Labeled Data

In traditional machine learning workflows, there’s one guaranteed bottleneck: labeled data. We can’t fine-tune a retriever without it, and we can’t test domain-specific relevance until we collect enough of it. What used to take months — hiring annotators, defining taxonomies, manually labeling edge cases — is now being compressed into days using Large Language Models (LLMs).

This isn’t just a speed hack. It’s a shift in the entire prototyping mindset.

From Scarcity to Simulation

Retrieval-Augmented Generation (RAG) helps LLMs stay grounded by supplying external documents during inference. But RAG’s strength is only as good as its retriever. And building a solid retriever typically meant curating thousands of (query, document) pairs — until now.

Enter synthetic query generation.

Instead of waiting for data to exist, we generate it. Instead of assuming relevance is static, we shape it to match the intent. In low-resource domains (legal, tax, multi-lingual), this approach isn’t optional — it’s essential.

🛠️ Code Demo: Synthesizing (query, document) Pairs for RAG

Here’s a minimal Python example that demonstrates how to use OpenAI’s API to synthesize training data for a domain-specific retriever.

This version targets argument retrieval, but can be adapted to other tasks like FAQ generation, legal lookup, or medical triage.

import openai
import json
import random

openai.api_key = "your-openai-api-key"

# Example arguments (from your domain corpus)
documents = [
    "Even if a fine is proportional to income, it does not account for wealth disparities or dependents.",
    "Universal basic income could reduce poverty, but it might disincentivize work and strain public budgets.",
    "Banning fossil fuels overnight would cause economic instability and penalize low-income communities first."
]

# Few-shot prompt examples (manually labeled)
few_shot_examples = [
    {
        "argument": "Proportional fines ensure fairness regardless of income.",
        "query": "What's a counter-argument to income-based fines being fair?"
    },
    {
        "argument": "Basic income is necessary to future-proof automation impacts.",
        "query": "Give a counterpoint to why UBI helps with automation."
    }
]

# Instruction prompt
base_prompt = (
    "Task: Generate a concise counter-argument query for the given argument.\n"
)

for example in few_shot_examples:
    base_prompt += f"Argument: {example['argument']}\n"
    base_prompt += f"Counter-Argument Query: {example['query']}\n"

# Generate queries for each document
synthetic_pairs = []

for doc in documents:
    prompt = base_prompt + f"Argument: {doc}\nCounter-Argument Query:"
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=40,
        temperature=0.7
    )
    
    query = response['choices'][0]['message']['content'].strip()
    
    synthetic_pairs.append({
        "query": query,
        "document": doc
    })

# Output the dataset
with open("synthetic_dataset.jsonl", "w") as f:
    for pair in synthetic_pairs:
        f.write(json.dumps(pair) + "\n")

print("✅ Synthetic dataset created with", len(synthetic_pairs), "pairs.")

What This Code Teaches

Prompt Engineering: You’re guiding the model with task-specific examples. The better your few-shot examples, the more usable your generated data.
Scaling: This approach can be wrapped in batches, parallelized, or run through multiple models to increase diversity.
Cost-Efficiency: Generating 50k pairs costs less than hiring annotators and is faster to iterate on.

Final Thought: Your Retriever Is Only As Good As Its Queries

By using synthetic data to pre-train your retriever, you build alignment between what your users will ask and what your model will return. In the early stages of building intelligent features, your job isn’t to “gather more data” — it’s to simulate signal.

In RAG workflows, you’re not just querying documents — you’re training the questions.

The Code