Retrieval-Augmented Generation, or RAG, is now the standard architecture for building LLM applications that need access to your own data.
The idea is simple: instead of hoping the language model knows the answer, you fetch relevant information from your database and include it in the prompt. The model reads this content and generates an answer based on what it finds.
Here is how it works: you take your data, break it into smaller pieces called chunks, and convert each chunk into a vector embedding—a list of numbers that represents what the text means. These vectors get stored in a database. When a user asks a question, you convert their question into a vector too, search for chunks with similar vectors, and pass those chunks to the LLM.
This approach solves important problems. Language models have outdated knowledge. They have never seen your internal data. Fine-tuning them is expensive and slow. RAG lets you connect any data source to any model without retraining.
But most teams make the same mistake when building these systems.
They spend weeks perfecting their embeddings and chunking logic. They benchmark different models and tweak similarity thresholds. What they forget is that vector search only answers one question: how similar does this text sound to the query? It has no understanding of time, categories, permissions, or business rules. It cannot distinguish between what is semantically relevant and what is actually correct for the user's context.
In other words, similarity is not the same as relevance. Your retrieval system might return content that sounds right but is completely wrong for the situation.
What this guide covers:
There are several mistakes teams make when building RAG systems. This guide focuses on the most common and most damaging one: optimising embeddings while ignoring metadata filtering. We will explore why this happens, why it breaks production systems, and how to fix it with a staged hybrid filtering approach—complete with working Node.js code you can adapt for your own projects.
Let us start by understanding what RAG actually is and why it exists.
Understanding RAG: The Bridge Between Your Data and LLMs
What is Retrieval-Augmented Generation?
RAG combines two systems: a retrieval system that finds relevant information and a generation system that produces answers. First, you search your knowledge base to find content that might help answer the question. Then, you hand that content to an LLM and ask it to generate a response.
The retrieval phase acts as a filter. Instead of asking the LLM to remember everything, you give it exactly what it needs for the current question. The generation phase takes this information and turns it into a clear, readable answer. When retrieval works well, even a simple language model can produce accurate results.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ Your Data │───▶│ Chunking & │───▶│ Vector │
│ │ │ Embedding │ │ Database │
│ │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ Answer │◀───│ LLM │◀───│ Retrieved │
│ │ │ │ │ Chunks │
│ │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
▲
│
┌──────────────┐
│ │
│ User Query │
│ │
└──────────────┘RAG Pipeline Overview
Why RAG Exists: The Problem It Solves
Language models have a fundamental limitation: their knowledge stops at the point they were trained. They cannot access information that came after their training cutoff. They also have no awareness of private or proprietary information that was never part of their training data.
Fine-tuning seems like an obvious solution, but it is expensive, slow, and needs to be repeated every time your data changes. For use cases where information updates frequently, fine-tuning is not practical.
RAG keeps the knowledge separate from the model. Your data lives in a vector database that you can update anytime. New information becomes searchable within minutes, not weeks. The LLM stays unchanged—it just receives different context for each query.
The Core Components of a RAG Pipeline
A RAG system has five core components. Ingestion is where your data enters the system. Chunking breaks that data into smaller, searchable units. Embedding converts each chunk into a vector—a numerical representation of its meaning. Storage holds these vectors in a database optimised for similarity search. Retrieval happens at runtime: the query gets embedded, compared against stored vectors, and matching chunks are returned.
Each component involves decisions that affect the final output. But one decision matters more than most teams realise: how you filter what gets searched in the first place.
What Challenges RAG Systems Are Designed to Solve
The Knowledge Boundary Problem
Every LLM has a knowledge cutoff date. Ask it about events that happened after that date and it will either admit it does not know or make something up. RAG solves this by maintaining a separate knowledge base that you control and update independently of the model.
The Hallucination Problem
Without grounding, LLMs generate confident-sounding answers that are partially or completely wrong. They fill gaps with plausible guesses. RAG reduces this by providing source material. The model answers based on actual content, and you can trace every response back to its source.
The Scale and Freshness Problem
Organisations deal with large volumes of information that change constantly. New reports, updated policies, added records. RAG handles this naturally. Adding or updating information means inserting or replacing vectors in your database. Your knowledge base stays current without touching the model.
The Common Mistakes People Make When Building RAG Systems
Before we dive into the solution, let us look at the five most common mistakes teams make when building RAG systems. We will cover all five briefly, then spend the rest of this guide solving Mistake #1 in depth—because it is the most widespread and causes the most damage in production.
Mistake | What Goes Wrong | Impact Level |
|---|---|---|
#1 Ignoring metadata filtering | Wrong content gets retrieved despite good embeddings | 🔴 Critical |
#2 Poor chunking strategy | LLM lacks context or gets diluted information | 🟠 High |
#3 Treating all content as equal | Drafts compete with verified documents | 🟠 High |
#4 Retrieval-generation mismatch | Retrieved content is unhelpful for the LLM | 🟡 Medium |
#5 No observability | Cannot diagnose why wrong answers happen | 🟡 Medium |
Mistake #1: Optimising Embeddings While Ignoring Metadata Filtering
This is the most common and costly mistake in production RAG systems. Teams spend weeks fine-tuning embedding models and perfecting chunking strategies while ignoring a fundamental truth: vector search only measures semantic similarity.
Consider a real scenario. You have a RAG system with 1 million records spanning multiple years. A user asks: "What were our Q4 2025 cloud infrastructure costs?"
Your embedding model converts this query into a vector and searches for similar content. The problem? Content about "cloud infrastructure costs" from 2019, 2020, 2021, 2022, 2023, and 2024 are all semantically similar to this query. The vector search has no concept of time. It cannot distinguish between a 2019 report and a 2025 report—both talk about the same topic.
What happens without metadata filtering:
Problem | Consequence |
|---|---|
System searches all 1 million records | Wasted compute resources |
Results include wrong years | Incorrect information retrieved |
75% of context window wasted | Relevant content gets pushed out |
Latency increases by 30ms+ | Slower response times |
LLM generates answer using outdated figures | User gets wrong answer |
The mental model you need:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Vector Search → "How similar is this content?" │
│ │
│ Metadata Filter → "Is this content allowed to │
│ compete in the first place?" │
│ │
└─────────────────────────────────────────────────────────────┘
Both are necessary. Neither is sufficient alone.We will solve this mistake in detail in the sections that follow.
In this guide will cover, Mistake #1 as it is foundational. If you retrieve the wrong content, nothing else matters. Your chunking can be perfect. Your LLM can be state-of-the-art. But if your system returns 2019 data for a 2025 question, your users get wrong answers.
Fixing this mistake requires understanding the staged hybrid filtering approach. Let us dive in.
Deep Dive: The Staged Hybrid Filtering Approach
The solution to Mistake #1 is not choosing between vector search and metadata filtering. It is combining them in a staged pipeline.
Understanding the Three-Stage Architecture
The staged hybrid approach processes queries in three phases:
Stage | Operation | Purpose | Example |
|---|---|---|---|
Stage 1 | Pre-filter | Eliminate impossible candidates | Remove all documents not from 2025 |
Stage 2 | Vector search | Semantic ranking | Find most similar content in filtered set |
Stage 3 | Post-filter | Lightweight refinement | Keep only verified documents |
This architecture balances speed, accuracy, and flexibility.
┌─────────────────────────────┐
│ │
│ 1,000,000 Records │
│ (Full Corpus) │
│ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ STAGE 1: Pre-Filter │
│ • year = 2025 │
│ • quarter = Q4 │
│ • department = finance │
│ +6ms latency │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ │
│ 100,000 Records │
│ (Filtered Candidates) │
│ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ STAGE 2: Vector Search │
│ • Semantic similarity │
│ • HNSW / IVF algorithm │
│ • Top 50 results │
│ ~12ms latency │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ │
│ 50 Records │
│ (Semantically Similar) │
│ │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ STAGE 3: Post-Filter │
│ • author_verified = true │
│ • word_count > 20 │
│ • required tags │
│ +2ms latency │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ │
│ 10 Records │
│ (Final Results) │
│ │
└─────────────────────────────┘
Total: ~20ms for 97% recall with correct categoryDiagram : Staged Hybrid Filtering Funnel
Stage 1: Pre-filtering (Indexed, Selective)
Pre-filtering happens before vector search. You use indexed metadata fields to eliminate documents that cannot possibly be correct answers.
Good candidates for pre-filtering:
Date ranges (year, quarter, month)
Categorical fields (department, product, region)
Access control (user permissions, security level)
Why pre-filter works:
Uses traditional database indexes (B-trees, hash indexes)
Sub-millisecond filtering on properly indexed fields
Dramatically reduces the search space
// Example: Pre-filter using Supabase RPC function
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: 10,
filter_year: 2025,
filter_quarter: 'Q4',
filter_departments: ['finance', 'infrastructure'],
filter_max_access_level: userAccessLevel
});When to use pre-filtering:
"Pre-filter when selectivity is less than 10%. If your filter removes more than 90% of documents, always pre-filter".
Stage 2: Approximate Nearest Neighbour (ANN) Vector Search
After pre-filtering reduces your search space, vector search ranks the remaining content by semantic similarity.
Modern vector databases use ANN algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). These trade perfect recall for speed—instead of comparing against every vector, they search a subset likely to contain the nearest neighbours.
// Vector search on pre-filtered subset using Supabase
async function vectorSearch(queryEmbedding, queryContext, topK = 10) {
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: topK,
filter_year: queryContext.year,
filter_quarter: queryContext.quarter,
filter_max_access_level: queryContext.maxAccessLevel
});
if (error) {
throw new Error(`Search failed: ${error.message}`);
}
return data;
}Key parameters to tune:
ef_search(HNSW) ornprobe(IVF): Higher values improve recall but increase latencytopK: Number of results to return for the next stage
Stage 3: Post-filtering (Non-indexed, Lightweight)
Post-filtering applies additional criteria after vector search returns results. Use this for:
Attributes that are expensive to index
High-cardinality fields (thousands of unique values)
Frequently changing metadata
Complex business rules
// Post-filter on vector search results
function postFilter(results, criteria) {
return results.filter(result => {
const meta = result.metadata;
// Check author verification
if (criteria.requireVerified && !meta.author_verified) {
return false;
}
// Check minimum word count
if (criteria.minWordCount && meta.word_count < criteria.minWordCount) {
return false;
}
// Check required tags
if (criteria.requiredTags) {
const hasAllTags = criteria.requiredTags.every(tag =>
meta.tags?.includes(tag)
);
if (!hasAllTags) return false;
}
return true;
});
}When to use post-filtering:
Post-filter when selectivity is greater than 50%. If the filter only removes a small portion of results, post-filtering is more efficient.
The Hybrid Decision Framework:
┌─────────────────────┐
│ What % of records │
│ does this filter │
│ eliminate? │
└──────────┬──────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ │ │ │ │ │
│ > 90% │ │ 50 - 90% │ │ < 50% │
│ │ │ │ │ │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ │ │ │ │ │
│ PRE-FILTER │ │ STAGED │ │ POST-FILTER│
│ │ │ HYBRID │ │ ONLY │
│ │ │ │ │ │
└────────────┘ └────────────┘ └────────────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Best for: │ │ Best for: │ │ Best for: │
│ • Dates │ │ • Combined │ │ • Tags │
│ • Category │ │ criteria │ │ • Verified │
│ • Access │ │ • Flexible │ │ • Complex │
│ level │ │ rules │ │ rules │
└────────────┘ └────────────┘ └────────────┘Diagram : Filter Strategy Decision Tree
Use this framework to decide where each filter belongs:
Selectivity | Strategy | Reason |
|---|---|---|
Less than 10% | Pre-filter | Eliminates most documents, saves vector search time |
10% to 50% | Staged hybrid | Combine pre-filter and post-filter |
Greater than 50% | Post-filter only | Not selective enough to justify index overhead |
Implementation Walkthrough (Node.js)
Let us build a complete staged hybrid filtering pipeline.
Project Setup and Dependencies
mkdir rag-hybrid-filter
cd rag-hybrid-filter
npm init -y
npm install @supabase/supabase-js openai dotenv
Create your environment file:
# .env
SUPABASE_URL=your_supabase_project_url
SUPABASE_ANON_KEY=your_supabase_anon_key
OPENAI_API_KEY=your_openai_api_key
Database Setup (Run in Supabase SQL Editor)
Before using the code, you need to set up your Supabase database with the pgvector extension and create the necessary table and functions.
//sql
-- Enable the pgvector extension
create extension if not exists vector;
-- Create the documents table with metadata columns
create table documents (
id text primary key,
content text not null,
embedding vector(1536),
-- Indexed fields (for pre-filtering)
year integer not null,
quarter text not null,
department text not null,
document_type text not null,
access_level integer not null,
-- Non-indexed fields (for post-filtering)
author text,
author_verified boolean default false,
tags text[],
word_count integer,
created_at timestamptz default now(),
source_url text
);
-- Create indexes for pre-filter fields
create index idx_documents_year on documents(year);
create index idx_documents_quarter on documents(quarter);
create index idx_documents_department on documents(department);
create index idx_documents_access_level on documents(access_level);
-- Create index for vector similarity search
create index idx_documents_embedding on documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
-- Function for vector search with pre-filtering
create or replace function match_documents(
query_embedding vector(1536),
match_count int default 10,
filter_year int default null,
filter_quarter text default null,
filter_department text default null,
filter_departments text[] default null,
filter_max_access_level int default null
)
returns table (
id text,
content text,
year int,
quarter text,
department text,
document_type text,
access_level int,
author text,
author_verified boolean,
tags text[],
word_count int,
created_at timestamptz,
source_url text,
similarity float
)
language plpgsql
as $$
begin
return query
select
d.id,
d.content,
d.year,
d.quarter,
d.department,
d.document_type,
d.access_level,
d.author,
d.author_verified,
d.tags,
d.word_count,
d.created_at,
d.source_url,
1 - (d.embedding <=> query_embedding) as similarity
from documents d
where
(filter_year is null or d.year = filter_year)
and (filter_quarter is null or d.quarter = filter_quarter)
and (filter_department is null or d.department = filter_department)
and (filter_departments is null or d.department = any(filter_departments))
and (filter_max_access_level is null or d.access_level <= filter_max_access_level)
order by d.embedding <=> query_embedding
limit match_count;
end;
$$;Defining Your Metadata Schema
Design your metadata schema before ingesting any data. Think about what filters you will need at query time.
// schemas/document-metadata.js
const documentMetadataSchema = {
// Indexed fields (for pre-filtering)
year: 'number', // 2019, 2020, ..., 2025
quarter: 'string', // Q1, Q2, Q3, Q4
department: 'string', // finance, engineering, marketing
document_type: 'string', // report, memo, policy, guide
access_level: 'number', // 1 = public, 2 = internal, 3 = confidential
// Non-indexed fields (for post-filtering)
author: 'string',
author_verified: 'boolean',
tags: 'string[]',
word_count: 'number',
created_at: 'string', // ISO date string
source_url: 'string'
};
// Validate metadata before ingestion
function validateMetadata(metadata) {
const errors = [];
if (!metadata.year || metadata.year < 2000 || metadata.year > 2030) {
errors.push('Invalid year');
}
if (!['Q1', 'Q2', 'Q3', 'Q4'].includes(metadata.quarter)) {
errors.push('Invalid quarter');
}
if (!metadata.department) {
errors.push('Department is required');
}
if (typeof metadata.access_level !== 'number' || metadata.access_level < 1 || metadata.access_level > 3) {
errors.push('Invalid access level');
}
return { valid: errors.length === 0, errors };
}
module.exports = { documentMetadataSchema, validateMetadata };
Ingesting Documents with Rich Metadata
// ingest.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');
const { validateMetadata } = require('./schemas/document-metadata');
const supabase = createClient(
process.env.SUPABASE_URL,
process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text
});
return response.data[0].embedding;
}
function chunkDocument(text, maxChunkSize = 500) {
const paragraphs = text.split(/\n\n+/);
const chunks = [];
let currentChunk = '';
for (const paragraph of paragraphs) {
if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk) {
chunks.push(currentChunk.trim());
currentChunk = '';
}
currentChunk += paragraph + '\n\n';
}
if (currentChunk.trim()) {
chunks.push(currentChunk.trim());
}
return chunks;
}
async function ingestDocument(documentId, text, metadata) {
// Validate metadata
const validation = validateMetadata(metadata);
if (!validation.valid) {
throw new Error(`Invalid metadata: ${validation.errors.join(', ')}`);
}
// Chunk the document
const chunks = chunkDocument(text);
// Generate embeddings and prepare records
const records = await Promise.all(
chunks.map(async (chunk, index) => {
const embedding = await getEmbedding(chunk);
return {
id: `${documentId}_chunk_${index}`,
content: chunk,
embedding: embedding,
year: metadata.year,
quarter: metadata.quarter,
department: metadata.department,
document_type: metadata.document_type,
access_level: metadata.access_level,
author: metadata.author,
author_verified: metadata.author_verified,
tags: metadata.tags,
word_count: chunk.split(/\s+/).length,
created_at: metadata.created_at,
source_url: metadata.source_url
};
})
);
// Upsert to Supabase
const { data, error } = await supabase
.from('documents')
.upsert(records, { onConflict: 'id' });
if (error) {
throw new Error(`Supabase insert error: ${error.message}`);
}
console.log(`Ingested ${records.length} chunks for document ${documentId}`);
return records.length;
}
// Example usage
async function main() {
const sampleDocument = `
Cloud Infrastructure Cost Report - Q4 2025
Executive Summary
Total cloud spending for Q4 2025 reached $2.4 million, representing a 15% increase
from Q3 2025. The primary drivers were expanded compute capacity for the ML platform
and increased storage requirements for the data lake.
Breakdown by Service
Compute services accounted for 45% of total spend. Storage services represented 30%.
Networking and data transfer made up the remaining 25%.
Recommendations
We recommend implementing reserved instances for predictable workloads to reduce
compute costs by an estimated 20% in Q1 2026.
`;
const metadata = {
year: 2025,
quarter: 'Q4',
department: 'infrastructure',
document_type: 'report',
access_level: 2,
author: 'finance-team',
author_verified: true,
tags: ['cloud', 'costs', 'infrastructure', 'quarterly-report'],
created_at: '2025-01-15T10:00:00Z',
source_url: 'https://internal.company.com/reports/q4-2025-cloud-costs'
};
await ingestDocument('doc_q4_2025_cloud_costs', sampleDocument, metadata);
}
main().catch(console.error);Building the Staged Query Pipeline
// query.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');
const supabase = createClient(
process.env.SUPABASE_URL,
process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text
});
return response.data[0].embedding;
}
// Stage 2: Vector search with pre-filter using Supabase RPC
async function vectorSearchWithPreFilter(queryEmbedding, queryContext, topK = 20) {
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: topK,
filter_year: queryContext.year || null,
filter_quarter: queryContext.quarter || null,
filter_department: queryContext.department || null,
filter_departments: queryContext.departments || null,
filter_max_access_level: queryContext.maxAccessLevel || null
});
if (error) {
throw new Error(`Supabase query error: ${error.message}`);
}
return data || [];
}
// Stage 3: Post-filter results
function applyPostFilter(results, postFilterCriteria) {
return results.filter(result => {
// Filter by author verification
if (postFilterCriteria.requireVerified && !result.author_verified) {
return false;
}
// Filter by minimum word count
if (postFilterCriteria.minWordCount && result.word_count < postFilterCriteria.minWordCount) {
return false;
}
// Filter by required tags (must have all)
if (postFilterCriteria.requiredTags) {
const docTags = result.tags || [];
const hasAllTags = postFilterCriteria.requiredTags.every(tag =>
docTags.includes(tag)
);
if (!hasAllTags) return false;
}
// Filter by excluded tags (must have none)
if (postFilterCriteria.excludedTags) {
const docTags = result.tags || [];
const hasExcludedTag = postFilterCriteria.excludedTags.some(tag =>
docTags.includes(tag)
);
if (hasExcludedTag) return false;
}
return true;
});
}
// Complete staged hybrid query
async function stagedHybridQuery(query, queryContext, postFilterCriteria = {}, finalTopK = 5) {
console.log('\n=== Staged Hybrid Query ===');
console.log(`Query: "${query}"`);
const startTime = Date.now();
// Log pre-filter context
console.log('\nStage 1 - Pre-filter context:', JSON.stringify(queryContext, null, 2));
// Get query embedding
const queryEmbedding = await getEmbedding(query);
// Stage 2: Vector search with pre-filter
const stage2Start = Date.now();
const vectorResults = await vectorSearchWithPreFilter(queryEmbedding, queryContext, 20);
const stage2Time = Date.now() - stage2Start;
console.log(`\nStage 2 - Vector search: ${vectorResults.length} results in ${stage2Time}ms`);
// Stage 3: Post-filter
const stage3Start = Date.now();
const filteredResults = applyPostFilter(vectorResults, postFilterCriteria);
const stage3Time = Date.now() - stage3Start;
console.log(`Stage 3 - Post-filter: ${filteredResults.length} results in ${stage3Time}ms`);
// Return top K final results
const finalResults = filteredResults.slice(0, finalTopK);
const totalTime = Date.now() - startTime;
console.log(`\nTotal query time: ${totalTime}ms`);
console.log(`Final results: ${finalResults.length}`);
return {
results: finalResults,
metadata: {
totalTime,
stage2Time,
stage3Time,
preFilterApplied: queryContext,
postFilterApplied: postFilterCriteria,
vectorResultCount: vectorResults.length,
postFilterResultCount: filteredResults.length
}
};
}
// Generate answer using retrieved context
async function generateAnswer(query, retrievedChunks) {
const context = retrievedChunks
.map((chunk, i) => `[${i + 1}] ${chunk.content}`)
.join('\n\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `You are a helpful assistant that answers questions based on the provided context.
Only use information from the context. If the context does not contain
enough information to answer, say so clearly.`
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${query}`
}
],
temperature: 0.1
});
return response.choices[0].message.content;
}
// Example usage
async function main() {
const query = 'What were our Q4 2025 cloud infrastructure costs?';
// Define query context for pre-filtering
const queryContext = {
year: 2025,
quarter: 'Q4',
departments: ['infrastructure', 'finance'],
maxAccessLevel: 2 // User's access level
};
// Define post-filter criteria
const postFilterCriteria = {
requireVerified: true,
minWordCount: 20
};
// Execute staged hybrid query
const { results, metadata } = await stagedHybridQuery(
query,
queryContext,
postFilterCriteria,
5
);
if (results.length === 0) {
console.log('\nNo results found matching criteria.');
return;
}
// Display retrieved chunks
console.log('\n=== Retrieved Chunks ===');
results.forEach((result, i) => {
console.log(`\n[${i + 1}] Similarity: ${result.similarity.toFixed(4)}`);
console.log(` Year: ${result.year}, Quarter: ${result.quarter}`);
console.log(` Department: ${result.department}`);
console.log(` Text: ${result.content.substring(0, 100)}...`);
});
// Generate answer
console.log('\n=== Generated Answer ===');
const answer = await generateAnswer(query, results);
console.log(answer);
}
main().catch(console.error);Testing and Validating Your Filters
// test-filters.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');
const supabase = createClient(
process.env.SUPABASE_URL,
process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text
});
return response.data[0].embedding;
}
// Vector search without any filtering
async function vectorSearchNoFilter(queryEmbedding, topK = 10) {
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: topK
});
if (error) {
throw new Error(`Supabase query error: ${error.message}`);
}
return data || [];
}
// Vector search with pre-filter
async function vectorSearchWithPreFilter(queryEmbedding, preFilter, topK = 10) {
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_count: topK,
filter_year: preFilter.year || null,
filter_quarter: preFilter.quarter || null
});
if (error) {
throw new Error(`Supabase query error: ${error.message}`);
}
return data || [];
}
// Compare results with and without filtering
async function compareFilterStrategies(query, preFilter, postFilterCriteria) {
const queryEmbedding = await getEmbedding(query);
console.log(`\nQuery: "${query}"\n`);
console.log('='.repeat(60));
// Strategy 1: No filtering
const noFilterStart = Date.now();
const noFilterResults = await vectorSearchNoFilter(queryEmbedding, 10);
const noFilterTime = Date.now() - noFilterStart;
console.log('\n1. NO FILTER');
console.log(` Time: ${noFilterTime}ms`);
console.log(` Results: ${noFilterResults.length}`);
if (noFilterResults.length > 0) {
const years = noFilterResults.map(r => r.year);
console.log(` Years in results: ${[...new Set(years)].sort().join(', ')}`);
const correctYear = noFilterResults.filter(r => r.year === 2025).length;
console.log(` Correct year (2025): ${correctYear}/${noFilterResults.length}`);
}
// Strategy 2: Post-filter only
const postOnlyStart = Date.now();
const postOnlyVectorResults = await vectorSearchNoFilter(queryEmbedding, 50);
const postFiltered = postOnlyVectorResults.filter(result => {
if (preFilter.year && result.year !== preFilter.year) return false;
if (preFilter.quarter && result.quarter !== preFilter.quarter) return false;
if (postFilterCriteria.requireVerified && !result.author_verified) return false;
return true;
}).slice(0, 10);
const postOnlyTime = Date.now() - postOnlyStart;
console.log('\n2. POST-FILTER ONLY');
console.log(` Time: ${postOnlyTime}ms`);
console.log(` Results after filter: ${postFiltered.length}`);
if (postFiltered.length > 0) {
const years = postFiltered.map(r => r.year);
console.log(` Years in results: ${[...new Set(years)].sort().join(', ')}`);
}
// Strategy 3: Pre-filter
const preFilterStart = Date.now();
const preFilterResults = await vectorSearchWithPreFilter(queryEmbedding, preFilter, 10);
const preFilterTime = Date.now() - preFilterStart;
console.log('\n3. PRE-FILTER');
console.log(` Time: ${preFilterTime}ms`);
console.log(` Results: ${preFilterResults.length}`);
if (preFilterResults.length > 0) {
const years = preFilterResults.map(r => r.year);
console.log(` Years in results: ${[...new Set(years)].sort().join(', ')}`);
const correctYear = preFilterResults.filter(r => r.year === 2025).length;
console.log(` Correct year (2025): ${correctYear}/${preFilterResults.length}`);
}
// Strategy 4: Staged hybrid
const hybridStart = Date.now();
const hybridVectorResults = await vectorSearchWithPreFilter(queryEmbedding, preFilter, 20);
const hybridFiltered = hybridVectorResults.filter(result => {
if (postFilterCriteria.requireVerified && !result.author_verified) return false;
if (postFilterCriteria.minWordCount && result.word_count < postFilterCriteria.minWordCount) return false;
return true;
}).slice(0, 10);
const hybridTime = Date.now() - hybridStart;
console.log('\n4. STAGED HYBRID (Pre + Post)');
console.log(` Time: ${hybridTime}ms`);
console.log(` Results: ${hybridFiltered.length}`);
if (hybridFiltered.length > 0) {
const years = hybridFiltered.map(r => r.year);
console.log(` Years in results: ${[...new Set(years)].sort().join(', ')}`);
console.log(` All verified: ${hybridFiltered.every(r => r.author_verified)}`);
}
// Summary
console.log('\n' + '='.repeat(60));
console.log('SUMMARY');
console.log('='.repeat(60));
console.log(`
| Strategy | Time | Correct Results |
|----------------|--------|-----------------|
| No filter | ${noFilterTime.toString().padStart(4)}ms | Unreliable |
| Post-filter | ${postOnlyTime.toString().padStart(4)}ms | ${postFiltered.length} results |
| Pre-filter | ${preFilterTime.toString().padStart(4)}ms | ${preFilterResults.length} results |
| Staged hybrid | ${hybridTime.toString().padStart(4)}ms | ${hybridFiltered.length} results |
`);
}
// Run comparison
async function main() {
const query = 'What were our Q4 2025 cloud infrastructure costs?';
const preFilter = {
year: 2025,
quarter: 'Q4'
};
const postFilterCriteria = {
requireVerified: true,
minWordCount: 20
};
await compareFilterStrategies(query, preFilter, postFilterCriteria);
}
main().catch(console.error);Seeing the Difference: Wrong vs Right Output
Let us see what happens in practice when you query a RAG system with and without proper filtering.
Query: "What were our Q4 2025 cloud infrastructure costs?"
Without Metadata Filtering
Retrieved Content:
[1] Score: 0.92 - "Cloud Infrastructure Cost Report - Q4 2023...
Total spending reached $1.8 million..."
[2] Score: 0.89 - "Infrastructure Budget Review - Q3 2024...
Cloud costs trending upward..."
[3] Score: 0.87 - "Q4 2022 Cloud Expenses...
Compute services: $1.2 million..."
LLM Answer:
"Based on the available information, cloud infrastructure costs
were approximately $1.8 million in Q4."What went wrong:
Issue | Detail |
|---|---|
Wrong year retrieved | 2023 data returned for 2025 question |
High similarity scores | Vector search found semantically similar content |
Confident wrong answer | LLM has no way to know the data is outdated |
User impact | Business decisions made on 2-year-old data |
With Staged Hybrid Filtering
Pre-filter applied:
year: 2025
quarter: Q4
department: infrastructure OR finance
Retrieved Content:
[1] Score: 0.94 - "Cloud Infrastructure Cost Report - Q4 2025...
Total spending reached $2.4 million..."
[2] Score: 0.88 - "Q4 2025 Infrastructure Summary...
15% increase from Q3 2025..."
[3] Score: 0.85 - "2025 Annual Cloud Review...
Q4 represented highest quarterly spend..."
Post-filter applied:
author_verified: true
word_count: > 20
LLM Answer:
"Total cloud spending for Q4 2025 reached $2.4 million,
representing a 15% increase from Q3 2025. The primary drivers
were expanded compute capacity for the ML platform and
increased storage requirements for the data lake."What went right:
Improvement | Detail |
|---|---|
Correct year only | Pre-filter eliminated all non-2025 content |
Verified sources | Post-filter ensured authoritative content |
Accurate answer | LLM generated response from correct data |
Traceable | Can verify answer against source material |
The Difference in Numbers
Metric | Without Filtering | With Staged Hybrid |
|---|---|---|
Records searched | 1,000,000 | 100,000 |
Results from correct year | 2 of 10 (20%) | 10 of 10 (100%) |
Context window efficiency | ~25% relevant | ~95% relevant |
Answer accuracy | ❌ Wrong | ✅ Correct |
Query latency | 12ms | 20ms |
The 8ms additional latency is the cost of getting the right answer instead of the wrong one.
Real Benchmarks: The Numbers
Here is what you can expect when running these strategies on a corpus of 1 million documents:
Strategy | Latency | Recall | Correct Category |
|---|---|---|---|
No filter | 12ms | ~60% | ❌ Includes wrong years |
Post-filter only | 45ms | 87% | ⚠️ Sometimes |
Pre-filter only | 18ms | 95% | ✅ Yes |
Staged hybrid | 20ms | 97% | ✅ Yes |
Key observations:
No filter is fast but wrong. Low latency means nothing when you return outdated information.
Post-filter is slow and wasteful. You perform vector search on documents you will discard.
Pre-filter is the baseline. Always use it for highly selective metadata like dates and categories.
Staged hybrid gives the best results. Small latency increase for better recall and flexibility.
Key Takeaways
Vector search finds meaning. Metadata enforces reality.
Your RAG system needs both. Semantic similarity tells you what content is related to the query. Metadata filtering ensures that related content is also correct for the user's context.
The Decision Framework
Selectivity | Strategy | Example Filters |
|---|---|---|
Removes > 90% of records | Pre-filter | Year, quarter, department |
Removes 50-90% of records | Staged hybrid | Date + verified status |
Removes < 50% of records | Post-filter only | Tags, word count |
Three Things to Remember
Design metadata before you design chunking. The filters you need at query time should inform your entire ingestion pipeline.
Pre-filter aggressively on indexed fields. Dates, categories, and access levels are cheap to filter and dramatically reduce your search space.
Measure end-to-end, not just retrieval metrics. Retrieval precision means nothing if users get wrong answers. Test your complete pipeline with real queries.
Validation Checklist
Before you ship, verify:
Pre-filters are applied for date, category, and access control
Post-filters handle non-indexed criteria
Query logs capture what was retrieved and why
Test queries return results from correct time periods only
Latency is within acceptable range (typically < 100ms)
Recall is measured and meets your threshold (typically > 90%)