Mistake in Production RAG Systems: Why Your Vector Search Is Searching the Wrong Documents

06 Feb 2026

Mistake in Production RAG Systems: Why Your Vector Search Is Searching the Wrong Documents

Retrieval-Augmented Generation, or RAG, is now the standard architecture for building LLM applications that need access to your own data.

The idea is simple: instead of hoping the language model knows the answer, you fetch relevant information from your database and include it in the prompt. The model reads this content and generates an answer based on what it finds.

Here is how it works: you take your data, break it into smaller pieces called chunks, and convert each chunk into a vector embedding—a list of numbers that represents what the text means. These vectors get stored in a database. When a user asks a question, you convert their question into a vector too, search for chunks with similar vectors, and pass those chunks to the LLM.

This approach solves important problems. Language models have outdated knowledge. They have never seen your internal data. Fine-tuning them is expensive and slow. RAG lets you connect any data source to any model without retraining.

But most teams make the same mistake when building these systems.

They spend weeks perfecting their embeddings and chunking logic. They benchmark different models and tweak similarity thresholds. What they forget is that vector search only answers one question: how similar does this text sound to the query? It has no understanding of time, categories, permissions, or business rules. It cannot distinguish between what is semantically relevant and what is actually correct for the user's context.

In other words, similarity is not the same as relevance. Your retrieval system might return content that sounds right but is completely wrong for the situation.

What this guide covers:

There are several mistakes teams make when building RAG systems. This guide focuses on the most common and most damaging one: optimising embeddings while ignoring metadata filtering. We will explore why this happens, why it breaks production systems, and how to fix it with a staged hybrid filtering approach—complete with working Node.js code you can adapt for your own projects.

Let us start by understanding what RAG actually is and why it exists.

Understanding RAG: The Bridge Between Your Data and LLMs

What is Retrieval-Augmented Generation?

RAG combines two systems: a retrieval system that finds relevant information and a generation system that produces answers. First, you search your knowledge base to find content that might help answer the question. Then, you hand that content to an LLM and ask it to generate a response.

The retrieval phase acts as a filter. Instead of asking the LLM to remember everything, you give it exactly what it needs for the current question. The generation phase takes this information and turns it into a clear, readable answer. When retrieval works well, even a simple language model can produce accurate results.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│              │    │              │    │              │
│   Your Data  │───▶│  Chunking &  │───▶│   Vector     │
│              │    │  Embedding   │    │   Database   │
│              │    │              │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                                               ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│              │    │              │    │              │
│   Answer     │◀───│     LLM      │◀───│  Retrieved   │
│              │    │              │    │   Chunks     │
│              │    │              │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
                           ▲
                           │
                    ┌──────────────┐
                    │              │
                    │  User Query  │
                    │              │
                    └──────────────┘

RAG Pipeline Overview

Why RAG Exists: The Problem It Solves

Language models have a fundamental limitation: their knowledge stops at the point they were trained. They cannot access information that came after their training cutoff. They also have no awareness of private or proprietary information that was never part of their training data.

Fine-tuning seems like an obvious solution, but it is expensive, slow, and needs to be repeated every time your data changes. For use cases where information updates frequently, fine-tuning is not practical.

RAG keeps the knowledge separate from the model. Your data lives in a vector database that you can update anytime. New information becomes searchable within minutes, not weeks. The LLM stays unchanged—it just receives different context for each query.

The Core Components of a RAG Pipeline

A RAG system has five core components. Ingestion is where your data enters the system. Chunking breaks that data into smaller, searchable units. Embedding converts each chunk into a vector—a numerical representation of its meaning. Storage holds these vectors in a database optimised for similarity search. Retrieval happens at runtime: the query gets embedded, compared against stored vectors, and matching chunks are returned.

Each component involves decisions that affect the final output. But one decision matters more than most teams realise: how you filter what gets searched in the first place.

What Challenges RAG Systems Are Designed to Solve

The Knowledge Boundary Problem

Every LLM has a knowledge cutoff date. Ask it about events that happened after that date and it will either admit it does not know or make something up. RAG solves this by maintaining a separate knowledge base that you control and update independently of the model.

The Hallucination Problem

Without grounding, LLMs generate confident-sounding answers that are partially or completely wrong. They fill gaps with plausible guesses. RAG reduces this by providing source material. The model answers based on actual content, and you can trace every response back to its source.

The Scale and Freshness Problem

Organisations deal with large volumes of information that change constantly. New reports, updated policies, added records. RAG handles this naturally. Adding or updating information means inserting or replacing vectors in your database. Your knowledge base stays current without touching the model.

The Common Mistakes People Make When Building RAG Systems

Before we dive into the solution, let us look at the five most common mistakes teams make when building RAG systems. We will cover all five briefly, then spend the rest of this guide solving Mistake #1 in depth—because it is the most widespread and causes the most damage in production.

Mistake

What Goes Wrong

Impact Level

#1 Ignoring metadata filtering

Wrong content gets retrieved despite good embeddings

🔴 Critical

#2 Poor chunking strategy

LLM lacks context or gets diluted information

🟠 High

#3 Treating all content as equal

Drafts compete with verified documents

🟠 High

#4 Retrieval-generation mismatch

Retrieved content is unhelpful for the LLM

🟡 Medium

#5 No observability

Cannot diagnose why wrong answers happen

🟡 Medium


Mistake #1: Optimising Embeddings While Ignoring Metadata Filtering

This is the most common and costly mistake in production RAG systems. Teams spend weeks fine-tuning embedding models and perfecting chunking strategies while ignoring a fundamental truth: vector search only measures semantic similarity.

Consider a real scenario. You have a RAG system with 1 million records spanning multiple years. A user asks: "What were our Q4 2025 cloud infrastructure costs?"

Your embedding model converts this query into a vector and searches for similar content. The problem? Content about "cloud infrastructure costs" from 2019, 2020, 2021, 2022, 2023, and 2024 are all semantically similar to this query. The vector search has no concept of time. It cannot distinguish between a 2019 report and a 2025 report—both talk about the same topic.

What happens without metadata filtering:

Problem

Consequence

System searches all 1 million records

Wasted compute resources

Results include wrong years

Incorrect information retrieved

75% of context window wasted

Relevant content gets pushed out

Latency increases by 30ms+

Slower response times

LLM generates answer using outdated figures

User gets wrong answer

The mental model you need:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   Vector Search    →    "How similar is this content?"      │
│                                                             │
│   Metadata Filter  →    "Is this content allowed to         │
│                          compete in the first place?"       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Both are necessary. Neither is sufficient alone.

We will solve this mistake in detail in the sections that follow.

In this guide will cover, Mistake #1 as it is foundational. If you retrieve the wrong content, nothing else matters. Your chunking can be perfect. Your LLM can be state-of-the-art. But if your system returns 2019 data for a 2025 question, your users get wrong answers.

Fixing this mistake requires understanding the staged hybrid filtering approach. Let us dive in.

Deep Dive: The Staged Hybrid Filtering Approach

The solution to Mistake #1 is not choosing between vector search and metadata filtering. It is combining them in a staged pipeline.

Understanding the Three-Stage Architecture

The staged hybrid approach processes queries in three phases:

Stage

Operation

Purpose

Example

Stage 1

Pre-filter

Eliminate impossible candidates

Remove all documents not from 2025

Stage 2

Vector search

Semantic ranking

Find most similar content in filtered set

Stage 3

Post-filter

Lightweight refinement

Keep only verified documents

This architecture balances speed, accuracy, and flexibility.

                    ┌─────────────────────────────┐
                    │                             │
                    │     1,000,000 Records       │
                    │        (Full Corpus)        │
                    │                             │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │   STAGE 1: Pre-Filter       │
                    │   • year = 2025             │
                    │   • quarter = Q4            │
                    │   • department = finance    │
                    │   +6ms latency              │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │                             │
                    │      100,000 Records        │
                    │    (Filtered Candidates)    │
                    │                             │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │   STAGE 2: Vector Search    │
                    │   • Semantic similarity     │
                    │   • HNSW / IVF algorithm    │
                    │   • Top 50 results          │
                    │   ~12ms latency             │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │                             │
                    │        50 Records           │
                    │   (Semantically Similar)    │
                    │                             │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │   STAGE 3: Post-Filter      │
                    │   • author_verified = true  │
                    │   • word_count > 20         │
                    │   • required tags           │
                    │   +2ms latency              │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │                             │
                    │        10 Records           │
                    │      (Final Results)        │
                    │                             │
                    └─────────────────────────────┘

Total: ~20ms for 97% recall with correct category

Diagram : Staged Hybrid Filtering Funnel

Stage 1: Pre-filtering (Indexed, Selective)

Pre-filtering happens before vector search. You use indexed metadata fields to eliminate documents that cannot possibly be correct answers.

Good candidates for pre-filtering:

  • Date ranges (year, quarter, month)

  • Categorical fields (department, product, region)

  • Access control (user permissions, security level)

Why pre-filter works:

  • Uses traditional database indexes (B-trees, hash indexes)

  • Sub-millisecond filtering on properly indexed fields

  • Dramatically reduces the search space

// Example: Pre-filter using Supabase RPC function
const { data, error } = await supabase.rpc('match_documents', {
  query_embedding: queryEmbedding,
  match_count: 10,
  filter_year: 2025,
  filter_quarter: 'Q4',
  filter_departments: ['finance', 'infrastructure'],
  filter_max_access_level: userAccessLevel
});

When to use pre-filtering:

"Pre-filter when selectivity is less than 10%. If your filter removes more than 90% of documents, always pre-filter".

Stage 2: Approximate Nearest Neighbour (ANN) Vector Search

After pre-filtering reduces your search space, vector search ranks the remaining content by semantic similarity.

Modern vector databases use ANN algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). These trade perfect recall for speed—instead of comparing against every vector, they search a subset likely to contain the nearest neighbours.

// Vector search on pre-filtered subset using Supabase
async function vectorSearch(queryEmbedding, queryContext, topK = 10) {
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: topK,
    filter_year: queryContext.year,
    filter_quarter: queryContext.quarter,
    filter_max_access_level: queryContext.maxAccessLevel
  });

  if (error) {
    throw new Error(`Search failed: ${error.message}`);
  }
  return data;
}

Key parameters to tune:

  • ef_search (HNSW) or nprobe (IVF): Higher values improve recall but increase latency

  • topK: Number of results to return for the next stage

Stage 3: Post-filtering (Non-indexed, Lightweight)

Post-filtering applies additional criteria after vector search returns results. Use this for:

  • Attributes that are expensive to index

  • High-cardinality fields (thousands of unique values)

  • Frequently changing metadata

  • Complex business rules

// Post-filter on vector search results
function postFilter(results, criteria) {
  return results.filter(result => {
    const meta = result.metadata;

    // Check author verification
    if (criteria.requireVerified && !meta.author_verified) {
      return false;
    }

    // Check minimum word count
    if (criteria.minWordCount && meta.word_count < criteria.minWordCount) {
      return false;
    }

    // Check required tags
    if (criteria.requiredTags) {
      const hasAllTags = criteria.requiredTags.every(tag =>
        meta.tags?.includes(tag)
      );
      if (!hasAllTags) return false;
    }

    return true;
  });
}

When to use post-filtering:

Post-filter when selectivity is greater than 50%. If the filter only removes a small portion of results, post-filtering is more efficient.


The Hybrid Decision Framework:

                         ┌─────────────────────┐
                         │  What % of records  │
                         │  does this filter   │
                         │     eliminate?      │
                         └──────────┬──────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
       ┌────────────┐        ┌────────────┐        ┌────────────┐
       │            │        │            │        │            │
       │   > 90%    │        │  50 - 90%  │        │   < 50%    │
       │            │        │            │        │            │
       └─────┬──────┘        └─────┬──────┘        └─────┬──────┘
             │                     │                     │
             ▼                     ▼                     ▼
       ┌────────────┐        ┌────────────┐        ┌────────────┐
       │            │        │            │        │            │
       │ PRE-FILTER │        │  STAGED    │        │ POST-FILTER│
       │            │        │  HYBRID    │        │    ONLY    │
       │            │        │            │        │            │
       └────────────┘        └────────────┘        └────────────┘
             │                     │                     │
             ▼                     ▼                     ▼
       ┌────────────┐        ┌────────────┐        ┌────────────┐
       │ Best for:  │        │ Best for:  │        │ Best for:  │
       │ • Dates    │        │ • Combined │        │ • Tags     │
       │ • Category │        │   criteria │        │ • Verified │
       │ • Access   │        │ • Flexible │        │ • Complex  │
       │   level    │        │   rules    │        │   rules    │
       └────────────┘        └────────────┘        └────────────┘

Diagram : Filter Strategy Decision Tree

Use this framework to decide where each filter belongs:

Selectivity

Strategy

Reason

Less than 10%

Pre-filter

Eliminates most documents, saves vector search time

10% to 50%

Staged hybrid

Combine pre-filter and post-filter

Greater than 50%

Post-filter only

Not selective enough to justify index overhead


Implementation Walkthrough (Node.js)

Let us build a complete staged hybrid filtering pipeline.

Project Setup and Dependencies

mkdir rag-hybrid-filter
cd rag-hybrid-filter
npm init -y
npm install @supabase/supabase-js openai dotenv

Create your environment file:

# .env
SUPABASE_URL=your_supabase_project_url
SUPABASE_ANON_KEY=your_supabase_anon_key
OPENAI_API_KEY=your_openai_api_key

Database Setup (Run in Supabase SQL Editor)

Before using the code, you need to set up your Supabase database with the pgvector extension and create the necessary table and functions.

//sql
-- Enable the pgvector extension
create extension if not exists vector;

-- Create the documents table with metadata columns
create table documents (
  id text primary key,
  content text not null,
  embedding vector(1536),
  
  -- Indexed fields (for pre-filtering)
  year integer not null,
  quarter text not null,
  department text not null,
  document_type text not null,
  access_level integer not null,
  
  -- Non-indexed fields (for post-filtering)
  author text,
  author_verified boolean default false,
  tags text[],
  word_count integer,
  created_at timestamptz default now(),
  source_url text
);

-- Create indexes for pre-filter fields
create index idx_documents_year on documents(year);
create index idx_documents_quarter on documents(quarter);
create index idx_documents_department on documents(department);
create index idx_documents_access_level on documents(access_level);

-- Create index for vector similarity search
create index idx_documents_embedding on documents 
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

-- Function for vector search with pre-filtering
create or replace function match_documents(
  query_embedding vector(1536),
  match_count int default 10,
  filter_year int default null,
  filter_quarter text default null,
  filter_department text default null,
  filter_departments text[] default null,
  filter_max_access_level int default null
)
returns table (
  id text,
  content text,
  year int,
  quarter text,
  department text,
  document_type text,
  access_level int,
  author text,
  author_verified boolean,
  tags text[],
  word_count int,
  created_at timestamptz,
  source_url text,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    d.id,
    d.content,
    d.year,
    d.quarter,
    d.department,
    d.document_type,
    d.access_level,
    d.author,
    d.author_verified,
    d.tags,
    d.word_count,
    d.created_at,
    d.source_url,
    1 - (d.embedding <=> query_embedding) as similarity
  from documents d
  where
    (filter_year is null or d.year = filter_year)
    and (filter_quarter is null or d.quarter = filter_quarter)
    and (filter_department is null or d.department = filter_department)
    and (filter_departments is null or d.department = any(filter_departments))
    and (filter_max_access_level is null or d.access_level <= filter_max_access_level)
  order by d.embedding <=> query_embedding
  limit match_count;
end;
$$;

Defining Your Metadata Schema

Design your metadata schema before ingesting any data. Think about what filters you will need at query time.

// schemas/document-metadata.js

const documentMetadataSchema = {
  // Indexed fields (for pre-filtering)
  year: 'number',           // 2019, 2020, ..., 2025
  quarter: 'string',        // Q1, Q2, Q3, Q4
  department: 'string',     // finance, engineering, marketing
  document_type: 'string',  // report, memo, policy, guide
  access_level: 'number',   // 1 = public, 2 = internal, 3 = confidential

  // Non-indexed fields (for post-filtering)
  author: 'string',
  author_verified: 'boolean',
  tags: 'string[]',
  word_count: 'number',
  created_at: 'string',     // ISO date string
  source_url: 'string'
};

// Validate metadata before ingestion
function validateMetadata(metadata) {
  const errors = [];

  if (!metadata.year || metadata.year < 2000 || metadata.year > 2030) {
    errors.push('Invalid year');
  }

  if (!['Q1', 'Q2', 'Q3', 'Q4'].includes(metadata.quarter)) {
    errors.push('Invalid quarter');
  }

  if (!metadata.department) {
    errors.push('Department is required');
  }

  if (typeof metadata.access_level !== 'number' || metadata.access_level < 1 || metadata.access_level > 3) {
    errors.push('Invalid access level');
  }

  return { valid: errors.length === 0, errors };
}

module.exports = { documentMetadataSchema, validateMetadata };

Ingesting Documents with Rich Metadata

// ingest.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');
const { validateMetadata } = require('./schemas/document-metadata');

const supabase = createClient(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });
  return response.data[0].embedding;
}

function chunkDocument(text, maxChunkSize = 500) {
  const paragraphs = text.split(/\n\n+/);
  const chunks = [];
  let currentChunk = '';

  for (const paragraph of paragraphs) {
    if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk) {
      chunks.push(currentChunk.trim());
      currentChunk = '';
    }
    currentChunk += paragraph + '\n\n';
  }

  if (currentChunk.trim()) {
    chunks.push(currentChunk.trim());
  }
  return chunks;
}

async function ingestDocument(documentId, text, metadata) {
  // Validate metadata
  const validation = validateMetadata(metadata);
  if (!validation.valid) {
    throw new Error(`Invalid metadata: ${validation.errors.join(', ')}`);
  }

  // Chunk the document
  const chunks = chunkDocument(text);

  // Generate embeddings and prepare records
  const records = await Promise.all(
    chunks.map(async (chunk, index) => {
      const embedding = await getEmbedding(chunk);
      return {
        id: `${documentId}_chunk_${index}`,
        content: chunk,
        embedding: embedding,
        year: metadata.year,
        quarter: metadata.quarter,
        department: metadata.department,
        document_type: metadata.document_type,
        access_level: metadata.access_level,
        author: metadata.author,
        author_verified: metadata.author_verified,
        tags: metadata.tags,
        word_count: chunk.split(/\s+/).length,
        created_at: metadata.created_at,
        source_url: metadata.source_url
      };
    })
  );

  // Upsert to Supabase
  const { data, error } = await supabase
    .from('documents')
    .upsert(records, { onConflict: 'id' });
  if (error) {
    throw new Error(`Supabase insert error: ${error.message}`);
  }

  console.log(`Ingested ${records.length} chunks for document ${documentId}`);
  return records.length;
}

// Example usage
async function main() {
  const sampleDocument = `
    Cloud Infrastructure Cost Report - Q4 2025

    Executive Summary
    Total cloud spending for Q4 2025 reached $2.4 million, representing a 15% increase
    from Q3 2025. The primary drivers were expanded compute capacity for the ML platform
    and increased storage requirements for the data lake.

    Breakdown by Service
    Compute services accounted for 45% of total spend. Storage services represented 30%.
    Networking and data transfer made up the remaining 25%.

    Recommendations
    We recommend implementing reserved instances for predictable workloads to reduce
    compute costs by an estimated 20% in Q1 2026.
  `;

  const metadata = {
    year: 2025,
    quarter: 'Q4',
    department: 'infrastructure',
    document_type: 'report',
    access_level: 2,
    author: 'finance-team',
    author_verified: true,
    tags: ['cloud', 'costs', 'infrastructure', 'quarterly-report'],
    created_at: '2025-01-15T10:00:00Z',
    source_url: 'https://internal.company.com/reports/q4-2025-cloud-costs'
  };

  await ingestDocument('doc_q4_2025_cloud_costs', sampleDocument, metadata);
}

main().catch(console.error);

Building the Staged Query Pipeline

// query.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');

const supabase = createClient(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });
  return response.data[0].embedding;
}

// Stage 2: Vector search with pre-filter using Supabase RPC
async function vectorSearchWithPreFilter(queryEmbedding, queryContext, topK = 20) {
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: topK,
    filter_year: queryContext.year || null,
    filter_quarter: queryContext.quarter || null,
    filter_department: queryContext.department || null,
    filter_departments: queryContext.departments || null,
    filter_max_access_level: queryContext.maxAccessLevel || null
  });

  if (error) {
    throw new Error(`Supabase query error: ${error.message}`);
  }
  return data || [];
}

// Stage 3: Post-filter results
function applyPostFilter(results, postFilterCriteria) {
  return results.filter(result => {
    // Filter by author verification
    if (postFilterCriteria.requireVerified && !result.author_verified) {
      return false;
    }

    // Filter by minimum word count
    if (postFilterCriteria.minWordCount && result.word_count < postFilterCriteria.minWordCount) {
      return false;
    }

    // Filter by required tags (must have all)
    if (postFilterCriteria.requiredTags) {
      const docTags = result.tags || [];
      const hasAllTags = postFilterCriteria.requiredTags.every(tag =>
        docTags.includes(tag)
      );
      if (!hasAllTags) return false;
    }

    // Filter by excluded tags (must have none)
    if (postFilterCriteria.excludedTags) {
      const docTags = result.tags || [];
      const hasExcludedTag = postFilterCriteria.excludedTags.some(tag =>
        docTags.includes(tag)
      );
      if (hasExcludedTag) return false;
    }
    return true;
  });
}

// Complete staged hybrid query
async function stagedHybridQuery(query, queryContext, postFilterCriteria = {}, finalTopK = 5) {
  console.log('\n=== Staged Hybrid Query ===');
  console.log(`Query: "${query}"`);

  const startTime = Date.now();

  // Log pre-filter context
  console.log('\nStage 1 - Pre-filter context:', JSON.stringify(queryContext, null, 2));

  // Get query embedding
  const queryEmbedding = await getEmbedding(query);

  // Stage 2: Vector search with pre-filter
  const stage2Start = Date.now();
  const vectorResults = await vectorSearchWithPreFilter(queryEmbedding, queryContext, 20);
  const stage2Time = Date.now() - stage2Start;
  console.log(`\nStage 2 - Vector search: ${vectorResults.length} results in ${stage2Time}ms`);

  // Stage 3: Post-filter
  const stage3Start = Date.now();
  const filteredResults = applyPostFilter(vectorResults, postFilterCriteria);
  const stage3Time = Date.now() - stage3Start;
  console.log(`Stage 3 - Post-filter: ${filteredResults.length} results in ${stage3Time}ms`);

  // Return top K final results
  const finalResults = filteredResults.slice(0, finalTopK);
  const totalTime = Date.now() - startTime;

  console.log(`\nTotal query time: ${totalTime}ms`);
  console.log(`Final results: ${finalResults.length}`);

  return {
    results: finalResults,
    metadata: {
      totalTime,
      stage2Time,
      stage3Time,
      preFilterApplied: queryContext,
      postFilterApplied: postFilterCriteria,
      vectorResultCount: vectorResults.length,
      postFilterResultCount: filteredResults.length
    }
  };
}

// Generate answer using retrieved context
async function generateAnswer(query, retrievedChunks) {
  const context = retrievedChunks
    .map((chunk, i) => `[${i + 1}] ${chunk.content}`)
    .join('\n\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant that answers questions based on the provided context. 
                  Only use information from the context. If the context does not contain 
                  enough information to answer, say so clearly.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${query}`
      }
    ],
    temperature: 0.1
  });

  return response.choices[0].message.content;
}

// Example usage
async function main() {
  const query = 'What were our Q4 2025 cloud infrastructure costs?';

  // Define query context for pre-filtering
  const queryContext = {
    year: 2025,
    quarter: 'Q4',
    departments: ['infrastructure', 'finance'],
    maxAccessLevel: 2  // User's access level
  };

  // Define post-filter criteria
  const postFilterCriteria = {
    requireVerified: true,
    minWordCount: 20
  };

  // Execute staged hybrid query
  const { results, metadata } = await stagedHybridQuery(
    query,
    queryContext,
    postFilterCriteria,
    5
  );

  if (results.length === 0) {
    console.log('\nNo results found matching criteria.');
    return;
  }

  // Display retrieved chunks
  console.log('\n=== Retrieved Chunks ===');
  results.forEach((result, i) => {
    console.log(`\n[${i + 1}] Similarity: ${result.similarity.toFixed(4)}`);
    console.log(`    Year: ${result.year}, Quarter: ${result.quarter}`);
    console.log(`    Department: ${result.department}`);
    console.log(`    Text: ${result.content.substring(0, 100)}...`);
  });

  // Generate answer
  console.log('\n=== Generated Answer ===');
  const answer = await generateAnswer(query, results);
  console.log(answer);
}

main().catch(console.error);

Testing and Validating Your Filters

// test-filters.js
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const OpenAI = require('openai');

const supabase = createClient(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_ANON_KEY
);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });
  return response.data[0].embedding;
}

// Vector search without any filtering
async function vectorSearchNoFilter(queryEmbedding, topK = 10) {
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: topK
  });

  if (error) {
    throw new Error(`Supabase query error: ${error.message}`);
  }

  return data || [];
}

// Vector search with pre-filter
async function vectorSearchWithPreFilter(queryEmbedding, preFilter, topK = 10) {
  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: queryEmbedding,
    match_count: topK,
    filter_year: preFilter.year || null,
    filter_quarter: preFilter.quarter || null
  });

  if (error) {
    throw new Error(`Supabase query error: ${error.message}`);
  }

  return data || [];
}

// Compare results with and without filtering
async function compareFilterStrategies(query, preFilter, postFilterCriteria) {
  const queryEmbedding = await getEmbedding(query);

  console.log(`\nQuery: "${query}"\n`);
  console.log('='.repeat(60));

  // Strategy 1: No filtering
  const noFilterStart = Date.now();
  const noFilterResults = await vectorSearchNoFilter(queryEmbedding, 10);
  const noFilterTime = Date.now() - noFilterStart;

  console.log('\n1. NO FILTER');
  console.log(`   Time: ${noFilterTime}ms`);
  console.log(`   Results: ${noFilterResults.length}`);
  if (noFilterResults.length > 0) {
    const years = noFilterResults.map(r => r.year);
    console.log(`   Years in results: ${[...new Set(years)].sort().join(', ')}`);
    const correctYear = noFilterResults.filter(r => r.year === 2025).length;
    console.log(`   Correct year (2025): ${correctYear}/${noFilterResults.length}`);
  }

  // Strategy 2: Post-filter only
  const postOnlyStart = Date.now();
  const postOnlyVectorResults = await vectorSearchNoFilter(queryEmbedding, 50);
  const postFiltered = postOnlyVectorResults.filter(result => {
    if (preFilter.year && result.year !== preFilter.year) return false;
    if (preFilter.quarter && result.quarter !== preFilter.quarter) return false;
    if (postFilterCriteria.requireVerified && !result.author_verified) return false;
    return true;
  }).slice(0, 10);
  const postOnlyTime = Date.now() - postOnlyStart;

  console.log('\n2. POST-FILTER ONLY');
  console.log(`   Time: ${postOnlyTime}ms`);
  console.log(`   Results after filter: ${postFiltered.length}`);
  if (postFiltered.length > 0) {
    const years = postFiltered.map(r => r.year);
    console.log(`   Years in results: ${[...new Set(years)].sort().join(', ')}`);
  }

  // Strategy 3: Pre-filter
  const preFilterStart = Date.now();
  const preFilterResults = await vectorSearchWithPreFilter(queryEmbedding, preFilter, 10);
  const preFilterTime = Date.now() - preFilterStart;

  console.log('\n3. PRE-FILTER');
  console.log(`   Time: ${preFilterTime}ms`);
  console.log(`   Results: ${preFilterResults.length}`);
  if (preFilterResults.length > 0) {
    const years = preFilterResults.map(r => r.year);
    console.log(`   Years in results: ${[...new Set(years)].sort().join(', ')}`);
    const correctYear = preFilterResults.filter(r => r.year === 2025).length;
    console.log(`   Correct year (2025): ${correctYear}/${preFilterResults.length}`);
  }

  // Strategy 4: Staged hybrid
  const hybridStart = Date.now();
  const hybridVectorResults = await vectorSearchWithPreFilter(queryEmbedding, preFilter, 20);
  const hybridFiltered = hybridVectorResults.filter(result => {
    if (postFilterCriteria.requireVerified && !result.author_verified) return false;
    if (postFilterCriteria.minWordCount && result.word_count < postFilterCriteria.minWordCount) return false;
    return true;
  }).slice(0, 10);
  const hybridTime = Date.now() - hybridStart;

  console.log('\n4. STAGED HYBRID (Pre + Post)');
  console.log(`   Time: ${hybridTime}ms`);
  console.log(`   Results: ${hybridFiltered.length}`);
  if (hybridFiltered.length > 0) {
    const years = hybridFiltered.map(r => r.year);
    console.log(`   Years in results: ${[...new Set(years)].sort().join(', ')}`);
    console.log(`   All verified: ${hybridFiltered.every(r => r.author_verified)}`);
  }

  // Summary
  console.log('\n' + '='.repeat(60));
  console.log('SUMMARY');
  console.log('='.repeat(60));
  console.log(`
| Strategy       | Time   | Correct Results |
|----------------|--------|-----------------|
| No filter      | ${noFilterTime.toString().padStart(4)}ms | Unreliable      |
| Post-filter    | ${postOnlyTime.toString().padStart(4)}ms | ${postFiltered.length} results       |
| Pre-filter     | ${preFilterTime.toString().padStart(4)}ms | ${preFilterResults.length} results       |
| Staged hybrid  | ${hybridTime.toString().padStart(4)}ms | ${hybridFiltered.length} results       |
  `);
}

// Run comparison
async function main() {
  const query = 'What were our Q4 2025 cloud infrastructure costs?';

  const preFilter = {
    year: 2025,
    quarter: 'Q4'
  };

  const postFilterCriteria = {
    requireVerified: true,
    minWordCount: 20
  };

  await compareFilterStrategies(query, preFilter, postFilterCriteria);
}

main().catch(console.error);

Seeing the Difference: Wrong vs Right Output

Let us see what happens in practice when you query a RAG system with and without proper filtering.

Query: "What were our Q4 2025 cloud infrastructure costs?"

Without Metadata Filtering

Retrieved Content:
[1] Score: 0.92 - "Cloud Infrastructure Cost Report - Q4 2023... 
    Total spending reached $1.8 million..."
[2] Score: 0.89 - "Infrastructure Budget Review - Q3 2024... 
    Cloud costs trending upward..."
[3] Score: 0.87 - "Q4 2022 Cloud Expenses... 
    Compute services: $1.2 million..."

LLM Answer:
"Based on the available information, cloud infrastructure costs 
were approximately $1.8 million in Q4."

What went wrong:

Issue

Detail

Wrong year retrieved

2023 data returned for 2025 question

High similarity scores

Vector search found semantically similar content

Confident wrong answer

LLM has no way to know the data is outdated

User impact

Business decisions made on 2-year-old data

With Staged Hybrid Filtering

Pre-filter applied:
  year: 2025
  quarter: Q4
  department: infrastructure OR finance

Retrieved Content:
[1] Score: 0.94 - "Cloud Infrastructure Cost Report - Q4 2025... 
    Total spending reached $2.4 million..."
[2] Score: 0.88 - "Q4 2025 Infrastructure Summary... 
    15% increase from Q3 2025..."
[3] Score: 0.85 - "2025 Annual Cloud Review... 
    Q4 represented highest quarterly spend..."

Post-filter applied:
  author_verified: true
  word_count: > 20

LLM Answer:
"Total cloud spending for Q4 2025 reached $2.4 million, 
representing a 15% increase from Q3 2025. The primary drivers 
were expanded compute capacity for the ML platform and 
increased storage requirements for the data lake."

What went right:

Improvement

Detail

Correct year only

Pre-filter eliminated all non-2025 content

Verified sources

Post-filter ensured authoritative content

Accurate answer

LLM generated response from correct data

Traceable

Can verify answer against source material


The Difference in Numbers

Metric

Without Filtering

With Staged Hybrid

Records searched

1,000,000

100,000

Results from correct year

2 of 10 (20%)

10 of 10 (100%)

Context window efficiency

~25% relevant

~95% relevant

Answer accuracy

❌ Wrong

✅ Correct

Query latency

12ms

20ms

The 8ms additional latency is the cost of getting the right answer instead of the wrong one.

Real Benchmarks: The Numbers

Here is what you can expect when running these strategies on a corpus of 1 million documents:

Strategy

Latency

Recall

Correct Category

No filter

12ms

~60%

❌ Includes wrong years

Post-filter only

45ms

87%

⚠️ Sometimes

Pre-filter only

18ms

95%

✅ Yes

Staged hybrid

20ms

97%

✅ Yes

Key observations:

  • No filter is fast but wrong. Low latency means nothing when you return outdated information.

  • Post-filter is slow and wasteful. You perform vector search on documents you will discard.

  • Pre-filter is the baseline. Always use it for highly selective metadata like dates and categories.

  • Staged hybrid gives the best results. Small latency increase for better recall and flexibility.

Key Takeaways

Vector search finds meaning. Metadata enforces reality.

Your RAG system needs both. Semantic similarity tells you what content is related to the query. Metadata filtering ensures that related content is also correct for the user's context.

The Decision Framework

Selectivity

Strategy

Example Filters

Removes > 90% of records

Pre-filter

Year, quarter, department

Removes 50-90% of records

Staged hybrid

Date + verified status

Removes < 50% of records

Post-filter only

Tags, word count

Three Things to Remember

  • Design metadata before you design chunking. The filters you need at query time should inform your entire ingestion pipeline.

  • Pre-filter aggressively on indexed fields. Dates, categories, and access levels are cheap to filter and dramatically reduce your search space.

  • Measure end-to-end, not just retrieval metrics. Retrieval precision means nothing if users get wrong answers. Test your complete pipeline with real queries.

Validation Checklist

Before you ship, verify:

  • Pre-filters are applied for date, category, and access control

  • Post-filters handle non-indexed criteria

  • Query logs capture what was retrieved and why

  • Test queries return results from correct time periods only

  • Latency is within acceptable range (typically < 100ms)

  • Recall is measured and meets your threshold (typically > 90%)