SOPHIE Daddy Quant Blog - Stock & Options Analysis

Executive Summary

This comprehensive analysis evaluates three vector storage solutions—Chroma, Facebook AI Similarity Search (FAISS), and Scikit-learn—for building a Retrieval-Augmented Generation (RAG) knowledge base chatbot using Confluence page data.

🎯 The Core Challenge

Confluence data exists in a complex, hierarchical structure where pages have parent-child relationships and rich interconnections. Standard RAG techniques that treat documents as isolated chunks destroy this crucial contextual integrity.

Winner: Chroma

Purpose-built AI database with integrated metadata storage and efficient pre-filtering. Ideal architectural fit for hierarchical RAG.

Performance: FAISS

Unparalleled raw speed for vector similarity search, but requires complex custom architecture for metadata handling.

Prototype: Scikit-learn

Excellent for learning and small-scale experiments, but unsuitable for production RAG systems due to scalability limitations.

The Hierarchical Data Challenge

🏗️ Confluence's Complex Structure

Unlike simple document collections, Confluence data forms a rich, interconnected graph where context is everything:

📄 Page Hierarchies

Tree structures with explicit parent-child relationships creating contextual meaning

🏢 Spaces & Teams

Organizational boundaries that group related knowledge and define access patterns

🏷️ Rich Metadata

Labels, authors, dates, permissions that provide essential filtering context

🔗 Hyperlink Graph

Dense interconnections that span hierarchies and create knowledge pathways

⚠️ The Contextual Integrity Problem

Naive Chunking Destroys Context

Standard document splitting severs the vital link between text and its hierarchical position

Retrieval Ambiguity

A chunk saying "The limit is 100 requests per minute" is useless without knowing it's from "Production API Billing Documentation"

Lost Knowledge Graphs

The rich interconnections that humans use to navigate knowledge are completely ignored

Vector Store Deep Dive

Chroma

AI-Native Vector Database

A purpose-built AI application database that integrates high-performance HNSW indexing with robust, native metadata storage and efficient pre-query filtering.

Key Strengths

• Native metadata co-location with vectors
• Efficient pre-filtering (queries filtered before vector search)
• Full database API with CRUD operations
• Excellent developer experience and documentation
• Client-server architecture for scalability

Considerations

• Slightly lower raw performance than FAISS at extreme scale
• Newer ecosystem compared to established alternatives
• Limited indexing algorithm options vs FAISS

FAISS

High-Performance Vector Library

A C++ library offering unparalleled speed for raw vector similarity search, with extensive indexing options and proven scalability to billions of vectors.

Key Strengths

• Blazing fast unfiltered vector search
• Proven at billion-vector scale
• Extensive, highly tunable indexing algorithms
• GPU acceleration support
• Battle-tested in production

Major Limitations

• No native metadata storage
• Requires separate database for metadata
• Complex application-level synchronization
• Filtering requires post-processing workarounds
• High engineering overhead

Scikit-learn

ML Toolkit Component

A familiar machine learning toolkit component excellent for prototyping and educational purposes, but fundamentally unsuitable for production RAG systems.

Good For

• Extremely simple to start with
• Perfect for learning and experimentation
• No external dependencies
• Familiar sklearn interface

Critical Limitations

• Does not scale (curse of dimensionality)
• Memory-bound, single-machine only
• No database features (concurrency, persistence)
• Unsuitable for production environments
• No metadata handling capabilities

Feature Comparison Matrix

A detailed comparison across the critical dimensions that matter for hierarchical RAG implementation:

Feature / Criterion	Chroma	FAISS	Scikit-learn
Architecture Type	Integrated Database	C++ Library	ML Toolkit Component
Metadata Storage	Native & Integrated	External DB Required	None
Metadata Filtering	Pre-filtering (Efficient)	Post-filtering (Complex)	Not Available
Scalability	Good (Client-Server)	Excellent (Billions)	Poor (Memory-bound)
Developer Experience	Excellent (Full API)	Fair (Complex Setup)	Good (Prototyping)
Hierarchical RAG Fit	Excellent	Fair (Custom Logic)	Unsuitable

Final Recommendation

Chroma is the Clear Winner

For building a knowledge base chatbot on hierarchical Confluence data, Chroma provides the optimal balance of architectural fit, developer experience, and performance characteristics.

Superior Architectural Fit

Native co-location of vectors and metadata with efficient pre-filtering is purpose-built for advanced retrieval patterns like Parent-Document Retrieval and Graph RAG.

Reduced Engineering Overhead

The "batteries-included" database approach eliminates infrastructure complexity, allowing teams to focus on retrieval quality rather than plumbing.

Pragmatic Performance

While FAISS is faster in isolation, Chroma's integrated approach often delivers lower end-to-end latency for filter-heavy hierarchical queries.

⚠️ Important Considerations

• This recommendation is based on total cost of ownership, not raw performance benchmarks
• For use cases requiring billion-vector scale with minimal filtering, FAISS may still be preferred
• Always prototype with your specific data and query patterns before final architecture decisions
• Consider hybrid approaches where appropriate (e.g., Chroma for filtered queries, FAISS for similarity-only)

Read Full Research Document