==================== Data Model Reference ==================== Ada stores all documents in a Chroma vector database with structured metadata schemas. This page documents the complete data model, including schemas, metadata fields, and usage patterns. .. contents:: Contents :local: :depth: 2 -------- Overview -------- Document Structure ================== Every document in Ada's vector database has four components: 1. **ID**: Unique identifier (UUID format: ``doc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx``) 2. **Text**: The actual content of the document (plain text) 3. **Embeddings**: 768-dimensional vector from ``nomic-embed-text`` model 4. **Metadata**: Structured fields describing the document (see schemas below) Collection Organization ======================= Ada uses a single Chroma collection named ``conversations`` that contains all document types. Documents are differentiated by their ``type`` metadata field. Available document types: - ``persona``: Identity and behavioral guidelines - ``faq``: Knowledge base entries and specialist documentation - ``memory``: Long-term facts and context - ``turn``: Conversation history (user/assistant pairs) - ``summary``: Conversation summaries ---------------------- Schema Introspection ---------------------- Live Schema API =============== Ada provides a ``/v1/schema`` endpoint that returns JSON Schema definitions for all document types. This makes the data model fully introspectable at runtime. **Get all schemas:** .. code-block:: bash curl http://localhost:5000/api/schema **Get specific schema:** .. code-block:: bash curl http://localhost:5000/api/schema?doc_type=memory **Response structure:** .. code-block:: json { "document_type": "memory", "schema": { ... JSON Schema ... }, "fields": ["type", "timestamp", "source", "scope", ...], "required_fields": ["type", "timestamp", "importance"] } Python Schema Access ==================== Import Pydantic models directly for type-safe document creation: .. code-block:: python from brain.schemas import ( PersonaMetadata, FAQMetadata, MemoryMetadata, TurnMetadata, SummaryMetadata, validate_metadata, get_all_schemas, ) # Validate metadata meta = {"type": "memory", "timestamp": "2025-12-16T06:00:00+00:00", "importance": 4} validated = validate_metadata("memory", meta) # Get JSON Schema schemas = get_all_schemas() memory_schema = schemas["memory"] ---------------- Document Schemas ---------------- Base Metadata (All Documents) ============================== All documents share these common fields: :type: :Type: ``string`` (enum) :Required: Yes :Values: ``persona``, ``faq``, ``memory``, ``turn``, ``summary`` :Description: Document type identifier :timestamp: :Type: ``string`` (ISO 8601 UTC) :Required: Yes :Example: ``2025-12-16T06:00:00+00:00`` :Description: When document was created or last updated :source: :Type: ``string`` :Required: Yes :Default: ``system`` :Values: ``chat``, ``kb``, ``system``, ``import`` :Description: Origin of the document :scope: :Type: ``string`` :Required: Yes :Default: ``global`` :Format: ``global`` or ``entity:`` :Description: Access scope for document retrieval :Example: ``entity:project-alpha`` for project-specific documents Persona Documents ================= **Purpose:** Define Ada's identity, tone, and behavioral guidelines. **When used:** Loaded from ``persona.md`` on startup, retrieved when identity/tone questions arise. **Metadata fields:** :type: Fixed value: ``persona`` :version: :Type: ``string`` :Required: Yes :Description: Version timestamp for persona updates :Example: ``2025-12-14T20:12:53.986582Z`` :topic: :Type: ``string`` (optional) :Description: Topic or section of persona if chunked :Examples: ``tone``, ``safety``, ``reasoning`` **Example document:** .. code-block:: json { "id": "doc-abc123...", "document": "You are Ada, a helpful assistant...", "metadata": { "type": "persona", "timestamp": "2025-12-16T06:00:00+00:00", "source": "kb", "scope": "global", "version": "2025-12-14T20:12:53.986582Z", "topic": "tone" } } **Query example:** .. code-block:: python # Retrieve persona context for identity questions results = rag_store.col.query( query_texts=["What is my tone?"], where={"type": "persona"}, n_results=2 ) FAQ Documents ============= **Purpose:** Knowledge base entries, specialist documentation, general reference. **When used:** Retrieved via semantic similarity for relevant context. **Metadata fields:** :type: Fixed value: ``faq`` :topic: :Type: ``string`` (optional) :Description: Category for organization :Examples: ``specialists``, ``api``, ``configuration``, ``troubleshooting`` :specialist_name: :Type: ``string`` (optional) :Field name: ``_specialist_name`` (with underscore) :Description: Name of specialist this doc describes :Examples: ``web_search``, ``ocr``, ``vision`` :version: :Type: ``string`` (optional) :Field name: ``_version`` (with underscore) :Default: ``auto`` :Description: Version of the FAQ entry **Example documents:** .. code-block:: json { "id": "doc-faq-001", "document": "Q: How do I use the web search specialist?\nA: Use SPECIALIST_REQUEST[web_search:{\"query\":\"...\"}]", "metadata": { "type": "faq", "timestamp": "2025-12-16T06:00:00+00:00", "source": "system", "scope": "global", "topic": "specialists", "_specialist_name": "web_search", "_version": "auto" } } .. code-block:: json { "id": "doc-faq-002", "document": "Q: How do I configure RAG?\nA: Set RAG_ENABLED=true in .env...", "metadata": { "type": "faq", "timestamp": "2025-12-16T06:00:00+00:00", "source": "kb", "scope": "global", "topic": "configuration" } } **Query example:** .. code-block:: python # Retrieve FAQ entries about specialists results = rag_store.col.query( query_texts=["How do I invoke a specialist?"], where={"type": "faq", "topic": "specialists"}, n_results=3 ) Memory Documents ================ **Purpose:** Long-term facts and context that persist across conversations. **When used:** Retrieved via semantic similarity, weighted by importance and recency. **Metadata fields:** :type: Fixed value: ``memory`` :importance: :Type: ``integer`` :Required: Yes :Range: 1-5 (1=low, 5=critical) :Description: Importance level affecting retrieval ranking :Default: 3 :tags: :Type: ``array of strings`` (optional) :Description: Tags for categorization and filtering :Examples: ``["python", "preferences"]``, ``["project-alpha", "deadline"]`` :entity: :Type: ``string`` (optional) :Description: Entity this memory pertains to (extracted from scope) :Examples: ``project-alpha``, ``team-beta`` :conversation_id: :Type: ``string`` (optional) :Description: Original conversation where memory was created :Format: ``conv-`` **Example documents:** .. code-block:: json { "id": "doc-mem-001", "document": "luna prefers Python over JavaScript for backend development", "metadata": { "type": "memory", "timestamp": "2025-12-16T06:00:00+00:00", "source": "chat", "scope": "global", "importance": 4, "tags": ["preferences", "programming"] } } .. code-block:: json { "id": "doc-mem-002", "document": "project-alpha deadline is December 20th, 2025", "metadata": { "type": "memory", "timestamp": "2025-12-16T06:00:00+00:00", "source": "chat", "scope": "entity:project-alpha", "entity": "project-alpha", "importance": 5, "tags": ["deadline", "critical"], "conversation_id": "conv-abc123" } } **Query example:** .. code-block:: python # Retrieve memories with entity scoping results = rag_store.retrieve_memories( query="project-alpha status", k=5, entity="project-alpha" # Filter to entity-scoped memories ) Turn Documents ============== **Purpose:** Conversation history (user messages and assistant responses). **When used:** Retrieved for recent context in ongoing conversations. **Metadata fields:** :type: Fixed value: ``turn`` :conversation_id: :Type: ``string`` :Required: Yes :Format: ``conv-`` :Description: Conversation this turn belongs to :role: :Type: ``string`` :Required: Yes :Values: ``user`` or ``assistant`` :Description: Speaker role :turn_index: :Type: ``integer`` (optional) :Description: Sequential turn number within conversation :Examples: ``1``, ``2``, ``3`` **Example documents:** .. code-block:: json { "id": "doc-turn-001", "document": "How do I configure the web search specialist?", "metadata": { "type": "turn", "timestamp": "2025-12-16T06:00:00+00:00", "source": "chat", "scope": "global", "conversation_id": "conv-abc123", "role": "user", "turn_index": 1 } } .. code-block:: json { "id": "doc-turn-002", "document": "To configure web search, set SEARXNG_URL in your .env file...", "metadata": { "type": "turn", "timestamp": "2025-12-16T06:00:01+00:00", "source": "chat", "scope": "global", "conversation_id": "conv-abc123", "role": "assistant", "turn_index": 1 } } **Query example:** .. code-block:: python # Retrieve recent turns from specific conversation results = rag_store.col.query( query_texts=["conversation context"], where={"type": "turn", "conversation_id": "conv-abc123"}, n_results=10 ) Summary Documents ================= **Purpose:** Compressed summaries of conversation turns for efficient context. **When used:** Generated periodically (every N turns) to maintain context efficiency. **Metadata fields:** :type: Fixed value: ``summary`` :conversation_id: :Type: ``string`` :Required: Yes :Format: ``conv-`` :Description: Conversation this summary belongs to :turn_range: :Type: ``string`` (optional) :Description: Range of turns covered by this summary :Format: ``-`` :Examples: ``1-8``, ``9-16`` :summary_index: :Type: ``integer`` (optional) :Description: Sequential summary number within conversation :Examples: ``1``, ``2``, ``3`` **Example document:** .. code-block:: json { "id": "doc-sum-001", "document": "Discussed web search configuration. User wants to set up SearxNG...", "metadata": { "type": "summary", "timestamp": "2025-12-16T06:08:00+00:00", "source": "system", "scope": "global", "conversation_id": "conv-abc123", "turn_range": "1-8", "summary_index": 1 } } **Query example:** .. code-block:: python # Retrieve conversation summaries results = rag_store.col.query( query_texts=["web search configuration"], where={"type": "summary"}, n_results=3 ) -------------------- Retrieval Strategies -------------------- Semantic Similarity =================== All documents are retrieved via semantic similarity using the ``nomic-embed-text`` embedding model (768 dimensions). .. code-block:: python # Basic semantic query results = rag_store.col.query( query_texts=["How do I use specialists?"], n_results=5 ) Metadata Filtering ================== Chroma supports filtering by metadata fields using ``where`` clauses: .. code-block:: python # Filter by type where = {"type": "faq"} # Filter by multiple fields where = {"type": "memory", "importance": {"$gte": 4}} # Filter by scope where = {"scope": "entity:project-alpha"} # Complex filters where = { "$and": [ {"type": "memory"}, {"importance": {"$gte": 4}}, {"tags": {"$contains": "critical"}} ] } Importance-Weighted Retrieval ============================== For memory documents, retrieval can be weighted by importance and recency: .. code-block:: python # Memory retrieval with importance weighting memories = rag_store.retrieve_memories( query="project status", k=5, entity="project-alpha" # Optional entity scoping ) # Scoring formula: # score = (similarity * 0.5) + (importance * 0.5) + (recency_bonus) Recency-Based Ranking ====================== Turn and summary documents use timestamp-based ranking for temporal awareness: .. code-block:: python # Retrieve recent turns (automatically sorted by timestamp) recent_turns = rag_store.col.query( query_texts=["context"], where={ "type": "turn", "conversation_id": conversation_id }, n_results=RAG_TURN_TOP_K ) -------------- Usage Examples -------------- Creating Documents ================== **Persona:** .. code-block:: python from brain.rag_store import RagStore rag_store = RagStore() rag_store.upsert_doc( text="You are Ada, a helpful assistant for luna.", type="persona", scope="global", timestamp="2025-12-16T06:00:00+00:00", source="kb", version="2025-12-16T06:00:00+00:00" ) **FAQ:** .. code-block:: python rag_store.upsert_doc( text="Q: How do I use web search?\nA: Use SPECIALIST_REQUEST[web_search:...]", type="faq", scope="global", timestamp="2025-12-16T06:00:00+00:00", source="system", topic="specialists", _specialist_name="web_search" ) **Memory:** .. code-block:: python rag_store.upsert_memory( text="luna prefers Python over JavaScript", scope="global", importance=4, tags=["preferences", "programming"], timestamp="2025-12-16T06:00:00+00:00", source="chat" ) **Turn:** .. code-block:: python conversation_id = "conv-abc123" rag_store.upsert_turn( conversation_id=conversation_id, user_text="How do I configure web search?", assistant_text="Set SEARXNG_URL in .env...", user_ts="2025-12-16T06:00:00+00:00", assistant_ts="2025-12-16T06:00:01+00:00", source="chat" ) Querying Documents ================== **Get persona context:** .. code-block:: python persona_docs = rag_store.col.query( query_texts=["What is my tone?"], where={"type": "persona"}, n_results=2 ) **Search FAQs:** .. code-block:: python faq_results = rag_store.col.query( query_texts=["specialist documentation"], where={"type": "faq", "topic": "specialists"}, n_results=3 ) **Retrieve memories:** .. code-block:: python memories = rag_store.retrieve_memories( query="user preferences", k=5, entity=None # Global scope ) **Get conversation history:** .. code-block:: python turns = rag_store.col.query( query_texts=["recent discussion"], where={"type": "turn", "conversation_id": "conv-abc123"}, n_results=10 ) Validating Metadata =================== Use Pydantic schemas for validation: .. code-block:: python from brain.schemas import validate_metadata # Valid metadata meta = { "type": "memory", "timestamp": "2025-12-16T06:00:00+00:00", "source": "chat", "scope": "global", "importance": 4, "tags": ["test"] } validated = validate_metadata("memory", meta) # Invalid metadata (raises ValidationError) bad_meta = { "type": "memory", "importance": 10 # Out of range (1-5) } # Raises: ValidationError: importance must be between 1 and 5 -------------------- Configuration -------------------- RAG-related configuration options (see :doc:`configuration` for full reference): :RAG_ENABLED: Master toggle for RAG system (default: ``true``) :RAG_ENABLE_PERSONA: Enable persona document retrieval (default: ``true``) :RAG_ENABLE_FAQ: Enable FAQ document retrieval (default: ``true``) :RAG_ENABLE_MEMORY: Enable memory document retrieval (default: ``true``) :RAG_ENABLE_TURN: Enable turn document retrieval (default: ``true``) :RAG_ENABLE_SUMMARY: Enable summary document retrieval (default: ``true``) :RAG_TURN_TOP_K: Number of turns to retrieve (default: ``4``, range: 1-20) :RAG_FAQ_TOP_K: Number of FAQs to retrieve (default: ``2``, range: 1-10) :RAG_MEMORY_TOP_K: Number of memories to retrieve (default: ``3``, range: 1-20) :RAG_SUMMARY_TOP_K: Number of summaries to retrieve (default: ``2``, range: 1-10) See :doc:`configuration` for complete details on all RAG settings. -------- See Also -------- - :doc:`configuration` - Environment variable configuration - :doc:`api_usage` - API usage and endpoints - :doc:`memory` - Memory management and consolidation - :doc:`architecture` - System architecture overview - :doc:`testing` - Testing and validation