Token Budget Monitoring ====================== .. note:: This feature is part of Ada v2.0's biomimetic foundation. It provides visibility into token usage patterns before implementing optimization strategies. Overview -------- The Token Budget Monitor tracks how many tokens are consumed by different parts of Ada's context window: - **Persona** - Your AI persona description - **FAQ** - Frequently asked questions and answers - **Memories** - Retrieved memories from RAG search - **Conversation History** - Recent messages - **Specialist Results** - OCR output, web searches, etc. - **System Notices** - Important system alerts Why Token Monitoring? ~~~~~~~~~~~~~~~~~~~~~ **The Problem**: Modern LLMs have large context windows (128k+ tokens), but loading too much context: 1. Slows down responses (more tokens to process) 2. Costs more (token-based pricing) 3. Can overwhelm the model with irrelevant information 4. Makes it harder to stay within limits **The Solution**: Track what's using tokens so we can optimize strategically. Architecture ------------ The monitoring system uses ``tiktoken`` (OpenAI's tokenizer) to accurately count tokens the way language models see them. .. code-block:: python from brain.token_monitor import TokenBudgetMonitor # Initialize monitor monitor = TokenBudgetMonitor( max_tokens=128000, # Context window size warning_threshold=102400 # Warn at 80% ) # Track components monitor.track("persona", persona_text) monitor.track("memories", memory_text) monitor.track("specialist_ocr", ocr_results) # Get breakdown breakdown = monitor.get_breakdown() print(f"Total tokens: {breakdown.total_tokens}") print(f"Percentage used: {breakdown.percentage_used:.1f}%") # Human-readable summary print(monitor.get_summary()) Data Structures -------------- ComponentTokens ~~~~~~~~~~~~~~ Tracks tokens for a single component: .. code-block:: python @dataclass class ComponentTokens: tokens: int # Absolute token count percentage: float # Percentage of total TokenUsageBreakdown ~~~~~~~~~~~~~~~~~~ Complete breakdown of all token usage: .. code-block:: python @dataclass class TokenUsageBreakdown: components: Dict[str, ComponentTokens] # Per-component stats total_tokens: int # Sum of all components percentage_used: float # Percentage of budget is_warning: bool # True if over threshold API Reference ------------ TokenBudgetMonitor ~~~~~~~~~~~~~~~~~ .. py:class:: TokenBudgetMonitor(max_tokens=128000, warning_threshold=102400, model="gpt-4") Main monitoring class. :param max_tokens: Maximum token budget (context window size) :param warning_threshold: Token count at which to warn :param model: Model name for tokenizer (affects counting) .. py:method:: count_tokens(text: str) -> int Count tokens in text. :param text: Text to count tokens for :returns: Number of tokens .. py:method:: track(component: str, text: str) -> int Track token usage for a component. :param component: Component name (e.g., "persona", "memories") :param text: Text content to track :returns: Number of tokens in this text Components accumulate tokens if tracked multiple times. .. py:method:: get_breakdown() -> TokenUsageBreakdown Get detailed breakdown of token usage. :returns: TokenUsageBreakdown with all stats .. py:method:: get_summary() -> str Get human-readable summary. :returns: Formatted string with breakdown .. py:method:: get_top_components(n: int = 3) -> List[Tuple[str, int]] Get top N components by token usage. :param n: Number of components to return :returns: List of (name, tokens) tuples, sorted descending .. py:method:: reset() Reset tracking for a new request. Call at the start of each request to track independently. .. py:method:: log_breakdown() Log detailed breakdown at INFO level. Useful for debugging and monitoring. Usage Examples ------------- Basic Tracking ~~~~~~~~~~~~~ .. code-block:: python from brain.token_monitor import TokenBudgetMonitor monitor = TokenBudgetMonitor() # Track different components monitor.track("persona", "I am Ada, a friendly AI assistant...") monitor.track("memories", "User prefers Python over JavaScript...") monitor.track("history", "User: Hello! Ada: Hi there!") # Get summary print(monitor.get_summary()) # Output: # Token Usage: 1,234 / 128,000 (0.96%) # # Breakdown by component: # history : 567 tokens ( 45.9%) # memories : 456 tokens ( 37.0%) # persona : 211 tokens ( 17.1%) Warning Detection ~~~~~~~~~~~~~~~~ .. code-block:: python monitor = TokenBudgetMonitor( max_tokens=1000, warning_threshold=800 ) # Add a lot of content monitor.track("large_component", "..." * 500) breakdown = monitor.get_breakdown() if breakdown.is_warning: print(f"⚠️ Warning! Using {breakdown.total_tokens} tokens") print("Top components:") for name, tokens in monitor.get_top_components(3): print(f" {name}: {tokens} tokens") Integration with Prompt Builder ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from brain.prompt_builder import PromptAssembler from brain.token_monitor import TokenBudgetMonitor # Create assembler (includes cache automatically) assembler = PromptAssembler() # Build prompt (cache handles optimization) prompt = assembler.build_prompt( user_message="What's the weather?", conversation_id="conv-123" ) # Check cache stats stats = assembler.cache.get_stats() print(f"Cache hit rate: {stats.hit_rate:.2%}") Per-Request Tracking ~~~~~~~~~~~~~~~~~~~ .. code-block:: python monitor = TokenBudgetMonitor() for request in requests: # Reset for new request monitor.reset() # Track this request monitor.track("persona", persona) monitor.track("context", request.context) # Process request... # Log usage for this request monitor.log_breakdown() Token Counting Details --------------------- The monitor uses ``tiktoken`` to count tokens exactly as language models do. .. important:: Different models tokenize differently! The default uses GPT-4's tokenizer for consistency, but you can specify a different model: .. code-block:: python # Use a different tokenizer monitor = TokenBudgetMonitor(model="gpt-3.5-turbo") Common token counts (approximate): - "Hello" = 1 token - "Hello world" = 2 tokens - "Hello, how are you?" = 6 tokens - Average English word ≈ 1.3 tokens - 100 words ≈ 130 tokens - 1000 words ≈ 1300 tokens Performance Considerations ------------------------- Token counting is fast but not free: - **Negligible impact** for normal usage (< 1ms per component) - **Caching recommended** for large static content (persona, FAQ) - **Monitor overhead** is far less than token processing by LLM The token monitor itself uses minimal memory: - ~1KB per component tracked - Tracking 20 components = ~20KB overhead - Completely acceptable for the visibility gained Future Enhancements ------------------ Planned improvements for v2.0: 1. **Automatic Optimization** - Suggest which components to cache or reduce 2. **Historical Tracking** - Track token usage over time to identify trends 3. **Budget Allocation** - Set per-component token budgets 4. **Smart Truncation** - Automatically trim least important components when over budget 5. **Model-Specific Optimization** - Tune counting for different model families Integration Status ----------------- .. list-table:: :header-rows: 1 :widths: 30 20 50 * - Component - Status - Notes * - Token Monitor (Core) - ✅ Complete - All tests passing * - Prompt Builder Integration - 🚧 Planned - Next step * - Logging Integration - 🚧 Planned - Automatic logging * - Dashboard/UI - 📋 Future - Visual token breakdown * - API Endpoint - 📋 Future - ``GET /v1/tokens/usage`` See Also -------- - :doc:`architecture` - How token monitoring fits in Ada's architecture - :doc:`memory` - RAG system that benefits from token optimization - :doc:`streaming` - How tokens flow through the system - **Phase 1 Roadmap** - ``.ai/explorations/planning/IMPLEMENTATION_ROADMAP.md`` .. note:: Token monitoring is **read-only** - it observes but doesn't modify behavior. This provides a safe foundation for optimization features built on top.