Token Budget Monitoring
Note
This feature is part of Ada v2.0’s biomimetic foundation. It provides visibility into token usage patterns before implementing optimization strategies.
Overview
The Token Budget Monitor tracks how many tokens are consumed by different parts of Ada’s context window:
Persona - Your AI persona description
FAQ - Frequently asked questions and answers
Memories - Retrieved memories from RAG search
Conversation History - Recent messages
Specialist Results - OCR output, web searches, etc.
System Notices - Important system alerts
Why Token Monitoring?
The Problem: Modern LLMs have large context windows (128k+ tokens), but loading too much context:
Slows down responses (more tokens to process)
Costs more (token-based pricing)
Can overwhelm the model with irrelevant information
Makes it harder to stay within limits
The Solution: Track what’s using tokens so we can optimize strategically.
Architecture
The monitoring system uses tiktoken (OpenAI’s tokenizer) to accurately count
tokens the way language models see them.
from brain.token_monitor import TokenBudgetMonitor
# Initialize monitor
monitor = TokenBudgetMonitor(
max_tokens=128000, # Context window size
warning_threshold=102400 # Warn at 80%
)
# Track components
monitor.track("persona", persona_text)
monitor.track("memories", memory_text)
monitor.track("specialist_ocr", ocr_results)
# Get breakdown
breakdown = monitor.get_breakdown()
print(f"Total tokens: {breakdown.total_tokens}")
print(f"Percentage used: {breakdown.percentage_used:.1f}%")
# Human-readable summary
print(monitor.get_summary())
Data Structures
ComponentTokens
Tracks tokens for a single component:
@dataclass
class ComponentTokens:
tokens: int # Absolute token count
percentage: float # Percentage of total
TokenUsageBreakdown
Complete breakdown of all token usage:
@dataclass
class TokenUsageBreakdown:
components: Dict[str, ComponentTokens] # Per-component stats
total_tokens: int # Sum of all components
percentage_used: float # Percentage of budget
is_warning: bool # True if over threshold
API Reference
TokenBudgetMonitor
- class TokenBudgetMonitor(max_tokens=128000, warning_threshold=102400, model='gpt-4')
Main monitoring class.
- Parameters:
max_tokens – Maximum token budget (context window size)
warning_threshold – Token count at which to warn
model – Model name for tokenizer (affects counting)
- count_tokens(text: str) int
Count tokens in text.
- Parameters:
text – Text to count tokens for
- Returns:
Number of tokens
- track(component: str, text: str) int
Track token usage for a component.
- Parameters:
component – Component name (e.g., “persona”, “memories”)
text – Text content to track
- Returns:
Number of tokens in this text
Components accumulate tokens if tracked multiple times.
- get_breakdown() TokenUsageBreakdown
Get detailed breakdown of token usage.
- Returns:
TokenUsageBreakdown with all stats
- get_summary() str
Get human-readable summary.
- Returns:
Formatted string with breakdown
- get_top_components(n: int = 3) List[Tuple[str, int]]
Get top N components by token usage.
- Parameters:
n – Number of components to return
- Returns:
List of (name, tokens) tuples, sorted descending
- reset()
Reset tracking for a new request.
Call at the start of each request to track independently.
- log_breakdown()
Log detailed breakdown at INFO level.
Useful for debugging and monitoring.
Usage Examples
Basic Tracking
from brain.token_monitor import TokenBudgetMonitor
monitor = TokenBudgetMonitor()
# Track different components
monitor.track("persona", "I am Ada, a friendly AI assistant...")
monitor.track("memories", "User prefers Python over JavaScript...")
monitor.track("history", "User: Hello! Ada: Hi there!")
# Get summary
print(monitor.get_summary())
# Output:
# Token Usage: 1,234 / 128,000 (0.96%)
#
# Breakdown by component:
# history : 567 tokens ( 45.9%)
# memories : 456 tokens ( 37.0%)
# persona : 211 tokens ( 17.1%)
Warning Detection
monitor = TokenBudgetMonitor(
max_tokens=1000,
warning_threshold=800
)
# Add a lot of content
monitor.track("large_component", "..." * 500)
breakdown = monitor.get_breakdown()
if breakdown.is_warning:
print(f"⚠️ Warning! Using {breakdown.total_tokens} tokens")
print("Top components:")
for name, tokens in monitor.get_top_components(3):
print(f" {name}: {tokens} tokens")
Integration with Prompt Builder
from brain.prompt_builder import PromptAssembler
from brain.token_monitor import TokenBudgetMonitor
# Create assembler (includes cache automatically)
assembler = PromptAssembler()
# Build prompt (cache handles optimization)
prompt = assembler.build_prompt(
user_message="What's the weather?",
conversation_id="conv-123"
)
# Check cache stats
stats = assembler.cache.get_stats()
print(f"Cache hit rate: {stats.hit_rate:.2%}")
Per-Request Tracking
monitor = TokenBudgetMonitor()
for request in requests:
# Reset for new request
monitor.reset()
# Track this request
monitor.track("persona", persona)
monitor.track("context", request.context)
# Process request...
# Log usage for this request
monitor.log_breakdown()
Token Counting Details
The monitor uses tiktoken to count tokens exactly as language models do.
Important
Different models tokenize differently! The default uses GPT-4’s tokenizer for consistency, but you can specify a different model:
# Use a different tokenizer
monitor = TokenBudgetMonitor(model="gpt-3.5-turbo")
Common token counts (approximate):
“Hello” = 1 token
“Hello world” = 2 tokens
“Hello, how are you?” = 6 tokens
Average English word ≈ 1.3 tokens
100 words ≈ 130 tokens
1000 words ≈ 1300 tokens
Performance Considerations
Token counting is fast but not free:
Negligible impact for normal usage (< 1ms per component)
Caching recommended for large static content (persona, FAQ)
Monitor overhead is far less than token processing by LLM
The token monitor itself uses minimal memory:
~1KB per component tracked
Tracking 20 components = ~20KB overhead
Completely acceptable for the visibility gained
Future Enhancements
Planned improvements for v2.0:
Automatic Optimization - Suggest which components to cache or reduce
Historical Tracking - Track token usage over time to identify trends
Budget Allocation - Set per-component token budgets
Smart Truncation - Automatically trim least important components when over budget
Model-Specific Optimization - Tune counting for different model families
Integration Status
Component |
Status |
Notes |
|---|---|---|
Token Monitor (Core) |
✅ Complete |
All tests passing |
Prompt Builder Integration |
🚧 Planned |
Next step |
Logging Integration |
🚧 Planned |
Automatic logging |
Dashboard/UI |
📋 Future |
Visual token breakdown |
API Endpoint |
📋 Future |
|
See Also
Architecture - How token monitoring fits in Ada’s architecture
Memory - RAG system that benefits from token optimization
Streaming - How tokens flow through the system
Phase 1 Roadmap -
.ai/explorations/planning/IMPLEMENTATION_ROADMAP.md
Note
Token monitoring is read-only - it observes but doesn’t modify behavior. This provides a safe foundation for optimization features built on top.