Token Budget Monitoring

Note

This feature is part of Ada v2.0’s biomimetic foundation. It provides visibility into token usage patterns before implementing optimization strategies.

Overview

The Token Budget Monitor tracks how many tokens are consumed by different parts of Ada’s context window:

Persona - Your AI persona description
FAQ - Frequently asked questions and answers
Memories - Retrieved memories from RAG search
Conversation History - Recent messages
Specialist Results - OCR output, web searches, etc.
System Notices - Important system alerts

Why Token Monitoring?

The Problem: Modern LLMs have large context windows (128k+ tokens), but loading too much context:

Slows down responses (more tokens to process)
Costs more (token-based pricing)
Can overwhelm the model with irrelevant information
Makes it harder to stay within limits

The Solution: Track what’s using tokens so we can optimize strategically.

Architecture

The monitoring system uses tiktoken (OpenAI’s tokenizer) to accurately count tokens the way language models see them.

from brain.token_monitor import TokenBudgetMonitor

# Initialize monitor
monitor = TokenBudgetMonitor(
    max_tokens=128000,      # Context window size
    warning_threshold=102400 # Warn at 80%
)

# Track components
monitor.track("persona", persona_text)
monitor.track("memories", memory_text)
monitor.track("specialist_ocr", ocr_results)

# Get breakdown
breakdown = monitor.get_breakdown()
print(f"Total tokens: {breakdown.total_tokens}")
print(f"Percentage used: {breakdown.percentage_used:.1f}%")

# Human-readable summary
print(monitor.get_summary())

Data Structures

ComponentTokens

Tracks tokens for a single component:

@dataclass
class ComponentTokens:
    tokens: int          # Absolute token count
    percentage: float    # Percentage of total

TokenUsageBreakdown

Complete breakdown of all token usage:

@dataclass
class TokenUsageBreakdown:
    components: Dict[str, ComponentTokens]  # Per-component stats
    total_tokens: int                        # Sum of all components
    percentage_used: float                   # Percentage of budget
    is_warning: bool                         # True if over threshold

API Reference

TokenBudgetMonitor

class TokenBudgetMonitor(max_tokens=128000, warning_threshold=102400, model='gpt-4')

Main monitoring class.

Parameters:

max_tokens – Maximum token budget (context window size)
warning_threshold – Token count at which to warn
model – Model name for tokenizer (affects counting)

count_tokens(text: str) → int

Count tokens in text.

Parameters:: text – Text to count tokens for
Returns:: Number of tokens

track(component: str, text: str) → int

Track token usage for a component.

Parameters:

component – Component name (e.g., “persona”, “memories”)
text – Text content to track

Returns:

Number of tokens in this text

Components accumulate tokens if tracked multiple times.

get_breakdown() → TokenUsageBreakdown

Get detailed breakdown of token usage.

Returns:: TokenUsageBreakdown with all stats

get_summary() → str

Get human-readable summary.

Returns:: Formatted string with breakdown

get_top_components(n: int = 3) → List[Tuple[str, int]]

Get top N components by token usage.

Parameters:: n – Number of components to return
Returns:: List of (name, tokens) tuples, sorted descending

reset()

Reset tracking for a new request.

Call at the start of each request to track independently.

log_breakdown()

Log detailed breakdown at INFO level.

Useful for debugging and monitoring.

Usage Examples

Basic Tracking

from brain.token_monitor import TokenBudgetMonitor

monitor = TokenBudgetMonitor()

# Track different components
monitor.track("persona", "I am Ada, a friendly AI assistant...")
monitor.track("memories", "User prefers Python over JavaScript...")
monitor.track("history", "User: Hello! Ada: Hi there!")

# Get summary
print(monitor.get_summary())

# Output:
# Token Usage: 1,234 / 128,000 (0.96%)
#
# Breakdown by component:
#   history              :    567 tokens ( 45.9%)
#   memories             :    456 tokens ( 37.0%)
#   persona              :    211 tokens ( 17.1%)

Warning Detection

monitor = TokenBudgetMonitor(
    max_tokens=1000,
    warning_threshold=800
)

# Add a lot of content
monitor.track("large_component", "..." * 500)

breakdown = monitor.get_breakdown()
if breakdown.is_warning:
    print(f"⚠️  Warning! Using {breakdown.total_tokens} tokens")
    print("Top components:")
    for name, tokens in monitor.get_top_components(3):
        print(f"  {name}: {tokens} tokens")

Integration with Prompt Builder

from brain.prompt_builder import PromptAssembler
from brain.token_monitor import TokenBudgetMonitor

# Create assembler (includes cache automatically)
assembler = PromptAssembler()

# Build prompt (cache handles optimization)
prompt = assembler.build_prompt(
    user_message="What's the weather?",
    conversation_id="conv-123"
)

# Check cache stats
stats = assembler.cache.get_stats()
print(f"Cache hit rate: {stats.hit_rate:.2%}")

Per-Request Tracking

monitor = TokenBudgetMonitor()

for request in requests:
    # Reset for new request
    monitor.reset()

    # Track this request
    monitor.track("persona", persona)
    monitor.track("context", request.context)

    # Process request...

    # Log usage for this request
    monitor.log_breakdown()

Token Counting Details

The monitor uses tiktoken to count tokens exactly as language models do.

Important

Different models tokenize differently! The default uses GPT-4’s tokenizer for consistency, but you can specify a different model:

# Use a different tokenizer
monitor = TokenBudgetMonitor(model="gpt-3.5-turbo")

Common token counts (approximate):

“Hello” = 1 token
“Hello world” = 2 tokens
“Hello, how are you?” = 6 tokens
Average English word ≈ 1.3 tokens
100 words ≈ 130 tokens
1000 words ≈ 1300 tokens

Performance Considerations

Token counting is fast but not free:

Negligible impact for normal usage (< 1ms per component)
Caching recommended for large static content (persona, FAQ)
Monitor overhead is far less than token processing by LLM

The token monitor itself uses minimal memory:

~1KB per component tracked
Tracking 20 components = ~20KB overhead
Completely acceptable for the visibility gained

Future Enhancements

Planned improvements for v2.0:

Automatic Optimization - Suggest which components to cache or reduce
Historical Tracking - Track token usage over time to identify trends
Budget Allocation - Set per-component token budgets
Smart Truncation - Automatically trim least important components when over budget
Model-Specific Optimization - Tune counting for different model families

Integration Status

Component	Status	Notes
Token Monitor (Core)	✅ Complete	All tests passing
Prompt Builder Integration	🚧 Planned	Next step
Logging Integration	🚧 Planned	Automatic logging
Dashboard/UI	📋 Future	Visual token breakdown
API Endpoint	📋 Future	`GET /v1/tokens/usage`