Token Budget Monitoring

Note

This feature is part of Ada v2.0’s biomimetic foundation. It provides visibility into token usage patterns before implementing optimization strategies.

Overview

The Token Budget Monitor tracks how many tokens are consumed by different parts of Ada’s context window:

  • Persona - Your AI persona description

  • FAQ - Frequently asked questions and answers

  • Memories - Retrieved memories from RAG search

  • Conversation History - Recent messages

  • Specialist Results - OCR output, web searches, etc.

  • System Notices - Important system alerts

Why Token Monitoring?

The Problem: Modern LLMs have large context windows (128k+ tokens), but loading too much context:

  1. Slows down responses (more tokens to process)

  2. Costs more (token-based pricing)

  3. Can overwhelm the model with irrelevant information

  4. Makes it harder to stay within limits

The Solution: Track what’s using tokens so we can optimize strategically.

Architecture

The monitoring system uses tiktoken (OpenAI’s tokenizer) to accurately count tokens the way language models see them.

from brain.token_monitor import TokenBudgetMonitor

# Initialize monitor
monitor = TokenBudgetMonitor(
    max_tokens=128000,      # Context window size
    warning_threshold=102400 # Warn at 80%
)

# Track components
monitor.track("persona", persona_text)
monitor.track("memories", memory_text)
monitor.track("specialist_ocr", ocr_results)

# Get breakdown
breakdown = monitor.get_breakdown()
print(f"Total tokens: {breakdown.total_tokens}")
print(f"Percentage used: {breakdown.percentage_used:.1f}%")

# Human-readable summary
print(monitor.get_summary())

Data Structures

ComponentTokens

Tracks tokens for a single component:

@dataclass
class ComponentTokens:
    tokens: int          # Absolute token count
    percentage: float    # Percentage of total

TokenUsageBreakdown

Complete breakdown of all token usage:

@dataclass
class TokenUsageBreakdown:
    components: Dict[str, ComponentTokens]  # Per-component stats
    total_tokens: int                        # Sum of all components
    percentage_used: float                   # Percentage of budget
    is_warning: bool                         # True if over threshold

API Reference

TokenBudgetMonitor

class TokenBudgetMonitor(max_tokens=128000, warning_threshold=102400, model='gpt-4')

Main monitoring class.

Parameters:
  • max_tokens – Maximum token budget (context window size)

  • warning_threshold – Token count at which to warn

  • model – Model name for tokenizer (affects counting)

count_tokens(text: str) int

Count tokens in text.

Parameters:

text – Text to count tokens for

Returns:

Number of tokens

track(component: str, text: str) int

Track token usage for a component.

Parameters:
  • component – Component name (e.g., “persona”, “memories”)

  • text – Text content to track

Returns:

Number of tokens in this text

Components accumulate tokens if tracked multiple times.

get_breakdown() TokenUsageBreakdown

Get detailed breakdown of token usage.

Returns:

TokenUsageBreakdown with all stats

get_summary() str

Get human-readable summary.

Returns:

Formatted string with breakdown

get_top_components(n: int = 3) List[Tuple[str, int]]

Get top N components by token usage.

Parameters:

n – Number of components to return

Returns:

List of (name, tokens) tuples, sorted descending

reset()

Reset tracking for a new request.

Call at the start of each request to track independently.

log_breakdown()

Log detailed breakdown at INFO level.

Useful for debugging and monitoring.

Usage Examples

Basic Tracking

from brain.token_monitor import TokenBudgetMonitor

monitor = TokenBudgetMonitor()

# Track different components
monitor.track("persona", "I am Ada, a friendly AI assistant...")
monitor.track("memories", "User prefers Python over JavaScript...")
monitor.track("history", "User: Hello! Ada: Hi there!")

# Get summary
print(monitor.get_summary())

# Output:
# Token Usage: 1,234 / 128,000 (0.96%)
#
# Breakdown by component:
#   history              :    567 tokens ( 45.9%)
#   memories             :    456 tokens ( 37.0%)
#   persona              :    211 tokens ( 17.1%)

Warning Detection

monitor = TokenBudgetMonitor(
    max_tokens=1000,
    warning_threshold=800
)

# Add a lot of content
monitor.track("large_component", "..." * 500)

breakdown = monitor.get_breakdown()
if breakdown.is_warning:
    print(f"⚠️  Warning! Using {breakdown.total_tokens} tokens")
    print("Top components:")
    for name, tokens in monitor.get_top_components(3):
        print(f"  {name}: {tokens} tokens")

Integration with Prompt Builder

from brain.prompt_builder import PromptAssembler
from brain.token_monitor import TokenBudgetMonitor

# Create assembler (includes cache automatically)
assembler = PromptAssembler()

# Build prompt (cache handles optimization)
prompt = assembler.build_prompt(
    user_message="What's the weather?",
    conversation_id="conv-123"
)

# Check cache stats
stats = assembler.cache.get_stats()
print(f"Cache hit rate: {stats.hit_rate:.2%}")

Per-Request Tracking

monitor = TokenBudgetMonitor()

for request in requests:
    # Reset for new request
    monitor.reset()

    # Track this request
    monitor.track("persona", persona)
    monitor.track("context", request.context)

    # Process request...

    # Log usage for this request
    monitor.log_breakdown()

Token Counting Details

The monitor uses tiktoken to count tokens exactly as language models do.

Important

Different models tokenize differently! The default uses GPT-4’s tokenizer for consistency, but you can specify a different model:

# Use a different tokenizer
monitor = TokenBudgetMonitor(model="gpt-3.5-turbo")

Common token counts (approximate):

  • “Hello” = 1 token

  • “Hello world” = 2 tokens

  • “Hello, how are you?” = 6 tokens

  • Average English word ≈ 1.3 tokens

  • 100 words ≈ 130 tokens

  • 1000 words ≈ 1300 tokens

Performance Considerations

Token counting is fast but not free:

  • Negligible impact for normal usage (< 1ms per component)

  • Caching recommended for large static content (persona, FAQ)

  • Monitor overhead is far less than token processing by LLM

The token monitor itself uses minimal memory:

  • ~1KB per component tracked

  • Tracking 20 components = ~20KB overhead

  • Completely acceptable for the visibility gained

Future Enhancements

Planned improvements for v2.0:

  1. Automatic Optimization - Suggest which components to cache or reduce

  2. Historical Tracking - Track token usage over time to identify trends

  3. Budget Allocation - Set per-component token budgets

  4. Smart Truncation - Automatically trim least important components when over budget

  5. Model-Specific Optimization - Tune counting for different model families

Integration Status

Component

Status

Notes

Token Monitor (Core)

✅ Complete

All tests passing

Prompt Builder Integration

🚧 Planned

Next step

Logging Integration

🚧 Planned

Automatic logging

Dashboard/UI

📋 Future

Visual token breakdown

API Endpoint

📋 Future

GET /v1/tokens/usage

See Also

  • Architecture - How token monitoring fits in Ada’s architecture

  • Memory - RAG system that benefits from token optimization

  • Streaming - How tokens flow through the system

  • Phase 1 Roadmap - .ai/explorations/planning/IMPLEMENTATION_ROADMAP.md

Note

Token monitoring is read-only - it observes but doesn’t modify behavior. This provides a safe foundation for optimization features built on top.