Research Paper · September 2025 · CKX 004

ContextOS

A Hierarchical Architecture for Managing Context in Large Language Models

Ahmed Hafdi  ·  System Design Proposal  ·  September 2025
Abstract

This paper proposes ContextOS, a practical system architecture for managing context in LLM applications. Current approaches treat context retrieval as a one-size-fits-all operation, resulting in consistent 100–200ms latency regardless of query recency. I propose a three-tier caching system inspired by how computer processors manage memory — keep recent stuff fast (L1), compress frequently-used stuff (L2), and store everything else efficiently (L3). The architecture introduces context capsules: smart containers that store information in multiple formats and automatically choose the right format based on recency and importance. Expected reduction: average retrieval latency from 150ms to 20–40ms.

§ 1

The Context Problem

Large Language Models are incredibly powerful, but they have a fundamental limitation: they can only pay attention to what you put in their context window. Current systems handle this poorly — every retrieval takes about 150ms regardless of whether you're asking a follow-up question from 30 seconds ago or searching millions of documents for the first time.

The core insight is that context access follows the same patterns as CPU memory access:

Temporal Locality
If you accessed something recently, you'll likely access it again soon. A follow-up question about transformers should find transformer context in microseconds, not 150ms.
Spatial Locality
If you accessed something, you'll likely access related things. Accessing attention mechanisms implies likely interest in positional encodings and multi-head attention.
Frequency Patterns
Some information gets accessed repeatedly across sessions. Frequently accessed topics should live permanently in faster tiers, self-organizing based on usage.
Cache Hierarchies
Computer architects solved this decades ago. L1 cache: <1ms. L2 cache: <10ms. L3 storage: <100ms. ContextOS applies identical principles to LLM context.

The Concrete Scenario

User: "How do transformers handle long-range dependencies?" System: Searches entire corpus → 150ms → answer. User: "What about computational complexity?" (30 seconds later) Current: Search entire corpus again → 150ms (wasteful) ContextOS: L1 cache hit → <1ms → answer

§ 2

Three-Tier Cache Architecture

ContextOS is an operating system for context. Just like an OS manages how programs access memory and storage, ContextOS manages how LLMs access context across three tiers with dramatically different speed–capacity trade-offs.

Interactive 3D Cache Hierarchy — Drag to Rotate
L1
Hot Context
In-Memory · RAM · 2,000–8,000 tokens
<1ms Current session
L2
Warm Context
Redis Cluster · 32,000–64,000 tokens
<10ms Compressed summaries
L3
Cold Storage
VectorDB + GraphDB + ObjectStore · Unlimited
<100ms Full corpus search
drag to rotate
Tier Storage Latency Capacity Format
L1 Server RAM <1ms 2,000–8,000 tokens Full raw text
L2 Redis cluster <10ms 32,000–64,000 tokens Compressed summaries
L3 VectorDB + Graph + Object <100ms Unlimited All formats

Six Components

1
User Interface Layer
Captures query text, timestamp, session ID, conversation history. The timestamp and session ID determine whether this is a follow-up or a fresh topic.
2
Context Scheduler
The traffic controller. Parses entities, topics, intent, and complexity. Searches tiers in order (L1 → L2 → L3). Ranks results using semantic similarity (40%), recency (20%), frequency (20%), trust (10%), entity overlap (10%).
3
Three-Tier Cache
L1 (hot, RAM), L2 (warm, Redis), L3 (cold, multi-backend). Each tier searched in order; misses fall through to the next tier.
4
Context Capsule Engine
Assembles ranked capsules into a structured prompt. Dynamically selects format (raw/summary/facts) based on token budget and priority. Allocates 30% to L1, 40% to L3, 20% to L2, 10% to facts.
5
LLM Inference Engine
Receives the structured context prompt and generates the response. The richer, more relevant context produces higher-quality answers.
6
Feedback Loop
After each response: updates capsule access counts, records co-access patterns, promotes frequently-used capsules to faster tiers, and creates new capsules from Q&A pairs.

§ 3

Context Capsules

A capsule is a self-contained information container that stores the same content in multiple formats simultaneously — the fundamental unit of information in ContextOS. The system automatically selects the right format based on priority, token budget, and query type.

Raw Text
Full fidelity, zero information loss. Used for high-priority L1 context where detail matters.
~2,000 tokens · 100% fidelity
Summary
10–30% of original size. Used for background context in L2 where token efficiency matters.
~200 tokens · 90% token saving
Facts
Ultra-compressed atomic claims. Used for fact-checking queries and filling remaining token budget.
~30 tokens · atomic
Embedding
768-dimensional vector for semantic similarity search. Enables fast retrieval across millions of capsules.
768 floats · no tokens
Structured
Tables, code, citations in parseable format. Used when the query involves comparisons or references.
Variable · machine-readable

Capsule Lifecycle

Capsules evolve through three phases, automatically migrating between tiers based on usage patterns:

① Creation
New content arrives → all formats generated simultaneously → capsule starts in L3 cold storage. Q&A pairs generate new capsules that start in L1 (hot).
② Growth
Gets accessed → builds usage statistics → promoted to faster tiers. At 10 accesses/hour → L2. At 50 accesses/10min → L1. Co-access patterns recorded.
③ Decay
No recent access → relevance score drops → demoted to slower tiers. TTL policies trigger invalidation when source documents update.

§ 4

L3 Storage Backends

L3 isn't a single database — it's three specialized storage systems working in concert. Each backend excels at a different class of retrieval query.

VectorDB
Pinecone · Weaviate · Milvus
Converts query to 768-dim embedding, returns top-k semantically similar capsules. Best for "find me information about X" queries.
GraphDB
Neo4j · Amazon Neptune
Traverses knowledge graph along edges like compares-to, part-of, developed-by. Best for "how does X relate to Y?" queries.
ObjectStore
S3-compatible storage
Stores and retrieves full documents — PDFs, text, code. Provides the source of truth when raw original content is needed.

The three backends are complementary: vector search finds similar content, graph traversal reveals non-obvious connections, and object storage provides full-fidelity retrieval. Combining all three produces the most comprehensive results possible.

§ 5

The Learning Loop

After every response, ContextOS updates its understanding of access patterns. The system gets smarter over time — the 10th query in a domain is dramatically faster than the 1st.

Capsule Statistics
access_count += 1
last_accessed = now()
utility_score = f(feedback)
Co-Access Patterns
Capsules used together get linked. Next time capsule A is accessed, capsule B is pre-fetched automatically.
Cache Promotion
>10 accesses/hour → L2
>50 accesses/10min → L1
New Capsule Creation
Every Q&A pair becomes a new capsule starting in L1 — immediately hot. The conversation itself becomes context for future questions.

What the System Learns

1
Your Interests — If you frequently ask about machine learning, ML capsules stay permanently in L1 without manual configuration.
2
Topic Clusters — Which topics are related and queried together. Transformers → attention mechanisms → positional encoding becomes a cached cluster.
3
Query Patterns — Common follow-up questions get pre-fetched into L1 before you even ask them.
4
Optimal Formats — Which capsule format (raw/summary/facts) works best for each query type, learned from response quality feedback.

§ 6

Expected Performance

Based on the architecture design, latency improvements increase dramatically as the system learns access patterns over 50–100 queries:

Traditional RAG    ContextOS
First-time (cold)
150ms
Follow-up (warm)
80ms
50ms
Frequent (hot)
150ms
15ms
Average (after learning)
150ms
40ms
Query Type Traditional RAG ContextOS Improvement
First-time (cold) 150ms 150ms
Follow-up (warm) 150ms 50–80ms ~50%
Frequent (hot) 150ms 10–20ms ~87%
Average (after learning) 150ms 30–50ms 67–80%

Expected Cache Hit Rates

After 50–100 queries in a domain, the system self-organizes to keep most context in fast tiers:

40–60%
L1 cache hit rate — served in under 1ms from RAM
20–30%
L2 cache hit rate — served in under 10ms from Redis
70–90%
Combined L1+L2 hit rate — only 10–30% of queries reach L3

§ 7

Limitations & Open Questions

1
Cold Start Problem — First-time users see no benefit until the system learns their patterns. Mitigation: pre-populate caches from similar users or domain defaults.
2
Optimal Cache Sizes — What is the sweet spot for L1/L2 capacity? This likely varies by application domain and user behavior patterns.
3
Cache Consistency — If source documents update, cached summaries become stale. TTL policies and dependency tracking are required for cache invalidation.
4
Compression Strategies — Should L2 use extractive summarization, abstractive summarization, or learned query-aware compression? Each has different trade-offs.
5
Multi-User Caching — How to share L2/L3 across users while preserving L1 privacy? Federated ContextOS across organizations is a long-term research direction.
§0Overview
§1Problem
§2Architecture
§3Capsules
§4Cache Tiers
§5Backends
§6Learning
§7Performance
§8Open Questions

Loading PDF…

Overview
ContextOS
An OS for LLM Context Management
L1
Hot Context
<1ms
L2
Warm Context
<10ms
L3
Cold Storage
<100ms
The 150ms Problem
Q1
Cold query → L3 search → 150ms
Q2
Follow-up (30s later) → still 150ms
No temporal awareness. Every query treated identically.
ContextOS Q2 → L1 hit → <1ms
6-Component System
1
User Interface
Captures query + timestamp + session
2
Context Scheduler
Parses intent, searches tiers, ranks results
3
Three-Tier Cache
L1 → L2 → L3 with cascade fallback
4
Capsule Engine
Assembles optimal context prompt
5
LLM Inference
Generates answer from rich context
6
Feedback Loop
Updates stats, promotes capsules
Capsule Formats
Raw Text
100%
Summary
~15%
Facts
~3%
Embedding
768d

System auto-selects format based on priority and token budget.

Cache Tier Latencies
L1
<1ms
RAM
L2
<10ms
Redis
L3
<100ms
Storage

L1 is 100× faster than L3. Keeping hot context in L1 transforms average latency.

Three L3 Backends
VectorDB — semantic similarity, embedding search
GraphDB — relationship traversal, concept linking
ObjectStore — full document retrieval, source of truth
All three combined = most comprehensive results
The System Gets Smarter
Promotion
Frequent capsules move to faster tiers automatically
Demotion
Unused capsules decay to slower tiers over time
+
New Capsules
Every Q&A pair becomes a hot L1 capsule immediately
~
Co-Access
Related capsules get pre-fetched together
67–80% Latency Reduction
Traditional RAG
150ms
Cold query
150ms
Warm follow-up
60ms
Hot (frequent)
15ms
Average (learned)
40ms
50%
L1 hit rate
25%
L2 hit rate
80%
L1+L2 combined
Open Research Questions
?
Cold start — bootstrap for new users without history
?
Optimal sizes — ideal L1/L2 capacity per domain
?
Consistency — cache invalidation when source updates
?
Compression — extractive vs. abstractive vs. learned
?
Multi-user — shared L2/L3 with L1 privacy boundaries
Page 1 / 18 scroll to explore ↓