Research Paper · September 2025 · CKX 004

ContextOS

A Hierarchical Architecture for Managing Context in Large Language Models

Ahmed Hafdi · System Design Proposal · September 2025

Abstract

This paper proposes ContextOS, a practical system architecture for managing context in LLM applications. Current approaches treat context retrieval as a one-size-fits-all operation, resulting in consistent 100–200ms latency regardless of query recency. I propose a three-tier caching system inspired by how computer processors manage memory — keep recent stuff fast (L1), compress frequently-used stuff (L2), and store everything else efficiently (L3). The architecture introduces context capsules: smart containers that store information in multiple formats and automatically choose the right format based on recency and importance. Expected reduction: average retrieval latency from 150ms to 20–40ms.

§ 1 The Context Problem

Large Language Models are incredibly powerful, but they have a fundamental limitation: they can only pay attention to what you put in their context window. Current systems handle this poorly — every retrieval takes about 150ms regardless of whether you're asking a follow-up question from 30 seconds ago or searching millions of documents for the first time.

The core insight is that context access follows the same patterns as CPU memory access:

Temporal Locality

If you accessed something recently, you'll likely access it again soon. A follow-up question about transformers should find transformer context in microseconds, not 150ms.

Spatial Locality

If you accessed something, you'll likely access related things. Accessing attention mechanisms implies likely interest in positional encodings and multi-head attention.

Frequency Patterns

Some information gets accessed repeatedly across sessions. Frequently accessed topics should live permanently in faster tiers, self-organizing based on usage.

Cache Hierarchies

Computer architects solved this decades ago. L1 cache: <1ms. L2 cache: <10ms. L3 storage: <100ms. ContextOS applies identical principles to LLM context.

The Concrete Scenario

User: "How do transformers handle long-range dependencies?"
System: Searches entire corpus → 150ms → answer.

User: "What about computational complexity?" (30 seconds later)
Current:  Search entire corpus again → 150ms (wasteful)
ContextOS: L1 cache hit → <1ms → answer

§ 2 Three-Tier Cache Architecture

ContextOS is an operating system for context. Just like an OS manages how programs access memory and storage, ContextOS manages how LLMs access context across three tiers with dramatically different speed–capacity trade-offs.

Interactive 3D Cache Hierarchy — Drag to Rotate

Hot Context

In-Memory · RAM · 2,000–8,000 tokens

<1ms Current session

Warm Context

Redis Cluster · 32,000–64,000 tokens

<10ms Compressed summaries

Cold Storage

VectorDB + GraphDB + ObjectStore · Unlimited

<100ms Full corpus search

drag to rotate

Tier	Storage	Latency	Capacity	Format
L1	Server RAM	<1ms	2,000–8,000 tokens	Full raw text
L2	Redis cluster	<10ms	32,000–64,000 tokens	Compressed summaries
L3	VectorDB + Graph + Object	<100ms	Unlimited	All formats

Six Components

User Interface Layer

Captures query text, timestamp, session ID, conversation history. The timestamp and session ID determine whether this is a follow-up or a fresh topic.

↓

Context Scheduler

The traffic controller. Parses entities, topics, intent, and complexity. Searches tiers in order (L1 → L2 → L3). Ranks results using semantic similarity (40%), recency (20%), frequency (20%), trust (10%), entity overlap (10%).

↓

Three-Tier Cache

L1 (hot, RAM), L2 (warm, Redis), L3 (cold, multi-backend). Each tier searched in order; misses fall through to the next tier.

↓

Context Capsule Engine

Assembles ranked capsules into a structured prompt. Dynamically selects format (raw/summary/facts) based on token budget and priority. Allocates 30% to L1, 40% to L3, 20% to L2, 10% to facts.

↓

LLM Inference Engine

Receives the structured context prompt and generates the response. The richer, more relevant context produces higher-quality answers.

↓

Feedback Loop

After each response: updates capsule access counts, records co-access patterns, promotes frequently-used capsules to faster tiers, and creates new capsules from Q&A pairs.

§ 3 Context Capsules

A capsule is a self-contained information container that stores the same content in multiple formats simultaneously — the fundamental unit of information in ContextOS. The system automatically selects the right format based on priority, token budget, and query type.

Raw Text

Full fidelity, zero information loss. Used for high-priority L1 context where detail matters.

~2,000 tokens · 100% fidelity

Summary

10–30% of original size. Used for background context in L2 where token efficiency matters.

~200 tokens · 90% token saving

Facts

Ultra-compressed atomic claims. Used for fact-checking queries and filling remaining token budget.

~30 tokens · atomic

Embedding

768-dimensional vector for semantic similarity search. Enables fast retrieval across millions of capsules.

768 floats · no tokens

Structured

Tables, code, citations in parseable format. Used when the query involves comparisons or references.

Variable · machine-readable

Capsule Lifecycle

Capsules evolve through three phases, automatically migrating between tiers based on usage patterns:

① Creation

New content arrives → all formats generated simultaneously → capsule starts in L3 cold storage. Q&A pairs generate new capsules that start in L1 (hot).

② Growth

Gets accessed → builds usage statistics → promoted to faster tiers. At 10 accesses/hour → L2. At 50 accesses/10min → L1. Co-access patterns recorded.

③ Decay

No recent access → relevance score drops → demoted to slower tiers. TTL policies trigger invalidation when source documents update.

§ 4 L3 Storage Backends

L3 isn't a single database — it's three specialized storage systems working in concert. Each backend excels at a different class of retrieval query.

∿

VectorDB

Pinecone · Weaviate · Milvus

Converts query to 768-dim embedding, returns top-k semantically similar capsules. Best for "find me information about X" queries.

◇

GraphDB

Neo4j · Amazon Neptune

Traverses knowledge graph along edges like compares-to, part-of, developed-by. Best for "how does X relate to Y?" queries.

□

ObjectStore

S3-compatible storage

Stores and retrieves full documents — PDFs, text, code. Provides the source of truth when raw original content is needed.

The three backends are complementary: vector search finds similar content, graph traversal reveals non-obvious connections, and object storage provides full-fidelity retrieval. Combining all three produces the most comprehensive results possible.

§ 5 The Learning Loop

After every response, ContextOS updates its understanding of access patterns. The system gets smarter over time — the 10th query in a domain is dramatically faster than the 1st.

Capsule Statistics

access_count += 1
last_accessed = now()
utility_score = f(feedback)

Co-Access Patterns

Capsules used together get linked. Next time capsule A is accessed, capsule B is pre-fetched automatically.

Cache Promotion

>10 accesses/hour → L2
>50 accesses/10min → L1

New Capsule Creation

Every Q&A pair becomes a new capsule starting in L1 — immediately hot. The conversation itself becomes context for future questions.

What the System Learns

Your Interests — If you frequently ask about machine learning, ML capsules stay permanently in L1 without manual configuration.

Topic Clusters — Which topics are related and queried together. Transformers → attention mechanisms → positional encoding becomes a cached cluster.

Query Patterns — Common follow-up questions get pre-fetched into L1 before you even ask them.

Optimal Formats — Which capsule format (raw/summary/facts) works best for each query type, learned from response quality feedback.

§ 6 Expected Performance

Based on the architecture design, latency improvements increase dramatically as the system learns access patterns over 50–100 queries:

Traditional RAG ContextOS

First-time (cold)

150ms

Follow-up (warm)

80ms

50ms

Frequent (hot)

150ms

15ms

Average (after learning)

150ms

40ms

Query Type	Traditional RAG	ContextOS	Improvement
First-time (cold)	150ms	150ms	—
Follow-up (warm)	150ms	50–80ms	~50%
Frequent (hot)	150ms	10–20ms	~87%
Average (after learning)	150ms	30–50ms	67–80%

Expected Cache Hit Rates

After 50–100 queries in a domain, the system self-organizes to keep most context in fast tiers:

40–60%

L1 cache hit rate — served in under 1ms from RAM

20–30%

L2 cache hit rate — served in under 10ms from Redis

70–90%

Combined L1+L2 hit rate — only 10–30% of queries reach L3

§ 7 Limitations & Open Questions

Cold Start Problem — First-time users see no benefit until the system learns their patterns. Mitigation: pre-populate caches from similar users or domain defaults.

Optimal Cache Sizes — What is the sweet spot for L1/L2 capacity? This likely varies by application domain and user behavior patterns.

Cache Consistency — If source documents update, cached summaries become stale. TTL policies and dependency tracking are required for cache invalidation.

Compression Strategies — Should L2 use extractive summarization, abstractive summarization, or learned query-aware compression? Each has different trade-offs.

Multi-User Caching — How to share L2/L3 across users while preserving L1 privacy? Federated ContextOS across organizations is a long-term research direction.

§0Overview

§1Problem

§2Architecture

§3Capsules

§4Cache Tiers

§5Backends

§6Learning

§7Performance

§8Open Questions

Loading PDF…

Overview

ContextOS

An OS for LLM Context Management

Hot Context
<1ms

Warm Context
<10ms

Cold Storage
<100ms

The 150ms Problem

Cold query → L3 search → 150ms

Follow-up (30s later) → still 150ms

❌

No temporal awareness. Every query treated identically.

✓

ContextOS Q2 → L1 hit → <1ms

6-Component System

User Interface

Captures query + timestamp + session

Context Scheduler

Parses intent, searches tiers, ranks results

Three-Tier Cache

L1 → L2 → L3 with cascade fallback

Capsule Engine

Assembles optimal context prompt

LLM Inference

Generates answer from rich context

Feedback Loop

Updates stats, promotes capsules

Capsule Formats

Raw Text

100%

Summary

~15%

Facts

~3%

Embedding

768d

System auto-selects format based on priority and token budget.

Cache Tier Latencies

<1ms

RAM

<10ms

Redis

<100ms

Storage

L1 is 100× faster than L3. Keeping hot context in L1 transforms average latency.

Three L3 Backends

∿

VectorDB — semantic similarity, embedding search

◇

GraphDB — relationship traversal, concept linking

□

ObjectStore — full document retrieval, source of truth

→

All three combined = most comprehensive results

The System Gets Smarter

↑

Promotion

Frequent capsules move to faster tiers automatically

↓

Demotion

Unused capsules decay to slower tiers over time

New Capsules

Every Q&A pair becomes a hot L1 capsule immediately

Co-Access

Related capsules get pre-fetched together

67–80% Latency Reduction

Traditional RAG

150ms

Cold query

150ms

Warm follow-up

60ms

Hot (frequent)

15ms

Average (learned)

40ms

50%

L1 hit rate

25%

L2 hit rate

80%

L1+L2 combined

Open Research Questions

Cold start — bootstrap for new users without history

Optimal sizes — ideal L1/L2 capacity per domain

Consistency — cache invalidation when source updates

Compression — extractive vs. abstractive vs. learned

Multi-user — shared L2/L3 with L1 privacy boundaries

Page 1 / 18 scroll to explore ↓