← Back to posts
Engineering/14 min read/February 20, 2026

How OpenClaw Actually Works: Sessions, Memory, Browser Automation and Tokens

I've been running OpenClaw for weeks and kept hitting token limits without understanding why. So I dug into every layer — sessions, memory, browser automation, costs. Here's what I found.

AI AgentsOpenClawBrowser AutomationArchitectureTokens

I've been running OpenClaw for a few weeks and honestly had no idea what was happening under the hood. I was hitting my Claude Max $200/month weekly limits and couldn't explain why. Agents were "remembering" things across days but I didn't know how. Browser automation was taking minutes per task. Sessions were growing to 5MB files and I had no idea what that meant.

So I sat down with Claude Code and went through every layer — session files, SQLite databases, config, heartbeat logs — until I actually understood it. This is what I found.

What OpenClaw Is (And Isn't)

OpenClaw is not a chatbot. It's not a Claude wrapper. It's not "ChatGPT but customized."

The closest analogy is a personal AI operating system — multiple specialized agents running in the background, maintaining their own memory, getting better at their specific jobs over time. You interact with it like a team — sometimes you're talking to the CEO who routes to specialists, sometimes agents are working overnight while you sleep.

The best contrast is with Claude Code, which I also use constantly:

Claude Code OpenClaw
What CLI coding tool Personal AI OS
When You reach for it It's always running
Memory Per-session only Persistent across all sessions
Agents You orchestrate Self-orchestrating team
Best for "Build this feature" "Run things while I sleep"

They're complementary. Claude Code is what I reach for when I want to do focused coding work. OpenClaw handles everything else — monitoring markets, writing and posting Twitter threads, running research pipelines, building things overnight — and it does it whether I'm at my laptop or not.

The Agent Team

OpenClaw runs 14 agents. Each has:

  • A distinct role and personality (defined in a SOUL.md file)
  • Its own workspace and file system
  • Its own persistent memory (markdown files + a vector database)
  • A configured AI model matched to the complexity of its job

The lineup I run:

Claw — The CEO. Every request goes here first. Routes to specialists or handles directly. Has the deepest memory of context and ongoing priorities.

Builder — Engineering. Ships features overnight via sub-agents. Knows my codebases and conventions.

Lobster — Content and research. Writes DeFi/AI Twitter threads, posts them autonomously to @0xlobsterbrain, tracks what's been covered.

Analyst — Deep research. Market intelligence, trend analysis, deep dives.

Trader — Markets and DeFi positions.

Watcher — Monitoring. Flags when something needs attention.

Marketer — Product launches, growth, copy.

Hacker, Director, Kat, DeFi, Iron, Coach, Zen — Other specialists.

Each agent is a separate process with its own context window. They don't share a conversation thread — they communicate by passing messages through the orchestration layer.

What Happens Every Time You Send a Message

This is the part nobody explains clearly. Every API call sends three layers of context.

Layer 1: The System Prompt (~8,400 tokens, every session)

This gets assembled when a session starts and stays constant. I could see the exact breakdown from session metadata:

Component Size
SOUL.md (personality + rules) 3,529 chars
TOOLS.md (setup-specific notes) 4,387 chars
MEMORY.md (long-term memory) 2,840 chars
AGENTS.md (operating instructions) 1,397 chars
HEARTBEAT.md (periodic checklist) 1,693 chars
IDENTITY.md 623 chars
USER.md (who you are) 272 chars
Skills listing (15 skills) 7,488 chars
Tool schemas (24 tools) 15,685 chars
Total ~33,500 chars / ~8,400 tokens

That's the baseline tax. Every message you send, the model receives those ~8,400 tokens before it sees your question. This is what makes the agent feel like "itself" across sessions — but it's not free.

Layer 2: Conversation History (the snowball)

The active session file for Claw was 5.6MB with 848 messages, using 109,173 input tokens. Every message, the entire conversation history goes to the API. This is how the model has context about what you discussed this morning when you ask something tonight — but it also means every message gets progressively more expensive.

This is the main driver of token costs. Not the system prompt. Not memory search. The growing conversation history.

Layer 3: Memory Search Injection (bounded, on demand)

When memory_search is called, the system queries a SQLite vector database, finds up to 6 relevant chunks (up to ~2K tokens), and injects them into context. This is how the agent "remembers" things without loading everything upfront. More on how this actually works below.

The Memory System (Two Layers)

Memory is dual-layered: markdown files for human-readable notes, and SQLite vector databases for semantic search.

Layer 1: Markdown Files

Every agent has a workspace directory:

workspace/
├── SOUL.md          — who the agent is
├── MEMORY.md        — curated long-term memory
├── TOOLS.md         — setup-specific notes
├── USER.md          — who it's helping
├── AGENTS.md        — operating instructions
└── memory/
    ├── 2026-02-20.md  — today's raw log
    ├── 2026-02-19.md  — yesterday
    └── ...

Daily logs are raw notes from each day — decisions made, things that happened, context for ongoing tasks. Agents write to these freely.

MEMORY.md is the curated version — distilled lessons and important context worth keeping long-term. Claw reviews daily logs during heartbeats and updates MEMORY.md with what actually matters.

Every session, the agent reads SOUL.md, USER.md, and today + yesterday's memory files. That's your continuity. That's why the agent knows what happened yesterday even in a fresh session — it literally reads about it.

Layer 2: SQLite Vector Database

Each agent has a SQLite database:

  • files — which markdown files are indexed (26 files for Claw)
  • chunks — text chunks with embeddings (80 chunks for Claw)
  • embedding_cache — cached embeddings to avoid recomputing
  • chunks_fts — full-text search index

When an agent writes to a memory file, a watcher picks it up within 2 seconds: splits into 320-token chunks with 60-token overlap, generates vector embeddings, stores in SQLite.

When memory_search is called, it runs a hybrid search:

  • 75% vector similarity (finds semantically related content)
  • 25% full-text search (exact keyword matching)
  • Temporal decay (recent memories score higher)
  • MMR to remove redundant results

So when you ask the agent about something from two weeks ago, it surfaces relevant context without you specifying where to look.

Browser Automation: Old Way vs. Efficient Mode

This is where things got interesting. Browser automation is one of OpenClaw's most powerful features — agents can navigate websites, fill forms, click buttons, post content — but the default approach was burning tokens at an insane rate.

The Old Way: Full LLM Loop

The old browser workflow looked like this for every interaction:

snapshot (full DOM) → LLM reads everything → decides action
→ click → snapshot again → LLM reads again → next action...

For a simple web interaction you'd do:

  1. Navigate to page → full snapshot (200KB of DOM)
  2. LLM reads everything, decides what to click
  3. Click → another full snapshot
  4. LLM reads everything again, decides next step
  5. ... repeat for every action

The numbers were brutal: ~15,000 tokens per task, 200KB+ transcripts, 3-5 minutes per workflow. An overnight research session with 20 page visits was catastrophic for token budgets.

Efficient Mode: The Fix

The breakthrough was mode=efficient for snapshots. Instead of capturing the full rendered DOM, it returns only the interactive elements — inputs, buttons, links, selectors — with minimal surrounding context.

# Old way
browser action=snapshot  # Returns full DOM, often 200KB

# Efficient mode
browser action=snapshot mode=efficient  # Returns just interactive elements, ~20KB

This is now the global default in the config:

"browser": {
  "snapshotDefaults": {
    "mode": "efficient"
  }
}

Pair this with batch fill — sending multiple field values in one command instead of one at a time — and the numbers change dramatically:

Method Tokens Transcript Size Time
Old (field-by-field LLM loop) ~15,000 200KB+ 3-5 min
Batch fill + efficient snapshot ~5,000 80KB ~1 min
Deterministic scripts (no LLM loop) ~500 20KB 30 sec

The deterministic approach is the extreme end — write scripts that know exactly which selectors to use for known sites, and the agent runs them without an LLM reasoning loop at all. No thinking required, just execution.

Browser Automation in Practice: Twitter Threads

The Lobster agent is a good example of browser automation being genuinely useful. Lobster's job is to research DeFi/AI topics and post threads to @0xlobsterbrain.

The workflow:

  1. Pick a topic from TOPICS.md priority queue (17 queued topics like "Perps strategies," "Hyperliquid deep dive," etc.)
  2. Research deeply — reads docs, GitHub repos, papers
  3. Writes a draft thread (usually 5-13 tweets) to a drafts file
  4. Pings me on Telegram with the draft
  5. I approve (or tweak)
  6. Lobster posts via browser automation — navigates to Twitter, composes each tweet, threads them together
  7. Marks topic done in TOPICS.md with date + thread URL
  8. Syncs status to Mission Control dashboard

The browser part is what makes step 6 possible without me being at my laptop. Lobster opens a real browser (using a saved Chrome profile with Twitter already logged in), and posts each tweet in sequence. Efficient mode means this doesn't burn thousands of tokens.

A 13-tweet thread gets posted in about 2 minutes. Lobster then runs an hourly cron to check replies and continue research on the next topic.

One thing it learned the hard way: tracking what's already been posted matters. It tried to re-research a topic already covered because the Completed section of TOPICS.md was empty. Fixed by checking the @0xlobsterbrain Twitter profile for recent threads and backfilling the completed list. The agent actually wrote this lesson to its LEARNING.md so it doesn't repeat the mistake.

What Browser Automation Is Good At vs. Where It Struggles

Works well:

  • Posting to social platforms with consistent structure
  • Web research across multiple pages (real browser = cookies, JS rendering, no scraper blocks)
  • Repetitive multi-step workflows on known sites

Struggles:

  • Sites with heavy anti-bot detection
  • Complex custom dropdowns and multi-step forms
  • CAPTCHAs
  • Sites that frequently change DOM structure

The agent builds up site-specific knowledge over time in LEARNING.md files — what selectors work for which platforms, what fails, what the fallbacks are. That institutional knowledge accumulates the more the agent runs.

Why Token Costs Spiral

When I hit my Claude Max weekly limits and couldn't explain why, it was several things compounding:

Heartbeats × 14 agents. Claw heartbeats every 2 hours — 12 wakeups/day × ~8,400 token system prompt = ~100K tokens just to start sessions. Multiplied across all active agents with their own schedules.

Expensive models on core agents. Claw, Builder, and Trader were on Claude Opus — roughly 5-15x the cost per token vs Sonnet.

Sonnet 4.6 regression. After hitting limits, I switched from Opus to Sonnet 4.6. Turned out Sonnet 4.6 uses 4-5x more tokens than Sonnet 4.5 on certain tasks. Ended up costing as much as Opus for worse results. Sonnet 4.5 is actually the efficient one — worth testing before assuming newer = cheaper.

Large files loaded whole. One agent was doing cat on a 79KB JSON file during every heartbeat just to check a status field. That's 79KB of context 12 times a day. Replaced with a targeted jq query — same information, ~200 tokens instead of 2,000+.

Old browser mode. Before efficient snapshots, heavy browser sessions were burning 15,000+ tokens per task. An overnight research session with 20 pages visited was brutal.

Conversation history snowballing. Builder runs overnight builds with sub-agents. Its session file grew massive over weeks. Every morning heartbeat carries all that history forward.

The Fixes That Helped

Switched models. Main agents now run on Kimi K2.5 (Moonshot AI, free tier, 256K context window) with GPT-5.3-codex as fallback. The token pressure is completely different.

Enabled efficient browser mode globally. One config change — snapshotDefaults: { mode: "efficient" } — cut browser token usage by 60-97% depending on the task.

Archived old daily memory files. 20+ daily files were sitting in active directories. Moved everything older than 10 days to archive/ subdirectories. Added a cron to do this automatically every ~10 days across all 9 active workspaces.

Fixed the large JSON reads. Replaced whole-file reads with targeted jq queries wherever agents were loading large JSON into context.

Converted repetitive browser tasks to scripts. For known-structure sites, write the selector logic once as a script. The agent calls it — no LLM reasoning loop, just execution.

The "Grows With You" Part Is Real

People are skeptical about this — it sounds like marketing. But the mechanism is concrete.

When OpenClaw was first set up, MEMORY.md was nearly empty. Answers were generic.

Weeks later, MEMORY.md knows specific things: my timezone and location, my active projects and their status, my communication style preferences, that Sonnet 4.6 is token-hungry, that efficient browser mode should always be on. None of that was configured upfront. It came from conversations — the agents wrote things down when they seemed worth keeping, and the vector search surfaces them when relevant.

The compounding works because memory persists across model upgrades. When a better model comes out, you switch it in and the accumulated context stays. You don't start over.

Compare that to a regular chatbot: every session is blank. The model improves but your context doesn't compound. You're permanently working with a smarter amnesiac.

Where It Falls Down

Cold start is rough. Fresh agents with empty memory are basically generic assistants. Takes days to weeks of actual use before it feels personal.

Memory sync is manual. If one specialist learns something, others don't automatically know. You have to be intentional about what gets written to shared vs. agent-specific memory.

Token costs are real, especially early. Before you optimize — right model per agent, browser in efficient mode, large files queried not loaded — costs add up fast.

Session history is the silent killer. Long-running agents accumulate conversation history. The compaction system helps (trim at 80% of context window, hard reset at 92%), but you need to know this is happening.

Browser automation requires learning time. Agents need to discover site-specific patterns before they're reliable. The first run on a new site will often fail. Those failures get written to LEARNING.md and the agent gets better. But it takes time.

The Architecture That Works

After a few weeks running this, here's what I think the right setup is:

Match model cost to task complexity. CEO doing complex routing and synthesis — worth the expensive model. Agents doing repetitive monitoring or posting tasks — don't need Opus.

Always use efficient browser mode. There's no reason not to. Full DOM capture is almost never needed. Efficient mode loses no meaningful information for the vast majority of tasks.

Heartbeat frequency should match urgency. A market watcher needs frequent heartbeats. A research agent doing weekly deep dives doesn't. Dial this in to cut background token burn.

Keep workspace files small. Every file that auto-loads into the system prompt is tokens you pay forever. Keep MEMORY.md curated. Archive daily logs. I went from 47KB workspace to 14KB — that's real savings across every session.

Use scripts for repetitive browser tasks. If an agent runs the same browser workflow repeatedly, script it. The LLM reasoning loop is for novel situations, not repetition.

Sub-agents are where the real leverage is. Builder can spawn up to 8 sub-agents working in parallel overnight. Eight agents shipping in parallel for 8 hours is a categorically different output than one agent doing the same.

The Reason I Keep Running It

Despite the optimization work and the costs, the alternative is starting fresh every conversation.

With a regular chatbot, every session is blank. The model keeps improving but your context doesn't compound. You're permanently working with a smarter amnesiac.

With OpenClaw, every interaction adds to a growing picture. Three months from now, the accumulated memory will be meaningfully richer — and the next model upgrade inherits all of it.

That's the bet. It takes patience in the early days and active optimization to keep costs controlled. But if you're using AI for anything ongoing and serious, starting from zero every session starts to feel like a tax you shouldn't be paying.


If you're running something similar or building multi-agent infrastructure, I'd love to hear what you're doing — reach out on Twitter.