How Cursor Indexes Codebases Fast: Merkle Trees Meet AI preview image

Merkle Trees

  • Each leaf node = hash of a data block (file). Non-leaf nodes = hash of children.
  • Any change propagates up to the root hash, enabling fast change detection.

Cursor's Process

  1. Chunking: Split code into semantic units (functions, classes) via AST
  2. Merkle tree: Compute hashes for each chunk, sync with server
  3. Embedding: OpenAI or custom models vectorize chunks
  4. Vector DB: Stored in Turbopuffer with obfuscated file paths
  5. Change detection: Every 10 minutes, compare Merkle hashes -- upload only changed files
  6. RAG: When user queries (@Codebase), retrieve relevant chunks for LLM context

Security

  • Code not stored in database -- deleted after request
  • File paths obfuscated with secret keys
  • Git integration for team collaboration

Key Insight

"Merkle trees are a fast, secure fingerprint system for detecting changes in data."

Related writing

Related writing