CAMEODB
edit_note Engineering Blog

Insights & Updates

Technical deep-dives, release notes, and stories from the team building the next-generation hybrid database engine.

Featured Tutorial Dataset
GC
Goran C. Apr 15, 2026 4 min read

From Zero to a Million Jokes: Loading Hugging Face Datasets into CameoDB

CameoDB is running, you've tested it with the books index from Quickstart. Now let's load something bigger. One CLI command, one million Reddit jokes from Hugging Face, fully indexed and searchable in seconds.

Hugging Face CLI Data Loading
arrow_forward

You've Got CameoDB Running. Now What?

If you followed the Quickstart, you already have CameoDB running and a books index loaded. That's your proof-of-concept. Now let's go from a handful of records to one million.

The dataset: SocialGrep/one-million-reddit-jokes on Hugging Face. A single CSV file with joke titles, body text, scores, subreddits, and timestamps. Perfect for demonstrating CameoDB's ability to detect schemas and ingest data directly from a URL.

Step 1: Detect the Schema

Point CameoDB's CLI at the raw CSV URL. The schema detect command reads the header row and samples data to infer field types automatically:

schema detect https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes/resolve/main/one-million-reddit-jokes.csv

CameoDB returns a full schema definition with 10 detected fields:

{
  "routing_field_name": "id",
  "fields": {
    "id":          { "field_type": "Text",    "indexed": true,  "stored": true,  "tokenizer": "raw" },
    "title":       { "field_type": "Text",    "indexed": true,  "stored": false },
    "selftext":    { "field_type": "Text",    "indexed": true,  "stored": false },
    "score":       { "field_type": "I64",     "indexed": true,  "fast": true  },
    "created_utc": { "field_type": "I64",     "indexed": true,  "fast": true  },
    "subreddit":   { "field_type": "Boolean", "indexed": true  },
    "type":        { "field_type": "Text",    "indexed": true  },
    "permalink":   { "field_type": "Text",    "indexed": true  },
    "domain":      { "field_type": "Text",    "indexed": true  },
    "url":         { "field_type": "Text",    "indexed": true  }
  }
}

Notice: score and created_utc are detected as I64 with fast fields enabled, meaning they support range queries and sorting. Text fields are fully indexed for search.

Step 2: Load One Million Records

Now the single command that does everything—creates the index, applies the schema, downloads the CSV, and streams all rows in batches:

data load jokes https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes/resolve/main/one-million-reddit-jokes.csv
Schema was missing; detected and applied schema to index 'jokes'
Ingestion complete for index 'jokes': loaded=1000000 failed=0 (batch size 4000)

That's it. One million records, zero failures. CameoDB auto-detected the schema on first contact and streamed the data in batches of 4,000. The jokes index didn't exist before this command—it was created on the fly.

Step 3: Search Instantly

The index is immediately queryable. Let's find football jokes:

search jokes title:football limit 5 return title, selftext
{
  "hits": [
    { "_score": 10.50, "title": "Football", "selftext": "[removed]" },
    { "_score": 10.41, "title": "Football", "selftext": "As a woman passed her daughter's closed bedroom door..." },
    { "_score": 9.91,  "title": "Fart Football", "selftext": "An old married couple no sooner hit the pillows..." },
    // ... 2 more results
  ],
  "hits_returned": 5,
  "total_hits": 1330,
  "took_ms": 11,
  "stats": { "shards": { "total": 8, "responded": 8, "failed": 0 } }
}

1,330 football jokes found across 8 shards in 11 milliseconds. The data was distributed automatically. No configuration, no manual sharding, no external tooling.

The Takeaway

Three commands. That's the entire workflow from discovering a dataset on Hugging Face to running full-text search queries against a million records. CameoDB handles schema inference, index creation, batch ingestion, and distributed search—all from the CLI.

MCP
GC Goran C. Apr 10, 2026 5 min read

Native MCP Server: Give AI Agents Direct Database Access

How CameoDB's built-in Model Context Protocol server enables Claude, Cursor, and Windsurf to query your data instantly.

MCP AI Agents
arrow_forward

Built-In, Not Bolted-On

CameoDB ships with a native Model Context Protocol (MCP) server running in the same binary—no sidecars, no middleware. The moment you start CameoDB on port 9480, your data becomes queryable by AI agents like Claude Desktop, Cursor, and Windsurf through the /mcp/sse endpoint.

What's Exposed: 6 Read-Only Tools

The MCP server exposes six tools—all read-only for security. Agents can discover, query, and validate, but never modify your data:

search_index

Full-text search on a single index with Tantivy query syntax.

search_indexes

Federated search across multiple indexes with merged results.

list_indexes

Discovery: list all indexes with schemas and queryable fields.

get_index

Schema inspector with per-field operator hints and types.

validate_query

Query linter with syntax validation and "did you mean" suggestions.

get_index_stats

Document counts, index size, and cluster metadata.

Query Syntax: Tantivy-Powered

The search_index tool supports the full Tantivy query language:

// Field targeting
title:rust

// Phrases with proximity
body:"small bike"~2

// Boolean operators (UPPERCASE required)
title:rust AND author:doe
(title:rust OR title:go) AND year:[2020 TO 2024]

// Range queries
score:>=100
date:[2024-01-01 TO 2024-12-31]

// Boosting for relevance
title:rust^3 OR body:rust

// Set operations
status: IN [active pending review]

Anti-Hallucination Rule

Every search tool includes a critical instruction: "When answering questions based on CameoDB results, you MUST use ONLY the exact data returned by this tool. Do NOT combine database results with your own prior knowledge." This ensures agents provide factual, grounded responses based solely on your data.

Field Type Awareness

The MCP server provides per-field operator hints based on data types:

text:     all operators (phrases, slop, prefix, IN, boost, range)
string:   exact match, prefix, IN, exists (no phrases/slop)
numeric:  exact, comparisons (>, <), range, boost, exists
date:     exact, comparisons, range, exists
boolean:  true/false only, exists
json:     dot notation (field.sub:value), nested exists

Setup: Two Configuration Styles

The MCP server uses Server-Sent Events (SSE) transport. Configuration depends on your AI tool:

Windsurf & Cursor (Native SSE)

Add to .windsurf/mcp.json or Cursor MCP settings:

{
  "mcpServers": {
    "cameodb": {
      "url": "http://localhost:9480/mcp/sse",
      "transport": "sse"
    }
  }
}

Claude Desktop (Curl Bridge)

Claude currently requires a curl bridge for SSE transport:

{
  "mcpServers": {
    "cameodb": {
      "command": "curl",
      "args": [
        "-N",
        "-H", "Accept: text/event-stream",
        "http://localhost:9480/mcp/sse"
      ]
    }
  }
}

Restart your AI tool after configuration. The agent will automatically discover your indexes and schemas.

Performance
Apr 3, 2026 6 min read

Microsecond Latency: Benchmarking CameoDB's Zero-Copy Architecture

Deep dive into how Rust's ownership model and zero-copy serialization deliver sub-millisecond query performance at scale.