---
title: "RFC: hub recall API"
url: https://mdfy.app/OGySiVoO
updated: 2026-05-14T18:15:49.480Z
source: "mdfy.app"
---
# RFC: hub recall API

> Status: shipped.

## The endpoint

`POST mdfy.app/api/hub/{slug}/recall`

Body:

```json
{
  "question": "How does cross-AI memory work?",
  "k": 10,
  "level": "doc",
  "rerank": true
}
```

Returns the top-k matching chunks (or docs), ranked.

## How the recall actually works

1. **Embedding lookup.** Embed the question with `text-embedding-3-small` (1536 dim).
2. **Hybrid retrieval.** Run two queries in parallel against the user's hub:
   - **Vector.** pgvector cosine match against `documents.embedding` (HNSW index, `ef_search = 40`).
   - **Lexical.** Postgres `to_tsvector` full-text search against `documents.fts`.
3. **Union + de-dup.** Concatenate the top 30 from each, de-dup by doc id. ~30-50 unique candidates.
4. **Reranker (optional).** If `rerank: true`, send the candidates + the question to Anthropic Haiku, which scores each match. Re-sort by Haiku's score, take top-k.
5. **Return.** Each result includes: doc id, doc title, doc URL, the matched chunk text, the rank score, and the source (vector / lexical / both).

## What's tunable

- `k` — number of results to return. 1-20.
- `level` — `"doc"` returns whole docs; `"chunk"` returns specific passages (chunks are pre-computed at ~500 tokens each).
- `rerank` — boolean. Default true. Costs ~300ms p95. False for speed-first paths.
- `min_score` — discard results below a cosine threshold. Useful for "don't return anything if nothing matches."

## Auth

The endpoint is publicly callable for public hubs. For restricted/private hubs, the caller has to be the owner OR have an MCP-signed token. Anonymous calls to a private hub return 401.

## What it doesn't do

- **Multi-hop reasoning.** No "fetch this, then fetch what it links to, then aggregate." That's a higher-level construct that lives in the caller's loop.
- **Live recomputation of embeddings.** We embed at write time; recall reads from the existing vectors. Staleness is bounded by the longest delay between a doc edit and the embedding-refresh job (currently 30s).
- **Graph traversal.** Recall is flat over chunks. The graph relationships are at the concept level, accessible separately via the concept index.

## What's next

- **Per-hub recall caching.** Common queries against a public hub should be cacheable for ~60s.
- **Streaming results.** Today the response waits for the reranker to finish. We could stream the union results as they arrive and replace them as the reranker scores them. Tradeoff: more complex client code.
- **Configurable embedding model.** Currently hardcoded to OpenAI ada-3-small. Worth exposing if we ever support a non-OpenAI default.
