Back to Blog🇮🇹 Leggi in Italiano

RL-Scored Retrieval: How a Neural Reranker Learns What You Actually Need

Static rerankers freeze the day they ship. Ours retrains every night on your codebase, your logs, your knowledge graph. Here is what that buys you.

Why cosine alone fails at the personal level

Every RAG pipeline follows the same path: embed query, embed docs, cosine similarity, top-K. It works—that's why it's everywhere. It also hits a ceiling.

Cosine measures topical overlap, not usefulness. Two snippets equally "about authentication" where only one answers your current question. "JWT rotation pattern" and "Why we moved off JWT" live next to each other in embedding space. Which you want depends on what you're doing now—the index can't know that.

Two-stage retrieval is now standard

The fix everyone uses: two-stage pipeline. Cheap vector recall for K candidates (50-200), then expensive reranker scoring query+candidate pairs. The 2026 reranker market is crowded and good. Cohere Rerank 3.5 and Voyage rerank-2.5 lead closed-source; BGE reranker v2-m3, Jina Reranker v3, Mixedbread cover open-source. All share one trait: static at inference. Trained once on massive data, frozen, shipped. Excellent general judges that know nothing about you.

What our reranker does that they don't

Orchestrator Pro ships a different reranker. Six things put it in a different product category:

1. It learns from YOUR codebase and YOUR patterns. Static rerankers train once on aggregated clicks, then freeze. Ours retrains nightly on your local log via offline_trainer.py. After a month the model is shaped by how you search, what you cite, what you edit—not what millions of anonymous users clicked in 2023.

2. It is KG-context-aware. When scoring a candidate, the reranker sees the candidate's typed neighbours in the knowledge graph (via semantic_graph_search). Static rerankers see only (query, doc_text) pairs. Ours sees (query, doc_text, graph_neighbourhood).

3. It picks detail levels, not just order. hybrid_search(detail="titles" | "descriptions" | "full"). The RL policy chooses level per candidate by confidence: top-3 high-score get full text, mid-score get summaries, low-score get titles only. Static rerankers reorder; you still pay for full chunks across top-K. Ours prevents token waste.

4. The same model feeds the agent router. A Q-learning policy decides which agent—coder, planner, kg-navigator—gets a task. Routing and retrieval telemetry share one corpus; the system learns across both axes.

5. Per-user, per-machine, never leaves your box. Weights at state/rl_model_1024.pt are yours. Cohere and Voyage are cloud APIs. BGE and Jina run locally but stay frozen—local execution isn't personalization.

6. Graceful degradation. RL server down or model missing? Retrieval falls through to cosine. No request fails because the reranker is unavailable.

How it actually works

The Pro reranker is a small PyTorch network — input dim 1024 to match qwen3-embedding, weights at state/rl_model_1024.pt, served by a local process on port 11439. On each query the orchestrator pulls the top K from Weaviate, posts query embedding + candidate embeddings + metadata (recency, node type, tags, access frequency, graph neighbours) to the RL server, and receives a reordered top-K plus a per-item detail-level decision.

Every retrieval writes a row via rl_logger.py: which candidates were shown, which the agent cited or edited, which were ignored. That is the reward signal. Overnight, offline_trainer.py replays the log and updates weights. Same pipeline feeds the Q-learning router that dispatches agents.

Pre-trained baseline, personalization gradient

Pre-trained baseline + personalization gradient. From session one, the reranker performs at baseline trained on aggregated orchestrator telemetry—not random, not cosine-only. From there, every retrieval feeds the local logger; nightly retraining tightens the model around your patterns. Within hours of real use the personalization signal is detectable; within weeks it dominates.

If a Python developer types "how does X work" and consistently clicks code examples over conceptual overviews, the reranker learns that. If they type "pattern for Y" and click ADRs, it learns that too. Cohere can't learn either. Neither can BGE.

How this compares

ProductMethodPersonalizes to userKG-awarePicks detail levelLocalPricing
VibeCoded Orchestrator ProRL-scored neural reranker (Q-learning)Yes — nightly retrainYesYesYes€19/mo, €149/yr, €199 lifetime (cap 100)
Cohere Rerank 3.5Static cross-encoderNoNoNoNo (cloud)Per-query API
Voyage rerank-2.5Static cross-encoder, instruction-tunedNoNoNoNo (cloud)Per-query API
BGE reranker v2-m3Static cross-encoderNoNoNoYesFree
Jina Reranker v3Static listwise, 131k contextNoNoNoYes (weights)Free weights / paid API
Mixedbread rerank-large-v1Static, GRPO + contrastive + preferenceNoNoNoYesFree weights
Cursor / Copilot retrievalProprietary, undisclosedUnclearUnclearUnclearNoBundled
Mem0 / Mem0gVector + BM25 + entity graphPartial (memory graph)PartialNoNo (cloud)Free → $19 → $249/mo

Static cross-encoders—Cohere Rerank 3.5, Voyage rerank-2.5, BGE-v2-m3, Jina Reranker v3—are excellent zero-shot rerankers. Trained on millions of cross-domain pairs, strong BEIR day-1 accuracy. Our reranker ships pre-trained on aggregated orchestrator telemetry, so it doesn't start from zero—sensible baseline at install. The difference: what happens after. Static models freeze at ship. Ours adapts to your codebase, KG, patterns. Within hours of real use it's already moving in your favor; within weeks it compounds into something no static model can replicate.

Funnel

The base VibeCoded Orchestrator is free and AGPL — KG, code graph, cosine retrieval, no RL. Most users will not need more.

Orchestrator Pro adds the RL reranker, KG-aware scoring, detail-level selection, and the shared Q-learning router for €19/month, €149/year, or €199 lifetime (capped at 100 seats). If personalization over time matters to your workflow — and it matters more the more distinctive your workflow is — Pro earns its keep around week three. If not, cosine is still right there, still free, still fast.

MAO (our hierarchical multi-agent orchestrator) is a separate product on a separate track — not part of this comparison.

Sources