Picking a Code Embedding Model for Solidity in 2026: A Deep Dive into the MTEB Code Leaderboard

There’s a quiet arms race happening in smart contract tooling that doesn’t get nearly enough coverage. While everyone is arguing about which L2 will win or whether account abstraction will finally go mainstream, a small but growing cohort of security researchers, audit firms, and protocol developers are trying to solve a much more tractable problem: how do you build a retrieval system that actually understands Solidity?

The use cases are real and multiplying. Automated audit pipelines that retrieve semantically similar vulnerable code patterns. RAG-based assistants for internal contract libraries spanning hundreds of thousands of lines. Semantic search over past audit reports so a reviewer can ask “show me every time we’ve seen a reentrancy guard bypassed via a low-level call” and get something useful back. Fuzzer-guided corpus selection that pulls in contracts with similar state machine logic.

All of these depend on one unglamorous upstream decision: which embedding model do you use?

I’ve spent the last few weeks pulling apart the MTEB Code Leaderboard snapshot from today — February 25, 2026 — and I want to walk through what I found, what the data actually tells you about Solidity specifically, and where I’d put my money depending on your deployment constraints. The raw data is available if you want to reproduce any of this:

The Benchmark Problem: There Is No Solidity Task

Let me get the uncomfortable truth out of the way first.

MTEB Code has no Solidity benchmark. None. If you go looking for a SolidityRetrieval task in the leaderboard columns, you won’t find it. The tasks that exist are AppsRetrieval, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, StackOverflowQA, CosQA, and SyntheticText2SQL. Some are genuinely useful proxies. Most are only loosely related to what you actually care about.

You can see this immediately by loading the data:

python

import pandas as pd
import re

def clean_model_name(name: str) -> str:
    """Strip markdown link syntax: [ModelName](url) -> ModelName"""
    match = re.search(r'\[(.+?)\]', name)
    return match.group(1) if match else name

summary = pd.read_csv("mteb_lb_ds_code_feb_25_2026_summary.csv")
tasks   = pd.read_csv("mteb_lb_ds_code_feb_25_2026_perf_per_task.csv")

summary["Model"] = summary["Model"].apply(clean_model_name)

print("Task columns in perf_per_task:")
print([c for c in tasks.columns if c != "Model"])

code

Task columns in perf_per_task:
['AppsRetrieval', 'COIRCodeSearchNetRetrieval', 'CodeEditSearchRetrieval',
 'CodeFeedbackMT', 'CodeFeedbackST', 'CodeSearchNetCCRetrieval',
 'CodeSearchNetRetrieval', 'CodeTransOceanContest', 'CodeTransOceanDL',
 'CosQA', 'StackOverflowQA', 'SyntheticText2SQL']

No Solidity column. The closest proxies are CodeSearchNetCCRetrieval (CSNCC) and CodeSearchNetRetrieval (CSN).

CSNCC covers multiple languages including JavaScript, which shares enough syntactic DNA with Solidity — C-style syntax, event-driven patterns, prototype-adjacent inheritance — that a model scoring well there has likely internalised some transferable structure. CSN is the classic natural-language-to-code retrieval task: given a docstring, find the function. It measures how well a model bridges intent and implementation, which is exactly what you want when a developer asks “find the contract that handles emergency pauses with a time delay.”

StackOverflowQA is surprisingly predictive of code QA generalisation even though SO has relatively sparse Solidity content. AppsRetrieval correlates with top-line model quality but says little about language-specific fidelity.

Keep this framing in mind throughout. When I say a model “should generalise to Solidity,” I’m making an inference from training data provenance, architecture, and proxy scores — not a direct measurement. Anyone claiming otherwise is overselling their certainty.

The Leaderboard Landscape

The snapshot has 204 models ranked by Borda score. Let’s look at what we’re actually working with before pulling out individual picks.

python

merged = summary.merge(tasks, on="Model", how="inner")

print(merged["Retrieval"].describe())
print(f"\nModels with retrieval >= 90: {(merged['Retrieval'] >= 90).sum()}")
print(f"Models with retrieval >= 80: {(merged['Retrieval'] >= 80).sum()}")
print(f"Models with retrieval < 70:  {(merged['Retrieval'] < 70).sum()}")

code

count    204.000000
mean      64.318627
std       17.421083
min       19.220000
25%       54.417500
50%       66.185000
75%       77.320000
max       97.290000

Models with retrieval >= 90: 9
Models with retrieval >= 80: 34
Models with retrieval < 70:  108

The distribution is heavily skewed. The top 9 models are separated from the median by a gap that reflects genuine capability differences, not benchmark noise. That median of 64.3 is a reminder of how much junk floats up in leaderboards — half the models on here you should never touch for a production retrieval system.

Let’s also visualise the hardware picture, since for self-hosted models this is often the actual binding constraint:

python

import matplotlib.pyplot as plt

local = merged.dropna(subset=["Memory Usage (MB)", "Retrieval"]).copy()
local["mem_gb"] = local["Memory Usage (MB)"] / 1024

highlights = {
    "CodeSearch-ModernBERT-Crow-Plus": (0.59, 84.59),
    "Octen-Embedding-4B":              (7.49, 90.81),
    "Octen-Embedding-8B":              (14.10, 93.36),
    "llama-embed-nemotron-8b":         (27.96, 96.59),
    "Qwen3-Embedding-4B":              (7.49, 80.07),
}

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(local["mem_gb"], local["Retrieval"], alpha=0.5, s=40, color="#aaa")

for name, (x, y) in highlights.items():
    ax.scatter(x, y, s=80, zorder=5)
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(x + 0.4, y - 1.8),
        fontsize=8,
        arrowprops=dict(arrowstyle="->", color="#555", lw=0.8),
    )

ax.set_xlabel("VRAM / RAM (GB)")
ax.set_ylabel("Retrieval Score")
ax.set_title("MTEB Code — Retrieval Score vs Memory (self-hosted only)")
plt.tight_layout()
plt.savefig("retrieval_vs_memory.png", dpi=150)

A few structural observations from that plot and the raw data:

API models still lead on raw accuracy. Voyage AI’s models occupy the top of the leaderboard by a meaningful margin. voyage-4-large at 97.29 is not incremental improvement over the open-weight competition — it’s a genuine gap.

The 4B parameter sweet spot is real. There’s a cluster of 4B-class models — Octen-4B, MoD-Embedding, Qwen3-4B — that score in the 80–91 range while fitting on an 8 GB GPU. This tier didn’t exist two years ago.

Sub-1 GB models have matured. CodeSearch-ModernBERT-Crow-Plus at 607 MB and 84.59 retrieval is remarkable. Six months ago you couldn’t get anywhere near this accuracy in that footprint.

The 8B gap is closing but hardware costs persist. llama-embed-nemotron-8b at 96.59 is exceptional, but 28.6 GB VRAM means you’re paying for an A100 or H100. That’s a different economics conversation than spinning up a g4dn.xlarge.

Category 1: Maximum Accuracy

When you need to be right as often as possible and cost is a secondary concern.

1. voyage-4-large — The Ceiling

Retrieval score 97.29. API-only. 32k token context. 2048-dimensional embeddings.

If retrieval accuracy is your north star and you’re comfortable with an API dependency, this is the answer. Voyage AI has been consistent about training on code corpora, and the public GitHub data in their training set almost certainly includes a non-trivial volume of Solidity from DeFi protocols, audit repos, and OpenZeppelin-adjacent libraries.

What distinguishes voyage-4-large from everything else isn’t just the headline number — it’s the consistency. The AppsRetrieval score of 97.29 requires understanding what code does, not just what it looks like. You cannot fake that with shallow syntactic matching.

python

import voyageai

client = voyageai.Client()  # reads VOYAGE_API_KEY from env

def embed_contracts(
    paths: list[str],
    model: str = "voyage-4-large",
    batch_size: int = 128,
) -> list[list[float]]:
    """
    Embed Solidity files by path.
    Voyage supports up to 128 documents per request.
    """
    texts = [open(p).read() for p in paths]
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch  = texts[i : i + batch_size]
        result = client.embed(batch, model=model, input_type="document")
        all_embeddings.extend(result.embeddings)

    return all_embeddings


def embed_query(query: str, model: str = "voyage-4-large") -> list[float]:
    result = client.embed([query], model=model, input_type="query")
    return result.embeddings[0]

The practical concern is data privacy. If you’re embedding audit targets or proprietary protocol code, that code is going to Voyage’s API. Non-starter for some teams, acceptable trade-off for others. Voyage offers zero-data-retention under enterprise contracts, but have that conversation explicitly rather than assuming.

2. llama-embed-nemotron-8b — Best Self-Hosted Score on the Board

Retrieval score 96.59. ~28.6 GB VRAM. 4096-dimensional embeddings. 32k token context.

The most accurate self-hosted option in the entire leaderboard. The gap below it is significant — 96.59 versus 93.36 for the next best open-weight model. For use cases where you cannot use an external API but need accuracy as close to the frontier as possible, this is your answer.

The StackOverflowQA score of 96.59 is worth noting. SO questions are ambiguous and colloquial in ways pure function-retrieval benchmarks don’t capture. A model that handles them well can handle the kinds of natural language questions your developers actually ask: “how does this contract handle slippage protection,” “find the invariant check in the lending pool,” “where’s the fee calculation for liquidity removal.”

python

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "nvidia/llama-embed-nemotron-8b",
    device="cuda",
    trust_remote_code=True,
)

# Nemotron uses an instruction prefix for asymmetric retrieval.
# Documents are embedded without a prefix; queries use this one.
QUERY_PREFIX = "Instruct: Retrieve relevant Solidity code\nQuery: "

def embed_documents(texts: list[str], batch_size: int = 16) -> torch.Tensor:
    return model.encode(
        texts,
        batch_size=batch_size,
        convert_to_tensor=True,
        normalize_embeddings=True,
        show_progress_bar=True,
    )

def embed_query_nemotron(query: str) -> torch.Tensor:
    return model.encode(
        QUERY_PREFIX + query,
        convert_to_tensor=True,
        normalize_embeddings=True,
    )

28.6 GB VRAM means an A100 80GB, an H100, or a pair of 24 GB consumer cards with tensor parallelism. For a team running a dedicated audit tooling server this is manageable. For a shared cloud VPS, it’s prohibitive.

3. Octen-Embedding-8B — Accuracy That Fits on One GPU

Retrieval score 93.36. ~14.4 GB VRAM. 4096-dimensional embeddings. 32k token context.

This is the most interesting model in the accuracy tier. 93.36 retrieval at 14.4 GB VRAM fits on a single RTX 4090, a single A100-16GB, or an L40S. You give up roughly 3 percentage points versus Nemotron-8B in exchange for halving your VRAM requirement and cutting cloud GPU cost by a similar factor.

python

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "bflhc/Octen-Embedding-8B",
    device="cuda",
    trust_remote_code=True,
)

def embed_batch(texts: list[str], batch_size: int = 8) -> torch.Tensor:
    return model.encode(
        texts,
        batch_size=batch_size,
        convert_to_tensor=True,
        normalize_embeddings=True,
        show_progress_bar=True,
    )

AppsRetrieval of 92.21 and StackOverflowQA of 94.50 tell a consistent story: robust generalisation across retrieval task structures without degrading badly on any single type. That consistency matters more in production than a headline number, because your actual query distribution won’t look like any single benchmark.

Category 2: CPU-First, Budget-Conscious Deployments

For teams on ≤16 GB VPS instances, bare-metal servers without GPUs, or CI/CD pipelines where inference cost matters.

This is where the leaderboard tells the most interesting story, because efficiency improvements in this tier over the past year have been dramatic.

1. CodeSearch-ModernBERT-Crow-Plus — The Efficiency Winner

Retrieval score 84.59. 607 MB memory. 0.152B parameters. 768-dimensional embeddings. 1024 token context.

607 megabytes. You can load this on a $6/month VPS with 1 GB RAM and have headroom left for your application. CPU inference is fast enough for synchronous, user-facing retrieval — single-digit millisecond latency per query on a four-core cloud instance.

python

from sentence_transformers import SentenceTransformer
import numpy as np

# Loads entirely into CPU RAM; ~607 MB
model = SentenceTransformer(
    "Shuu12121/CodeSearch-ModernBERT-Crow-Plus",
    device="cpu",
)

def embed(texts: list[str], batch_size: int = 64) -> np.ndarray:
    return model.encode(
        texts,
        batch_size=batch_size,
        normalize_embeddings=True,
        show_progress_bar=False,
    )

The CodeSearchNet score of 89.30 is the best direct Solidity-proxy score of any sub-1 GB model on the leaderboard. ModernBERT’s rotary positional embeddings and flash attention bring genuine efficiency improvements, and the code search fine-tuning clearly paid off.

The constraint you must plan around is the 1024 token context window. Any Solidity contract with meaningful inheritance, a library of functions, or verbose NatSpec will overflow it. You have two practical paths.

Option A — Function-level chunking. Parse the contract and embed individual functions. This is often the right unit for retrieval anyway — you want to find the specific function that handles fee accrual, not the 800-line contract it lives in.

python

import re
from dataclasses import dataclass, field


@dataclass
class SolidityChunk:
    contract:   str
    name:       str
    kind:       str   # "function" | "modifier" | "constructor" | etc.
    source:     str
    start_line: int
    metadata:   dict = field(default_factory=dict)


def chunk_by_function(source: str, contract_name: str = "") -> list[SolidityChunk]:
    """
    Naive regex-based function extractor.
    For production use, prefer a proper Solidity AST parser
    (e.g. solidity-parser-antlr via subprocess, or slither's AST walker).
    """
    pattern = re.compile(
        r"""
        (?P<kind>function|modifier|constructor|receive|fallback)
        \s+
        (?P<name>\w+)?   # constructors / receive / fallback have no name
        [^{]*            # visibility, returns, override, modifier list
        \{
        """,
        re.VERBOSE,
    )

    chunks = []
    for m in pattern.finditer(source):
        start_pos  = m.start()
        start_line = source[:start_pos].count("\n") + 1

        # Walk forward to find the matching closing brace
        depth = 0
        end   = start_pos
        for i, ch in enumerate(source[start_pos:], start_pos):
            if ch == "{":
                depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    end = i + 1
                    break

        kind = m.group("kind")
        name = m.group("name") or kind  # unnamed receive/fallback

        chunks.append(SolidityChunk(
            contract   = contract_name,
            name       = name,
            kind       = kind,
            source     = source[start_pos:end].strip(),
            start_line = start_line,
        ))

    return chunks

Option B — Sliding window with overlap. If you need whole-contract embeddings (deduplication, similarity clustering), chunk at the token level with overlap and pool the resulting vectors at query time.

python

from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeSearch-ModernBERT-Crow-Plus")


def sliding_window_chunks(
    text: str,
    max_tokens: int = 900,  # leave headroom under the 1024 hard limit
    overlap:    int = 100,
) -> list[str]:
    """
    Tokenise -> split into overlapping windows -> decode back to strings.
    Overlap ensures context isn't lost at seam boundaries.
    """
    ids  = tokenizer.encode(text, add_special_tokens=False)
    step = max_tokens - overlap
    windows = []
    for start in range(0, len(ids), step):
        window = ids[start : start + max_tokens]
        windows.append(tokenizer.decode(window))
        if start + max_tokens >= len(ids):
            break
    return windows


def embed_contract_windowed(
    source:   str,
    embed_fn,          # callable: list[str] -> np.ndarray
    pool:     str = "mean",
) -> np.ndarray:
    """
    Embed a full contract via sliding windows, then pool chunk vectors
    into a single representative vector.
    """
    chunks = sliding_window_chunks(source)
    vecs   = embed_fn(chunks)   # shape: (n_chunks, dim)
    if pool == "mean":
        return vecs.mean(axis=0)
    elif pool == "max":
        return vecs.max(axis=0)
    raise ValueError(f"Unknown pool strategy: {pool}")

For most audit tooling I’d start with function-level chunking (Option A) and only reach for windowed pooling if you specifically need whole-contract similarity.

2. jina-embeddings-v5-text-small — Long Context on a CPU Budget

Retrieval score 83.34. ~1.14 GB memory. 0.596B parameters. 1024-dimensional embeddings. 32,768 token context.

The headline differentiator is the context window. 32k tokens is enough to embed an entire Solidity file in one shot — ERC-4626 vault implementations, complex governance contracts, multi-faceted proxy patterns. No chunking required.

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small",
    device="cpu",
    trust_remote_code=True,
)

def embed_whole_file(sol_path: str) -> np.ndarray:
    """
    With a 32k context window we can embed the entire file in one shot.
    At ~4 chars/token a 120k-character contract still fits comfortably.
    """
    source = open(sol_path).read()
    return model.encode(source, normalize_embeddings=True)

The StackOverflowQA score of 93.39 is the best of any sub-2 GB model in the leaderboard. If your retrieval workload mixes code with documentation, audit commentary, or protocol specs, Jina’s generalist strengths are more valuable than ModernBERT’s code-search specialisation.

3. Octen-Embedding-0.6B — The Reliable Fallback

Retrieval score 80.26. ~1.14 GB memory. 0.596B parameters. 32k token context.

Essentially the same hardware profile as Jina v5 small with a slightly lower retrieval score, but it’s fully offline and belongs to the same model family as the 4B and 8B variants. If you scale up to a GPU later and switch to Octen-4B, you’re not changing model families and reindexing from scratch — the embedding space is similar enough to make incremental migration practical.

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "bflhc/Octen-Embedding-0.6B",
    device="cpu",
    trust_remote_code=True,
)

Category 3: Balanced Performance

Where accuracy and resource requirements are weighted together.

1. CodeSearch-ModernBERT-Crow-Plus — Best Accuracy per Megabyte, Full Stop

Already covered above, but worth restating in the balanced context: if you’re on a ≤16 GB VPS without a GPU, CodeSearch-ModernBERT doesn’t just win the efficiency category — it wins balance too. A model that scores 90+ but runs at 0.3 queries/second on your actual hardware is not better for your use case than one scoring 84 that handles 50 queries/second on the same box.

2. Octen-Embedding-4B — The GPU Sweet Spot

Retrieval score 90.81. ~7.7 GB VRAM. 4B parameters. 2560-dimensional embeddings. 32k token context.

If you have any GPU at all — a g4dn.xlarge (T4, 16 GB), an A10G, an L4, or a gaming machine with an RTX 3080+ — this is where I’d land for most production use cases. 7.7 GB VRAM is comfortable on a 16 GB card: batch generously, run the model alongside a vector DB and application server, and not worry about memory pressure.

The jump from ModernBERT (84.59) to Octen-4B (90.81) is 6.2 points — qualitatively different retrieval, not rounding noise. Queries that small models handle poorly (conceptually similar code with different variable names, patterns described abstractly) start working reliably.

python

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "bflhc/Octen-Embedding-4B",
    device="cuda",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": torch.float16},  # halves VRAM to ~3.8 GB
)

def embed_batch(texts: list[str], batch_size: int = 32) -> torch.Tensor:
    return model.encode(
        texts,
        batch_size=batch_size,
        convert_to_tensor=True,
        normalize_embeddings=True,
        show_progress_bar=True,
    )

3. Qwen3-Embedding-4B — The Broadest Per-Task Coverage

Retrieval score 80.07 overall — but look at the per-task breakdown: CodeSearchNetCCRetrieval of 95.59 and CodeSearchNetRetrieval of 92.34. These are the two best Solidity proxies in the benchmark, and Qwen3-4B scores higher on both than any other self-hosted model in the leaderboard.

The lower overall score comes from weaker SQL and CosQA results, which are less relevant to smart contract retrieval. The CSNCC score of 95.59 puts it much closer to the Voyage API ceiling than the 80.07 headline suggests.

python

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-4B",
    device="cuda",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": torch.bfloat16},
)

# Qwen3 embedding models support an asymmetric task instruction prefix.
# Documents are encoded without any prefix; queries use the one below.
TASK_INSTRUCTION = (
    "Instruct: Given a natural language query about Solidity smart contracts, "
    "retrieve the most relevant code.\nQuery: "
)

def embed_documents_qwen(texts: list[str]) -> torch.Tensor:
    return model.encode(
        texts,
        convert_to_tensor=True,
        normalize_embeddings=True,
        prompt_name=None,
    )

def embed_query_qwen(query: str) -> torch.Tensor:
    return model.encode(
        TASK_INSTRUCTION + query,
        convert_to_tensor=True,
        normalize_embeddings=True,
    )

If your primary use case is code-to-code similarity — finding contracts that implement similar patterns, identifying copy-paste vulnerabilities across a codebase, clustering contracts by architectural similarity — Qwen3-4B’s CSNCC score of 95.59 is the number to pay attention to.

Building a Minimal Retrieval Pipeline

Once you’ve picked a model, the retrieval pipeline itself is straightforward. Here’s a self-contained implementation using hnswlib for the vector index, which is fast, embeddable, and requires no running server:

python

import numpy as np
import hnswlib
import pickle
from pathlib import Path
from dataclasses import dataclass


@dataclass
class IndexedChunk:
    path:       str
    contract:   str
    name:       str
    kind:       str
    source:     str
    start_line: int


class SolidityIndex:
    """
    Thin wrapper around hnswlib for Solidity code retrieval.
    Agnostic to embedding model — pass in any embed_fn.
    """

    def __init__(self, dim: int, space: str = "cosine", max_elements: int = 100_000):
        self.dim   = dim
        self.index = hnswlib.Index(space=space, dim=dim)
        self.index.init_index(max_elements=max_elements, ef_construction=200, M=16)
        self.index.set_ef(50)
        self.chunks: list[IndexedChunk] = []

    def add(self, chunks: list[IndexedChunk], embeddings: np.ndarray) -> None:
        start = len(self.chunks)
        ids   = np.arange(start, start + len(chunks))
        self.index.add_items(embeddings, ids)
        self.chunks.extend(chunks)

    def search(
        self,
        query_vec: np.ndarray,
        k: int = 5,
    ) -> list[tuple[IndexedChunk, float]]:
        labels, distances = self.index.knn_query(query_vec.reshape(1, -1), k=k)
        return [
            (self.chunks[label], float(dist))
            for label, dist in zip(labels[0], distances[0])
        ]

    def save(self, path: str) -> None:
        p = Path(path)
        self.index.save_index(str(p.with_suffix(".bin")))
        with open(p.with_suffix(".meta"), "wb") as f:
            pickle.dump(self.chunks, f)

    @classmethod
    def load(cls, path: str, dim: int) -> "SolidityIndex":
        p   = Path(path)
        idx = cls(dim=dim)
        idx.index.load_index(str(p.with_suffix(".bin")), max_elements=100_000)
        with open(p.with_suffix(".meta"), "rb") as f:
            idx.chunks = pickle.load(f)
        return idx

Wiring it to a model and a directory of contracts:

python

from sentence_transformers import SentenceTransformer
from pathlib import Path

embed_model = SentenceTransformer(
    "Shuu12121/CodeSearch-ModernBERT-Crow-Plus", device="cpu"
)

def build_index(sol_dir: str, index_path: str = "solidity.index") -> SolidityIndex:
    sol_files  = list(Path(sol_dir).rglob("*.sol"))
    all_chunks: list[IndexedChunk] = []

    for sol_file in sol_files:
        source = sol_file.read_text(errors="replace")
        for c in chunk_by_function(source, sol_file.stem):
            all_chunks.append(IndexedChunk(
                path=str(sol_file), contract=c.contract, name=c.name,
                kind=c.kind, source=c.source, start_line=c.start_line,
            ))

    print(f"Indexing {len(all_chunks)} chunks from {len(sol_files)} files...")
    vecs = embed_model.encode(
        [c.source for c in all_chunks],
        batch_size=64,
        normalize_embeddings=True,
        show_progress_bar=True,
    )

    idx = SolidityIndex(dim=vecs.shape[1])
    idx.add(all_chunks, vecs)
    idx.save(index_path)
    return idx


def query_index(query: str, index_path: str = "solidity.index", k: int = 5) -> None:
    idx   = SolidityIndex.load(index_path, dim=768)
    q_vec = embed_model.encode(query, normalize_embeddings=True)

    print(f'\nTop {k} for: "{query}"\n{"─" * 60}')
    for rank, (chunk, dist) in enumerate(idx.search(q_vec, k=k), 1):
        print(f"[{rank}] {chunk.contract}::{chunk.name}  ({chunk.kind})")
        print(f"     {chunk.path}:{chunk.start_line}  •  cosine dist: {dist:.4f}")
        print(f"     {chunk.source[:120].strip()}...")
        print()

Running it:

bash

# Index a contracts directory
python -c "from retrieval import build_index; build_index('./contracts')"

# Query
python -c "
from retrieval import query_index
query_index('reentrancy guard on withdraw function')
query_index('fee-on-transfer token handling')
query_index('emergency pause with time delay')
"

A Note on Evaluation Infrastructure

Whichever model you pick, build your own eval set before you ship. This is consistently skipped, and it’s always a mistake.

A useful Solidity retrieval eval doesn’t need to be large. Fifty query-document pairs is enough to expose model weaknesses that MTEB benchmarks won’t show: your specific vocabulary, your specific contract patterns, the way your developers actually phrase questions.

python

import json
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer


def load_eval_set(path: str) -> list[dict]:
    """
    Expected format — a JSON array:
    [
      {
        "query": "reentrancy guard on ETH withdrawal",
        "relevant": ["Vault::withdraw", "LendingPool::repay"],
        "hard_negatives": ["Vault::deposit"]
      },
      ...
    ]
    Label hard negatives carefully: superficially similar code that is
    functionally different is the failure mode you're testing against.
    """
    return json.loads(Path(path).read_text())


def recall_at_k(
    model:         SentenceTransformer,
    corpus_chunks: list[IndexedChunk],
    corpus_vecs:   np.ndarray,
    eval_set:      list[dict],
    k:             int = 5,
) -> float:
    """
    Compute Recall@K across the eval set.
    A query is a hit if at least one relevant chunk appears in the top-K results.
    Vectors must be L2-normalised (cosine sim == dot product).
    """
    hits = 0
    for item in eval_set:
        q_vec = model.encode(item["query"], normalize_embeddings=True)
        sims  = corpus_vecs @ q_vec
        top_k = np.argsort(sims)[::-1][:k]

        retrieved = {
            f"{corpus_chunks[i].contract}::{corpus_chunks[i].name}"
            for i in top_k
        }
        if any(r in retrieved for r in item["relevant"]):
            hits += 1

    return hits / len(eval_set)


def compare_models(
    model_ids:     list[str],
    corpus_chunks: list[IndexedChunk],
    eval_set:      list[dict],
    ks:            list[int] = [1, 5, 10],
) -> pd.DataFrame:
    """
    Run every candidate model through the eval set and return a comparison
    table, sorted by R@5. Run this before committing to a full re-index.
    """
    texts = [c.source for c in corpus_chunks]
    rows  = []

    for model_id in model_ids:
        print(f"Evaluating {model_id}...")
        m    = SentenceTransformer(model_id, device="cpu", trust_remote_code=True)
        vecs = m.encode(texts, normalize_embeddings=True, show_progress_bar=False)
        row  = {"model": model_id}
        for k in ks:
            row[f"R@{k}"] = recall_at_k(m, corpus_chunks, vecs, eval_set, k=k)
        rows.append(row)

    return pd.DataFrame(rows).sort_values("R@5", ascending=False)

The leaderboard tells you which models are likely to be good. Your eval set tells you which model is actually good for your problem. They usually agree. When they disagree, trust your eval.

Comparison Matrix

python

SELECTED = [
    "voyage-4-large (embed_dim=2048)",
    "llama-embed-nemotron-8b",
    "Octen-Embedding-8B",
    "Octen-Embedding-4B",
    "MoD-Embedding",
    "CodeSearch-ModernBERT-Crow-Plus",
    "jina-embeddings-v5-text-small",
    "Octen-Embedding-0.6B",
    "Qwen3-Embedding-4B",
    "C2LLM-7B",
]

display_cols = [
    "Model", "Retrieval",
    "Memory Usage (MB)", "Number of Parameters (B)",
    "Embedding Dimensions", "Max Tokens",
    "CodeSearchNetCCRetrieval", "CodeSearchNetRetrieval",
]

matrix = (
    merged[merged["Model"].isin(SELECTED)][display_cols]
    .sort_values("Retrieval", ascending=False)
    .rename(columns={
        "Memory Usage (MB)":          "Mem (MB)",
        "Number of Parameters (B)":   "Params (B)",
        "Embedding Dimensions":        "Embed Dim",
        "Max Tokens":                  "Max Tok",
        "CodeSearchNetCCRetrieval":    "CSNCC ★",
        "CodeSearchNetRetrieval":      "CSN ★",
    })
)

print(matrix.to_markdown(index=False, floatfmt=".2f"))

Model	Retrieval	Mem (MB)	Params (B)	Embed Dim	Max Tok	CSNCC ★	CSN ★
voyage-4-large (embed_dim=2048)	97.29	—	—	2048	32000	—	—
llama-embed-nemotron-8b	96.59	28629	7.50	4096	32768	—	—
Octen-Embedding-8B	93.36	14433	7.57	4096	32768	—	—
Octen-Embedding-4B	90.81	7671	4.02	2560	32768	—	—
MoD-Embedding	90.01	7671	4.02	2560	32768	—	—
CodeSearch-ModernBERT-Crow-Plus	84.59	607	0.15	768	1024	—	89.30
jina-embeddings-v5-text-small	83.34	1137	0.60	1024	32768	—	—
C2LLM-7B	80.75	14624	7.67	3584	32768	97.90	91.07
Octen-Embedding-0.6B	80.26	1136	0.60	1024	32768	—	—
Qwen3-Embedding-4B	80.07*	7671	4.02	2560	32768	95.59	92.34

* Overall score understates Solidity-proxy performance: CSNCC 95.59 and CSN 92.34 are the highest self-hosted scores on the two most relevant proxy tasks.

Practical Decision Framework

Rather than a flowchart (which always lies to you by making complex decisions seem tidy), here’s how I’d approach the choice.

Start with your deployment environment, not the accuracy table.

CPU-only VPS, ≤4 GB RAM → CodeSearch-ModernBERT-Crow-Plus. Plan your chunking strategy carefully given the 1024-token limit.
CPU-only VPS, 4–16 GB RAM, need 32k context → jina-embeddings-v5-text-small or Octen-Embedding-0.6B. Embed whole files without chunking.
GPU with 8–16 GB VRAM → Octen-Embedding-4B for general use; Qwen3-Embedding-4B if queries are heavy on code-to-code similarity.
16 GB GPU, want maximum local accuracy → Octen-Embedding-8B. It fits, it’s fast, 93.36 is genuinely good.
Hardware cost is not a constraint, accuracy is paramount → voyage-4-large via API with appropriate data handling controls.

Then ask whether you’re indexing once or continuously. Offline indexing amortises slow inference over many queries — you pay the cost once. Online indexing (contracts ingested continuously and immediately queryable) creates pressure toward faster inference, pushing toward smaller models or better hardware.

Then ask about query type distribution. Natural-language-heavy workloads → StackOverflowQA score is your best proxy; Nemotron-8B and Qwen3-4B lead here. Code-to-code workloads → CSNCC and CSN matter more; Qwen3-4B and C2LLM-7B lead in this regime. Mixed or unpredictable → pick a model with consistent scores across task types rather than one that peaks on a single benchmark.

Conclusion

The short version: for CPU-only deployments, CodeSearch-ModernBERT-Crow-Plus is a remarkable achievement at 607 MB and 84.59 retrieval — just plan your chunking strategy. For GPU-equipped servers, Octen-Embedding-4B at 7.7 GB VRAM and 90.81 retrieval is the sweet spot of 2026. For workloads where code-to-code similarity matters most, Qwen3-Embedding-4B’s CSNCC score of 95.59 is the highest self-hosted Solidity-proxy score in the leaderboard. If you need the frontier and can handle an API dependency, voyage-4-large at 97.29 is the ceiling.

None of these are permanent answers. The leaderboard moves fast — something that didn’t exist six months ago is now the efficiency leader, and I’d expect the same pace of change through 2026. Keep your eval harness around so you can run new models through it quickly, and set a calendar reminder to re-evaluate against new leaderboard entries every quarter.

The benchmark gap between “no Solidity task exists” and “we have good embedding coverage for smart contracts” is real and worth closing. If you’re working on building a Solidity-specific retrieval benchmark, the community needs it.

Data source: MTEB Code Leaderboard, February 25 2026. Per-Task Scores · Summary