Generating vector embeddings for semantic search locally

In applications today we represent "things" as rows in a table. A row for each book in a Goodreads dataset. A row for each product in an Amazon dataset. And for the last few decades we’ve been working on ways to turn the textual (as well as associated graphics, audio, etc.) into meaningful vectors of rational (decimal) numbers.

The idea is that when we insert a row into a table we store a vector representing the row as a column in the row. The vector is the result of passing values in the row (perhaps the title and description) to a function F.

When a user queries the application, perhaps wanting to find books about ancient Egypt, their query is run through the function F to produce a new vector. This vector can be trivially compared to existing vectors in the database to find the nearest match, the book most like ancient Egypt.

The result of F is called an embedding (vector). F itself is a combination of:

a machine learning embedding model (such as nomic-embed-text-v2-moe or all-MiniLM-L6-v2) in a format like GGUF, ONNX, safetensors, etc.
an inference engine (such as vLLM, llama.cpp, TGI, onnxruntime, etc.)
the hardware you run on (gpu, cpu, memory available, etc.)
any other settings (context size, etc.)

The inference engine (2) interprets the model (1) and decides how to execute it given your hardware (3) and settings (4). When you use an AI provider like OpenAI or Anthropic or Gemini all four of these components may be proprietary and they will be managed. They’ll also be simpler and potentially even cheaper than executing F locally.

There also seems to be no guarantee that a given model will work with a given inference engine nor that the inference engine will support your hardware, and so on. For example, vLLM did not seem to like my M4 Macbook Air. And llama.cpp did not seem to like Qwen3 models.

However, we can still run some F locally. In the rest of this post we’ll assume a fresh Ubuntu virtual machine: install a model, install a runtime, then we’ll semantically search a dataset. We’ll even turn off networking during the semantic search step to show we’re not doing anything at all with online services.

First install uv to manage Python dependencies.

$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ source $HOME/.local/bin/env

Create a new uv project and add llama.cpp as the sole dependency. This is a wrapper around an LLM inference engine that itself has no dependencies.

$ uv init
$ # If the `uv add` command fails, run `sudo apt -y install build-essential`.
$ uv add llama-cpp-python

Finally, download the nomic-embed-text-v2-moe model in GGUF format (it will give us a few different quantizations of the same model). We’ll use the nomic-embed-text-v2-moe.Q4_K_M.gguf quantization which (as indicated in the readme) is middle-of-the-line for performance and memory usage.

$ curl -LO https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe-GGUF/resolve/main/nomic-embed-text-v2-moe.Q4_K_M.gguf

Now disable all connections except ssh.

$ ping -c 1 -W 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=1.38 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.383/1.383/1.383/0.000 ms
$ sudo ufw allow ssh
Rules updated
Rules updated (v6)
$ sudo ufw default deny incoming
Default incoming policy changed to 'deny'
(be sure to update your rules accordingly)
$ sudo ufw default deny outgoing
Default outgoing policy changed to 'deny'
(be sure to update your rules accordingly)
$ sudo ufw enable
Command may disrupt existing ssh connections. Proceed with operation (y|n)? y
Firewall is active and enabled on system startup
$ ping -c 1 -W 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Now let’s make a little dataset in Python of very different types of things.

import math
import os

from llama_cpp import Llama

dataset = ['dog', 'milkshake', 'college course', 'korea', 'granite', 'cat', 'question', 'crocodile', 'pineapple']

We’ll implement a little cosine similarity function to compare vector embeddings.

def cosine_similarity(v1, v2):
    dot_product = sum(a * b for a, b in zip(v1, v2))
    magnitude1 = math.sqrt(sum(a * a for a in v1))
    magnitude2 = math.sqrt(sum(b * b for b in v2))
    return dot_product / (magnitude1 * magnitude2)

Then we’ll configure Llama to use the model we downloaded and we’ll generate embeddings for each item in our dataset.

llm = Llama(model_path="./nomic-embed-text-v2-moe.Q4_K_M.gguf",
    embedding=True,
    verbose=False)
embeddings = []
for item in dataset:
    embeddings.append((item, llm.embed("search_document: " + item)))

Nomic says to prepend input you’ll store with `search_document: ` and to prepend input you’re comparing against with `search_query: `. I don’t understand this. But it seems to make a tangible difference.

Finally we’ll retrieve a user’s query from environment variables, grab its embedding vector, and use cosine similarity to find and print the most similar items in our dataset.

query_embedding = llm.embed("search_query: "+os.environ["QUERY"])
results = []
for (item, embedding) in embeddings:
    score = cosine_similarity(query_embedding, embedding)
    results.append((item, score))

results.sort(key=lambda x: x[1], reverse=True)
for (item, _) in results:
    print(item)

llama.cpp is uncontrollably verbose so we’ll redirect stderr to /dev/null when we run this program. Let’s try it out.

$ QUERY="country in asia" uv run main.py 2>/dev/null
korea
question
college course
dog
crocodile
cat
granite
pineapple
milkshake

That makes sense.

$ QUERY="common pet" uv run main.py 2>/dev/null
dog
cat
question
crocodile
college course
milkshake
granite
pineapple
korea

That makes sense too!

$ QUERY="calculus" uv run main.py 2>/dev/null
college course
question
dog
cat
crocodile
granite
korea
pineapple
milkshake

As does that. But not everything does:

$ QUERY="prickly" uv run main.py 2>/dev/null
question
cat
crocodile
dog
granite
milkshake
korea
pineapple
college course

Ok, ok. Now in the real-world the dataset would probably be rows in a table. We’d generate the embeddings once on insert and update and we’d store them in the database itself. You must use the same model for the vector embeddings you store and for the vector embedding you generate for your query.

We’re doing O(N) full-table scans here (ML people seem to call this a Flat index, or brute-force). There are probabilistic indexing methods for approximate nearest neighbor (ANN) including Inverted File Indexes (IVF) and Hierarchical Navigable Small Worlds (HNSW).

Vector databases (Qdrant, Weviate, turbopuffer, etc.), vector database extensions (Postgre’s cube extension, pgvector, MariaDB’s Vector type, etc.), and in-memory libraries (FAISS, hnswlib, Annoy, etc.) are used to store and efficiently search vector embeddings (with indexes like IVF, HNSW, etc.), not create the embeddings themselves. Embeddings are created with a model and an inference engine, or with a third-party service.

Generating vector embeddings for semantic search locally

Job Openings

Events