In applications today we represent "things" as rows in a table. A row
for each book in a Goodreads dataset. A row for each product in an
Amazon dataset. And for the last few decades we’ve been working on
ways to turn the textual (as well as associated graphics, audio, etc.)
into meaningful vectors of rational (decimal) numbers.
The idea is that when we insert a row into a table we store a vector
representing the row as a column in the row. The vector is the result
of passing values in the row (perhaps the title and description) to a
function F.
When a user queries the application, perhaps wanting to
find books about ancient Egypt, their query is run through the
function F to produce a new vector. This vector can be trivially
compared to existing vectors in the database to find the nearest
match, the book most like ancient Egypt.
The result of F is called an embedding (vector). F itself is a
combination of:
- a machine learning embedding model (such as nomic-embed-text-v2-moe or all-MiniLM-L6-v2) in a format like GGUF, ONNX, safetensors, etc.
- an inference engine (such as vLLM, llama.cpp, TGI, onnxruntime, etc.)
- the hardware you run on (gpu, cpu, memory available, etc.)
- any other settings (context size, etc.)
The inference engine (2) interprets the model (1) and decides how to
execute it given your hardware (3) and settings (4). When you use an
AI provider like OpenAI or Anthropic or Gemini all four of these
components may be proprietary and they will be managed. They’ll
also be simpler and potentially even cheaper than executing F locally.
There also seems to be no guarantee that a given model will work with
a given inference engine nor that the inference engine will support
your hardware, and so on. For example, vLLM did not seem to like my M4
Macbook Air. And llama.cpp did not seem to like Qwen3 models.
However, we can still run some F locally. In the rest of this post
we’ll assume a fresh Ubuntu virtual machine: install a model, install
a runtime, then we’ll semantically search a dataset. We’ll even turn
off networking during the semantic search step to show we’re not doing
anything at all with online services.
First install uv to manage Python
dependencies.
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ source $HOME/.local/bin/env
Create a new uv project and add
llama.cpp as the sole
dependency. This is a wrapper around an LLM inference engine that
itself has no dependencies.
$ uv init
$ # If the `uv add` command fails, run `sudo apt -y install build-essential`.
$ uv add llama-cpp-python
Finally, download the nomic-embed-text-v2-moe model in GGUF format (it
will give us a few different
quantizations
of the same model). We’ll use the
nomic-embed-text-v2-moe.Q4_K_M.gguf quantization which (as indicated
in the
readme)
is middle-of-the-line for performance and memory usage.
$ curl -LO https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe-GGUF/resolve/main/nomic-embed-text-v2-moe.Q4_K_M.gguf
Now disable all connections except ssh.
$ ping -c 1 -W 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=1.38 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.383/1.383/1.383/0.000 ms
$ sudo ufw allow ssh
Rules updated
Rules updated (v6)
$ sudo ufw default deny incoming
Default incoming policy changed to 'deny'
(be sure to update your rules accordingly)
$ sudo ufw default deny outgoing
Default outgoing policy changed to 'deny'
(be sure to update your rules accordingly)
$ sudo ufw enable
Command may disrupt existing ssh connections. Proceed with operation (y|n)? y
Firewall is active and enabled on system startup
$ ping -c 1 -W 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
Now let’s make a little dataset in Python of very different types of things.
import math
import os
from llama_cpp import Llama
dataset = ['dog', 'milkshake', 'college course', 'korea', 'granite', 'cat', 'question', 'crocodile', 'pineapple']
We’ll implement a little cosine similarity function to compare vector embeddings.
def cosine_similarity(v1, v2):
dot_product = sum(a * b for a, b in zip(v1, v2))
magnitude1 = math.sqrt(sum(a * a for a in v1))
magnitude2 = math.sqrt(sum(b * b for b in v2))
return dot_product / (magnitude1 * magnitude2)
Then we’ll configure Llama to use the model we downloaded and we’ll
generate embeddings for each item in our dataset.
llm = Llama(model_path="./nomic-embed-text-v2-moe.Q4_K_M.gguf",
embedding=True,
verbose=False)
embeddings = []
for item in dataset:
embeddings.append((item, llm.embed("search_document: " + item)))
Nomic
says
to prepend input you’ll store with `search_document: ` and to
prepend input you’re comparing against with `search_query: `. I
don’t understand this. But it seems to make a tangible difference.
Finally we’ll retrieve a user’s query from environment variables, grab
its embedding vector, and use cosine similarity to find and print the
most similar items in our dataset.
query_embedding = llm.embed("search_query: "+os.environ["QUERY"])
results = []
for (item, embedding) in embeddings:
score = cosine_similarity(query_embedding, embedding)
results.append((item, score))
results.sort(key=lambda x: x[1], reverse=True)
for (item, _) in results:
print(item)
llama.cpp is uncontrollably verbose so we’ll redirect stderr to
/dev/null when we run this program. Let’s try it out.
$ QUERY="country in asia" uv run main.py 2>/dev/null
korea
question
college course
dog
crocodile
cat
granite
pineapple
milkshake
That makes sense.
$ QUERY="common pet" uv run main.py 2>/dev/null
dog
cat
question
crocodile
college course
milkshake
granite
pineapple
korea
That makes sense too!
$ QUERY="calculus" uv run main.py 2>/dev/null
college course
question
dog
cat
crocodile
granite
korea
pineapple
milkshake
As does that. But not everything does:
$ QUERY="prickly" uv run main.py 2>/dev/null
question
cat
crocodile
dog
granite
milkshake
korea
pineapple
college course
Ok, ok. Now in the real-world the dataset would probably be rows in a
table. We’d generate the embeddings once on insert and update and we’d
store them in the database itself. You must use the same model for the
vector embeddings you store and for the vector embedding you generate
for your query.
We’re doing O(N) full-table scans here (ML people seem to call this a
Flat index, or brute-force). There are probabilistic indexing methods
for approximate nearest neighbor (ANN) including Inverted File Indexes
(IVF) and Hierarchical Navigable Small Worlds (HNSW).
Vector databases (Qdrant,
Weviate,
turbopuffer, etc.), vector database
extensions (Postgre’s
cube extension,
pgvector, MariaDB’s Vector
type, etc.), and
in-memory libraries
(FAISS,
hnswlib,
Annoy, etc.) are used to store
and efficiently search vector embeddings (with indexes like IVF,
HNSW, etc.), not create the embeddings themselves. Embeddings are
created with a model and an inference engine, or with a third-party
service.
About the author
Phil is the founder of The Consensus. Before this, he contributed to
Postgres products at EnterpriseDB, cofounded and led marketing at
TigerBeetle, and was an engineering manager at Oracle. He runs the
Software Internals Discord, the NYC Systems Coffee Club, the
Software Internals Email Book Club, and co-runs NYC
Systems. @eatonphil