picking a vector index without overthinking it

if you are working with vector search, at some point you have to pick an index. the three you will run into most often are ivfflat, hnsw and streamingdiskann. they all do the same basic job, finding the closest vectors to your query, but they make different trade-offs. here is how they compare and when to use each one.

(if you are on postgres, the first two come from pgvector and streamingdiskann comes from pgvectorscale.)

ivfflat

ivfflat is a cluster based index. it groups your vectors into buckets, and when you search it only looks inside the buckets closest to your query instead of scanning everything.

pros

it does not use much memory, so it is cheap to run.
it is much faster than a linear scan because it skips most of the data.

cons

it needs to be rebuilt as your data grows, otherwise the clusters stop matching your data and recall drops.
accuracy is a bit lower than the other two, since it only checks a few buckets.

verdict

good for medium sized data, roughly 100k to 1m vectors, where you care more about cost than squeezing out the last bit of accuracy.

hnsw

hnsw is a graph based index. it connects vectors to their neighbors and walks the graph to find close matches, which makes search both fast and accurate.

pros

good balance between speed and accuracy.
you can add new vectors without rebuilding the whole index.
the index builds quickly and can build in parallel.
you can add quantization to bring the memory cost down.

cons

it keeps the graph in memory, so it uses a lot of ram.
since it lives in ram, your cost goes up as your data grows.
that ram dependency also means it does not scale endlessly.
filtered search can lose accuracy, because filtering while walking the graph can skip over good matches.

verdict

best for real time search on medium workloads, again around 100k to 1m vectors.

streamingdiskann

streamingdiskann is built to live on disk instead of fully in memory, which is what lets it handle much larger datasets.

pros

high accuracy.
scales to 1b+ vectors.
binary quantization by default.
cost scales with disk and ram, and since disk is cheaper this keeps things affordable at scale.
handles updates without needing a rebuild (this is the "streaming" part).

cons

longer build times compared to hnsw.

verdict

best when you need real time search, filtered search and large scale all at once.

conclusion

there is no single best index, it depends on your data size and what you actually care about.

if you want something cheap and simple for medium data, go with ivfflat. if you want fast and accurate real time search and you have the ram for it, hnsw is a solid pick. and if you are dealing with huge datasets or need filtered search at scale, streamingdiskann is the one to reach for.