Reduce final clustering pass sample size (#130451)
Figuring out the right balance on index throughput and speed is tricky. Initially I was digging into reducing the "neighborhood" size for the fix up. This actually harmed recall a bit too much in my tests, while it did speed things up. While I do think there is ground to be covered there, I pivoted to reducing the sample size since now we actually have true random sampling (instead of the first N docs). In the extreme case, this improves force-merge time by 25% with zero change in recall. On the lower end, it only improves about 8%. I really do think there is ground to be recovered in the "fix up phase", but this is a nice improvement :). ``` index_name index_type num_docs index_time(ms) force_merge_time(ms) num_segments ----------------------------------- ---------- -------- -------------- -------------------- ------------ corpus-quora-E5-small.fvec.flat ivf 500000 17443 18422 0 cohere-wikipedia-docs-768d.vec ivf 2000000 156320 193383 0 corpus-dbpedia-entity-arctic-0.fvec ivf 1000000 92902 82131 0 index_name index_type n_probe latency(ms) net_cpu_time(ms) avg_cpu_count QPS recall visited ----------------------------------- ---------- ------- ----------- ---------------- ------------- ------- ------ -------- corpus-quora-E5-small.fvec.flat ivf 10 0.95 0.00 0.00 1052.63 0.83 5713.06 corpus-quora-E5-small.fvec.flat ivf 20 0.69 0.00 0.00 1449.28 0.89 10620.80 corpus-quora-E5-small.fvec.flat ivf 30 0.81 0.00 0.00 1234.57 0.92 15498.94 corpus-quora-E5-small.fvec.flat ivf 40 0.94 0.00 0.00 1063.83 0.93 20088.68 corpus-quora-E5-small.fvec.flat ivf 50 1.11 0.00 0.00 900.90 0.94 24801.41 cohere-wikipedia-docs-768d.vec ivf 10 1.20 0.00 0.00 833.33 0.66 2824.19 cohere-wikipedia-docs-768d.vec ivf 20 1.33 0.00 0.00 751.88 0.74 4875.23 cohere-wikipedia-docs-768d.vec ivf 30 1.44 0.00 0.00 694.44 0.79 6974.69 cohere-wikipedia-docs-768d.vec ivf 40 1.56 0.00 0.00 641.03 0.81 9147.20 cohere-wikipedia-docs-768d.vec ivf 50 1.66 0.00 0.00 602.41 0.83 11478.62 cohere-wikipedia-docs-768d.vec ivf 60 1.80 0.00 0.00 555.56 0.85 13863.93 cohere-wikipedia-docs-768d.vec ivf 70 1.96 0.00 0.00 510.20 0.87 16301.12 cohere-wikipedia-docs-768d.vec ivf 80 2.05 0.00 0.00 487.80 0.88 18761.24 cohere-wikipedia-docs-768d.vec ivf 90 2.18 0.00 0.00 458.72 0.89 21185.38 cohere-wikipedia-docs-768d.vec ivf 100 2.27 0.00 0.00 440.53 0.90 23648.77 corpus-dbpedia-entity-arctic-0.fvec ivf 10 0.79 0.00 0.00 1265.82 0.52 3654.77 corpus-dbpedia-entity-arctic-0.fvec ivf 20 0.97 0.00 0.00 1030.93 0.61 7170.57 corpus-dbpedia-entity-arctic-0.fvec ivf 30 1.13 0.00 0.00 884.96 0.67 10761.73 corpus-dbpedia-entity-arctic-0.fvec ivf 40 1.27 0.00 0.00 787.40 0.70 14550.00 corpus-dbpedia-entity-arctic-0.fvec ivf 50 1.42 0.00 0.00 704.23 0.72 18149.22 corpus-dbpedia-entity-arctic-0.fvec ivf 60 1.61 0.00 0.00 621.12 0.74 21971.72 corpus-dbpedia-entity-arctic-0.fvec ivf 70 1.74 0.00 0.00 574.71 0.76 25612.96 corpus-dbpedia-entity-arctic-0.fvec ivf 80 1.94 0.00 0.00 515.46 0.77 29311.67 corpus-dbpedia-entity-arctic-0.fvec ivf 90 2.05 0.00 0.00 487.80 0.78 33034.66 corpus-dbpedia-entity-arctic-0.fvec ivf 100 2.23 0.00 0.00 448.43 0.80 36743.77 ```
This commit is contained in:
parent
f81d35536d
commit
a9625cec7a
|
@ -67,7 +67,7 @@ public class HierarchicalKMeans {
|
|||
// partition the space
|
||||
KMeansIntermediate kMeansIntermediate = clusterAndSplit(vectors, targetSize);
|
||||
if (kMeansIntermediate.centroids().length > 1 && kMeansIntermediate.centroids().length < vectors.size()) {
|
||||
int localSampleSize = Math.min(kMeansIntermediate.centroids().length * samplesPerCluster, vectors.size());
|
||||
int localSampleSize = Math.min(kMeansIntermediate.centroids().length * samplesPerCluster / 2, vectors.size());
|
||||
KMeansLocal kMeansLocal = new KMeansLocal(localSampleSize, maxIterations, clustersPerNeighborhood, DEFAULT_SOAR_LAMBDA);
|
||||
kMeansLocal.cluster(vectors, kMeansIntermediate, true);
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue