Reduce final clustering pass sample size (#130451)

Figuring out the right balance on index throughput and speed is tricky.
Initially I was digging into reducing the "neighborhood" size for the
fix up. This actually harmed recall a bit too much in my tests, while it
did speed things up. While I do think there is ground to be covered
there, I pivoted to reducing the sample size since now we actually have
true random sampling (instead of the first N docs).

In the extreme case, this improves force-merge time by 25% with zero
change in recall. On the lower end, it only improves about 8%.

I really do think there is ground to be recovered in the "fix up phase",
but this is a nice improvement :).

```
index_name                           index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
-----------------------------------  ----------  --------  --------------  --------------------  ------------
corpus-quora-E5-small.fvec.flat             ivf    500000           17443                 18422             0
cohere-wikipedia-docs-768d.vec              ivf   2000000          156320                193383             0
corpus-dbpedia-entity-arctic-0.fvec         ivf   1000000           92902                 82131             0

index_name                           index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count      QPS  recall   visited
-----------------------------------  ----------  -------  -----------  ----------------  -------------  -------  ------  --------
corpus-quora-E5-small.fvec.flat             ivf       10         0.95              0.00           0.00  1052.63    0.83   5713.06
corpus-quora-E5-small.fvec.flat             ivf       20         0.69              0.00           0.00  1449.28    0.89  10620.80
corpus-quora-E5-small.fvec.flat             ivf       30         0.81              0.00           0.00  1234.57    0.92  15498.94
corpus-quora-E5-small.fvec.flat             ivf       40         0.94              0.00           0.00  1063.83    0.93  20088.68
corpus-quora-E5-small.fvec.flat             ivf       50         1.11              0.00           0.00   900.90    0.94  24801.41
cohere-wikipedia-docs-768d.vec              ivf       10         1.20              0.00           0.00   833.33    0.66   2824.19
cohere-wikipedia-docs-768d.vec              ivf       20         1.33              0.00           0.00   751.88    0.74   4875.23
cohere-wikipedia-docs-768d.vec              ivf       30         1.44              0.00           0.00   694.44    0.79   6974.69
cohere-wikipedia-docs-768d.vec              ivf       40         1.56              0.00           0.00   641.03    0.81   9147.20
cohere-wikipedia-docs-768d.vec              ivf       50         1.66              0.00           0.00   602.41    0.83  11478.62
cohere-wikipedia-docs-768d.vec              ivf       60         1.80              0.00           0.00   555.56    0.85  13863.93
cohere-wikipedia-docs-768d.vec              ivf       70         1.96              0.00           0.00   510.20    0.87  16301.12
cohere-wikipedia-docs-768d.vec              ivf       80         2.05              0.00           0.00   487.80    0.88  18761.24
cohere-wikipedia-docs-768d.vec              ivf       90         2.18              0.00           0.00   458.72    0.89  21185.38
cohere-wikipedia-docs-768d.vec              ivf      100         2.27              0.00           0.00   440.53    0.90  23648.77
corpus-dbpedia-entity-arctic-0.fvec         ivf       10         0.79              0.00           0.00  1265.82    0.52   3654.77
corpus-dbpedia-entity-arctic-0.fvec         ivf       20         0.97              0.00           0.00  1030.93    0.61   7170.57
corpus-dbpedia-entity-arctic-0.fvec         ivf       30         1.13              0.00           0.00   884.96    0.67  10761.73
corpus-dbpedia-entity-arctic-0.fvec         ivf       40         1.27              0.00           0.00   787.40    0.70  14550.00
corpus-dbpedia-entity-arctic-0.fvec         ivf       50         1.42              0.00           0.00   704.23    0.72  18149.22
corpus-dbpedia-entity-arctic-0.fvec         ivf       60         1.61              0.00           0.00   621.12    0.74  21971.72
corpus-dbpedia-entity-arctic-0.fvec         ivf       70         1.74              0.00           0.00   574.71    0.76  25612.96
corpus-dbpedia-entity-arctic-0.fvec         ivf       80         1.94              0.00           0.00   515.46    0.77  29311.67
corpus-dbpedia-entity-arctic-0.fvec         ivf       90         2.05              0.00           0.00   487.80    0.78  33034.66
corpus-dbpedia-entity-arctic-0.fvec         ivf      100         2.23              0.00           0.00   448.43    0.80  36743.77
```
This commit is contained in:
Benjamin Trent 2025-07-02 10:04:11 -04:00 committed by GitHub
parent f81d35536d
commit a9625cec7a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 1 additions and 1 deletions

View File

@ -67,7 +67,7 @@ public class HierarchicalKMeans {
// partition the space
KMeansIntermediate kMeansIntermediate = clusterAndSplit(vectors, targetSize);
if (kMeansIntermediate.centroids().length > 1 && kMeansIntermediate.centroids().length < vectors.size()) {
int localSampleSize = Math.min(kMeansIntermediate.centroids().length * samplesPerCluster, vectors.size());
int localSampleSize = Math.min(kMeansIntermediate.centroids().length * samplesPerCluster / 2, vectors.size());
KMeansLocal kMeansLocal = new KMeansLocal(localSampleSize, maxIterations, clustersPerNeighborhood, DEFAULT_SOAR_LAMBDA);
kMeansLocal.cluster(vectors, kMeansIntermediate, true);
}