Compared to the merge_request_diff association:
1. It's simpler to query. The query uses a foreign key to the
merge_request_diffs table, so no ordering is necessary.
2. It's faster for preloading. The merge_request_diff association has to load
every diff for the MRs in the set, then discard all but the most recent for
each. This association means that Rails can just query for N diffs from N
MRs.
3. It's more complicated to update. This is a bidirectional foreign key, so we
need to update two tables when adding a diff record. This also means we need
to handle this as a special case when importing a GitLab project.
There is some juggling with this association in the merge request model:
* `MergeRequest#latest_merge_request_diff` is _always_ the latest diff.
* `MergeRequest#merge_request_diff` reuses
`MergeRequest#latest_merge_request_diff` unless:
* Arguments are passed. These are typically to force-reload the association.
* It doesn't exist. That means we might be trying to implicitly create a
diff. This only seems to happen in specs.
* The association is already loaded. This is important for the reasons
explained in the comment, which I'll reiterate here: if we a) load a
non-latest diff, then b) get its `merge_request`, then c) get that MR's
`merge_request_diff`, we should get the diff we loaded in c), even though
that's not the latest diff.
Basically, `MergeRequest#merge_request_diff` is the latest diff in most cases,
but not quite all.
This updates the composite index on ci_pipelines (project_id, ref,
status) to also include the "id" column at the end. Adding this column
to the index drastically improves the performance of queries used for
getting the latest pipeline for a particular branch. For example, on
project dashboards we'll run a query like the following:
SELECT ci_pipelines.*
FROM ci_pipelines
WHERE ci_pipelines.project_id = 13083
AND ci_pipelines.ref = 'master'
AND ci_pipelines.status = 'success'
ORDER BY ci_pipelines.id DESC
LIMIT 1;
Limit (cost=0.43..58.88 rows=1 width=224) (actual time=26.956..26.956 rows=1 loops=1)
Buffers: shared hit=6544 dirtied=16
-> Index Scan Backward using ci_pipelines_pkey on ci_pipelines (cost=0.43..830922.89 rows=14216 width=224) (actual time=26.954..26.954 rows=1 loops=1)
Filter: ((project_id = 13083) AND ((ref)::text = 'master'::text) AND ((status)::text = 'success'::text))
Rows Removed by Filter: 6476
Buffers: shared hit=6544 dirtied=16
Planning time: 1.484 ms
Execution time: 27.000 ms
Because of the lack of "id" in the index we end up scanning over the
primary key index, then applying a filter to filter out any remaining
rows. The more pipelines a GitLab instance has the slower this will get.
By adding "id" to the mentioned composite index we can change the above
plan into the following:
Limit (cost=0.56..2.01 rows=1 width=224) (actual time=0.034..0.034 rows=1 loops=1)
Buffers: shared hit=5
-> Index Scan Backward using yorick_test on ci_pipelines (cost=0.56..16326.37 rows=11243 width=224) (actual time=0.033..0.033 rows=1 loops=1)
Index Cond: ((project_id = 13083) AND ((ref)::text = 'master'::text) AND ((status)::text = 'success'::text))
Buffers: shared hit=5
Planning time: 0.695 ms
Execution time: 0.061 ms
This in turn leads to a best-case improvement of roughly 25
milliseconds, give or take a millisecond or two.
We need to make sure merge_requests.source_project_id allows NULL values
_before_ updating rows as otherwise this will lead to a NOT NULL
constraint failing.
add_column_with_default is implemented in terms of update_column_in_batches, but
update_column_in_batches can be used independently. Neither of these should be
used on the specified large tables, because they will cause issues on large
instances like GitLab.com.
This also ignores the cop for all existing migrations, renaming
AddColumnWithDefaultToLargeTable where appropriate.
also, I refactored the MergeRequest#fetch_ref method to express
the side-effect that this method has.
MergeRequest#fetch_ref -> MergeRequest#fetch_ref!
Repository#fetch_source_branch -> Repository#fetch_source_branch!