The previous implementation iterated across the entire patch set
to determine the number of lines added, deleted, and changed. Rugged
has a native method `Rugged::Diff#stat` that does this already,
which appears to be a little faster and require less RAM than doing
this ourselves.
Improves performance in #41524
Given the priorities shifted for the Gitaly team, this endpoint does not
get a dedicated endpoint yet. To make it work in a cloud native
environment the request needs to go to Gitaly, not rugged. This is
achieved by rerouting to the generic TreeEntry endpoint.
By importing this Ruby code into gitlab-rails (and gitaly-ruby), we avoid
200ms of startup time for each gitlab_projects subprocess we are eliminating.
By not having a gitlab_projects subprocess between gitlab-rails / sidekiq and
any git subprocesses (e.g. for fork_project, fetch_remote, etc, calls), we can
also manage these git processes more cleanly, and avoid sending SIGKILL to them
Moving the check out of the general requests, makes sure we don't have
any slowdown in the regular requests.
To keep the process performing this checks small, the check is still
performed inside a unicorn. But that is called from a process running
on the same server.
Because the checks are now done outside normal request, we can have a
simpler failure strategy:
The check is now performed in the background every
`circuitbreaker_check_interval`. Failures are logged in redis. The
failures are reset when the check succeeds. Per check we will try
`circuitbreaker_access_retries` times within
`circuitbreaker_storage_timeout` seconds.
When the number of failures exceeds
`circuitbreaker_failure_count_threshold`, we will block access to the
storage.
After `failure_reset_time` of no checks, we will clear the stored
failures. This could happen when the process that performs the checks
is not running.
Prior to this MR there were two GitHub related importers:
* Github::Import: the main importer used for GitHub projects
* Gitlab::GithubImport: importer that's somewhat confusingly used for
importing Gitea projects (apparently they have a compatible API)
This MR renames the Gitea importer to Gitlab::LegacyGithubImport and
introduces a new GitHub importer in the Gitlab::GithubImport namespace.
This new GitHub importer uses Sidekiq for importing multiple resources
in parallel, though it also has the ability to import data sequentially
should this be necessary.
The new code is spread across the following directories:
* lib/gitlab/github_import: this directory contains most of the importer
code such as the classes used for importing resources.
* app/workers/gitlab/github_import: this directory contains the Sidekiq
workers, most of which simply use the code from the directory above.
* app/workers/concerns/gitlab/github_import: this directory provides a
few modules that are included in every GitHub importer worker.
== Stages
The import work is divided into separate stages, with each stage
importing a specific set of data. Stages will schedule the work that
needs to be performed, followed by scheduling a job for the
"AdvanceStageWorker" worker. This worker will periodically check if all
work is completed and schedule the next stage if this is the case. If
work is not yet completed this worker will reschedule itself.
Using this approach we don't have to block threads by calling `sleep()`,
as doing so for large projects could block the thread from doing any
work for many hours.
== Retrying Work
Workers will reschedule themselves whenever necessary. For example,
hitting the GitHub API's rate limit will result in jobs rescheduling
themselves. These jobs are not processed until the rate limit has been
reset.
== User Lookups
Part of the importing process involves looking up user details in the
GitHub API so we can map them to GitLab users. The old importer used
an in-memory cache, but this obviously doesn't work when the work is
spread across different threads.
The new importer uses a Redis cache and makes sure we only perform
API/database calls if absolutely necessary. Frequently used keys are
refreshed, and lookup misses are also cached; removing the need for
performing API/database calls if we know we don't have the data we're
looking for.
== Performance & Models
The new importer in various places uses raw INSERT statements (as
generated by `Gitlab::Database.bulk_insert`) instead of using Rails
models. This allows us to bypass any validations and callbacks,
drastically reducing the number of SQL queries and Gitaly RPC calls
necessary to import projects.
To ensure the code produces valid data the corresponding tests check if
the produced rows are valid according to the model validation rules.
This allows input to start processing immediately without waiting for the process to complete.
This also allows long or infinite inputs to be partially processed,
which will termiate the process when reading stops with SIGPIPE.
also, I refactored the MergeRequest#fetch_ref method to express
the side-effect that this method has.
MergeRequest#fetch_ref -> MergeRequest#fetch_ref!
Repository#fetch_source_branch -> Repository#fetch_source_branch!
Instead of only checking once within a timeout, check multiple times
within a timeout.
That means with a timeout of 30 seconds and 3 retries. Each try would
be allowed 20 seconds.
The circuitbreaker now has 2 failure modes:
- Backing off: This will raise the `Gitlab::Git::Storage::Failing`
exception. Access to the shard is blocked temporarily.
- Circuit broken: This will raise the
`Gitlab::Git::Storage::CircuitBroken` exception. Access to the shard
will be blocked until the failures are reset.
Replaces all the explicit include metadata syntax in the specs (tag:
true) into the implicit one (:tag).
Added a cop to prevent future errors and handle autocorrection.
When calling pre-receive, post-receive, and update hooks, add the GitLab
username as the GL_USERNAME environment variable.
This patch only handles cases where pushes are over http, or via
the web interface. Later, we will address the ssh case.
If the ref doesn't exist, and the source branch is deleted, we can't get it back
easily. Previously, we ignored this error by shelling out, so replicate that
behaviour.
In gitlab-org/gitlab-ee!2976, we saw that a given OID could point
to a commit, which would cause the delta size check to fail.
Gitaly already returns nil if the OID isn't a blob, so this change
makes the Rugged implementation consistent.
Users of project mirrors would see that the number of branches did not
match the number in the branch dropdown because remote branches were
counted when Rugged was in use. With Gitaly, only local branches
are counted.
Closes#36934
Otherwise some features would go untested in non-specific contexts
I did need to disable the
`gitlab_git_diff_size_limit_increase`-feature in some specs since we
depend on diffs being expandable while the file we are testing on is
smaller than the increased limit.
Submodules have a name in the configuration, but this name is simply
the path at which the submodule was initially checked in (by default
-- the name is totally arbitrary). If a submodule is moved, it
retains its original name, but its path changes. Since we discover
submodules inside trees, we have their path but not necessarily their
name.
Make the submodules() function return the submodule hash indexed by
path rather than name, so that renamed submodules can be looked up.
Signed-off-by: David Turner <novalis@novalis.org>
This adds an ID-less table containing one row per file, per merge request
diff. It has a column for each attribute on Gitlab::Git::Diff that is serialised
currently, with the advantage that we can easily query the attributes of this
new table.
It does not migrate existing data, so we have fallback code when the legacy
st_diffs column is present instead. For a merge request diff to be valid, it
should have at most one of:
* Rows in this new table, with the correct merge_request_diff_id.
* A non-NULL st_diffs column.
It may have neither, if the diff is empty.
This is controlled with the feature flag gitlab_git_diff_size_limit_increase.
Both of these limits were basically picked arbitrarily in the first place;
disabling the feature flag reverts to the old limits.
- Previously, we sorted commits by date, which seemed to work okay.
- The one edge case where this failed was when multiple commits have the same
commit date (for example: when a range of commits are cherry picked with a
single command, they all have the same commit date [and different author
dates]).
- Commits with the same commit date would be sorted arbitrarily, and usually
break the network graph.
- This commit solves the problem by both sorting by date, and by sorting
topographically (parents aren't displayed until all their children are
displayed)
- Include review comments from @adamniedzielski
A more detailed explanation is present here:
https://gitlab.com/gitlab-org/gitlab-ce/issues/30973#note_28706230
- We upgraded `rugged` to 0.25.1.1 in !10286 for %9.1
- Prior to this upgrade, the default sort order for commits returned by
`Gitlab::Git::Repository#find_commits` was `Rugged::SORT_DATE`, which the
graph relied on.
- While upgrading `rugged`, the MR also changed this default to
`Rugged::SORT_NONE`, which broke commit ordering in the graph.
- This commit adds an option to `Gitlab::Git::Repository#find_commits` to sort
by date, and changes the graph builder `Network::Graph` so it explictly
requests the `:date` sort order
This is analogous to `git log -- foo bar baz`, but not the same as
`Gitlab::Git::Repository#log(path: 'foo bar baz')`, which would run `git
log -- 'foo bar baz'`.
- `raise "string"` raises a `RuntimeError` - no need to be explicit
- Remove top-level comment in the `RevList` class
- Use `%w()` instead of `%w[]`
- Extract an `environment_variables` method to cache `env.slice(*ALLOWED_VARIABLES)`
- Use `start_with?` for env variable validation instead of regex match
- Validation specs for each allowed environment variable were identical. Build them dynamically.
- Minor change to `popen3` expectation.
- Don't define "allowed environment variables" in two places.
- Dispatch to different arities of `Popen.open` without an if/else block.
- Use `described_class` instead of explicitly stating the class name within a
- spec.
- Remove `git_environment_variables_validator_spec` and keep the validation inline.
The list of environment variables in `Gitlab::Git::RevList` need to be validate
to make sure that they don't reference any other project on disk.
This commit mixes in `ActiveModel::Validations` into `Gitlab::Git::RevList`, and
validates that the environment variables are on the level (using a custom
validator class). If the validations fail, the force push is still executed
without any environment variables set.
Add specs for the validation using shared examples.
Note: This feature was developed independently on master while this was
in review. I've removed the conflicting bits and left the relevant
additions, mainly a test for `Gitlab::Git::Hook`. The original commit
message follows:
1. `gitlab-shell` outputs errors to `stderr`, but we weren't using this
information, prior to this commit. Now we capture the `stderr`, and
display it in the flash message when branch creation fails.
2. This can be used to display better errors for other git operation
failures with small tweaks.
3. The return value of `Gitlab::Git::Hook#trigger` is changed from a
simple `true`/`false` to a tuple of `[status, errors]`. All usages
and tests have been updated to reflect this change.
4. This is only relevant to branch creation _from the Web UI_, since SSH
and HTTP pushes access `gitlab-shell` either directly or through
`gitlab-workhorse`.
5. A few minor changes need to be made on the `gitlab-shell` end. Right
now, the `stderr` message it outputs is prefixed by "GitLab: ", which
shows up in our flash message. This is better removed.