Document how Git object deduplication works in GitLab
This commit is contained in:
		
							parent
							
								
									a55056afb8
								
							
						
					
					
						commit
						b3663ea4d3
					
				|  | @ -54,6 +54,7 @@ description: 'Learn how to contribute to GitLab.' | ||||||
| - [Prometheus metrics](prometheus_metrics.md) | - [Prometheus metrics](prometheus_metrics.md) | ||||||
| - [Guidelines for reusing abstractions](reusing_abstractions.md) | - [Guidelines for reusing abstractions](reusing_abstractions.md) | ||||||
| - [DeclarativePolicy framework](policies.md) | - [DeclarativePolicy framework](policies.md) | ||||||
|  | - [How Git object deduplication works in GitLab](git_object_deduplication.md) | ||||||
| 
 | 
 | ||||||
| ## Performance guides | ## Performance guides | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -0,0 +1,261 @@ | ||||||
|  | # How Git object deduplication works in GitLab | ||||||
|  | 
 | ||||||
|  | When a GitLab user [forks a project](../workflow/forking_workflow.md), | ||||||
|  | GitLab creates a new Project with an associated Git repository that is a | ||||||
|  | copy of the original project at the time of the fork. If a large project | ||||||
|  | gets forked often, this can lead to a quick increase in Git repository | ||||||
|  | storage disk use. To counteract this problem, we are adding Git object | ||||||
|  | deduplication for forks to GitLab. In this document, we will describe how | ||||||
|  | GitLab implements Git object deduplication. | ||||||
|  | 
 | ||||||
|  | ## Enabling Git object deduplication via feature flags | ||||||
|  | 
 | ||||||
|  | As of GitLab 11.9, Git object deduplication in GitLab is in beta. In this | ||||||
|  | document, you can read about the caveats of enabling the feature. Also, | ||||||
|  | note that Git object deduplication is limited to forks of public | ||||||
|  | projects on hashed repository storage. | ||||||
|  | 
 | ||||||
|  | You can enable deduplication globally by setting the `object_pools` | ||||||
|  | feature flag to `true`: | ||||||
|  | 
 | ||||||
|  | ``` {.ruby} | ||||||
|  | Feature.enable(:object_pools) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Or just for forks of a specific project: | ||||||
|  | 
 | ||||||
|  | ``` {.ruby} | ||||||
|  | fork_parent = Project.find(MY_PROJECT_ID) | ||||||
|  | Feature.enable(:object_pools, fork_parent) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | To check if a project uses Git object deduplication, look in a Rails | ||||||
|  | console if `project.pool_repository` is present. | ||||||
|  | 
 | ||||||
|  | ## Pool repositories | ||||||
|  | 
 | ||||||
|  | ### Understanding Git alternates | ||||||
|  | 
 | ||||||
|  | At the Git level, we achieve deduplication by using [Git | ||||||
|  | alternates](https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objects). | ||||||
|  | Git alternates is a mechanism that lets a repository borrow objects from | ||||||
|  | another repository on the same machine. | ||||||
|  | 
 | ||||||
|  | If we want repository A to borrow from repository B, we first write a | ||||||
|  | path that resolves to `B.git/objects` in the special file | ||||||
|  | `A.git/objects/info/alternates`. This establishes the alternates link. | ||||||
|  | Next, we must perform a Git repack in A. After the repack, any objects | ||||||
|  | that are duplicated between A and B will get deleted from A. Repository | ||||||
|  | A is now no longer self-contained, but it still has its own refs and | ||||||
|  | configuration. Objects in A that are not in B will remain in A. For this | ||||||
|  | to work, it is of course critical that **no objects ever get deleted from | ||||||
|  | B** because A might need them. | ||||||
|  | 
 | ||||||
|  | ### Git alternates in GitLab: pool repositories | ||||||
|  | 
 | ||||||
|  | GitLab organizes this object borrowing by creating special **pool | ||||||
|  | repositories** which are hidden from the user. We then use Git | ||||||
|  | alternates to let a collection of project repositories borrow from a | ||||||
|  | single pool repository. We call such a collection of project | ||||||
|  | repositories a pool. Pools form star-shaped networks of repositories | ||||||
|  | that borrow from a single pool, which will resemble (but not be | ||||||
|  | identical to) the fork networks that get formed when users fork | ||||||
|  | projects. | ||||||
|  | 
 | ||||||
|  | At the Git level, pool repositories are created and managed using Gitaly | ||||||
|  | RPC calls. Just like with normal repositories, the authority on which | ||||||
|  | pool repositories exist, and which repositories borrow from them, lies | ||||||
|  | at the Rails application level in SQL. | ||||||
|  | 
 | ||||||
|  | In conclusion, we need three things for effective object deduplication | ||||||
|  | across a collection of GitLab project repositories at the Git level: | ||||||
|  | 
 | ||||||
|  | 1.  A pool repository must exist. | ||||||
|  | 2.  The participating project repositories must be linked to the pool | ||||||
|  |     repository via their respective `objects/info/alternates` files. | ||||||
|  | 3.  The pool repository must contain Git object data common to the | ||||||
|  |     participating project repositories. | ||||||
|  | 
 | ||||||
|  | ### Deduplication factor | ||||||
|  | 
 | ||||||
|  | The effectiveness of Git object deduplication in GitLab depends on the | ||||||
|  | amount of overlap between the pool repository and each of its | ||||||
|  | participants. As of GitLab 11.9, we have a somewhat optimistic system. | ||||||
|  | The only data that will be deduplicated is the data in the source | ||||||
|  | project repository at the time the pool repository is created. That is, | ||||||
|  | the data in the source project at the time of the first fork *after* the | ||||||
|  | deduplication feature has been enabled. | ||||||
|  | 
 | ||||||
|  | When we enable the object deduplication feature for | ||||||
|  | gitlab.com/gitlab-org/gitlab-ce, which is about 1GB at the time of | ||||||
|  | writing, all new forks of that project would be 1GB smaller than they | ||||||
|  | would have been without Git object deduplication. So even in its current | ||||||
|  | optimistic form, we expect Git object deduplication in GitLab to make a | ||||||
|  | difference. | ||||||
|  | 
 | ||||||
|  | However, if a lot of Git objects get added to the project repositories | ||||||
|  | in a pool after the pool repository was created these new Git objects | ||||||
|  | will currently (GitLab 11.9) not get deduplicated. Over time, the | ||||||
|  | deduplication factor of the pool will get worse and worse. | ||||||
|  | 
 | ||||||
|  | As an extreme example, if we create an empty repository A, and fork that | ||||||
|  | to repository B, behind the scenes we get an object pool P with no | ||||||
|  | objects in it at all. If we then push 1GB of Git data to A, and push the | ||||||
|  | same Git data to B, it will not get deduplicated, because that data was | ||||||
|  | not in A at the time P was created. | ||||||
|  | 
 | ||||||
|  | This also matters in less extreme examples. Consider a pool P with | ||||||
|  | source project A and 500 active forks B1, B2,...,B500. Suppose, | ||||||
|  | optimistically, that the forks are fully deduplicated at the start of | ||||||
|  | our scenario. Now some time passes and 200MB of new Git data gets added | ||||||
|  | to project A. Because of the forking workflow, this data makes also its way | ||||||
|  | into the forks B1, ..., B500. That means we would now have 100GB of Git | ||||||
|  | data sitting around (500 \* 200MB) across the forks, that could have | ||||||
|  | been deduplicated. But because of the way we do deduplication this new | ||||||
|  | data will not be deduplicated. | ||||||
|  | 
 | ||||||
|  | > TODO Add periodic maintenance of object pools to prevent gradual loss | ||||||
|  | > of deduplication over time. | ||||||
|  | > https://gitlab.com/groups/gitlab-org/-/epics/524 | ||||||
|  | 
 | ||||||
|  | ## SQL model | ||||||
|  | 
 | ||||||
|  | As of GitLab 11.8, project repositories in GitLab do not have their own | ||||||
|  | SQL table. They are indirectly identified by columns on the `projects` | ||||||
|  | table. In other words, the only way to look up a project repository is to | ||||||
|  | first look up its project, and then call `project.repository`. | ||||||
|  | 
 | ||||||
|  | With pool repositories we made a fresh start. These live in their own | ||||||
|  | `pool_repositories` SQL table. The relations between these two tables | ||||||
|  | are as follows: | ||||||
|  | 
 | ||||||
|  | -   a `Project` belongs to at most one `PoolRepository` | ||||||
|  |     (`project.pool_repository`) | ||||||
|  | -   as an automatic consequence of the above, a `PoolRepository` has | ||||||
|  |     many `Project`s | ||||||
|  | -   a `PoolRepository` has exactly one "source `Project`" | ||||||
|  |     (`pool.source_project`) | ||||||
|  | 
 | ||||||
|  | ### Assumptions | ||||||
|  | 
 | ||||||
|  | -   All repositories in a pool must use [hashed | ||||||
|  |     storage](../administration/repository_storage_types.md). This is so | ||||||
|  |     that we don't have to ever worry about updating paths in | ||||||
|  |     `object/info/alternates` files. | ||||||
|  | -   All repositories in a pool must be on the same Gitaly storage shard. | ||||||
|  |     The Git alternates mechanism relies on direct disk access across | ||||||
|  |     multiple repositories, and we can only assume direct disk access to | ||||||
|  |     be possible within a Gitaly storage shard. | ||||||
|  | -   All project repositories in a pool must have "Public" visibility in | ||||||
|  |     GitLab at the time they join. There are gotchas around visibility of | ||||||
|  |     Git objects across alternates links. This restriction is a defense | ||||||
|  |     against accidentally leaking private Git data. | ||||||
|  | -   The only two ways to remove a member project from a pool are (1) to | ||||||
|  |     delete the project or (2) to move the project to another Gitaly | ||||||
|  |     storage shard. | ||||||
|  | 
 | ||||||
|  | ### Creating pools and pool memberships | ||||||
|  | 
 | ||||||
|  | -   When a pool gets created, it must have a source project. The initial | ||||||
|  |     contents of the pool repository are a Git clone of the source | ||||||
|  |     project repository. | ||||||
|  | -   The occasion for creating a pool is when an existing eligible | ||||||
|  |     (public, hashed storage, non-forked) GitLab project gets forked and | ||||||
|  |     this project does not belong to a pool repository yet. The fork | ||||||
|  |     parent project becomes the source project of the new pool, and both | ||||||
|  |     the fork parent and the fork child project become members of the new | ||||||
|  |     pool. | ||||||
|  | -   Once project A has become the source project of a pool, all future | ||||||
|  |     eligible forks of A will become pool members. | ||||||
|  | -   If the fork source is itself a fork, the resulting repository will | ||||||
|  |     neither join the repository nor will a new pool repository be | ||||||
|  |     seeded. | ||||||
|  | 
 | ||||||
|  |     eg: | ||||||
|  | 
 | ||||||
|  |     Suppose fork A is part of a pool repository, any forks created off | ||||||
|  |     of fork A *will not* be a part of the pool repository that fork A is | ||||||
|  |     a part of. | ||||||
|  | 
 | ||||||
|  |     Suppose B is a fork of A, and A does not belong to an object pool. | ||||||
|  |     Now C gets created as a fork of B. C will not be part of a pool | ||||||
|  |     repository. | ||||||
|  | 
 | ||||||
|  | > TODO should forks of forks be deduplicated? | ||||||
|  | > https://gitlab.com/gitlab-org/gitaly/issues/1532 | ||||||
|  | 
 | ||||||
|  | ### Consequences | ||||||
|  | 
 | ||||||
|  | -   If a normal Project participating in a pool gets moved to another | ||||||
|  |     Gitaly storage shard, its "belongs to PoolRepository" relation must | ||||||
|  |     be broken. Because of the way moving repositories between shard is | ||||||
|  |     implemented, we will automatically get a fresh self-contained copy | ||||||
|  |     of the project's repository on the new storage shard. | ||||||
|  | -   If the source project of a pool gets moved to another Gitaly storage | ||||||
|  |     shard or is deleted, we may have to break the "PoolRepository has | ||||||
|  |     one source Project" relation? | ||||||
|  | 
 | ||||||
|  | > TODO What happens, or should happen, if a source project changes | ||||||
|  | > visibility, is deleted, or moves to another storage shard? | ||||||
|  | > https://gitlab.com/gitlab-org/gitaly/issues/1488 | ||||||
|  | 
 | ||||||
|  | ## Consistency between the SQL pool relation and Gitaly | ||||||
|  | 
 | ||||||
|  | As far as Gitaly is concerned, the SQL pool relations make two types of | ||||||
|  | claims about the state of affairs on the Gitaly server: pool repository | ||||||
|  | existence, and the existence of an alternates connection between a | ||||||
|  | repository and a pool. | ||||||
|  | 
 | ||||||
|  | ### Pool existence | ||||||
|  | 
 | ||||||
|  | If GitLab thinks a pool repository exists (i.e. it exists according to | ||||||
|  | SQL), but it does not on the Gitaly server, then certain RPC calls that | ||||||
|  | take the object pool as an argument will fail. | ||||||
|  | 
 | ||||||
|  | > TODO What happens if SQL says the pool repo exists but Gitaly says it | ||||||
|  | > does not? https://gitlab.com/gitlab-org/gitaly/issues/1533 | ||||||
|  | 
 | ||||||
|  | If GitLab thinks a pool does not exist, while it does exist on disk, | ||||||
|  | that has no direct consequences on its own. However, if other | ||||||
|  | repositories on disk borrow objects from this unknown pool repository | ||||||
|  | then we risk data loss, see below. | ||||||
|  | 
 | ||||||
|  | ### Pool relation existence | ||||||
|  | 
 | ||||||
|  | There are three different things that can go wrong here. | ||||||
|  | 
 | ||||||
|  | #### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects | ||||||
|  | 
 | ||||||
|  | In this case, we miss out on disk space savings but all RPC's on A itself | ||||||
|  | will function fine. As long as Git can find all its objects, it does not | ||||||
|  | matter exactly where those objects are. | ||||||
|  | 
 | ||||||
|  | #### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2 | ||||||
|  | 
 | ||||||
|  | If we are not careful, this situation can lead to data loss. During some | ||||||
|  | operations (repository maintenance), GitLab will try to re-link A to its | ||||||
|  | pool P1. If this clobbers the existing link to P2, then A will loose Git | ||||||
|  | objects and become invalid. | ||||||
|  | 
 | ||||||
|  | Also, keep in mind that if GitLab's database got messed up, it may not | ||||||
|  | even know that P2 exists. | ||||||
|  | 
 | ||||||
|  | > TODO Ensure that Gitaly will not clobber existing, unexpected | ||||||
|  | > alternates links. https://gitlab.com/gitlab-org/gitaly/issues/1534 | ||||||
|  | 
 | ||||||
|  | #### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P | ||||||
|  | 
 | ||||||
|  | This has the same data loss possibility as scenario 2 above. | ||||||
|  | 
 | ||||||
|  | ## Git object deduplication and GitLab Geo | ||||||
|  | 
 | ||||||
|  | When a pool repository record is created in SQL on a Geo primary, this | ||||||
|  | will eventually trigger an event on the Geo secondary. The Geo secondary | ||||||
|  | will then create the pool repository in Gitaly. This leads to an | ||||||
|  | "eventually consistent" situation because as each pool participant gets | ||||||
|  | synchronized, Geo will eventuall trigger garbage collection in Gitaly on | ||||||
|  | the secondary, at which stage Git objects will get deduplicated. | ||||||
|  | 
 | ||||||
|  | > TODO How do we handle the edge case where at the time the Geo | ||||||
|  | > secondary tries to create the pool repository, the source project does | ||||||
|  | > not exist? https://gitlab.com/gitlab-org/gitaly/issues/1533 | ||||||
		Loading…
	
		Reference in New Issue