283 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			283 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| ---
 | |
| stage: Platforms
 | |
| group: Scalability
 | |
| info: Any user with at least the Maintainer role can merge updates to this content. For details, see https://docs.gitlab.com/ee/development/development_processes.html#development-guidelines-review.
 | |
| ---
 | |
| 
 | |
| # Rails request SLIs (service level indicators)
 | |
| 
 | |
| > - [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
 | |
| 
 | |
| NOTE:
 | |
| This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_observability/index.md#error-budget)
 | |
| by default.
 | |
| 
 | |
| The request Apdex SLI and the error rate SLI are [SLIs defined in the application](index.md).
 | |
| 
 | |
| The request Apdex measures the duration of successful requests as an indicator for
 | |
| application performance. This includes the REST and GraphQL API, and the
 | |
| regular controller endpoints.
 | |
| 
 | |
| The error rate measures unsuccessful requests as an indicator for
 | |
| server misbehavior. This includes the REST API, and the
 | |
| regular controller endpoints.
 | |
| 
 | |
| 1. `gitlab_sli_rails_request_apdex_total`: This counter gets
 | |
|    incremented for every request that did not result in a response
 | |
|    with a `5xx` status code. It ensures slow failures are not
 | |
|    counted twice, because the request is already counted in the error SLI.
 | |
| 
 | |
| 1. `gitlab_sli_rails_request_apdex_success_total`: This counter gets
 | |
|    incremented for every successful request that performed faster than
 | |
|    the [defined target duration depending on the endpoint's urgency](#adjusting-request-urgency).
 | |
| 
 | |
| 1. `gitlab_sli_rails_request_error_total`: This counter gets
 | |
|    incremented for every request that resulted in a response
 | |
|    with a `5xx` status code.
 | |
| 
 | |
| 1. `gitlab_sli_rails_request_total`: This counter gets
 | |
|    incremented for every request.
 | |
| 
 | |
| These counters are labeled with:
 | |
| 
 | |
| 1. `endpoint_id`: The identification of the Rails Controller or the
 | |
|    Grape-API endpoint.
 | |
| 
 | |
| 1. `feature_category`: The feature category specified for that
 | |
|    controller or API endpoint.
 | |
| 
 | |
| ## Request Apdex SLO
 | |
| 
 | |
| These counters can be combined into a success ratio. The objective for
 | |
| this ratio is defined in the service catalog per service. For this SLI to meet SLO,
 | |
| the ratio recorded must be higher than:
 | |
| 
 | |
| - [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
 | |
| - [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
 | |
| - [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
 | |
| 
 | |
| For example: for the web-service, we want at least 99.8% of requests
 | |
| to be faster than their target duration.
 | |
| 
 | |
| We use these targets for alerting and service monitoring. Set durations taking
 | |
| these targets into account, so we don't cause alerts. The goal, however, is to
 | |
| set the urgency to a target that satisfies our users.
 | |
| 
 | |
| Both successful measurements and unsuccessful ones affect the
 | |
| error budget for stage groups.
 | |
| 
 | |
| ## Adjusting request urgency
 | |
| 
 | |
| Not all endpoints perform the same type of work, so it is possible to
 | |
| define different urgency levels for different endpoints. An endpoint with a
 | |
| lower urgency can have a longer request duration than endpoints with high urgency.
 | |
| 
 | |
| Long-running requests are more expensive for our infrastructure. While serving
 | |
| one request, the thread remains occupied for the duration of that request. The thread
 | |
| can handle nothing else. Due to Ruby's Global VM Lock, the thread might keep the
 | |
| lock and stall other requests handled by the same Puma worker
 | |
| process. The request is, in fact, a noisy neighbor for other requests
 | |
| handled by the worker. We cap the upper bound for a target duration at 5 seconds
 | |
| for this reason.
 | |
| 
 | |
| ## Decreasing the urgency (setting a higher target duration)
 | |
| 
 | |
| You can decrease the urgency on an existing endpoint on
 | |
| a case-by-case basis. Take the following into account:
 | |
| 
 | |
| 1. Apdex is about perceived performance. If a user is actively waiting
 | |
|    for the result of a request, waiting 5 seconds might not be
 | |
|    acceptable. However, if the endpoint is used by an automation
 | |
|    requiring a lot of data, 5 seconds could be acceptable.
 | |
| 
 | |
|    A product manager can help to identify how an endpoint is used.
 | |
| 
 | |
| 1. The workload for some endpoints can sometimes differ greatly
 | |
|    depending on the parameters specified by the caller. The urgency
 | |
|    needs to accommodate those differences. In some cases, you could
 | |
|    define a separate [application SLI](index.md#defining-a-new-sli)
 | |
|    for what the endpoint is doing.
 | |
| 
 | |
|    When the endpoints in certain cases turn into no-ops, making them
 | |
|    very fast, we should ignore these fast requests when setting the
 | |
|    target. For example, if the `MergeRequests::DraftsController` is
 | |
|    hit for every merge request being viewed, but rarely renders
 | |
|    anything, then we should pick the target that
 | |
|    would still accommodate the endpoint performing work.
 | |
| 
 | |
| 1. Consider the dependent resources consumed by the endpoint. If the endpoint
 | |
|    loads a lot of data from Gitaly or the database, and this causes
 | |
|    unsatisfactory performance, consider optimizing the
 | |
|    way the data is loaded rather than increasing the target duration
 | |
|    by lowering the urgency.
 | |
| 
 | |
|    In these cases, it might be appropriate to temporarily decrease
 | |
|    urgency to make the endpoint meet SLO, if this is bearable for the
 | |
|    infrastructure. In such cases, create a code comment linking to an issue.
 | |
| 
 | |
|    If the endpoint consumes a lot of CPU time, we should also consider
 | |
|    this: these kinds of requests are the kind of noisy neighbors we
 | |
|    should try to keep as short as possible.
 | |
| 
 | |
| 1. Traffic characteristics should also be taken into account. If the
 | |
|    traffic to the endpoint sometimes bursts, like CI traffic spinning up a
 | |
|    big batch of jobs hitting the same endpoint, then having these
 | |
|    endpoints take five seconds is unacceptable from an infrastructure point of
 | |
|    view. We cannot scale up the fleet fast enough to accommodate for
 | |
|    the incoming slow requests alongside the regular traffic.
 | |
| 
 | |
| When lowering the urgency for an existing endpoint, involve a
 | |
| [Scalability team member](https://handbook.gitlab.com/handbook/engineering/infrastructure/team/scalability/)
 | |
| in the review. We can use request rates and durations available in the
 | |
| logs to come up with a recommendation. You can pick a threshold
 | |
| using the same process as for
 | |
| [increasing urgency](#increasing-urgency-setting-a-lower-target-duration),
 | |
| picking a duration that is higher than the SLO for the service.
 | |
| 
 | |
| We shouldn't set the longest durations on endpoints in the merge
 | |
| requests that introduces them, because we don't yet have data to support
 | |
| the decision.
 | |
| 
 | |
| ## Increasing urgency (setting a lower target duration)
 | |
| 
 | |
| When increasing the urgency, we must make sure the endpoint
 | |
| still meets SLO for the fleet that handles the request. You can use the
 | |
| information in the logs to check:
 | |
| 
 | |
| 1. Open [this table in Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
 | |
| 
 | |
| 1. The table loads information for the busiest endpoints by
 | |
|    default. To speed the response, add both:
 | |
| 
 | |
|    - A filter for `json.meta.caller_id.keyword`.
 | |
|    - The identifier you're interested in, for example:
 | |
| 
 | |
|      ```ruby
 | |
|      Projects::RawController#show
 | |
|      ```
 | |
| 
 | |
|      or:
 | |
| 
 | |
|      ```plaintext
 | |
|      GET /api/:version/projects/:id/snippets/:snippet_id/raw
 | |
|      ```
 | |
| 
 | |
| 1. Check the [appropriate percentile duration](#request-apdex-slo) for
 | |
|    the service handling the endpoint. The overall duration should
 | |
|    be lower than your intended target.
 | |
| 
 | |
| 1. If the overall duration is below the intended target, check the peaks over time
 | |
|    in [this graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
 | |
|    in Kibana. Here, the percentile in question should not peak above
 | |
|    the target duration we want to set.
 | |
| 
 | |
| As decreasing a threshold too much could result in alerts for the
 | |
| Apdex degradation, also involve a Scalability team member in
 | |
| the merge request.
 | |
| 
 | |
| ## How to adjust the urgency
 | |
| 
 | |
| You can specify urgency similar to how endpoints
 | |
| [get a feature category](../feature_categorization/index.md). Endpoints without a
 | |
| specific target use the default urgency: 1s duration. These configurations
 | |
| are available:
 | |
| 
 | |
| | Urgency    | Duration in seconds | Notes                                         |
 | |
| |------------|---------------------|-----------------------------------------------|
 | |
| | `:high`    | [0.25s](https://gitlab.com/gitlab-org/gitlab/-/blob/2f7a38fe48934b78f04233c4d2c81cde88a06da7/lib/gitlab/endpoint_attributes/config.rb#L8)               |                                               |
 | |
| | `:medium`  | [0.5s](https://gitlab.com/gitlab-org/gitlab/-/blob/2f7a38fe48934b78f04233c4d2c81cde88a06da7/lib/gitlab/endpoint_attributes/config.rb#L9)                |                                               |
 | |
| | `:default` | [1s](https://gitlab.com/gitlab-org/gitlab/-/blob/2f7a38fe48934b78f04233c4d2c81cde88a06da7/lib/gitlab/endpoint_attributes/config.rb#L10)                  | The default when nothing is specified.        |
 | |
| | `:low`     | [5s](https://gitlab.com/gitlab-org/gitlab/-/blob/2f7a38fe48934b78f04233c4d2c81cde88a06da7/lib/gitlab/endpoint_attributes/config.rb#L11)                  |                                               |
 | |
| 
 | |
| ### Rails controller
 | |
| 
 | |
| An urgency can be specified for all actions in a controller:
 | |
| 
 | |
| ```ruby
 | |
| class Boards::ListsController < ApplicationController
 | |
|   urgency :high
 | |
| end
 | |
| ```
 | |
| 
 | |
| To also specify the urgency for certain actions in a controller:
 | |
| 
 | |
| ```ruby
 | |
| class Boards::ListsController < ApplicationController
 | |
|   urgency :high, [:index, :show]
 | |
| end
 | |
| ```
 | |
| 
 | |
| A custom RSpec matcher is available to check endpoint's request urgency in the controller specs:
 | |
| 
 | |
| ```ruby
 | |
| specify do
 | |
|    expect(get(:index, params: request_params)).to have_request_urgency(:medium)
 | |
| end
 | |
| ```
 | |
| 
 | |
| ### Grape endpoints
 | |
| 
 | |
| To specify the urgency for an entire API class:
 | |
| 
 | |
| ```ruby
 | |
| module API
 | |
|   class Issues < ::API::Base
 | |
|     urgency :low
 | |
|   end
 | |
| end
 | |
| ```
 | |
| 
 | |
| To specify the urgency also for certain actions in a API class:
 | |
| 
 | |
| ```ruby
 | |
| module API
 | |
|   class Issues < ::API::Base
 | |
|       urgency :medium, [
 | |
|         '/groups/:id/issues',
 | |
|         '/groups/:id/issues_statistics'
 | |
|       ]
 | |
|   end
 | |
| end
 | |
| ```
 | |
| 
 | |
| Or, we can specify the urgency per endpoint:
 | |
| 
 | |
| ```ruby
 | |
| get 'client/features', urgency: :low do
 | |
|   # endpoint logic
 | |
| end
 | |
| ```
 | |
| 
 | |
| A custom RSpec matcher is also compatible with grape endpoints' specs:
 | |
| 
 | |
| ```ruby
 | |
| 
 | |
| specify do
 | |
|    expect(get(api('/avatar'), params: { email: 'public@example.com' })).to have_request_urgency(:medium)
 | |
| end
 | |
| ```
 | |
| 
 | |
| WARNING:
 | |
| We can't specify the urgency at the namespace level. The directive is ignored when doing so.
 | |
| 
 | |
| ### Error budget attribution and ownership
 | |
| 
 | |
| This SLI is used for service level monitoring. It feeds into the
 | |
| [error budget for stage groups](../stage_group_observability/index.md#error-budget).
 | |
| 
 | |
| For more information, read the epic for
 | |
| [defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
 | |
| The endpoints for the SLI feed into a group's error budget based on the
 | |
| [feature category declared on it](../feature_categorization/index.md).
 | |
| 
 | |
| To know which endpoints are included for your group, you can see the
 | |
| request rates on the
 | |
| [group dashboard for your group](https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups).
 | |
| In the **Budget Attribution** row, the **Puma Apdex** log link shows you
 | |
| how many requests are not meeting a 1s or 5s target.
 | |
| 
 | |
| For more information about the content of the dashboard, see
 | |
| [Dashboards for stage groups](../stage_group_observability/index.md). For more information
 | |
| about our exploration of the error budget itself, see
 | |
| [issue 1365](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1365).
 |