https://gitlab.com/gitlab-org/gitlab-ce/issues/62971
Adds support to EnvironmentsController#metrics_dashboard
for the following params: group, title, y_label
These params are used to uniquely identify a panel on
the metrics dashboard.
Metrics are stored in several places, so this adds
utilities to find a specific panel from the database
or filesystem depending on the metric specified.
Also moves some shared utilities into separate classes,
notably default values and errors.
Previously, both InfluxSampler and RubySampler were relying on the
`GC::Profiler.total_time` data which is the sum over the list
of captured GC events. Also, both samplers asynchronously called
`GC::Profiler.clear` which led to incorrect metric data because
each sampler has the wrong assumption it is the only object who calls
`GC::Profiler.clear` and thus could rely on the gathered results between
such calls.
We should ensure that `GC::Profiler.total_time` is called only in one
place making it possible to rely on accumulated data between such wipes.
Also, we need to track the amount of profiler reports we lost.
In preparation for embedding specific metrics in issues
https://gitlab.com/gitlab-org/gitlab-ce/issues/62971,
this commit moves the BaseService for metrics dashboards
to a new services subdirectory. This is purely for the sake
of organization and maintainability.
* Remove `controller` and `action` labels from duration histogram.
* Create a new simple counter for `controller` and `action`.
* Adjust histogram buckets to observe smaller response times.
Adds GFM Pipline filters to insert a placeholder in the generated
HTML from GFM based on the presence of a metrics dashboard link.
The front end should look for the class 'js-render-metrics' to
determine if it should replace the element with metrics charts.
The data element 'data-dashboard-url' should be the endpoint
the front end should hit in order to obtain a dashboard layout
in order to appropriately render the charts.
Adds permission checks to the metrics_dashboard endpoint. Users
with role of Reporter or above should have access to view the
metrics for a given project.
This commits adds support for metrics dashboards
for embedding. If the flag 'embedded' is provided
to the environments/id/metrics_dashboard endpoint,
the response will be suitable for embedding in
issues or other content.
This is a precursor for support for embedding
metrics in GFM.
Error classes associated with individual stages of
dashboard processing tend to have very long names.
As dashboard post-processing includes more steps,
we will likely need to handle more error cases.
Refactoring to have all errors inherit from a specific
base class will help accommodate this and keep the code
more readable.
Adds a new stage to dashboard processesing step for the
EnvironmentsController::metrics_dashboard endpoint.
Allows the front end to avoid generating the endpoint
unitutive string mutations.
In some cases (during worker start) it's possible that
Puma.stats returns an empty hash for worker's last status. In
that case we just skip sampling of the worker until these
stats are available.
This adds ruby and unicorn instrumentation. This was originally
intended in 11.11 but due to performance concerns it was reverted. This
new commit foregoes the sys-proctable gem was causing performance issues
previously.
Updates the EnvironmentController#metrics_dashboard endpoint
to support a "dashboard" param, which can be used to specify
the filepath of a dashboard configuration from a project
repository. Dashboard configurations are expected to be
stored in .gitlab/dashboards/.
Updates dashboard post-processing steps to exclude custom
metrics, which should only display on the system dashboard.
This updates monitor docs to reflect the new ruby and unicorn metrics as
well as making it so we fetch process start time via the proc table
instead of via CLOCK_BOOTTIME
This adds new metrics for monitoring unicorn. These metrics include
process_cpu_seconds_total, process_start_time_seconds, process_max_fds,
and unicorn_workers.
* Use a simple counter for sampler duration instead of a histogram.
* Use a counter to collect GC time.
* Remove unused objects metric.
* Cleanup metric names to match Prometheus conventions.
* Prefix generic GC stats with `gc_stat`.
* Include worker label on memory and file descriptor metrics.
In MR https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15834 we
removed use of the data produced by the Allocations Gem. However, we
never removed the code that just enables tracking of allocations. In
this commit we remove all remaining traces of this Gem.
There seem to be a lot of cases where the suffix of an action (e.g.
".html") is set to bogus data, such as "*/*" or entire URLs. This can
increase cardinality of our metrics, and isn't very useful for
monitoring and filtering. To work around this, we enforce a whitelist
containing a few suffixes we actually care about. Suffixes not supported
will be grouped under the action without a suffix. This means that a
request to "FooController#bar.jpeg" will be assigned to
"FooController#bar".
Certain controllers (e.g. Doorkeeper::TokensController) don't expose the
method "request_format". This commit changes
Gitlab::Metrics::WebTransaction so we don't rely on this method, instead
using the underlying code this method uses.
Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/46412
The method "content_type" on a controller does not always return the
correct content type. On the other hand, the method "request_format"
does _and_ immediately returns a Symbol (e.g. :json) instead of a
mime-type name (e.g. application/json). With these changes metrics
should again report their action names correctly.
Fixes https://gitlab.com/gitlab-com/infrastructure/issues/3499
The Sidekiq exporter logs were mixing with the normal Sidekiq logs. In order
to support structured logging in Sidekiq, we either need to split this data
out or convert the exporter to produce structured logs. Since Sidekiq job
processing is fundamentally different information from Web server traffic,
it seems cleaner to move the metrics traffic into a separate file, where they
can be parsed by a different filter if needed.
Relates to #20060
Reduce cardinality of some of GitLab's Prometheus metrics and fix observed duration reporting.
Closes#41045
See merge request gitlab-org/gitlab-ce!15881
i.e.
Using compare and swap we update the expires_at value.
The thread that actually is able to update this value will also set
the cache holding method_call enabled state
On GitLab.com, InfluxSampler#sample_objects appears to take 1.2 s or so to
iterate through 1059 objects. This had led to delays of a couple hundred
milliseconds in processing in the main thread. Remove this code since it's not
really being used.
Closesgitlab-com/infrastructure#3250
* Follow Prometheus naming conventions[0].
* Simplify metrics by adding response lables to the histogram.
* Use standard `http_request_duration_seconds_...` names for the histogram.
[0]: https://prometheus.io/docs/practices/naming/#metric-names
`endpoint.route` is calling `env[Grape::Env::GRAPE_ROUTING_ARGS][:route_info]`
but `env[Grape::Env::GRAPE_ROUTING_ARGS]` is `nil` in the case of a 405 response
Signed-off-by: Rémy Coutable <remy@rymai.me>
GitLab Performance Monitoring is now able to track custom events not
directly related to application performance. These events include the
number of tags pushed, repositories created, builds registered, etc.
The use of these events is to get a better overview of how a GitLab
instance is used and how that may affect performance. For example, a
large number of Git pushes may have a negative impact on the underlying
storage engine.
Events are stored in the "events" measurement and are not prefixed with
"rails_" or "sidekiq_", this makes it easier to query events with the
same name triggered from different parts of the application. All events
being stored in the same measurement also makes it easier to downsample
data.
Currently the following events are tracked:
* Creating repositories
* Removing repositories
* Changing the default branch of a repository
* Pushing a new tag
* Removing an existing tag
* Pushing a commit (along with the branch being pushed to)
* Pushing a new branch
* Removing an existing branch
* Importing a repository (along with the URL we're importing)
* Forking a repository (along with the source/target path)
* CI builds registered (and when no build could be found)
* CI builds being updated
* Rails and Sidekiq exceptions
Fixesgitlab-org/gitlab-ce#13720
This reduces the overhead of the method instrumentation code primarily
by reducing the number of method calls. There are also some other small
optimisations such as not casting timing values to Floats (there's no
particular need for this), using Symbols for method call metric names,
and reducing the number of Hash lookups for instrumented methods.
The exact impact depends on the code being executed. For example, for a
method that's only called once the difference won't be very noticeable.
However, for methods that are called many times the difference can be
more significant.
For example, the loading time of a large commit
(nrclark/dummy_project@81ebdea5df)
was reduced from around 19 seconds to around 15 seconds using these
changes.
Process.clock_gettime allows getting the real time in nanoseconds as
well as allowing one to get a monotonic timestamp. This offers greater
accuracy without the overhead of having to allocate a Time instance. In
general using Time.now/Time.new is about 2x slower than using
Process.clock_gettime(). For example:
require 'benchmark/ips'
Benchmark.ips do |bench|
bench.report 'Time.now' do
Time.now.to_f
end
bench.report 'clock_gettime' do
Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond)
end
bench.compare!
end
Running this benchmark gives:
Calculating -------------------------------------
Time.now 108.052k i/100ms
clock_gettime 125.984k i/100ms
-------------------------------------------------
Time.now 2.343M (± 7.1%) i/s - 11.670M
clock_gettime 4.979M (± 0.8%) i/s - 24.945M
Comparison:
clock_gettime: 4979393.8 i/s
Time.now: 2342986.8 i/s - 2.13x slower
Another benefit of using Process.clock_gettime() is that we can simplify
the code a bit since it can give timestamps in nanoseconds out of the
box.
Previously we'd create a separate Metric instance for every method call
that would exceed the method call threshold. This is problematic because
it doesn't provide us with information to accurately get the _total_
execution time of a particular method. For example, if the method
"Foo#bar" was called 4 times with a runtime of ~10 milliseconds we'd end
up with 4 different Metric instances. If we were to then get the
average/95th percentile/etc of the timings this would be roughly 10
milliseconds. However, the _actual_ total time spent in this method
would be around 40 milliseconds.
To solve this problem we now create a single Metric instance per method.
This Metric instance contains the _total_ real/CPU time and the call
count for every instrumented method.
We can't do a lot with classes without names as we can't filter by them,
have no idea where they come from, etc. As such it's best to just ignore
these.