rabbitmq-server

Commit Graph

Author	SHA1	Message	Date
Michael Klishin	f414c2d512	More missed license header updates #9969	2024-02-05 11:53:50 -05:00
Michael Klishin	01092ff31f	(c) year bumps	2024-01-01 22:02:20 -05:00
Lajos Gerecs	82e25af5d5	Grafana: make sure dashboards do not break when detailed metrics are used (#5945 ) * Fix broken dashboards if detailed metrics are used If detailed metrics are pulled into the same prometheus, then we get an error in Grafana: execution: many-to-many matching not allowed: matching labels must be unique on one side This is because both endpoints provide `rabbit_identity_info` which is not unique to the endpoint. * add detailed metric scraper to prometheus config --------- Co-authored-by: Michal Kuratczyk <michal.kuratczyk@broadcom.com>	2023-12-27 15:44:05 +01:00
Péter Gömöri	fec09c0792	Escape prometheus core metric label values For example special characters like double quotes are allowed in queue names, in which case detailed metrics could produce unparsable text format output.	2023-12-03 01:14:44 +01:00
Michael Klishin	1b642353ca	Update (c) according to [1] 1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-and-vmware-intend-close-transaction-november-22-2023	2023-11-21 23:18:22 -05:00
Johan Rhodin	0b2a94c1ec	Update RabbitMQ-Overview.json Global counters for producers added in https://github.com/rabbitmq/rabbitmq-server/pull/3127 but never made it to this dashboard	2023-11-01 13:23:41 -05:00
Diana Parra Corbacho	5f0981c5a3	Allow to use Khepri database to store metadata instead of Mnesia [Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>	2023-09-29 16:00:11 +02:00
Péter Gömöri	e009a0af72	Expose number of unreachable cluster peers via Prometheus Unreachable peers is a subset of DB cluster nodes that are not connected to the current node via Erlang distribution for any reason.	2023-09-17 17:18:41 +02:00
Michal Kuratczyk	c56f2e2678	Remove the query threshold The graph looks empty or broken when values are sometimes above and sometimes below the 5000 limit. I think it's better to just show everything.	2023-09-07 11:33:35 +02:00
David Ansari	0f5fe8fadd	Add Prometheus metric messages dropped by MQTT QoS 0 queue type Why: A RabbitMQ operator should be able to see whether RabbitMQ drops MQTT QoS 0 messages due to overload protection. It's an indication that an MQTT subscriber does not consume fast enough. How: Use Prometheus global counters. There are 2 valid solutions: 1. Introduce a new metric called messages_dropped specifically for the rabbitmq_mqtt_qos0_queue type. This would work in a similar fashion how streams extends the per protocol global counters, but requires extending the per protocol & queue type global counters for the MQTT QoS queue type. The emitted metrics would look as follows: ``` rabbitmq_global_messages_dropped_total{protocol="mqtt310",queue_type="rabbit_mqtt_qos0_queue"} 0 rabbitmq_global_messages_dropped_total{protocol="mqtt311",queue_type="rabbit_mqtt_qos0_queue"} 0 rabbitmq_global_messages_dropped_total{protocol="mqtt50",queue_type="rabbit_mqtt_qos0_queue"} 0 ``` 2. Reuse the existing metric rabbitmq_global_messages_dead_lettered_maxlen_total This commit decides to go for the 2nd approach because: a) there is no need to add a new metric. Even though dead lettering is not supported for the MQTT QoS 0 queue type, this metric maps nicely to what happens: The queue drop messages since itx max length (mqtt.mailbox_soft_limit) is exceeded with overflow behaviour drop-head. Furtheremore the label `dead_letter_strategy="disabled"` tells that dead lettering is not taking place from this queue type. b) this metric allows to support dead lettering for the MQTT QoS 0 queue type in the future. The new dead lettering metrics look as follows: ``` rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_mqtt_qos0_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_confirmed_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 ```	2023-08-15 16:06:15 +02:00
Iliia Khaprov	19f122fea4	Prometheus core metrics collector: Do not render any sampels that are NaN or undefined close #8740	2023-07-08 18:45:03 +02:00
Simon Unge	2a2af36b9c	rename fix	2023-06-23 14:53:57 -07:00
Simon Unge	8b3ca4c972	See #8605 . Add authentcation support to prometheus.	2023-06-23 13:54:45 -07:00
Michael Klishin	55442aa914	Replace @rabbitmq.com addresses with rabbitmq-core@groups.vmware.com Don't ask why we have to do it. Because reasons!	2023-06-20 15:40:13 +04:00
Rin Kuryloski	eb94a58bc9	Add a workflow to compare the bazel/erlang.mk output To catch any drift between the builds	2023-05-15 13:54:14 +02:00
Michael Klishin	59fe5dc01b	Prometheus: handle scenarios when no listener is configured Start a plain TCP one with all defaults.	2023-05-06 00:19:58 +04:00
Chunyi Lyu	4ddb0c2038	Support TLS-only listener for Prometheus - tcp listener can be turned off by setting 'prometheus.tcp.listener = none' - config schema follows web_mqtt and web_stomp	2023-05-05 15:44:53 +01:00
Rin Kuryloski	a944439fba	Replace globs in bazel with explicit lists of files As this is preferred in rules_erlang 3.9.14	2023-04-25 17:29:12 +02:00
Rin Kuryloski	854d01d9a5	Restore the original -include_lib statements from before #6466 since this broke erlang_ls requires rules_erlang 3.9.13	2023-04-20 12:40:45 +02:00
Rin Kuryloski	8de8f59d47	Use gazelle generated bazel files Bazel build files are now maintained primarily with `bazel run gazelle`. This will analyze and merge changes into the build files as necessitated by certain code changes (e.g. the introduction of new modules). In some cases there hints to gazelle in the build files, such as `# gazelle:erlang...` or `# keep` comments. xref checks on plugins that depend on the cli are a good example.	2023-04-17 18:13:18 +02:00
Rin Kuryloski	8a7eee6a86	Ignore warnings when building plt files for dependencies As we don't generally care if a dependency has warnings, only the target	2023-04-17 10:09:24 +02:00
Rin Kuryloski	609171ec70	Rename the tanzu cli scope to vmware And update other references to commercial editions	2023-02-16 13:49:54 +01:00
Ilia Kurenkov	051419e46b	Fix descriptions of auth metrics.	2023-02-11 23:36:21 +01:00
Alexey Lebedeff	c7da0da8b8	Cleanup dialyzer calls - Use the same base .plt everywhere, so there is no need to list standard apps everywhere - Fix typespecs: some typos and the use of not-exported types	2023-02-06 17:05:30 +01:00
Rin Kuryloski	5ef8923462	Avoid the need to pass package name to rabbitmq_integration_suite	2023-01-18 15:25:27 +01:00
Rin Kuryloski	a317b30807	Use improved assert_suites2 macro from rules_erlang 3.9.0	2023-01-18 15:07:06 +01:00
Michael Klishin	ba7b44df8a	Merge pull request #6879 from rabbitmq/dialyzer-warnings-rabbitmq-prometheus Fix all dialyzer warnings in rabbitmq_prometheus	2023-01-13 12:21:15 -06:00
Alexey Lebedeff	cd92258346	Fix all dialyzer warnings in rabbitmq_prometheus	2023-01-13 15:52:26 +01:00
Michal Kuratczyk	510415f8b9	Update prometheus.erl to 4.10.0 Since 4.10.0 was released specifically to address an issue we encountered in RabbitMQ integration with prometheus.erl, new test was added to validate this functionality in the future.	2023-01-13 10:24:41 +01:00
Michael Klishin	8e8def801c	Wording	2023-01-03 14:19:58 -05:00
Ilia Kurenkov	98a5e34e90	DRY: Link docs for `/metrics/detailed` endpoint to website.	2023-01-03 13:45:45 +01:00
Michael Klishin	ec4f1dba7d	(c) year bump: 2022 => 2023	2023-01-01 23:17:36 -05:00
Luke Bakken	7fe159edef	Yolo-replace format strings Replaces `~s` and `~p` with their unicode-friendly counterparts. ``` git ls-files *.erl \| xargs sed -i.ORIG -e s/~s>/~ts/g -e s/~p>/~tp/g ```	2022-10-10 10:32:03 +04:00
Loïc Hoguin	73dd0acf01	rabbit_prometheus_http_SUITE: Update tests for new CQs CQs without consumers will have only one message in memory.	2022-09-27 12:00:10 +02:00
Michael Klishin	96b6e6c368	Merge pull request #5463 from rabbitmq/global-metrics-values Move message rate metrics from aggregated to global counters	2022-09-12 20:25:08 +04:00
Iliia Khaprov - VMware	e15d12d767	Merge pull request #5449 from rabbitmq/grafana-9-support Update RabbitMQ Dashboards to support latest Grafana versions	2022-08-24 21:25:56 +02:00
David Ansari	c3cccf4963	Rename run_queues_length_total to run_queues_length It's a gauge, not a counter. @deadtrickster fixed the bug in `d0feb0df58` See #4380	2022-08-24 18:04:14 +02:00
Péter Gömöri	bf00ee4cfc	Fix a typo in a comment in prometheus_rabbitmq_core_metrics_collector	2022-08-23 00:54:35 +02:00
Connor Rogers	6ee0a318e8	Move message rate metrics from channel/queue aggregation to global counters	2022-08-08 16:19:01 +01:00
Connor Rogers	c88326ef23	Add README.md for creating/updating dashboards	2022-08-05 17:41:47 +01:00
Connor Rogers	e35fd65ff3	Fix overview graphs in Grafana 9 '-1' is no longer accepted as of Grafana 9, and causes a console error when rendering	2022-08-05 17:09:34 +01:00
Connor Rogers	9ac6862e06	Fix dist link graph Both directions of the link were showing as one entry instead of two. This is beacuse of https://github.com/flant/grafana-statusmap/issues/277	2022-08-05 16:51:21 +01:00
Connor Rogers	42f30ba7c3	Set time series to show all series in tooltip	2022-08-05 16:14:29 +01:00
Connor Rogers	40767cdae4	Take dashboard definitions straight from exported Grafana for simplicity	2022-08-05 16:09:30 +01:00
Connor Rogers	4d28eef0f8	Migrate from deprecated panels in Grafana	2022-08-05 15:46:27 +01:00
Connor Rogers	8e404ecd04	Update to supported Grafana and Prometheus versions	2022-08-05 12:52:26 +01:00
Jean-Sébastien Pédron	6e9ee4d0da	Remove test code which depended on the `quorum_queue` feature flags These checks are now irrelevant as the feature flag is required.	2022-08-01 12:41:30 +02:00
Iliia Khaprov	360db38db0	Add process_start_time_seconds metrics. See #4539	2022-07-13 08:35:33 +02:00
Philip Kuryloski	15a79466b1	Use the new xref2 macro from rules_erlang That adopts the modern erlang.mk xref behaviour	2022-06-09 23:18:28 +02:00
Philip Kuryloski	327f075d57	Make rabbitmq-server work with rules_erlang 3 Also rework elixir dependency handling, so we no longer rely on mix to fetch the rabbitmq_cli deps Also: - Specify ra version with a commit rather than a branch - Fixup compilation options for erlang 23 - Add missing ra reference in MODULE.bazel - Add missing flag in oci.yaml - Reduce bazel rbe jobs to try to save memory - Use bazel built erlang for erlang git master tests - Use the same cache for all the workflows but windows - Avoid using `mix local.hex --force` in elixir rules - Fetching seems blocked in CI, and this should reduce hex api usage in all builds, which is always nice - Remove xref and dialyze tags since rules_erlang 3 includes them in the defaults	2022-06-08 14:04:53 +02:00
Loïc Hoguin	dc70cbf281	Update Erlang.mk and switch to new xref code	2022-05-31 13:51:12 +02:00
Michael Klishin	018b04a1ea	Wording	2022-04-26 23:08:00 +04:00
Péter Gömöri	35b21797ae	Expose head_message_timestamp via Prometheus plugin as well It is already exposed via rabbitmqctl and the API. It is also exposed by old or unofficial prometheus plugins and other monitoring integrations (DataDog).	2022-04-26 15:42:57 +02:00
Michael Klishin	7c47d0925a	Revert "Correct a double quote introduced in #4603" This reverts commit `6a44e0e2ef`. That wiped a lot of files unintentionally	2022-04-20 16:05:56 +04:00
Michael Klishin	6a44e0e2ef	Correct a double quote introduced in #4603	2022-04-20 16:01:29 +04:00
Luke Bakken	dba25f6462	Replace files with symlinks This prevents duplicated and out-of-date instructions.	2022-04-15 06:04:29 -07:00
Loïc Hoguin	499e0b9197	Remove the CQv1 disabled stats from management/Prometheus	2022-04-05 12:37:54 +02:00
Michael Klishin	c38a3d697d	Bump (c) year	2022-03-21 01:21:56 +04:00
David Ansari	a3905da47c	Add note about missed Prometheus counter updates Currently, the quorum queue state machine updates counters via mod_call effects which are not guaranteed to be executed. They are updated via mod_call effects such that only the leader increments the counter (and not the followers). In certain failure scenarios when dead-lettering lots of messages at the same time, these mod_call effects might not be executed. Hence, one shouldn't rely that counters for dead lettered messages and dead lettered confirmed messages match up 100% even though all dead-lettered messages were confirmed eventually.	2022-02-28 16:28:09 +01:00
David Ansari	8c286cc680	Add Prometheus metrics for dead-lettered messages > curl -s localhost:15692/metrics \| grep rabbitmq_global_messages_dead_lettered \# TYPE rabbitmq_global_messages_dead_lettered_delivery_limit_total counter \# HELP rabbitmq_global_messages_dead_lettered_delivery_limit_total Total number of messages dead-lettered due to delivery-limit exceeded rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_delivery_limit_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 \# TYPE rabbitmq_global_messages_dead_lettered_expired_total counter \# HELP rabbitmq_global_messages_dead_lettered_expired_total Total number of messages dead-lettered due to message TTL exceeded rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_expired_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 \# TYPE rabbitmq_global_messages_dead_lettered_rejected_total counter \# HELP rabbitmq_global_messages_dead_lettered_rejected_total Total number of messages dead-lettered due to basic.reject or basic.nack rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_rejected_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 \# TYPE rabbitmq_global_messages_dead_lettered_confirmed_total counter \# HELP rabbitmq_global_messages_dead_lettered_confirmed_total Total number of messages dead-lettered and confirmed by target queues rabbitmq_global_messages_dead_lettered_confirmed_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_least_once"} 0 \# TYPE rabbitmq_global_messages_dead_lettered_maxlen_total counter \# HELP rabbitmq_global_messages_dead_lettered_maxlen_total Total number of messages dead-lettered due to overflow drop-head or reject-publish-dlx rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_classic_queue",dead_letter_strategy="disabled"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="at_most_once"} 0 rabbitmq_global_messages_dead_lettered_maxlen_total{queue_type="rabbit_quorum_queue",dead_letter_strategy="disabled"} 0 A few notes: * dead_letter_strategy 'disabled' means either user did not configure dead-letter-exchange or configured dead-letter-exchange does not exist. * Only time series that make sense get output. Example 1: Combination of 'at_least_once' and 'maxlen' will always be 0. Hence, we omit that time series. Example 2: 'confirmed' makes only sense with quorum queues and 'at_least_once'. Example 3: 'delivery_limit' makes only sense with quorum queues. * Users get to know why messages were dead-lettered. * Before this commit, there was no possibilities for users to alert based on messages being dropped from the head of the queue when overflow=drop-head. * Users can now easily create alerts: Example 1: Message gets silently dropped (i.e. dead_letter_strategy='disabled') instead of actually dead-lettered. Example 2: Detect dead-letter topology misconfigurations. Example 3: Messages expire Example 4: Messages overflow Example 5: Messages requeued too often * Stream queues by definition do not dead-letter.	2022-02-28 16:28:02 +01:00
Philip Kuryloski	226e00fcd2	Tighten up dialyzer usage now that rules_erlang no longer cascades up dialyzer warnings from deps	2022-02-24 11:18:41 +01:00
Philip Kuryloski	d8201726ae	Ignore dialyzer warnings for most apps	2022-02-21 09:19:56 +01:00
Philip Kuryloski	efcd881658	Use rules_erlang v2 bazel-erlang has been renamed rules_erlang. v2 is a substantial refactor that brings Windows support. While this alone isn't enough to run all rabbitmq-server suites on windows, one can at least now start the broker (bazel run broker) and run the tests that do not start a background broker process	2022-01-18 13:43:46 +01:00
Alexey Lebedeff	7676ed9685	Use `rabbitmq_cluster_` prefix for cluster-wide metrics	2021-11-24 16:49:43 +01:00
Michael Klishin	38d64a54b1	Wording	2021-11-24 14:19:57 +03:00
Michael Klishin	a1c0cd3785	Wording	2021-11-24 14:02:10 +03:00
Alexey Lebedeff	6e3012aaf9	Add optional metrics for vhost and exchange count These can make sense in some scenarios, e.g. when vhost/exchanges are +created using self-service automation	2021-11-24 11:00:41 +01:00
Luke Bakken	bd2858c208	Compile the regex	2021-11-22 08:30:17 -08:00
dcorbacho	a7c9b66653	Use own key to exclude queues	2021-11-16 16:53:17 +01:00
dcorbacho	242cb539b3	Exclude queues from aggregated metrics in prometheus collector Uses same exclusion pattern as the management agent	2021-11-16 10:23:39 +01:00
Alexey Lebedeff	8598c51579	Pre-render prometheus labels This makes per-object metrics twice as fast. Depends on https://github.com/deadtrickster/prometheus.erl/pull/137	2021-11-09 13:04:39 +01:00
Alexey Lebedeff	b9ebfb8980	Fix ssl port handling in prometheus plugin All ssl options were stored in the same proplist, and the code was then trying to determine whether an option actually belongs to ranch ssl options or not. Some keys landed in the wrong place, like it did happen in #2975 - different ports were mentioned in listener config (default at top-level, and non-default in `ssl_opts`). Then `ranch` and `rabbitmq_web_dispatch` were treating this differently. This change just moves all ranch ssl opts into proper place using schema, removing any need for guessing in code. The only downside is that advanced config compatibility is broken.	2021-10-20 14:55:33 +02:00
Michael Klishin	3826a0df25	Compile #3561	2021-10-13 01:27:16 +03:00
Michael Klishin	670f240537	Compile #3561	2021-10-12 20:17:51 +03:00
Johannes Würbach	84de860b4c	feat(prom): expose cluster id in identity	2021-10-12 15:43:46 +02:00
Alexey Lebedeff	989a299720	Emit identity info in prometheus /metrics/detailed endpoint This is needed to make filtering metrics on a cluster name possible.	2021-09-28 19:35:02 +02:00
Alexey Lebedeff	5501d07b8b	Use rabbitmq_ct_helpers to allocate prometheus port This test always used standard 15692 before, which were causing conflicts with e.g. local `make run-broker`.	2021-09-22 15:23:35 +02:00
Alexey Lebedeff	4bb2262140	Allow selective querying for prometheus plugin	2021-09-20 14:59:17 +02:00
Michael Klishin	47b20e8f7c	Prometheus: alarm-related metric naming	2021-08-17 20:58:24 +03:00
Ilya Khaprov	9fed915192	Add alarms prometheus collector. close #2653	2021-08-16 20:32:29 +02:00
Gerhard Lazu	62d82e1660	Break down metrics by node in all RabbitMQ-Stream pie charts Otherwise we won't be able to see which nodes are running "hot" Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-08-11 13:39:30 +01:00
David Ansari	4b774db5c1	Use same threshold color for "Errors since boot"	2021-08-02 17:05:17 +02:00
David Ansari	c99ee6961e	Use same colorMode in all RabbitMQ-Stream panels Co-authored-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-08-02 13:33:00 +02:00
David Ansari	ea18c31288	Make RabbitMQ-Stream dashboard work via ConfigMap Before this commit, importing the dashboard via ConfigMap as seen in `1eb1dc618e` didn't work because DS_PROMETHEUS variable was undefined in Grafana. Related to https://github.com/rabbitmq/rabbitmq-server/pull/3250 Co-authored-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-08-02 13:12:48 +02:00
Gerhard Lazu	65afbb931b	Ensure RabbitMQ-Stream dashboard works correctly after import This breaks the docker-compose integration, but we need to move away from it anyways, the whole dev flow needs revisiting after our focus on K8s. $__rate_interval does not work with irate, dropping it in favour of 60s, same as all other dashboards. This is a follow-up to https://github.com/rabbitmq/rabbitmq-server/pull/3250 Thanks @ansd for mentioning about the post-import issues. It was uploaded as https://grafana.com/api/dashboards/14798/revisions/3/download Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-07-30 13:53:02 +01:00
Gerhard Lazu	35a6369327	Restart stream-perf-test on-failure This handles the scenario where rmq2 is not available, and stream-perf-test exits with a non-zero exit code. Good spot @ansd! Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-07-30 11:25:36 +01:00
David Ansari	47d572908d	Convert string to integer for ulimits.nofile Before this commit: > make overview metrics services.rmq1.ulimits.nofile.hard must be a integer make: *** [Makefile:68: overview] Error 15 Accoring to the docs https://docs.docker.com/compose/compose-file/compose-file-v3/#ulimits this must be an integer.	2021-07-30 09:46:38 +02:00
Gerhard Lazu	6f5c4118ea	Publish RabbitMQ-Stream dashboard to grafana.com Removed the Dockerfile and slimmed down the Makefile, all of this is now handled by https://github.com/rabbitmq/rabbitmq-server/blob/master/.github/workflows/oci.yaml cc @Zerpet @pjk25 More details here (including the steps used to publish to grafana.com): https://github.com/rabbitmq/release-engineering/issues/11#issuecomment-887627938 I don't want to hold up this PR, will invest in automating the steps described in the previous link another time. Time to 🚀 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-07-29 19:34:05 +01:00
Philip Kuryloski	b26289cb47	Adjust rabbitmq_prometheus test suite timeouts in bazel	2021-07-22 11:00:14 +02:00
Gerhard Lazu	66ef8adfc8	Fix accept dependency in rabbitmq_prometheus It's a runtime dependency, not a build dependency. This is a fix and should be backported to v3.9.x, after rc.2 and just before the final release. Would you disagree @dumbbell? Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-07-21 13:38:54 +01:00
Philip Kuryloski	8f9de08de7	Also assert no missing suites for all other deps	2021-07-12 18:05:55 +02:00
dcorbacho	b636ad2565	Rename protocol error counters to _total	2021-06-30 12:46:41 +02:00
dcorbacho	c9305d948a	Use number of publishing channels as global publishers in amqp091	2021-06-29 08:10:42 +01:00
Philip Kuryloski	8c7e7e0656	Revert "Default all `rabbitmq_integration_suite` to flaky in bazel" This reverts commit `70cb8147b2`.	2021-06-23 20:53:14 +02:00
Gerhard Lazu	c7971252cd	Global counters per protocol + protocol AND queue_type This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in https://github.com/rabbitmq/rabbitmq-server/pull/3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in https://github.com/rabbitmq/seshat/pull/1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>	2021-06-22 14:14:21 +01:00
Philip Kuryloski	70cb8147b2	Default all `rabbitmq_integration_suite` to flaky in bazel Most tests that can start rabbitmq nodes have some chance of flaking. Rather than chase individual flakes for now, this commit changes the default (though it can still be overriden, as is the case for config_scheme_SUITE in many places, since I have yet to see that particular suite flake).	2021-06-21 16:10:38 +02:00
Philip Kuryloski	30f9a95b9f	Add dialyze for remaning tier-1 plugins	2021-06-01 10:19:10 +02:00
Philip Kuryloski	a3dbdecb8c	Mark //deps/rabbitmq_prometheus:rabbit_prometheus_http_SUITE flaky	2021-05-21 18:32:20 +02:00
Philip Kuryloski	98e71c45d8	Perform xref checks on many tier-1 plugins	2021-05-21 12:03:22 +02:00
Philip Kuryloski	e6df6615e1	Futher bazel file refactoring and deduplication	2021-05-11 16:15:33 +02:00

1 2 3 4 5 ...

518 Commits