redis

Commit Graph

Author	SHA1	Message	Date
Ozan Tezcan	366c6aff81	Put replica online when bgsave is done (#13895 ) CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Before https://github.com/redis/redis/pull/13732, replicas were brought online immediately after master wrote the last bytes of the RDB file to the socket. This behavior remains unchanged if rdbchannel replication is not used. However, with rdbchannel replication, the replica is brought online after receiving the first ack which is sent by replica after rdb is loaded. To align the behavior, reverting this change to put replica online once bgsave is done. Additonal changes: - INFO field `mem_total_replication_buffers` will also contain `server.repl_full_sync_buffer.mem_used` which shows accumulated replication stream during rdbchannel replication on replica side. - Deleted debug level logging from some replication tests. These tests generate thousands of keys and it may cause per key logging on some cases.	2025-03-31 13:48:49 +03:00
Vitah Lin	057f039c4b	Fix 'RESTORE can set LFU' test (#13896 ) CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details When the `restore foo 0 $encoded freq 100` command and `set freq [r object freq foo]` run in different minute timestamps (i.e., when server.unixtime/60 changes between these operations), the assertion may fail due to the LFU decay. This PR updates the “RESTORE can set LFU” test to verify the actual freq value based on minute timestamps. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-03-28 13:33:58 +08:00
Ozan Tezcan	a0da8390a2	Fix use-after-free when diskless load config is not swapdb (#13887 ) CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details When the diskless load configuration is set to on-empty-db, we retain a pointer to the function library context. When emptyData() is called, it frees this function library context pointer, leading to a use-after-free situation. I refactored code to ensure that emptyData() is called first, followed by retrieving the valid pointer to the function library context. Refactored code should not introduce any runtime implications. Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0) Co-authored-by: Oran Agra <oran@redislabs.com>	2025-03-26 21:50:10 +03:00
Cong Chen	981aa5c12f	Fix timing issue in HEXPIREAT test (#13873 ) CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This fixes an error that occurs in the job [test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397) of the Daily workflow: ``` *** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test) ```	2025-03-26 10:00:38 +08:00
Oran Agra	2a189709e0	avoid possible use-after-free with module KSN changes (#13875 ) CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details in #13505, we changed the code to use the string value of the key rather than the integer value on the stack, but we have a test in unit/moduleapi/keyspace_events that uses keyspace notification hook to modify the value with RM_StringDMA, which can cause this value to be released before used. the reason it didn't happen so far is because we were using shared integers, so releasing the object doesn't free it.	2025-03-24 12:24:52 +02:00
Benson-li	427c36888e	Fix potential infinite loop of RANDOMKEY during client pause (#13863 ) CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details The bug mentioned in this [#13862](https://github.com/redis/redis/issues/13862) has been fixed. --------- Signed-off-by: li-benson <1260437731@qq.com> Signed-off-by: youngmore1024 <youngmore1024@outlook.com> Co-authored-by: youngmore1024 <youngmore1024@outlook.com>	2025-03-20 21:32:12 +08:00
debing.sun	cb02bd190b	Fix timing issue in module defrag test (#13870 ) After #13840, the data we populate becomes more complex and slower, we always wait for a defragmentation cycle to end before verifying that the test is okay. However, in some slow environments, an entire defragmentation cycle can exceed 5 seconds, and in my local test using 'taskset -c 0' it can reach 6 seconds, so increase the threshold to avoid test failures.	2025-03-20 21:22:47 +08:00
Yuan Wang	951ec79654	Cluster compatibility check (#13846 ) CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ### Background The program runs normally in standalone mode, but migrating to cluster mode may cause errors, this is because some cross slot commands can not run in cluster mode. We should provide an approach to detect this issue when running in standalone mode, and need to expose a metric which indicates the usage of no incompatible commands. ### Solution To avoid perf impact, we introduce a new config `cluster-compatibility-sample-ratio` which define the sampling ratio (0-100) for checking command compatibility in cluster mode. When a command is executed, it is sampled at the specified ratio to determine if it complies with Redis cluster constraints, such as cross-slot restrictions. A new metric is exposed: `cluster_incompatible_ops` in `info stats` output. The following operations will be considered incompatible operations. - cross-slot command If a command has multiple cross slot keys, it is incompatible - `swap, copy, move, select` command These commands involve multi databases in some cases, we don't allow multiple DB in cluster mode, so there are not compatible - Module command with `no-cluster` flag If a module command has `no-cluster` flag, we will encounter an error when loading module, leading to fail to load module if cluster is enabled, so this is incompatible. - Script/function with `no-cluster` flag Similar with module command, if we declare `no-cluster` in shebang of script/function, we also can not run it in cluster mode - `sort` command by/get pattern When `sort` command has `by/get` pattern option, we must ask that the pattern slot is equal with the slot of keys, otherwise it is incompatible in cluster mode. - The script/function command accesses the keys and declared keys have different slots For the script/function command, we not only check the slot of declared keys, but only check the slot the accessing keys, if they are different, we think it is incompatible. Besides, commands like `keys, scan, flushall, script/function flush`, that in standalone mode iterate over all data to perform the operation, are only valid for the server that executes the command in cluster mode and are not broadcasted. However, this does not lead to errors, so we do not consider them as incompatible commands. ### Performance impact test cross slot test Below are the test commands and results. When using MSET with 8 keys, performance drops by approximately 3%. single key test It may be due to the overhead of the sampling function, and single-key commands could cause a 1-2% performance drop.	2025-03-20 10:35:53 +08:00
Filipe Oliveira (Redis)	3e012c9260	Fix string2d usage in case of hexadecimal strings parsing and overflow (#13845 ) CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Since https://github.com/redis/redis/pull/11884, what was previously accepted as a valid input (hexadecimal string) before 8.0 returned an error. This PR addresses it. To avoid performance penalties if hints the compiler that the fallbacks are not likely to happen. Furthermore, we were ignoring std::result_out_of_range outputs from fast_float. This PR addresses it as well and includes tests for both identified scenarios. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-03-19 20:08:45 +08:00
debing.sun	a5a3afd923	Fix crash during SLAVEOF when clients are blocked on lazyfree (#13853 ) CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details After https://github.com/redis/redis/pull/13167, when a client calls `FLUSHDB` command, we still async empty database, and the client was blocked until the lazyfree completes. 1) If another client calls `SLAVEOF` command during this time, the server will unblock all blocked clients, including those blocked by the lazyfree. However, when unblocking a lazyfree blocked client, we forgot to call `updateStatsOnUnblock()`, which ultimately triggered the following assertion. 2) If a client blocked by Lazyfree is unblocked midway, and at this point the `bio_comp_list` has already received the completion notification for the bio, we might end up processing a client that has already been unblocked in `flushallSyncBgDone()`. Therefore, we need to filter it out. --------- Co-authored-by: oranagra <oran@redislabs.com>	2025-03-17 20:27:05 +08:00
debing.sun	f364dcca2d	Make RM_DefragRedisModuleDict API support incremental defragmentation for dict leaf (#13840 ) After https://github.com/redis/redis/pull/13816, we make a new API to defrag RedisModuleDict. Currently, we only support incremental defragmentation of the dictionary itself, but the defragmentation of values is still not incremental. If the values are very large, it could lead to significant blocking. Therefore, in this PR, we have added incremental defragmentation for the values. The main change is to the `RedisModuleDefragDictValueCallback`, we modified the return value of this callback. When the callback returns 1, we will save the `seekTo` as the key of the current unfinished node, and the next time we enter, we will continue defragmenting this node. When the return value is 0, we will proceed to the next node. ## Test Since each dictionary in the global dict originally contained only 10 strings, but now it has been changed to a nested dictionary, each dictionary now has 10 sub-dictionaries, with each sub-dictionary containing 10 strings, this has led to a corresponding reduction in the defragmentation time obtained from other tests. Therefore, the other tests have been modified to always wait for defragmentation to be turned off before the test begins, then start it after creating fragmentation, ensuring that they can always run for a full defragmentation cycle. --------- Co-authored-by: ephraimfeldblum <ephraim.feldblum@redis.com>	2025-03-04 17:19:41 +08:00
debing.sun	7939ba031d	Enable the callback to be NULL for RM_DefragRedisModuleDict() and reduce the system calls of RM_DefragShouldStop() (#13830 ) 1) Enable the callback to be NULL for RM_DefragRedisModuleDict() Because the dictionary may store only the key without the value. 2) Reduce the system calls of RM_DefragShouldStop() The API checks the following thresholds before performing a time check: over 512 defrag hits, or over 1024 defrag misses, and performs the time judgment if any of these thresholds are reached. 3) Added defragmentation statistics for dictionary items to cover the associated code for RM_DefragRedisModuleDict(). 4) Removed `module_ctx` from `defragModuleCtx` struct, which can be replaced by a temporary variable. --------- Co-authored-by: oranagra <oran@redislabs.com>	2025-02-26 20:04:29 +08:00
Yuan Wang	f1d6542b1a	Stabilize tcl test cases (#13829 ) Recently encountered some errors as bellow, HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate current time by `[clock seconds]` that may have a delay that causes results greater than expected. Dismiss memory test error, now we introduced rdb-channel replication, the full synchronization might finish before the child process exits. So we may fail if calling `bgsave` immediately after full sync.	2025-02-25 16:31:53 +08:00
Denis Nevmerzhitskii	33f03f6fc8	Fix wrong behavior of XREAD + after last entry of stream have been removed (#13632 ) Close #13628 This PR changes behavior of special `+` id of XREAD command. Now it uses `streamLastValidID` to find last entry instead of `last_id` field of stream object. This PR adds test for the issue. Notes Initial idea to update `last_id` while executing XDEL seems to be wrong. `last_id` is used to strore last generated id and not id of last entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: guybe7 <guy.benoish@redislabs.com>	2025-02-25 13:40:24 +08:00
Moti Cohen	0200e8ada6	Fix multiple issues with "INFO KEYSIZES" (#13825 ) This commit addresses several issues related to the `INFO KEYSIZES` feature: - HyperLogLog commands: `KEYSIZES` hooks were not properly set or tested. - HFE lazy expiration: `KEYSIZES` hooks were not properly set or tested. - Empty DB & SYNC flow: On `blocking_async=0` flow, global `keysizes` histogram were not reset (can reproduced using `DEBUG RELOAD`). - Empty string handling: Fix histogram for strings of size 0. Not relevant to other data-types.	2025-02-25 00:38:44 +02:00
debing.sun	ee933d9e2b	Fixed passing incorrect endtime value for module context (#13822 ) 1) Fix a bug that passing an incorrect endtime to module. This bug was found by @ShooterIT. After #13814, all endtime will be monotonic time, and we should no longer convert it to ustime relative. Add assertions to prevent endtime from being much larger thatn the current time. 2) Fix a race in test `Reduce defrag CPU usage when module data can't be defragged` --------- Co-authored-by: ShooterIT <wangyuancode@163.com>	2025-02-23 12:58:48 +08:00
debing.sun	032357ec0f	Add RM_DefragRedisModuleDict module API (#13816 ) After #13815, we introduced incremental defragmentation for global data for module. Now we added a new module API `RM_DefragRedisModuleDict` to incremental defrag `RedisModuleDict`. This PR adds a new APIs and a new defrag callback: ```c RedisModuleDict RM_DefragRedisModuleDict(RedisModuleDefragCtx ctx, RedisModuleDict dict, RedisModuleDefragDictValueCallback valueCB, RedisModuleString seekTo); typedef void (RedisModuleDefragDictValueCallback)(RedisModuleDefragCtx ctx, void data, unsigned char key, size_t keylen); ``` Usage: ```c RedisModuleString seekTo = NULL; RedisModuleDict dict = = RedisModule_CreateDict(ctx); ... populate the dict code ... /* Defragment a dictionary completely / do { RedisModuleDict new = RedisModule_DefragRedisModuleDict(ctx, dict, defragGlobalDictValueCB, &seekTo); if (new != NULL) { dict = new; } } while (seekTo); ``` --------- Co-authored-by: ShooterIT <wangyuancode@163.com> Co-authored-by: oranagra <oran@redislabs.com>	2025-02-20 21:09:29 +08:00
debing.sun	695126ccce	Add support for incremental defragmentation of global module data (#13815 ) ## Description Currently, when performing defragmentation on non-key data within the module, we cannot process the defragmentation incrementally. This limitation affects the efficiency and flexibility of defragmentation in certain scenarios. The primary goal of this PR is to introduce support for incremental defragmentation of global module data. ## Interface Change New module API `RegisterDefragFunc2` This is a more advanced version of `RM_RegisterDefragFunc`, in that it takes a new callbacks(`RegisterDefragFunc2`) that has a return value, and can use RM_DefragShouldStop in and indicate that it should be called again later, or is it done (returned 0). ## Note The `RegisterDefragFunc` API remains available. --------- Co-authored-by: ShooterIT <wangyuancode@163.com> Co-authored-by: oranagra <oran@redislabs.com>	2025-02-20 00:28:16 +08:00
debing.sun	725cd268e6	Refactor of ActiveDefrag to reduce latencies (#13814 ) This PR is based on: https://github.com/valkey-io/valkey/pull/1462 ## Issue/Problems Duty Cycle: Active Defrag has configuration values which determine the intended percentage of CPU to be used based on a gradient of the fragmentation percentage. However, Active Defrag performs its work on the 100ms serverCron timer. It then computes a duty cycle and performs a single long cycle. For example, if the intended CPU is computed to be 10%, Active Defrag will perform 10ms of work on this 100ms timer cron. * This type of cycle introduces large latencies on the client (up to 25ms with default configurations) * This mechanism is subject to starvation when slow commands delay the serverCron Maintainability: The current Active Defrag code is difficult to read & maintain. Refactoring of the high level control mechanisms and functions will allow us to more seamlessly adapt to new defragmentation needs. Specific examples include: * A single function (activeDefragCycle) includes the logic to start/stop/modify the defragmentation as well as performing one "step" of the defragmentation. This should be separated out, so that the actual defrag activity can be performed on an independent timer (see duty cycle above). * The code is focused on kvstores, with other actions just thrown in at the end (defragOtherGlobals). There's no mechanism to break this up to reduce latencies. * For the main dictionary (only), there is a mechanism to set aside large keys to be processed in a later step. However this code creates a separate list in each kvstore (main dict or not), bleeding/exposing internal defrag logic. We only need 1 list - inside defrag. This logic should be more contained for the main key store. * The structure is not well suited towards other non-main-dictionary items. For example, pub-sub and pub-sub-shard was added, but it's added in such a way that in CMD mode, with multiple DBs, we will defrag pub-sub repeatedly after each DB. ## Description of the feature Primarily, this feature will split activeDefragCycle into 2 functions. 1. One function will be called from serverCron to determine if a defrag cycle (a complete scan) needs to be started. It will also determine if the CPU expenditure needs to be adjusted. 2. The 2nd function will be a timer proc dedicated to performing defrag. This will be invoked independently from serverCron. Once the functions are split, there is more control over the latency created by the defrag process. A new configuration will be used to determine the running time for the defrag timer proc. The default for this will be 500us (one-half of the current minimum time). Then the timer will be adjusted to achieve the desired CPU. As an example, 5% of CPU will run the defrag process for 500us every 10ms. This is much better than running for 5ms every 100ms. The timer function will also adjust to compensate for starvation. If a slow command delays the timer, the process will run proportionately longer to ensure that the configured CPU is achieved. Given the presence of slow commands, the proportional extra time is insignificant to latency. This also addresses the overload case. At 100% CPU, if the event loop slows, defrag will run proportionately longer to achieve the configured CPU utilization. Optionally, in low CPU situations, there would be little impact in utilizing more than the configured CPU. We could optionally allow the timer to pop more often (even with a 0ms delay) and the (tail) latency impact would not change. And we add a time limit for the defrag duty cycle to prevent excessive latency. When latency is already high (indicated by a long time between calls), we don't want to make it worse by running defrag for too long. Addressing maintainability: * The basic code structure can more clearly be organized around a "cycle". * Have clear begin/end functions and a set of "stages" to be executed. * Rather than stages being limited to "kvstore" type data, a cycle should be more flexible, incorporating the ability to incrementally perform arbitrary work. This will likely be necessary in the future for certain module types. It can be used today to address oddballs like defragOtherGlobals. * We reduced some of the globals, and reduce some of the coupling. defrag_later should be removed from serverDb. * Each stage should begin on a fresh cycle. So if there are non-time-bounded operations like kvstoreDictLUTDefrag, these would be less likely to introduce additional latency. Signed-off-by: Jim Brunner [brunnerj@amazon.com](mailto:brunnerj@amazon.com) Signed-off-by: Madelyn Olson [madelyneolson@gmail.com](mailto:madelyneolson@gmail.com) Co-authored-by: Madelyn Olson [madelyneolson@gmail.com](mailto:madelyneolson@gmail.com) --------- Signed-off-by: Jim Brunner brunnerj@amazon.com Signed-off-by: Madelyn Olson madelyneolson@gmail.com Co-authored-by: Madelyn Olson madelyneolson@gmail.com Co-authored-by: ShooterIT <wangyuancode@163.com>	2025-02-20 00:05:24 +08:00
Ozan Tezcan	e2608478b6	Add HGETDEL, HGETEX and HSETEX hash commands (#13798 ) This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These commands enable user to do multiple operations in one step atomically e.g. set a hash field and update its TTL with a single command. Previously, it was only possible to do it by calling hset and hexpire commands subsequently. - HGETDEL command ``` HGETDEL <key> FIELDS <numfields> field [field ...] ``` Description Get and delete the value of one or more fields of a given hash key Reply Array reply: list of the value associated with each field or nil if the field doesn’t exist. - HGETEX command ``` HGETEX <key> [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| PERSIST] FIELDS <numfields> field [field ...] ``` Description Get the value of one or more fields of a given hash key, and optionally set their expiration Options: EX seconds: Set the specified expiration time, in seconds. PX milliseconds: Set the specified expiration time, in milliseconds. EXAT timestamp-seconds: Set the specified Unix time at which the field will expire, in seconds. PXAT timestamp-milliseconds: Set the specified Unix time at which the field will expire, in milliseconds. PERSIST: Remove the time to live associated with the field. Reply Array reply: list of the value associated with each field or nil if the field doesn’t exist. - HSETEX command ``` HSETEX <key> [FNX \| FXX] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] FIELDS <numfields> field value [field value...] ``` Description Set the value of one or more fields of a given hash key, and optionally set their expiration Options: FNX: Only set the fields if all do not already exist. FXX: Only set the fields if all already exist. EX seconds: Set the specified expiration time, in seconds. PX milliseconds: Set the specified expiration time, in milliseconds. EXAT timestamp-seconds: Set the specified Unix time at which the field will expire, in seconds. PXAT timestamp-milliseconds: Set the specified Unix time at which the field will expire, in milliseconds. KEEPTTL: Retain the time to live associated with the field. Note: If no option is provided, any associated expiration time will be discarded similar to how SET command behaves. Reply Integer reply: 0 if no fields were set Integer reply: 1 if all the fields were set	2025-02-14 17:13:35 +03:00
Yuan Wang	7f5f588232	AOF offset info (#13773 ) ### Background AOF is often used as an effective data recovery method, but now if we have two AOFs from different nodes, it is hard to learn which one has latest data. Generally, we determine whose data is more up-to-date by reading the latest modification time of the AOF file, but because of replication delay, even if both master and replica write to the AOF at the same time, the data in the master is more up-to-date (there are commands that didn't arrive at the replica yet, or a large number of commands have accumulated on replica side ), so we may make wrong decision. ### Solution The replication offset always increments when AOF is enabled even if there is no replica, we think replication offset is better method to determine which one has more up-to-date data, whoever has a larger offset will have newer data, so we add the start replication offset info for AOF, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 ``` And if we close gracefully the AOF file, not a crash, such as `shutdown`, `kill signal 15` or `config set appendonly no`, we will add the end replication offset, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532 ``` #### Things to pay attention to - For BASE AOF, we do not add `startoffset` and `endoffset` info, since we could not know the start replication replication of data, and it is useless to help us to determine which one has more up-to-date data. - For AOFs from old version, we also don't add `startoffset` and `endoffset` info, since we also don't know start replication replication of them. If we add the start offset from 0, we might make the judgment even less accurate. For example, if the master has just rewritten the AOF, its INCR AOF will inevitably be very small. However, if the replica has not rewritten AOF for a long time, its INCR AOF might be much larger. By applying the following method, we might make incorrect decisions, so we still just check timestamp instead of adding offset info - If the last INCR AOF has `startoffset` or `endoffset`, we need to restore `server.master_repl_offset` according to them to avoid the rollback of the `startoffset` of next INCR AOF. If it has `endoffset`, we just use this value as `server.master_repl_offset`, and a very important thing is to remove this information from the manifest file to avoid the next time we load the manifest file with wrong `endoffset`. If it only has `startoffset`, we calculate `server.master_repl_offset` by the `startoffset` plus the file size. ### How to determine which one has more up-to-date data If AOF has a larger replication offset, it will have more up-to-date data. The following is how to get AOF offset: Read the AOF manifest file to obtain information about the last INCR AOF 1. If the last INCR AOF has `endoffset` field, we can directly use the `endoffset` to present the replication offset of AOF 2. If there is no `endoffset`(such as redis crashes abnormally), but there is `startoffset` filed of the last INCR AOF, we can get the replication offset of AOF by `startoffset` plus the file size 3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`, maybe from old version, and new version redis has not rewritten AOF yet, we still need to check the modification timestamp of the last INCR AOF ### TODO Fix ping causing inconsistency between AOF size and replication offset in the future PR. Because we increment the replication offset when sending PING/REPLCONF to the replica but do not write data to the AOF file, this might cause the starting offset of the AOF file plus its size to be inconsistent with the actual replication offset.	2025-02-13 17:31:40 +08:00
kei-nan	d9134f8f95	Update tests/modules/moduleconfigs.c missing else clause Co-authored-by: debing.sun <debing.sun@redis.com>	2025-02-06 13:16:33 +02:00
jonathan keinan	7a40fd630d	* fix comments	2025-02-06 13:16:33 +02:00
jonathan keinan	b9361ad5fe	* only use new api if override-default was provided as an argument	2025-02-06 13:16:33 +02:00
kei-nan	f164012c19	Update tests/unit/moduleapi/moduleconfigs.tcl Co-authored-by: nafraf <nafraf@users.noreply.github.com>	2025-02-06 13:16:33 +02:00
jonathan keinan	49455c43ae	* change foo to goo so test will be correct and pass	2025-02-06 13:16:33 +02:00
jonathan keinan	c2694fb696	* change config value in test to be different than overwritten value	2025-02-06 13:16:33 +02:00
jonathan keinan	f7353db7eb	* fix test * cleanup strval2 on if an error during the OnLoad was encountered.	2025-02-06 13:16:33 +02:00
jonathan keinan	294492dbf2	* fix tests * add some logging to test module	2025-02-06 13:16:33 +02:00
jonathan keinan	fd5c325886	* initial commit	2025-02-06 13:16:33 +02:00
YaacovHazan	0aeb86d78d	Revert "Improve GETRANGE command behavior (#12272 )" Although the commit #6ceadfb58 improves GETRANGE command behavior, we can't accept it as we should avoid breaking changes for non-critical bug fixes. This reverts commit `6ceadfb580`.	2025-02-05 20:49:42 +02:00
YaacovHazan	8afb72a326	Revert "improve performance for scan command when matching data type (#12395 )" Although the commit #7f0a7f0a6 improves the performance of the SCAN command, we can't accept it as we should avoid breaking changes for non-critical bug fixes. This reverts commit `7f0a7f0a69`.	2025-02-05 20:49:42 +02:00
Raz Monsonego	04589f90d7	Add internal connection and command mechanism (#13740 ) # PR: Add Mechanism for Internal Commands and Connections in Redis This PR introduces a mechanism to handle internal commands and connections in Redis. It includes enhancements for command registration, internal authentication, and observability. ## Key Features 1. Internal Command Flag: - Introduced a new module command registration flag: `internal`. - Commands marked with `internal` can only be executed by internal connections, AOF loading flows, and master-replica connections. - For any other connection, these commands will appear as non-existent. 2. Support for internal authentication added to `AUTH`: - Used by depicting the special username `internal connection` with the right internal password, i.e.,: `AUTH "internal connection" <internal_secret>`. - No user-defined ACL username can have this name, since spaces are not aloud in the ACL parser. - Allows connections to authenticate as internal connections. - Authenticated internal connections can execute internal commands successfully. 4. Module API for Internal Secret: - Added the `RedisModule_GetInternalSecret()` API, that exposes the internal secret that should be used as the password for the new `AUTH "internal connection" <password>` command. - This API enables the modules to authenticate against other shards as local connections. ## Notes on Behavior - ACL validation: - Commands dispatched by internal connections bypass ACL validation, to give the caller full access regardless of the user with which it is connected. - Command Visibility: - Internal commands do not appear in `COMMAND <subcommand>` and `MONITOR` for non-internal connections. - Internal commands are logged in the slow log, latency report and commands' statistics to maintain observability. - `RM_Call()` Updates: - Non-internal connections: - Cannot execute internal commands when the command is sent with the `C` flag (otherwise can). - Internal connections bypass ACL validations (i.e., run as the unrestricted user). - Internal commands' success: - Internal commands succeed upon being sent from either an internal connection (i.e., authenticated via the new `AUTH "internal connection" <internal_secret>` API), an AOF loading process, or from a master via the replication link. Any other connections that attempt to execute an internal command fail with the `unknown command` error message raised. - `CLIENT LIST` flags: - Added the `I` flag, to indicate that the connection is internal. - Lua Scripts: - Prevented internal commands from being executed via Lua scripts. --------- Co-authored-by: Meir Shpilraien <meir@redis.com>	2025-02-05 11:48:08 +02:00
Ozan Tezcan	09f8a2f374	Start AOFRW before streaming repl buffer during fullsync (#13758 ) During fullsync, before loading RDB on the replica, we stop aof child to prevent copy-on-write disaster. Once rdb is loaded, aof is started again and it will trigger aof rewrite. With https://github.com/redis/redis/pull/13732 , for rdbchannel replication, this behavior was changed. Currently, we start aof after replication buffer is streamed to db. This PR changes it back to start aof just after rdb is loaded (before repl buffer is streamed) Both approaches may have pros and cons. If we start aof before streaming repl buffers, we may still face with copy-on-write issues as repl buffers potentially include large amount of changes. If we wait until replication buffer drained, it means we are delaying starting aof persistence. Additional changes are introduced as part of this PR: - Interface change: Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command reply. During full sync, it shows total memory consumed by accumulated replication stream buffer on replica. Added same metric to `MEMORY STATS` command reply as `replica.fullsync.buffer` field. - Fixes: - Count repl stream buffer size of replica as part of 'memory overhead' calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs. Before this PR, repl buffer was not counted as part of memory overhead calculation, causing misreports for fields like `used_memory_overhead` and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field in "MEMORY STATS" command reply. - Dismiss replication stream buffers memory of replica in the fork to reduce COW impact during a fork. - Fixed a few time sensitive flaky tests, deleted a noop statement, fixed some comments and fail messages in rdbchannel tests.	2025-02-04 21:40:18 +03:00
Meir Shpilraien (Spielrein)	870b6bd487	Added a shared secret over Redis cluster. (#13763 ) The PR introduces a new shared secret that is shared over all the nodes on the Redis cluster. The main idea is to leverage the cluster bus to share a secret between all the nodes such that later the nodes will be able to authenticate using this secret and send internal commands to each other (see #13740 for more information about internal commands). The way the shared secret is chosen is the following: 1. Each node, when start, randomly generate its own internal secret. 2. Each node share its internal secret over the cluster ping messages. 3. If a node gets a ping message with secret smaller then his current secret, it embrace it. 4. Eventually all nodes should embrace the minimal secret The converges of the secret is as good as the topology converges. To extend the ping messages to contain the secret, we leverage the extension mechanism. Nodes that runs an older Redis version will just ignore those extensions. Specific tests were added to verify that eventually all nodes see the secrets. In addition, a verification was added to the test infra to verify the secret on `cluster_config_consistent` and to `assert_cluster_state`.	2025-02-03 09:54:37 +02:00
Raz Monsonego	c688537d49	Add flag for ability of a module context to execute debug commands (#13774 ) This PR adds a flag to the `RM_GetContextFlags` module-API function that depicts whether the context may execute debug commands, according to redis's standards.	2025-02-03 09:52:41 +02:00
debing.sun	f86575f210	Gradually reduce defrag CPU usage when defragmentation is ineffective (#13752 ) This PR addresses an issue where if a module does not provide a defragmentation callback, we cannot defragment the fragmentation it generates. However, the defragmentation process still considers a large amount of fragmentation to be present, leading to more aggressive defragmentation efforts that ultimately have no effect. To mitigate this, the PR introduces a mechanism to gradually reduce the CPU consumption for defragmentation when the defragmentation effectiveness is poor. This occurs when the fragmentation rate drops below 2% and the hit ratio is less than 1%, or when the fragmentation rate increases by no more than 2%. The CPU consumption will be gradually decreased until it reaches the minimum threshold defined by `active-defrag-cycle-min`. --------- Co-authored-by: oranagra <oran@redislabs.com>	2025-01-24 11:35:32 +08:00
debing.sun	0f65806b5b	Update info.tcl test to revert client output limits sooner (#13738 ) This PR is based on: https://github.com/valkey-io/valkey/pull/1462 We set the client output buffer limits to 10 bytes, and then execute info stats which produces more than 10 bytes of output, which can cause that command to throw an error. I'm not sure why it wasn't consistently erroring before, might have been some change related to the ubuntu upgrade though. failed CI: https://github.com/redis/redis/actions/runs/12738281720/job/35500381299 ------ Co-authored-by: Madelyn Olson [madelyneolson@gmail.com](mailto:madelyneolson@gmail.com)	2025-01-14 17:30:18 +08:00
YaacovHazan	4a95b3005a	Fix Read/Write key pattern selector (CVE-2024-51741) The '%' rule must contain one or both of R/W	2025-01-13 21:20:19 +02:00
Ozan Tezcan	73a9b916c9	Rdb channel replication (#13732 ) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 Motivation During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. Implementation steps 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. Some details - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. API changes in INFO output: 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` API changes in CLIENT LIST In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` Config changes: - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) Internal API changes: Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. Testing: As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. Replica state diagram ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>	2025-01-13 15:09:52 +03:00
debing.sun	21aee83abd	Fix issue with argv not being shrunk (#13698 ) Found by @ShooterIT ## Describe If a client first creates a command with a very large number of parameters, such as 10,000 parameters, the argv will be expanded to accommodate 10,000. If the subsequent commands have fewer than 10,000 parameters, this argv will continue to be reused and will never be shrunk. ## Solution When determining whether it is necessary to rebuild argv, if the length of the previous argv has already exceeded 1024, we will progressively create argv regardless. ## Free argv in cron Add a new condition to determine whether argv needs to be resized in cron. When the number of parameters exceeds 128, we will resize it regardless to avoid a single client consuming too much memory. It will now occupy a maximum of (128 * 8 bytes). --------- Co-authored-by: Yuan Wang <wangyuancode@163.com>	2025-01-08 16:12:52 +08:00
debing.sun	08d714d0e5	Fix crash due to cron argv release (#13725 ) Introduced by https://github.com/redis/redis/issues/13521 If the client argv was released due to a timeout before sending the complete command, `argv_len` will be reset to 0. When argv is parsed again and resized, requesting a length of 0 may result in argv being NULL, then leading to a crash. And fix a bug that `argv_len` is not updated correctly in `replaceClientCommandVector()`. --------- Co-authored-by: ShooterIT <wangyuancode@163.com> Co-authored-by: meiravgri <109056284+meiravgri@users.noreply.github.com>	2025-01-08 09:57:23 +08:00
raffertyyu	04f63d4af7	Fix index error of CRLF when replying with integer-encoded strings (#13711 ) close #13709 Fix the index error of CRLF character for integer-encoded strings in addReplyBulk function --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-12-31 21:41:10 +08:00
Yuan Wang	64a40b20d9	Async IO Threads (#13695 ) ## Introduction Redis introduced IO Thread in 6.0, allowing IO threads to handle client request reading, command parsing and reply writing, thereby improving performance. The current IO thread implementation has a few drawbacks. - The main thread is blocked during IO thread read/write operations and must wait for all IO threads to complete their current tasks before it can continue execution. In other words, the entire process is synchronous. This prevents the efficient utilization of multi-core CPUs for parallel processing. - When the number of clients and requests increases moderately, it causes all IO threads to reach full CPU utilization due to the busy wait mechanism used by the IO threads. This makes it challenging for us to determine which part of Redis has reached its bottleneck. - When IO threads are enabled with TLS and io-threads-do-reads, a disconnection of a connection with pending data may result in it being assigned to multiple IO threads simultaneously. This can cause race conditions and trigger assertion failures. Related issue: redis#12540 Therefore, we designed an asynchronous IO threads solution. The IO threads adopt an event-driven model, with the main thread dedicated to command processing, meanwhile, the IO threads handle client read and write operations in parallel. ## Implementation ### Overall As before, we did not change the fact that all client commands must be executed on the main thread, because Redis was originally designed to be single-threaded, and processing commands in a multi-threaded manner would inevitably introduce numerous race and synchronization issues. But now each IO thread has independent event loop, therefore, IO threads can use a multiplexing approach to handle client read and write operations, eliminating the CPU overhead caused by busy-waiting. the execution process can be briefly described as follows: the main thread assigns clients to IO threads after accepting connections, IO threads will notify the main thread when clients finish reading and parsing queries, then the main thread processes queries from IO threads and generates replies, IO threads handle writing reply to clients after receiving clients list from main thread, and then continue to handle client read and write events. ### Each IO thread has independent event loop We now assign each IO thread its own event loop. This approach eliminates the need for the main thread to perform the costly `epoll_wait` operation for handling connections (except for specific ones). Instead, the main thread processes requests from the IO threads and hands them back once completed, fully offloading read and write events to the IO threads. Additionally, all TLS operations, including handling pending data, have been moved entirely to the IO threads. This resolves the issue where io-threads-do-reads could not be used with TLS. ### Event-notified client queue To facilitate communication between the IO threads and the main thread, we designed an event-notified client queue. Each IO thread and the main thread have two such queues to store clients waiting to be processed. These queues are also integrated with the event loop to enable handling. We use pthread_mutex to ensure the safety of queue operations, as well as data visibility and ordering, and race conditions are minimized, as each IO thread and the main thread operate on independent queues, avoiding thread suspension due to lock contention. And we implemented an event notifier based on `eventfd` or `pipe` to support event-driven handling. ### Thread safety Since the main thread and IO threads can execute in parallel, we must handle data race issues carefully. client->flags The primary tasks of IO threads are reading and writing, i.e. `readQueryFromClient` and `writeToClient`. However, IO threads and the main thread may concurrently modify or access `client->flags`, leading to potential race conditions. To address this, we introduced an io-flags variable to record operations performed by IO threads, thereby avoiding race conditions on `client->flags`. Pause IO thread In the main thread, we may want to operate data of IO threads, maybe uninstall event handler, access or operate query/output buffer or resize event loop, we need a clean and safe context to do that. We pause IO thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To avoid thread suspended, we use busy waiting to confirm the target status. Besides we use atomic variable to make sure memory visibility and ordering. We introduce these functions to pause/resume IO Threads as below. ``` pauseIOThread, resumeIOThread pauseAllIOThreads, resumeAllIOThreads pauseIOThreadsRange, resumeIOThreadsRange ``` Testing has shown that `pauseIOThread` is highly efficient, allowing the main thread to execute nearly 200,000 operations per second during stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can handle up to nearly 56,000 operations per second. But operations performed between pausing and resuming IO threads must be quick; otherwise, they could cause the IO threads to reach full CPU utilization. freeClient and freeClientAsync The main thread may need to terminate a client currently running on an IO thread, for example, due to ACL rule changes, reaching the output buffer limit, or evicting a client. In such cases, we need to pause the IO thread to safely operate on the client. maxclients and maxmemory-clients updating When adjusting `maxclients`, we need to resize the event loop for all IO threads. Similarly, when modifying `maxmemory-clients`, we need to traverse all clients to calculate their memory usage. To ensure safe operations, we pause all IO threads during these adjustments. Client info reading The main thread may need to read a client’s fields to generate a descriptive string, such as for the `CLIENT LIST` command or logging purposes. In such cases, we need to pause the IO thread handling that client. If information for all clients needs to be displayed, all IO threads must be paused. Tracking redirect Redis supports the tracking feature and can even send invalidation messages to a connection with a specified ID. But the target client may be running on IO thread, directly manipulating the client’s output buffer is not thread-safe, and the IO thread may not be aware that the client requires a response. In such cases, we pause the IO thread handling the client, modify the output buffer, and install a write event handler to ensure proper handling. clientsCron In the `clientsCron` function, the main thread needs to traverse all clients to perform operations such as timeout checks, verifying whether they have reached the soft output buffer limit, resizing the output/query buffer, or updating memory usage. To safely operate on a client, the IO thread handling that client must be paused. If we were to pause the IO thread for each client individually, the efficiency would be very low. Conversely, pausing all IO threads simultaneously would be costly, especially when there are many IO threads, as clientsCron is invoked relatively frequently. To address this, we adopted a batched approach for pausing IO threads. At most, 8 IO threads are paused at a time. The operations mentioned above are only performed on clients running in the paused IO threads, significantly reducing overhead while maintaining safety. ### Observability In the current design, the main thread always assigns clients to the IO thread with the least clients. To clearly observe the number of clients handled by each IO thread, we added the new section in INFO output. The `INFO THREADS` section can show the client count for each IO thread. ``` # Threads io_thread_0:clients=0 io_thread_1:clients=2 io_thread_2:clients=2 ``` Additionally, in the `CLIENT LIST` output, we also added a field to indicate the thread to which each client is assigned. `id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name= lib-ver= io-thread=1` ## Trade-off ### Special Clients For certain special types of clients, keeping them running on IO threads would result in severe race issues that are difficult to resolve. Therefore, we chose not to offload these clients to the IO threads. For replica, monitor, subscribe, and tracking clients, main thread may directly write them a reply when conditions are met. Race issues are difficult to resolve, so we have them processed in the main thread. This includes the Lua debug clients as well, since we may operate connection directly. For blocking client, after the IO thread reads and parses a command and hands it over to the main thread, if the client is identified as a blocking type, it will be remained in the main thread. Once the blocking operation completes and the reply is generated, the client is transferred back to the IO thread to send the reply and wait for event triggers. ### Clients Eviction To support client eviction, it is necessary to update each client’s memory usage promptly during operations such as read, write, or command execution. However, when a client operates on an IO thread, it is not feasible to update the memory usage immediately due to the risk of data races. As a result, memory usage can only be updated either in the main thread while processing commands or in the `ClientsCron` periodically. The downside of this approach is that updates might experience a delay of up to one second, which could impact the precision of memory management for eviction. To avoid incorrectly evicting clients. We adopted a best-effort compensation solution, when we decide to eviction a client, we update its memory usage again before evicting, if the memory used by the client does not decrease or memory usage bucket is not changed, then we will evict it, otherwise, not evict it. However, we have not completely solved this problem. Due to the delay in memory usage updates, it may lead us to make incorrect decisions about the need to evict clients. ### Defragment In the majority of cases we do NOT use the data from argv directly in the db. 1. key names We store a copy that we allocate in the main thread, see `sdsdup()` in `dbAdd()`. 2. hash key and value We store key as hfield and store value as sds, see `hfieldNew()` and `sdsdup()` in `hashTypeSet()`. 3. other datatypes They don't even use SDS, so there is no reference issues. But in some cases client the data from argv may be retain by the main thread. As a result, during fragmentation cleanup, we need to move allocations from the IO thread’s arena to the main thread’s arena. We always allocate new memory in the main thread’s arena, but the memory released by IO threads may not yet have been reclaimed. This ultimately causes the fragmentation rate to be higher compared to creating and allocating entirely within a single thread. The following cases below will lead to memory allocated by the IO thread being kept by the main thread. 1. string related command: `append`, `getset`, `mset` and `set`. If `tryObjectEncoding()` does not change argv, we will keep it directly in the main thread, see the code in `tryObjectEncoding()`(specifically `trimStringObjectIfNeeded()`) 2. block related command. the key names will be kept in `c->db->blocking_keys`. 3. watch command the key names will be kept in `c->db->watched_keys`. 4. [s]subscribe command channel name will be kept in `serverPubSubChannels`. 5. script load command script will be kept in `server.lua_scripts`. 7. some module API: `RM_RetainString`, `RM_HoldString` Those issues will be handled in other PRs. ## Testing ### Functional Testing The commit with enabling IO Threads has passed all TCL tests, but we did some changes: Client query buffer: In the original code, when using a reusable query buffer, ownership of the query buffer would be released after the command was processed. However, with IO threads enabled, the client transitions from an IO thread to the main thread for processing. This causes the ownership release to occur earlier than the command execution. As a result, when IO threads are enabled, the client's information will never indicate that a shared query buffer is in use. Therefore, we skip the corresponding query buffer tests in this case. Defragment: Add a new defragmentation test to verify the effect of io threads on defragmentation. Command delay: For deferred clients in TCL tests, due to clients being assigned to different threads for execution, delays may occur. To address this, we introduced conditional waiting: the process proceeds to the next step only when the `client list` contains the corresponding commands. ### Sanitizer Testing The commit passed all TCL tests and reported no errors when compiled with the `fsanitizer=thread` and `fsanitizer=address` options enabled. But we made the following modifications: we suppressed the sanitizer warnings for clients with watched keys when updating `client->flags`, we think IO threads read `client->flags`, but never modify it or read the `CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so there is no actual data race. ## Others ### IO thread number In the new multi-threaded design, the main thread is primarily focused on command processing to improve performance. Typically, the main thread does not handle regular client I/O operations but is responsible for clients such as replication and tracking clients. To avoid breaking changes, we still consider the main thread as the first IO thread. When the io-threads configuration is set to a low value (e.g., 2), performance does not show a significant improvement compared to a single-threaded setup for simple commands (such as SET or GET), as the main thread does not consume much CPU for these simple operations. This results in underutilized multi-core capacity. However, for more complex commands, having a low number of IO threads may still be beneficial. Therefore, it’s important to adjust the `io-threads` based on your own performance tests. Additionally, you can clearly monitor the CPU utilization of the main thread and IO threads using `top -H -p $redis_pid`. This allows you to easily identify where the bottleneck is. If the IO thread is the bottleneck, increasing the `io-threads` will improve performance. If the main thread is the bottleneck, the overall performance can only be scaled by increasing the number of shards or replicas. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: oranagra <oran@redislabs.com>	2024-12-23 14:16:40 +08:00
Nugine	684077682e	Fix bug in PFMERGE command (#13672 ) The bug was introduced in #13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON\|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from https://github.com/valkey-io/valkey/pull/1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>	2024-12-18 14:41:04 +08:00
Moti Cohen	c51c96656b	modules API: Add test for ACL check of empty prefix (#13678 ) - Add empty string test for the new API `RedisModule_ACLCheckKeyPrefixPermissions`. - Fix order of checks: `(pattern[patternLen - 1] != '*' \|\| patternLen == 0)` --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-12-10 09:16:30 +02:00
Moti Cohen	0dd057222b	Modules API: new HashFieldMinExpire(). Add flag REDISMODULE_HASH_EXPIRE_TIME to HashGet(). (#13676 ) This PR introduces API to query Expiration time of hash fields. # New `RedisModule_HashFieldMinExpire()` For a given hash, retrieves the minimum expiration time across all fields. If no fields have expiration or if the key is not a hash then return `REDISMODULE_NO_EXPIRE` (-1). ``` mstime_t RM_HashFieldMinExpire(RedisModuleKey *hash); ``` # Extension to `RedisModule_HashGet()` Adds a new flag, `REDISMODULE_HASH_EXPIRE_TIME`, to retrieve the expiration time of a specific hash field. If the field does not exist or has no expiration, returns `REDISMODULE_NO_EXPIRE`. It is fully backward-compatible (RM_HashGet retains its original behavior unless the new flag is used). Example: ``` mstime_t expiry1, expiry2; RedisModule_HashGet(mykey, REDISMODULE_HASH_EXPIRE_TIME, "field1", &expiry1, NULL); RedisModule_HashGet(mykey, REDISMODULE_HASH_EXPIRE_TIME, "field1", &expiry1, "field2", &expiry2, NULL); ```	2024-12-05 11:14:52 +02:00
Moti Cohen	06b144aa09	Modules API: Add RedisModule_ACLCheckKeyPrefixPermissions (#13666 ) This PR introduces a new API function to the Redis Module API: ``` int RedisModule_ACLCheckKeyPrefixPermissions(RedisModuleUser user, RedisModuleString prefix, int flags); ``` Purpose: The function checks if a given user has access permissions to any key that match a specific prefix. This validation is based on the user’s ACL permissions and the specified flags. Note, this prefix-based approach API may fail to detect prefixes that are individually uncovered but collectively covered by the patterns. For example the prefix `ID-` is not fully included in pattern `ID-[0]` and is not fully included in pattern `ID-[^0]` but it is fully included in the set of patterns `{ID-[0], ID-[^0]*}`	2024-11-28 18:33:58 +02:00
Ozan Tezcan	9ebf80a28c	Fix memory leak of jemalloc tcache on function flush command (#13661 ) Starting from https://github.com/redis/redis/pull/13133, we allocate a jemalloc thread cache and use it for lua vm. On certain cases, like `script flush` or `function flush` command, we free the existing thread cache and create a new one. Though, for `function flush`, we were not actually destroying the existing thread cache itself. Each call creates a new thread cache on jemalloc and we leak the previous thread cache instances. Jemalloc allows maximum 4096 thread cache instances. If we reach this limit, Redis prints "Failed creating the lua jemalloc tcache" log and abort. There are other cases that can cause this memory leak, including replication scenarios when emptyData() is called. The implication is that it looks like redis `used_memory` is low, but `allocator_allocated` and RSS remain high. Co-authored-by: debing.sun <debing.sun@redis.com>	2024-11-21 14:12:58 +03:00
Moti Cohen	155634502d	modules API: Support register unprefixed config parameters (#13656 ) PR #10285 introduced support for modules to register four types of configurations — Bool, Numeric, String, and Enum. Accessible through the Redis config file and the CONFIG command. With this PR, it will be possible to register configuration parameters without automatically prefixing the parameter names. This provides greater flexibility in configuration naming, enabling, for instance, both `bf-initial-size` or `initial-size` to be defined in the module without automatically prefixing with `<MODULE-NAME>.`. In addition it will also be possible to create a single additional alias via the same API. This brings us another step closer to integrate modules into redis core. Example: Register a configuration parameter `bf-initial-size` with an alias `initial-size` without the automatic module name prefix, set with new `REDISMODULE_CONFIG_UNPREFIXED` flag: ``` RedisModule_RegisterBoolConfig(ctx, "bf-initial-size\|initial-size", default_val, optflags \| REDISMODULE_CONFIG_UNPREFIXED, getfn, setfn, applyfn, privdata); ``` # API changes Related functions that now support unprefixed configuration flag (`REDISMODULE_CONFIG_UNPREFIXED`) along with optional alias: ``` RedisModule_RegisterBoolConfig RedisModule_RegisterEnumConfig RedisModule_RegisterNumericConfig RedisModule_RegisterStringConfig ``` # Implementation Details: `config.c`: On load server configuration, at function `loadServerConfigFromString()`, it collects all unknown configurations into `module_configs_queue` dictionary. These may include valid module configurations or invalid ones. They will be validated later by `loadModuleConfigs()` against the configurations declared by the loaded module(s). `Module.c:` The `ModuleConfig` structure has been modified to store now: (1) Full configuration name (2) Alias (3) Unprefixed flag status - ensuring that configurations retain their original registration format when triggered in notifications. Added error printout: This change introduces an error printout for unresolved configurations, detailing each unresolved parameter detected during startup. The last line in the output existed prior to this change and has been retained to systems relies on it: ``` 595011:M 18 Nov 2024 08:26:23.616 # Unresolved Configuration(s) Detected: 595011:M 18 Nov 2024 08:26:23.616 # >>> 'bf-initiel-size 8' 595011:M 18 Nov 2024 08:26:23.616 # >>> 'search-sizex 32' 595011:M 18 Nov 2024 08:26:23.616 # Module Configuration detected without loadmodule directive or no ApplyConfig call: aborting ``` # Backward Compatibility: Existing modules will function without modification, as the new functionality only applies if REDISMODULE_CONFIG_UNPREFIXED is explicitly set. # Module vs. Core API Conflict Behavior The new API allows to modules loading duplication of same configuration name or same configuration alias, just like redis core configuration allows (i.e. the users sets two configs with a different value, but these two configs are actually the same one). Unlike redis core, given a name and its alias, it doesn't allow have both configuration on load. To implement it, it is required to modify DS `module_configs_queue` to reflect the order of their loading and later on, during `loadModuleConfigs()`, resolve pairs of names and aliases and which one is the last one to apply. "Relaxing" this limitation can be deferred to a future update if necessary, but for now, we error in this case.	2024-11-21 09:55:02 +02:00

1 2 3 4 5 ...

2358 Commits