Commit Graph

22 Commits

Author SHA1 Message Date
Jim Brunner 1b5f245eae
Refactor of LFU/LRU code for modularity (#2857)
General cleanup on LRU/LFU code. Improve modularity and maintainability.

Specifically:
* Consolidates the mathematical logic for LRU/LFU into `lrulfu.c`, with
an API in `lrulfu.h`. Knowledge of the LRU/LFU implementation was
previously spread out across `db.c`, `evict.c`, `object.c`, `server.c`,
and `server.h`.
* Separates knowledge of the LRU from knowledge of the object containing
the LRU value. `lrulfu.c` knows about the LRU/LFU algorithms, without
knowing about the `robj`. `object.c` knows about the `robj` without
knowing about the details of the LRU/LFU algorithms.
* Eliminated `server.lruclock`, instead using `server.unixtime`. This
also eliminates the periodic need to call `mstime()` to maintain the lru
clock.
* Fixed a minor computation bug in the old `LFUTimeElapsed` function
(off by 1 after rollover).
* Eliminate specific IF checks for rollover, using defined behavior for
unsigned rollover instead.
* Fixed a bug in `debug.c` which would perform LFU modification on an
LRU value.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-12-02 10:14:33 -08:00
Zhijun Liao 3fd0942279
Refactor TCL reference to support running tests with CMake (#2816)
Historically, Valkey’s TCL test suite expected all binaries
(src/valkey-server, src/valkey-cli, src/valkey-benchmark, etc.) to exist
under the src/ directory. This PR enables Valkey TCL tests to run
seamlessly after a CMake build — no manual symlinks or make build
required.

The test framework accepts a new environment variable `VALKEY_BIN_DIR`
to look for the binaries.

CMake will copy all TCL test entrypoints (runtest, runtest-cluster,
etc.) into the CMake build dir (e.g. `cmake-build-debug`) and insert
`VALKEY_BIN_DIR` into these. Now we can either do
./cmake-build-debug/runtest at the project root or ./runtest at the
Cmake dir to run all tests.

A new CMake post-build target prints a friendly reminder after
successful builds, guiding developers on how to run tests with their
CMake binaries:

```
Hint: It is a good idea to run tests with your CMake-built binaries ;)
      ./cmake-build-debug/runtest

Build finished
```

A helper TCL script `tests/support/set_executable_path.tcl` is added to
support this change, which gets called by all test entrypoints:
`runtest`, `runtest-cluster`, `runtest-sentinel`.

---------

Signed-off-by: Zhijun <dszhijun@gmail.com>
2025-12-02 14:14:28 +01:00
Madelyn Olson a9a65abc85
Implement a lolwut for version 9 (#2646)
As requested, here is a version of lolwut for 9 that visualizes a Julia
set with ASCII art.

Example:
```
127.0.0.1:6379> lolwut version 9
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     .............                              
                                 ......................                         
                              ............................                      
                           ......:::--:::::::::::::::.......                    
                         .....:::=+*@@@=--------=+===--::....                   
                        ....:::-+@@@@&*+=====+%@@@@@@@@=-::....                 
                      .....:::-=+@@@@@%%*++*@@@@@@@@@&*=--::....                
                     .....::--=++#@@@@@@@@##@@@@@@@@@@@@@@=::....               
                    ......:-=@&#&@@@@@@@@@@@@@@@@@@@@@@@@@%-::...               
                   ......::-+@@@@@@@@@@@@@@@@@@&&@@@#%#&@@@-::....              
                  .......::-=+%@@@@@@@@@@@@@@@@#%%*+++++%@+-:.....              
                  .......::-=@&@@@@@@@@@@@@@@@@&*++=====---::.....              
                 .......:::--*@@@@@@@@@@@@@@@@@%++===----::::.....              
                ........::::-=+*%&@@@@@@@@@&&&%*+==----:::::......              
                ........::::--=+@@@@@@@@@@&##%*++==---:::::.......              
                .......:::::---=+#@@@@@@@@&&&#%*+==---:::::.......              
               ........:::::---=++*%%#&&@@@@@@@@@+=---::::........              
               .......:::::----=++*%##&@@@@@@@@@@%+=--::::.......               
               ......::::-----==++#@@@@@@@@@@@@@&%*+=-:::........               
               ......:::---====++*@@@@@@@@@@@@@@@@@@+-:::.......                
               .....::-=++==+++**%@@@@@@@@@@@@@@@@#*=--::.......                
                ....:-%@@%****%###&@@@@@@@@@@@@@@@@&+--:.......                 
                ....:-=@@@@@&@@@@@@@@@@@@@@@@@@@@@@@@=::......                  
                 ...::+@@@@@@@@@@@@@@@&&@@@@@@@@%**@+-::.....                   
                 ....::-=+%#@@@@@@@@@&%%%&@@@@@@*==-:::.....                    
                  ....::--+%@@@@@@@%++==++*#@@@@&=-:::....                      
                   ....:::-*@**@@+==----==*%@@@@+-:::....                       
                     .....:::---::::::::--=+@=--::.....                         
                       .........::::::::::::::.......                           
                         .........................                              
                             ..................                                 
                                    ...                                         
                                                                                
                                                                                
                                                                                
Ascii representation of Julia set with constant 0.41 + 0.29i
Don't forget to have fun! Valkey ver. 255.255.255
```

You can pass in arbitrary rows and colums (it's best when rows is 2x
number of columns) and an arbitrary julia constant so it is repeatable.
Worst case it takes about ~100us on my m2 macbook, which should be fine
to make sure it's not taking too many system resources.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2025-09-30 08:25:53 +02:00
Björn Svensson 2db4eeb1fc Remove temporary build correction for RDMA and libvalkey 0.1.0
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
2025-08-21 12:35:10 +02:00
Jacob Murphy d7993b78d8
Introduce atomic slot migration (#1949)
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at #1591,
modified to meet the designs we had outlined in #23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```
                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 
```

Co-authored-by: Binbin <binloveplay1314@qq.com>

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-08-11 18:02:37 -07:00
Ran Shidlansik c3b1b0317d Introduce volatile set
-------------

   Overview:
   ---------
   This PR introduces a complete redesign of the 'vset' (stands for volatile set) data structure,
   creating an adaptive container for expiring entries. The new design is
   memory-efficient, scalable, and dynamically promotes/demotes its internal
   representation depending on runtime behavior and volume.

   The core concept uses a single tagged pointer (`expiry_buckets`) that encodes
   one of several internal structures:
       - NONE   (-1): Empty set
       - SINGLE (0x1): One entry
       - VECTOR (0x2): Sorted vector of entry pointers
       - HT     (0x4): Hash table for larger buckets with many entries
       - RAX    (0x6): Radix tree (keyed by aligned expiry timestamps)

   This allows the set to grow and shrink seamlessly while optimizing for both
   space and performance.

   Motivation:
   -----------
   The previous design lacked flexibility in high-churn environments or
   workloads with skewed expiry distributions. This redesign enables dynamic
   layout adjustment based on the time distribution and volume of the inserted
   entries, while maintaining fast expiry checks and minimal memory overhead.

   Key Concepts:
   -------------
   - All pointers stored in the structure must be   odd-aligned   to preserve
     3 bits for tagging. This is safe with SDS strings (which set the LSB).

   - Buckets evolve automatically:
       - Start as NONE.
       - On first insert → become SINGLE.
       - If another entry with similar expiry → promote to VECTOR.
       - If VECTOR exceeds 127 entries → convert to RAX.
       - If a RAX bucket's vector fills and cannot split → promote to HT.

   - Each vector bucket is kept sorted by `entry->getExpiry()`.
   - Binary search is used for efficient insertion and splitting.

 # Coarse Buckets Expiration System for Hash Fields

 This PR introduces **coarse-grained expiration buckets** to support per-field
 expirations in hash types — a feature known as *volatile fields*.

 It enables scalable expiration tracking by grouping fields into time-aligned
 buckets instead of individually tracking exact timestamps.

 ## Motivation
 Valkey traditionally supports key-level expiration. However, in many applications,
 there's a strong need to expire individual fields within a hash (e.g., session keys,
 token caches, etc.).

 Tracking these at fine granularity is expensive and potentially unscalable, so
 this implementation introduces *bucketed expirations* to batch expirations together.

 ## Bucket Granularity and Timestamp Handling

 - Each expiration bucket represents a time slice of fixed width (e.g., 8192 ms).
 - Expiring fields are mapped to the **end** of a time slice (not the floor).
 - This design facilitates:
   - Efficient *splitting* of large buckets when needed
   - *Downgrading* buckets when fields permit tighter packing
   - Coalescing during lazy cleanup or memory pressure

 ### Example Calculation

 Suppose a field has an expiration time of `1690000123456` ms and the max bucket
 interval is 8192 ms:

 ```
 BUCKET_INTERVAL_MAX = 8192;
 expiry = 1690000123456;

 bucket_ts = (expiry & ~(BUCKET_INTERVAL_MAX - 1LL)) + BUCKET_INTERVAL_MAX;
           = (1690000123456 & ~8191) + 8192
           = 1690000122880 + 8192
           = 1690000131072
 ```

 The field is stored in a bucket that **ends at** `1690000131072` ms.

 ### Bucket Alignment Diagram

 ```
 Time (ms) →
 |----------------|----------------|----------------|
  128ms buckets → 1690000122880    1690000131072
                     ^               ^
                     |               |
               expiry floor     assigned bucket end
 ```

 ## Bucket Placement Logic

 - If a suitable bucket **already exists** (i.e., its `end_ts > expiry`), the field is added.
 - If no bucket covers the `expiry`, a **new bucket** is created at the computed `end_ts`.

 ## Bucket Downgrade Conditions

 Buckets are downgraded to smaller intervals when overpopulated (>127 fields).
 This happens when **all fields fit into a tighter bucket**.

 Downgrade rule:
 ```
 (max_expiry & ~(BUCKET_INTERVAL_MIN - 1LL)) + BUCKET_INTERVAL_MIN < current_bucket_ts
 ```
 If the above holds, all fields can be moved to a tighter bucket interval.

 ### Downgrade Bucket — Diagram

 ```
 Before downgrade:

   Current Bucket (8192 ms)
   |----------------------------------------|
   | Field A | Field B | Field C | Field D  |
   | exp=+30 | +200    | +500    | +1500    |
   |----------------------------------------|
                     ↑
        All expiries fall before tighter boundary

 After downgrade to 1024 ms:

   New Bucket (1024 ms)
   |------------------|
   | A | B | C | D     |
   |------------------|
 ```

 ### Bucket Split Strategy

 If downgrade is not possible, the bucket is **split**:
 - Fields are sorted by expiration time.
 - A subset that fits in an earlier bucket is moved out.
 - Remaining fields stay in the original bucket.

 ### Split Bucket — Diagram

 ```
 Before split:

   Large Bucket (8192 ms)
   |--------------------------------------------------|
   | A | B | C | D | E | F | G | H | I | J | ... | Z  |
   |---------------- Sorted by expiry ---------------|
             ↑
      Fields A–L can be moved to an earlier bucket

 After split:

   Bucket 1 (end=1690000129024)     Bucket 2 (end=1690000131072)
   |------------------------|       |------------------------|
   | A | B | C | ... | L     |       | M | N | O | ... | Z    |
   |------------------------|       |------------------------|
 ```

 ## Summary of Bucket Behavior
 | Scenario                        | Action Taken                 |
 |--------------------------------|------------------------------|
 | No bucket covers expiry        | New bucket is created        |
 | Existing bucket fits           | Field is added               |
 | Bucket overflows (>127 fields) | Downgrade or split attempted |

API Changes:
------------

   Create/Free:
      void vsetInit(vset *set);
      void vsetClear(vset *set);

  Mutation:
      bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
      bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry);
      bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry,
                                 void *new_entry, long long old_expiry,
                                 long long new_expiry);

  Expiry Retrieval:
      long long vsetEstimatedEarliestExpiry(vset *set, vsetGetExpiryFunc getExpiry);
      size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t max_count, void *ctx);

  Utilities:
      bool vsetIsEmpty(vset *set);
      size_t vsetMemUsage(vset *set);

  Iteration:
      void vsetStart(vset *set, vsetIterator *it);
      bool vsetNext(vsetIterator *it, void **entryptr);
      void vsetStop(vsetIterator *it);

   Entry Requirements:
   -------------------
   All entries must conform to the following interface via `volatileEntryType`:

       sds entryGetKey(const void  entry);         // for deduplication
       long long getExpiry(const void  entry);     // used for bucketing
       int expire(void  db, void  o, void  entry); // used for expiration callbacks

   Diagrams:
   ---------

   1. Tagged Pointer Representation
      -----------------------------
      Lower 3 bits of `expiry_buckets` encode bucket type:

         +------------------------------+
         |     pointer       | TAG (3b) |
         +------------------------------+
           ↑
           masked via VSET_PTR_MASK

         TAG values:
           0x1 → SINGLE
           0x2 → VECTOR
           0x4 → HT
           0x6 → RAX

   2. Evolution of the Bucket
      ------------------------

 *Volatile set top-level structure:*

```
+--------+     +--------+     +--------+     +--------+
| NONE   | --> | SINGLE | --> | VECTOR | --> |   RAX  |
+--------+     +--------+     +--------+     +--------+
```

*If the top-level element is a RAX, it has child buckets of type:*

```
+--------+     +--------+     +-----------+
| SINGLE | --> | VECTOR | --> | HASHTABLE |
+--------+     +--------+     +-----------+
```

*Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).*

   3. RAX Structure with Expiry-Aligned Keys
      --------------------------------------
      Buckets in RAX are indexed by aligned expiry timestamps:

         +------------------------------+
         | RAX key (bucket_ts) → Bucket|
         +------------------------------+
         |     0x00000020  → VECTOR     |
         |     0x00000040  → VECTOR     |
         |     0x00000060  → HT         |
         +------------------------------+

   4. Bucket Splitting (Inside RAX)
      -----------------------------
      If a vector bucket in a RAX fills:
        - Binary search for best split point.
        - Use `getExpiry(entry)` + `get_bucket_ts()` to find transition.
        - Create 2 new buckets and update RAX.

         Original:
             [entry1, entry2, ..., entryN]  ← bucket_ts = 64ms

         After split:
             [entry1, ..., entryK]  → bucket_ts = 32ms
             [entryK+1, ..., entryN] → bucket_ts = 64ms

      If all entries share same bucket_ts → promote to HT.

   5. Shrinking Behavior
      ------------------
      On deletion:
        - HT may shrink to VECTOR.
        - VECTOR with 1 item → becomes SINGLE.
        - If RAX has only one key left, it’s promoted up.

   Summary:
   --------
   This redesign provides:
     ✓ Fine-grained memory control
     ✓ High scalability for bursty TTL data
     ✓ Fast expiry checks via windowed organization
     ✓ Minimal overhead for sparse sets
     ✓ Flexible binary-search-based sorting and bucketing

   It also lays the groundwork for future enhancements, including metrics,
   prioritized expiry policies, or segmented cleaning.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-08-05 18:28:47 +03:00
Ran Shidlansik 65215e5378 Introduce HASH items expiration
Closes https://github.com/valkey-io/valkey/issues/640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](https://github.com/ranshid/valkey/pull/5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](https://github.com/ranshid/valkey/pull/5)
[The third PR](https://github.com/ranshid/valkey/pull/4/) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: https://github.com/valkey-io/valkey-rfc/pull/22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-08-05 18:28:47 +03:00
Binbin 3b12132ac0
Rename trace bgsave type to rdb and aof (#2400)
Previously, we called it the bgsave type, and it had two
events: rdb_unlink_temp_file and fork. However, the code
actually counts all forks as part of the bgsave type, such
as aof fork and module fork.

In this commit, the bgsave type was renamed to rdb type,
to align with the aof type, also adding the aof fork event.

Also doing some cleanup around the README file.
See #2070 for mode details.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-08-03 16:55:12 +08:00
chzhoo 3b25c4d6b8
Optimize scan/sscan/hscan/zscan commands by replacing list with vector (#2160)
The scan/sscan/hscan/zscan commands store iterated keys and values in
a `list`, where each list element triggers a separate memory allocation
and deallocation. This causes performance degradation when the list
contains a large number of elements.

We reduce memory allocation/deallocation by replacing `list` with
`vector`, a new dynamic array implementation.

For specific performance gains, refer to the benchmark results below.

### Benchmark
Run the benchmark script five times, and take the peak value as the
final result.

#### Scan command
**benchmark script**
```
valkey-benchmark -r 1000000000 -n 10000000 -P 32 set key-__rand_int__ value-__rand_int__
for count in `echo 4 16 64 512`;do
	for((x=0;x<5;x+=1));do
		valkey-benchmark --threads 2 -r 100000000 -n 2000000  scan __rand_int__ count $count;
	done
done
```
**benchmark result**
count | QPS after optimization | QPS before optimization | Performance
Boost
-- | -- | -- | --
4 | 114148.73| 108090.58 | 5.6%
16 | 98745.93| 87892.77 | 12.3%
64 | 64432.99| 49971.27 | 28.9%
512 | 15725.87 | 10174.75| 54.5%

#### Hscan Command
**benchmark script**
```
valkey-benchmark -r 1024 -n 1000000 -P 32 hset hash field-__rand_int__ value-__rand_int__
for count in `echo 4 16 64 512`;do
	for((x=0;x<5;x+=1));do
		valkey-benchmark --threads 2 -r 100000000 -n 2000000 hscan hash __rand_int__ count $count;
	done
done
```
**benchmark result**
count | QPS after optimization | QPS before optimization | Performance
Boost
-- | -- | -- | --
4 | 115921.87| 114266.12 | 1.4%
16 | 103879.91| 97546.70 | 6.5%
64 | 71405.62 | 62976.26 | 13.4%
512 | 23168.53 | 16064.64| 44.2%

#### Sscan command
**benchmark script**
```
valkey-benchmark -r 1024 -n 1000000 -P 32 sadd set element-__rand_int__
for count in `echo 4 16 64 512`;do
	for((x=0;x<5;x+=1));do
		valkey-benchmark --threads 2 -r 100000000 -n 2000000 sscan set __rand_int__ count $count;
	done
done
```
**benchmark result**
count | QPS after optimization | QPS before optimization | Performance
Boost
-- | -- | -- | --
4 | 123054.20| 119388.73 | 3.1%
16 | 115928.59| 110975.48 | 4.5%
64 | 94099.94| 86945.18 | 8.2%
512 | 40320.95 | 32501.83| 24.1%

#### Zscan command
**benchmark script**
```
valkey-benchmark -r 1024 -n 1000000 -P 32 zadd zset __rand_int__ element-__rand_int__
for count in `echo 4 16 64 512`;do
	for((x=0;x<5;x+=1));do
		valkey-benchmark --threads 2 -r 100000000 -n 2000000 zscan zset __rand_int__ count $count;
	done
done
```
**benchmark result**
count | QPS after optimization | QPS before optimization | Performance
Boost
-- | -- | -- | --
4 | 97437.40| 96362.33 | 1.1%
16 | 71405.62|70155.75 | 1.8%
64 | 35066.80| 33002.75 | 6.3%
512 | 7498.02 | 6654.89| 12.7%

CPU: AMD EPYC 9754 128-Core Processor * 8
OS: 	Ubuntu Server 22.04 LTS 64bit
Memory: 16GB
VM: Tencent cloud SA5 | SA5.2XLARGE16
Server startup command: valkey-server --save '' --appendonly no

---------

Signed-off-by: chzhoo <czawyx@163.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-06-18 00:13:29 +02:00
skyfirelee 1941d28acd
[NEW] Introduce lttng based tracing (#2070)
## Introduce

In a production environment, it's quite challenging to figure out why a
Valkey is under high load. Right now, tools like INFO or slowlog can
offer some clues. But if the Valkey can't respond, we might not get any
information at all.
Usually, we have to rely on tools like `strace` or `perf` to find the
root cause. If we set up trace points in advance during the project
development, we can quickly pinpoint performance issues.

In this current PR, support has been added for all latency sampling
points. Also, information reporting for command execution has been
added. At the same time, it supports dynamically turning on or off the
information reporting as required. The trace feature is implemented
based on LTTng, and this capability is supported in projects like QEMU,
Ceph.

## How to use

Building Valkey with LTTng support:

```
USE_LTTNG=yes make
```

Open event report:
```
config set trace-events "sys server db cluster aof commands"
```

Events are classified as follows:
- sys (System-level operations)
- server (Server core logic)
- db (Database core operations)
- cluster (Cluster configuration operations)
- aof (AOF persistence operations)
- commands(Command execution information)

## How to trace

Enable lttng trace events dynamically:
```
~# lttng destroy valkey
~# lttng create valkey
~# lttng enable-event -u valkey:*
~# lttng track -u -p `pidof valkey-server`
~# lttng start
~# lttng stop
~# lttng view
```

Examples (a client run 'SET', another run 'keys'):

```
[15:30:19.334463706] (+0.000001243) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334465183] (+0.000001477) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334466516] (+0.000001333) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334467738] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.334469105] (+0.000001367) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 1 }
[15:30:19.334470327] (+0.000001222) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 0 }
[15:30:19.369348485] (+0.034878158) libai valkey:command_call: { cpu_id = 15 }, { name = "keys", duration = 34874 }
[15:30:19.369698322] (+0.000349837) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 4 }
[15:30:19.369702327] (+0.000004005) libai valkey:command_call: { cpu_id = 15 }, { name = "set", duration = 2 }

```

Then we can use another script to analyze topN slow commands and other
system
level events.

About performance overhead (valkey-benchmark -t get -n 1000000 --threads
4):
1> no lttng builtin: 285632.69 requests per second
2> lttng builtin, no trace: 285551.09 requests per second (almost 0
overhead)
3> lttng builtin, trace commands: 266595.59 requests per second (about
~6.6 overhead)

Generally valkey-server would not run in full utilization, the overhead
is acceptable.

## Problem analysis

Add prot and conn field into trace command

Run benchmark tool:
```
GET: rps=227428.0 (overall: 222756.2) avg_msec=0.114 (overall: 0.117)
GET: rps=225248.0 (overall: 223005.2) avg_msec=0.115 (overall: 0.117)
GET: rps=167474.1 (overall: 217942.2) avg_msec=0.193 (overall: 0.122) --> performance drop
GET: rps=220192.0 (overall: 218129.5) avg_msec=0.118 (overall: 0.122)
GET: rps=222868.0 (overall: 218493.7) avg_msec=0.117 (overall: 0.121)

```
Run another 'keys *' command in another connection, lead benchmark
performance
drop.

At the same time, lttng traces events:
```
[21:16:30.420997167] (+0.000004064) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54668", name = "get", duration = 1 }
[21:16:30.421001262] (+0.000004095) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54782", name = "get", duration = 1 }
[21:16:30.485562459] (+0.064561197) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54386", name = "keys", duration = 64551 } --> root cause
[21:16:30.485583101] (+0.000020642) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54522", name = "get", duration = 1 }
[21:16:30.485763891] (+0.000180790) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54542", name = "get", duration = 1 }
[21:16:30.485766451] (+0.000002560) zhenwei valkey:command_call: { cpu_id = 6 }, { prot = "tcp", conn = "127.0.0.1:6379-127.0.0.1:54438", name = "get", duration = 1 }
```

From this change, we can see that connection
127.0.0.1:6379-127.0.0.1:54386
affects other connections.

---------

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: skyfirelee <739609084@qq.com>
Co-authored-by: zhenwei pi <pizhenwei@bytedance.com>
2025-06-08 14:39:57 -07:00
zhenwei pi 70106f06bc
Support RDMA for valkey-cli and benchmark (#2059)
Add '--rdma' for both valkey-cli and valkey-benchmark to enable RDMA.

Valkey has already replaced dependency `hiredis` with `libvalkey`
(#2032), and libvalkey also supports RDMA.

---------

Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2025-05-09 09:05:15 +02:00
Björn Svensson b5c7743971
Replace dependency `hiredis` with `libvalkey` (#2032)
This PR removes the dependency `deps/hiredis` and replaces it with
`deps/libvalkey`.
`libvalkey` is a forked `hiredis` with `hiredis-cluster` included.

Makefiles/CMake, types, include paths, comments and used API function
names are updated to match the new library.

Since both hiredis and Valkey use an `sds` type, we previously needed to
patch the sds type provided by hiredis, and have a `sdscompat.h` file to
map `sds` calls to the hiredis variant. This is no longer needed since
we can build a static libvalkey using `sds` provided by Valkey (same
with `dict`). Now we use Valkey's `sds` type in `valkey-cli` and
`valkey-benchmark` and that's why they now also require `fpconv` to
build.

The files from [libvalkey
0.1.0](https://github.com/valkey-io/libvalkey/releases/tag/0.1.0) is
added as a `git subtree` similar to how hiredis was included.

---------

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
2025-05-07 11:49:27 +02:00
eifrah-aws f8bc378dec
[CMake] Check both arm64 and aarch64 for ARM based system architecture (#1829)
While macOS will report `arm64` for `uname -m` command, others might
report `aarch64`. This PR fixes this

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
2025-03-10 13:29:21 -07:00
Ricardo Dias b58088e31b
Adds support for running `EVAL` with different scripting engines (#1497)
In this PR we re-implement the `EVAL` commands (`EVAL`, `EVALSHA`,
`SCRIPT LOAD`, etc...) to use the scripting engine infrastructure
introduced in 6adef8e2f9. This allows
`EVAL` to run scripts using different scripting engines.

The Lua scripting engine implementation code was moved into its own
subdirectory `src/lua`.

This new implementation generalizes the module API for implementing
scripting engines to work with both `FUNCTION` and `EVAL` commands.

Module API changes include:
* Rename of callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to
`ValkeyModuleScriptingEngineCompileCodeFunc`.
* Addition of a new enum `enum ValkeyModuleScriptingEngineSubsystemType`
to specify the scripting engine subsystem (EVAL, or FUNCTION, or both).
* In most callbacks was added a new parameter with the type
`ValkeyModuleScriptingEngineSubsystemType`.
* New callback specific for EVAL
`ValkeyModuleScriptingEngineResetEvalEnvFunc`.
* New API function `ValkeyModuleScriptingEngineExecutionState
(*ValkeyModule_GetFunctionExecutionState)(ValkeyModuleScriptingEngineServerRuntimeCtx
*server_ctx)` that is used by scripting engines to query the server
about the execution state of the script that is running.

Fixes #1261
Fixes #1468
Follow-up of #1277

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2025-02-10 12:23:08 +01:00
zhaozhao.zz 3f21705a6c
Feature COMMANDLOG to record slow execution and large request/reply (#1294)
As discussed in PR #336.

We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:

1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.

This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:

* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`

And the slowlog is also incorporated into the commandlog.

For each of these three types, additional configs have been added for
control:

* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2025-01-24 11:41:40 +08:00
Ricardo Dias af71619c45
Extract the scripting engine code from the functions unit (#1312)
This commit creates a new compilation unit for the scripting engine code
by extracting the existing code from the functions unit.
We're doing this refactor to prepare the code for running the `EVAL`
command using different scripting engines.

This PR has a module API change: we changed the type of error messages
returned by the callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a
`ValkeyModuleString` (aka `robj`);

This PR also fixes #1470.

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2025-01-16 10:08:16 +01:00
eifrah-aws b3b4bdcda4
CMake: fail on warnings (#1503)
When building with `CMake` (especially the targets `valkey-cli`,
`valkey-server` and `valkey-benchmark`) it is possible to have a
successful build while having warnings.

This PR fixes this - which is aligned with how the `Makefile` is working
today:
- Enable `-Wall` + `-Werror` for valkey targets
- Fixed warning in valkey-cli:jsonStringOutput method

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
2025-01-03 09:44:41 +08:00
Viktor Söderqvist c8ee5c2c46 Hashtable implementation including unit tests
A cache-line aware hash table with a user-defined key-value entry type,
supporting incremental rehashing, scan, iterator, random sampling,
incremental lookup and more...

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 21:30:56 +01:00
zhenwei pi 4695d118dd
RDMA builtin support (#1209)
There are several patches in this PR:

* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2024-11-29 11:13:34 +01:00
eifrah-aws 33f42d7fb5
CMake fixes + README update (#1276) 2024-11-22 12:17:53 -08:00
zvi-code b56eed2479
Remove valkey specific changes in jemalloc source code (#1266)
### Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.

### Added files: 

`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.


### Moved functions: 

[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file

### Existing API:

* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
        * #`nonfull_slabs`
        * #`total_slabs`
        * #free regs in the ptr slab

## Jemalloc API (via ctl interface)


[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,

* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes


[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:

* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to

### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
   - We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily

 
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.


## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
    
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
    mallctlbymib().

6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.


### Logical differences:

In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .


## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.

### Test: 
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1

#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`

*both runs perform same "work" number of buffers moved to reach
fragmentation target

Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint` 
- compare with naive defrag all: `int defrag_hint() {return 1;}`

---------

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-11-21 16:29:21 -08:00
eifrah-aws 07b3e7ae7a
Add CMake build system for valkey (#1196)
With this commit, users are able to build valkey using `CMake`.

## Example usage:

Build `valkey-server` in Release mode with TLS enabled and using
`jemalloc` as the allocator:

```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DCMAKE_INSTALL_PREFIX=/tmp/valkey-install \
         -DBUILD_MALLOC=jemalloc -DBUILD_TLS=1
make -j$(nproc) install

# start valkey
/tmp/valkey-install/bin/valkey-server
```

Build `valkey-unit-tests`:

```bash
mkdir build-release-ut
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DBUILD_MALLOC=jemalloc -DBUILD_UNIT_TESTS=1
make -j$(nproc)

# Run the tests
./bin/valkey-unit-tests 
```

Current features supported by this PR:

- Building against different allocators: (`jemalloc`, `tcmalloc`,
`tcmalloc_minimal` and `libc`), e.g. to enable `jemalloc` pass
`-DBUILD_MALLOC=jemalloc` to `cmake`
- OpenSSL builds (to enable TLS, pass `-DBUILD_TLS=1` to `cmake`)
- Sanitizier: pass `-DBUILD_SANITIZER=<address|thread|undefined>` to
`cmake`
- Install target + redis symbolic links
- Build `valkey-unit-tests` executable
- Standard CMake variables are supported. e.g. to install `valkey` under
`/home/you/root` pass `-DCMAKE_INSTALL_PREFIX=/home/you/root`

Why using `CMake`? To list *some* of the advantages of using `CMake`:

- Superior IDE integrations: cmake generates the file
`compile_commands.json` which is required by `clangd` to get a compiler
accuracy code completion (in other words: your VScode will thank you)
- Out of the source build tree: with the current build system, object
files are created all over the place polluting the build source tree,
the best practice is to build the project on a separate folder
- Multiple build types co-existing: with the current build system, it is
often hard to have multiple build configurations. With cmake you can do
it easily:
- It is the de-facto standard for C/C++ project these days

More build examples: 

ASAN build:

```bash
mkdir build-asan
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=libc
make -j$(nproc)
```

ASAN with jemalloc:

```bash
mkdir build-asan-jemalloc
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=jemalloc 
make -j$(nproc)
```

As seen by the previous examples, any combination is allowed and
co-exist on the same source tree.

## Valkey installation

With this new `CMake`, it is possible to install the binary by running
`make install` or creating a package `make package` (currently supported
on Debian like distros)

### Example 1: build & install using `make install`:

```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/valkey-install -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) install
# valkey is now installed under $HOME/valkey-install
```

### Example 2: create a `.deb` installer:

```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
# ... CPack deb generation output
sudo gdebi -n ./valkey_8.1.0_amd64.deb
# valkey is now installed under /opt/valkey
```

### Example 3: create installer for non Debian systems (e.g. FreeBSD or
macOS):

```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
mkdir -p /opt/valkey && ./valkey-8.1.0-Darwin.sh --prefix=/opt/valkey  --exclude-subdir
# valkey-server is now installed under /opt/valkey

```

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
2024-11-07 18:01:37 -08:00