Denial Of Service in Cassandra
How Denial of Service Manifests in Cassandra
Cassandra is designed for high write throughput and linear scalability, but certain query patterns can exhaust resources and lead to a denial‑of‑service (DoS) condition. The most common vectors are:
- Unbounded IN clauses – A query like
SELECT * FROM users WHERE user_id IN (?, ?, …)with thousands of placeholder values forces the coordinator to fetch many partitions simultaneously, heap‑allocating large result sets and triggering long garbage‑collection pauses. - Large range scans without paging – A request such as
SELECT * FROM sensor_data WHERE timestamp > ? ALLOW FILTERINGcan cause the node to stream millions of rows, filling network buffers and causing back‑pressure that stalls other operations. - Massive unlogged batches – Submitting a BATCH containing hundreds of INSERT/UPDATE statements makes the coordinator serialize all mutations into a single mutation object, increasing memory usage and potentially exceeding the native transport frame size.
- Tombstone storms – Repeatedly deleting wide rows (e.g.,
DELETE FROM logs WHERE day < ?) creates many tombstones; subsequent reads must scan all tombstones before returning live data, dramatically increasing read latency and CPU usage.
These patterns map to OWASP API Security Top 10 2023 API4:2023 Unrestricted Resource Consumption. Real‑world examples include CVE‑2021-25646, where a specially crafted CQL query with an enormous IN list caused a Cassandra node to exhaust its heap and become unresponsive.
The following Java driver snippet shows a dangerous pattern that can trigger a DoS:
// Dangerous: building a huge IN list at runtime
List ids = getIdsFromUserInput(); // could be thousands
BoundStatement stmt = new BoundStatement(prepared)
.setList("ids", ids, Integer.class);
session.execute(stmt); // may overload the coordinator
When the list size exceeds practical limits, the coordinator must allocate a large internal buffer, leading to heap pressure and eventual node stall.
Cassandra‑Specific Detection
Detecting a DoS‑prone configuration involves both runtime observation and active testing of the exposed API surface.
Runtime indicators:
- Elevated
GC pauses(>1 second) visible vianodetool gcstatsor JMXGarbageCollectormetrics. - Increasing
ReadTimeoutExceptionrates in application logs, often accompanied byOverloadedException. - High
PendingCompactionsorFlushWriterqueue length, indicating that write pressure is blocking reads. - Large values in
system.table_pending_tasksforMutationStageorReadStage.
Active testing with middleBrick:
middleBrick’s unauthenticated black‑box scan can be pointed at any Cassandra‑native HTTP/gRPC gateway (e.g., DataStax Astra HTTP API, Stargate, or a custom REST wrapper). The scanner attempts:
- Requests with increasingly large IN clause payloads to observe response time growth and error codes.
- Unpaginated range scans with wide row ranges to detect uncontrolled data streaming.
- Large batch submissions to measure coordinator memory usage via side‑channel timing.
Example CLI invocation:
middlebrick scan https://api.example.com/cassandra/v1/keyspace/myks/table/myTbl
The resulting report includes a Denial of Service finding with severity, the specific CQL pattern tested, and remediation guidance (see next section). Because middleBrick works without agents or credentials, it can be run against staging or production endpoints as part of a CI pipeline.
Cassandra‑Specific Remediation
Mitigations focus on limiting the amount of work a single request can force the cluster to perform, and on enabling built‑in throttling mechanisms.
Application‑level fixes:
- Cap the size of IN lists (e.g., max 100 values) and paginate larger sets via multiple queries.
- Always use paging for range scans: set a fetch size (
session.execute(stmt.setFetchSize(1000))) and iterate withResultSet. - Avoid unlogged batches for more than a few statements; use logged batches only when atomicity is required, and keep batch size under the configured
batch_size_fail_threshold_in_kb(default 50 KB). - Prefer token‑aware routing so that requests are sent directly to the replica owning the partition, reducing coordinator load.
Configuration‑level fixes (cassandra.yaml):
# Enable the built‑in token bucket rate limiter (available since 3.0)
rate_limiter: org.apache.cassandra.net.TokenBucketRateLimiter
# Allow 5 MB per second of inbound traffic; adjust based on node capacity
rate_limiter_rate_in_mb_per_sec: 5
rate_limiter_burst_in_mb: 10
# Prevent overly large frames
native_transport_max_frame_size_in_mb: 256
# Limit batch size
batch_size_fail_threshold_in_kb: 50
batch_size_warn_threshold_in_kb: 10
# Throttle compaction to avoid CPU starvation during spikes
compaction_throughput_mb_per_sec: 16
# Control concurrent operations
concurrent_reads: 32
concurrent_writes: 32
After changing these values, run nodetool reloadtriggers or restart the node. Verify effectiveness with nodetool netstats (shows incoming/outgoing bytes) and nodetool tpstats (thread pool utilization).
Verification:
Rescan the endpoint with middleBrick; the Denial of Service finding should downgrade from high to low or disappear, confirming that the request size limits and rate limiting are active.
Related CWEs: resourceConsumption
| CWE ID | Name | Severity |
|---|---|---|
| CWE-400 | Uncontrolled Resource Consumption | HIGH |
| CWE-770 | Allocation of Resources Without Limits | MEDIUM |
| CWE-799 | Improper Control of Interaction Frequency | MEDIUM |
| CWE-835 | Infinite Loop | HIGH |
| CWE-1050 | Excessive Platform Resource Consumption | MEDIUM |