Denial Of Service in Cockroachdb

How Denial Of Service Manifests in Cockroachdb

Denial of Service (DoS) in Cockroachdb often exploits the database's distributed architecture and consensus mechanisms. The most common attack vector targets the Raft consensus protocol that underpins Cockroachdb's replication. By flooding the cluster with write operations, an attacker can trigger excessive Raft log replication, causing network saturation and CPU exhaustion across nodes.

A specific Cockroachdb DoS pattern involves exploiting range lease contention. When multiple clients simultaneously attempt to acquire leases on hot ranges, the lease acquisition process can consume significant resources. The leaseholder election mechanism, which uses distributed locking, becomes a bottleneck under high contention, leading to cascading timeouts and degraded performance.

Another vector targets Cockroachdb's MVCC (Multi-Version Concurrency Control) system. By issuing a high volume of conflicting transactions that update the same keys, an attacker can force the system to maintain numerous historical versions, consuming disk space and slowing down garbage collection processes. This is particularly effective against tables with long GC TTL settings.

Resource exhaustion can also occur through query planning abuse. Complex queries with nested subqueries, especially those involving joins across large datasets, can trigger expensive planner operations. An attacker can craft queries that force the planner to explore exponential execution plan spaces, consuming CPU and memory before execution even begins.

Here's an example of a query that can trigger planner exhaustion:

SELECT * FROM table1
WHERE id IN (
SELECT id FROM table2
WHERE id IN (
SELECT id FROM table3
WHERE id IN (
SELECT id FROM table4
WHERE id IN (
SELECT id FROM table5
)
)
)
)
LIMIT 1;

This query structure causes the planner to recursively evaluate multiple IN clauses, potentially exploring a vast execution plan space.

Cockroachdb-Specific Detection

Detecting DoS vulnerabilities in Cockroachdb requires monitoring specific metrics and patterns. The crdb_internal schema provides valuable insights into cluster health and potential abuse patterns.

Key metrics to monitor include:

  • crdb_internal.node_statement_statistics - Tracks query execution statistics per node
  • crdb_internal.node_execution_statistics - Shows execution stage performance
  • crdb_internal.ranges - Reveals range distribution and leaseholder locations
  • crdb_internal.cluster_queries - Provides real-time query execution data

Here's a query to identify potentially abusive patterns:

SELECT node_id, user_name, query, count_executions, max_wall_time
FROM crdb_internal.node_statement_statistics
WHERE count_executions > 1000
AND max_wall_time > 1000
ORDER BY count_executions DESC;

This identifies queries executed frequently with high execution times, which could indicate DoS attempts.

middleBrick's black-box scanning approach can detect DoS vulnerabilities without requiring database credentials. The scanner tests for resource exhaustion patterns by:

  • Executing queries with exponential complexity to trigger planner exhaustion
  • Testing lease contention by rapidly acquiring locks on hot ranges
  • Monitoring response times for signs of resource exhaustion
  • Checking for proper rate limiting implementation

The scanner also examines Cockroachdb's configuration for vulnerable settings, such as overly permissive connection limits or disabled query timeouts.

For continuous monitoring, middleBrick's Pro plan can be configured to periodically scan your Cockroachdb endpoints, alerting you when DoS vulnerabilities are detected or when security scores drop below your threshold.

Cockroachdb-Specific Remediation

Mitigating DoS attacks in Cockroachdb requires a multi-layered approach combining configuration changes, query optimization, and architectural patterns.

First, implement query timeouts at the database level:

SET CLUSTER SETTING kv.bulk.ingest.max_rate = '100MB';
SET CLUSTER SETTING kv.range_split.by_load_enabled = false;
SET CLUSTER SETTING sql.defaults.statement_timeout = '10s';

These settings prevent resource-intensive operations from running indefinitely and control load distribution.

For application-level protection, use connection pooling with limits:

CREATE USER app_user WITH PASSWORD '' CONNECTION LIMIT 100;

This prevents a single application from exhausting connection resources.

Implement query cancellation for long-running operations:

CREATE TABLE monitoring.cancellable_queries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
query_id INT8,
created_at TIMESTAMP DEFAULT now(),
cancelled_at TIMESTAMP,
CHECK (cancelled_at IS NULL OR cancelled_at > created_at)
);

This table can track queries that should be cancelled if they exceed thresholds.

For range-level protection, consider implementing application-level rate limiting on hot ranges:

CREATE TABLE hot_ranges_lock (
range_id INT8 PRIMARY KEY,
lock_acquired_at TIMESTAMP,
lock_released_at TIMESTAMP,
CHECK (lock_released_at IS NULL OR lock_released_at > lock_acquired_at)
);

Application code can check this table before accessing hot ranges, implementing backpressure when necessary.

middleBrick's CLI tool can help validate these protections:

npx middlebrick scan your-cockroachdb-url:26257

The scanner will test whether your DoS mitigations are effective by attempting resource-intensive operations and verifying they're properly constrained.

For production environments, integrate middleBrick into your CI/CD pipeline to ensure DoS protections aren't accidentally removed during deployments:

jobs:
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run middleBrick scan
run: npx middlebrick scan ${{ secrets.COCKROACHDB_URL }}
continue-on-error: true
- name: Check security score
run: |
echo

Related CWEs: resourceConsumption

CWE IDNameSeverity
CWE-400Uncontrolled Resource Consumption HIGH
CWE-770Allocation of Resources Without Limits MEDIUM
CWE-799Improper Control of Interaction Frequency MEDIUM
CWE-835Infinite Loop HIGH
CWE-1050Excessive Platform Resource Consumption MEDIUM