HIGH llm data leakagecockroachdb

Llm Data Leakage in Cockroachdb

How Llm Data Leakage Manifests in Cockroachdb

Llm Data Leakage in Cockroachdb environments typically occurs when LLM endpoints are integrated with database operations without proper isolation. A common pattern involves using LLM responses to construct SQL queries or using database credentials to authenticate LLM API calls. This creates attack vectors where malicious prompts can extract database credentials, table schemas, or even manipulate database transactions.

One specific manifestation involves Cockroachdb's connection pooling libraries. When LLM endpoints receive system prompts containing database credentials, these credentials can be logged in plaintext or exposed through error messages. For example, a prompt like "Connect to my Cockroachdb instance at localhost:26257 with username admin and password secret123" could be captured in logs or returned in error responses.

Another pattern involves Cockroachdb's distributed transaction IDs. LLM endpoints that process database transaction metadata might inadvertently expose Cockroachdb's internal transaction IDs (TIDs) through their responses. These TIDs can be used to infer database activity patterns, identify active sessions, or even reconstruct query patterns when combined with other leaked information.

Cockroachdb's JSONB data type presents another vulnerability. When LLM endpoints process JSONB responses containing sensitive data, they might return the entire JSONB object without proper filtering. This is particularly problematic when the JSONB contains nested structures with credentials, API keys, or personally identifiable information that the LLM wasn't designed to handle.

// Vulnerable pattern - LLM processing Cockroachdb JSONB response
func processCustomerData(ctx context.Context, db *pgxpool.Pool) (string, error) {
    var customer struct {
        ID       string `json:"id"`
        Name     string `json:"name"`
        CreditCard struct {
            Number string `json:"number"`
            CVV    string `json:"cvv"`
        } `json:"credit_card"`
    }
    
    row := db.QueryRow(ctx, "SELECT data FROM customers WHERE id = $1", "12345")
    var jsonData string
    err := row.Scan(&jsonData)
    if err != nil {
        return "", err
    }
    
    // Pass raw JSON to LLM without sanitization
    response, err := llmClient.SendMessage(ctx, jsonData)
    return response, err
}

The above code demonstrates how Cockroachdb's JSONB responses containing sensitive credit card information can be directly passed to LLM endpoints without sanitization, leading to potential data leakage through the LLM's responses or logs.

Cockroachdb-Specific Detection

Detecting Llm Data Leakage in Cockroachdb requires examining both the database layer and LLM integration points. Start by scanning your Cockroachdb logs for patterns that match LLM API endpoints, prompt formats, or system prompt indicators. Cockroachdb's logging system can be configured to capture query execution details, which may reveal when LLM-related queries are executed.

Use Cockroachdb's built-in statement diagnostics to identify queries that process sensitive data. The following query helps identify recent queries that might involve LLM processing:

SELECT 
    query,
    statement_diagnostics_id,
    statement_timing,
    error_message
FROM 
    crdb_internal.statement_statistics 
WHERE 
    query LIKE '%llm%' OR 
    query LIKE '%prompt%' OR 
    query LIKE '%openai%' OR 
    query LIKE '%anthropic%'
ORDER BY 
    statement_timing DESC
LIMIT 50;

This query identifies statements containing LLM-related keywords and their execution patterns. Look for queries with unusually long execution times or those that process large JSONB objects, as these often indicate LLM integration points.

For automated detection, middleBrick's LLM/AI Security scanning can identify Cockroachdb-specific vulnerabilities. The scanner tests for system prompt leakage patterns using Cockroachdb-specific regex patterns that detect database connection strings, SQL statements, and Cockroachdb-specific error messages within LLM responses.

middleBrick also performs active prompt injection testing against LLM endpoints that might interact with Cockroachdb. The scanner attempts to extract system prompts containing database credentials, test for SQL injection through LLM responses, and verify that Cockroachdb transaction IDs are not exposed through LLM outputs.

Network-level detection is crucial for Cockroachdb deployments. Monitor network traffic between your LLM endpoints and Cockroachdb nodes for unusual patterns. Cockroachdb uses port 26257 for SQL communication, so any LLM-related traffic on this port should be investigated. Look for patterns like:

# Monitor LLM-related traffic to Cockroachdb
tcpdump -i eth0 port 26257 | grep -i -E "(openai|anthropic|llm|prompt)"

This helps identify when LLM services are directly communicating with Cockroachdb, which might indicate improper architectural separation.

Cockroachdb-Specific Remediation

Remediating Llm Data Leakage in Cockroachdb environments requires architectural changes and careful data handling. The first principle is strict separation between LLM processing and database operations. Never pass raw database responses directly to LLM endpoints. Instead, implement data sanitization layers that strip sensitive information before LLM processing.

// Secure pattern - Sanitize data before LLM processing
func processCustomerDataSecure(ctx context.Context, db *pgxpool.Pool) (string, error) {
    // Fetch only non-sensitive fields
    var customer struct {
        ID   string `json:"id"`
        Name string `json:"name"`
    }
    
    row := db.QueryRow(ctx, "SELECT id, name FROM customers WHERE id = $1", "12345")
    err := row.Scan(&customer.ID, &customer.Name)
    if err != nil {
        return "", err
    }
    
    // Convert to safe JSON without sensitive fields
    jsonData, err := json.Marshal(customer)
    if err != nil {
        return "", err
    }
    
    // Process sanitized data
    response, err := llmClient.SendMessage(ctx, string(jsonData))
    return response, err
}

This approach ensures that sensitive fields like credit card numbers, SSNs, or internal IDs never reach the LLM endpoint. The sanitization layer acts as a security boundary between your database and LLM services.

Implement Cockroachdb's row-level security (RLS) to restrict data access at the database level. This prevents LLM services from accessing sensitive columns even if they somehow bypass application-layer controls:

-- Create security policy to mask sensitive data
CREATE POLICY llm_access_policy ON customers
FOR SELECT USING (
    current_user = 'llm_service' AND 
    NOT (sensitive_data IS NOT NULL)
);

-- Alternative: Use views to expose only safe data
CREATE VIEW customer_safe AS
SELECT id, name, email FROM customers;

-- Grant LLM service access only to safe view
GRANT SELECT ON customer_safe TO llm_service;

These policies ensure that even if LLM services have database credentials, they can only access sanitized data views.

Configure Cockroachdb's logging to redact sensitive information from LLM-related queries. Use the redact option in your connection configuration to prevent credentials from appearing in logs:

# Cockroachdb connection configuration with redaction
conn:
  host: localhost
  port: 26257
  database: mydb
  redact: true
  log_statement: none
  log_sensitive_access: false

Implement network segmentation between LLM services and Cockroachdb nodes. Use firewall rules to restrict which services can communicate with your database:

# Firewall rules to isolate LLM services
iptables -A INPUT -p tcp --dport 26257 -s llm_service_network/24 -j DROP
iptables -A INPUT -p tcp --dport 26257 -s trusted_application_network/24 -j ACCEPT

This prevents direct LLM-to-database communication, forcing all data through your application's sanitization layer.

For LLM services that must interact with Cockroachdb, implement API gateways that enforce data sanitization policies. The gateway should strip any database credentials, transaction IDs, or sensitive metadata from requests and responses.

Finally, use Cockroachdb's audit logging to monitor LLM-related database access. Create specific audit policies for LLM service accounts:

-- Audit policy for LLM service
CREATE AUDIT LOGS FOR TABLE customers
INTO audit_log_customer_access
FILTER ON sensitive_columns = true
WHERE user_name = 'llm_service';

-- Monitor for unusual patterns
SELECT 
    user_name,
    query,
    statement_timing,
    log_time
FROM 
    audit_log_customer_access
WHERE 
    log_time > NOW() - INTERVAL '1 hour'
ORDER BY 
    log_time DESC;

This provides visibility into how LLM services interact with your database and helps detect any data leakage attempts.

Related CWEs: llmSecurity

CWE IDNameSeverity
CWE-754Improper Check for Unusual or Exceptional Conditions MEDIUM

Frequently Asked Questions

How can I test my Cockroachdb + LLM integration for data leakage vulnerabilities?
Use middleBrick's LLM/AI Security scanning to test your endpoints. The scanner actively probes for system prompt leakage, attempts prompt injection attacks, and checks for exposed database credentials in LLM responses. You can scan any API endpoint without credentials or setup—just provide the URL and middleBrick will test for Cockroachdb-specific patterns like transaction ID exposure and JSONB data leakage.
What's the difference between LLM data leakage and regular API data exposure?
LLM data leakage involves the unique characteristics of language model processing, where system prompts, training data, or model responses can inadvertently expose sensitive information. Unlike regular API exposure where data is returned directly, LLM leakage can occur through model logs, error messages, or even the model's internal representations. Cockroachdb-specific LLM leakage often involves transaction IDs, connection strings, or database schemas being exposed through LLM processing pipelines.