Llm Data Leakage in Cockroachdb
How Llm Data Leakage Manifests in Cockroachdb
Llm Data Leakage in Cockroachdb environments typically occurs when LLM endpoints are integrated with database operations without proper isolation. A common pattern involves using LLM responses to construct SQL queries or using database credentials to authenticate LLM API calls. This creates attack vectors where malicious prompts can extract database credentials, table schemas, or even manipulate database transactions.
One specific manifestation involves Cockroachdb's connection pooling libraries. When LLM endpoints receive system prompts containing database credentials, these credentials can be logged in plaintext or exposed through error messages. For example, a prompt like "Connect to my Cockroachdb instance at localhost:26257 with username admin and password secret123" could be captured in logs or returned in error responses.
Another pattern involves Cockroachdb's distributed transaction IDs. LLM endpoints that process database transaction metadata might inadvertently expose Cockroachdb's internal transaction IDs (TIDs) through their responses. These TIDs can be used to infer database activity patterns, identify active sessions, or even reconstruct query patterns when combined with other leaked information.
Cockroachdb's JSONB data type presents another vulnerability. When LLM endpoints process JSONB responses containing sensitive data, they might return the entire JSONB object without proper filtering. This is particularly problematic when the JSONB contains nested structures with credentials, API keys, or personally identifiable information that the LLM wasn't designed to handle.
// Vulnerable pattern - LLM processing Cockroachdb JSONB response
func processCustomerData(ctx context.Context, db *pgxpool.Pool) (string, error) {
var customer struct {
ID string `json:"id"`
Name string `json:"name"`
CreditCard struct {
Number string `json:"number"`
CVV string `json:"cvv"`
} `json:"credit_card"`
}
row := db.QueryRow(ctx, "SELECT data FROM customers WHERE id = $1", "12345")
var jsonData string
err := row.Scan(&jsonData)
if err != nil {
return "", err
}
// Pass raw JSON to LLM without sanitization
response, err := llmClient.SendMessage(ctx, jsonData)
return response, err
}
The above code demonstrates how Cockroachdb's JSONB responses containing sensitive credit card information can be directly passed to LLM endpoints without sanitization, leading to potential data leakage through the LLM's responses or logs.
Cockroachdb-Specific Detection
Detecting Llm Data Leakage in Cockroachdb requires examining both the database layer and LLM integration points. Start by scanning your Cockroachdb logs for patterns that match LLM API endpoints, prompt formats, or system prompt indicators. Cockroachdb's logging system can be configured to capture query execution details, which may reveal when LLM-related queries are executed.
Use Cockroachdb's built-in statement diagnostics to identify queries that process sensitive data. The following query helps identify recent queries that might involve LLM processing:
SELECT
query,
statement_diagnostics_id,
statement_timing,
error_message
FROM
crdb_internal.statement_statistics
WHERE
query LIKE '%llm%' OR
query LIKE '%prompt%' OR
query LIKE '%openai%' OR
query LIKE '%anthropic%'
ORDER BY
statement_timing DESC
LIMIT 50;
This query identifies statements containing LLM-related keywords and their execution patterns. Look for queries with unusually long execution times or those that process large JSONB objects, as these often indicate LLM integration points.
For automated detection, middleBrick's LLM/AI Security scanning can identify Cockroachdb-specific vulnerabilities. The scanner tests for system prompt leakage patterns using Cockroachdb-specific regex patterns that detect database connection strings, SQL statements, and Cockroachdb-specific error messages within LLM responses.
middleBrick also performs active prompt injection testing against LLM endpoints that might interact with Cockroachdb. The scanner attempts to extract system prompts containing database credentials, test for SQL injection through LLM responses, and verify that Cockroachdb transaction IDs are not exposed through LLM outputs.
Network-level detection is crucial for Cockroachdb deployments. Monitor network traffic between your LLM endpoints and Cockroachdb nodes for unusual patterns. Cockroachdb uses port 26257 for SQL communication, so any LLM-related traffic on this port should be investigated. Look for patterns like:
# Monitor LLM-related traffic to Cockroachdb
tcpdump -i eth0 port 26257 | grep -i -E "(openai|anthropic|llm|prompt)"
This helps identify when LLM services are directly communicating with Cockroachdb, which might indicate improper architectural separation.
Cockroachdb-Specific Remediation
Remediating Llm Data Leakage in Cockroachdb environments requires architectural changes and careful data handling. The first principle is strict separation between LLM processing and database operations. Never pass raw database responses directly to LLM endpoints. Instead, implement data sanitization layers that strip sensitive information before LLM processing.
// Secure pattern - Sanitize data before LLM processing
func processCustomerDataSecure(ctx context.Context, db *pgxpool.Pool) (string, error) {
// Fetch only non-sensitive fields
var customer struct {
ID string `json:"id"`
Name string `json:"name"`
}
row := db.QueryRow(ctx, "SELECT id, name FROM customers WHERE id = $1", "12345")
err := row.Scan(&customer.ID, &customer.Name)
if err != nil {
return "", err
}
// Convert to safe JSON without sensitive fields
jsonData, err := json.Marshal(customer)
if err != nil {
return "", err
}
// Process sanitized data
response, err := llmClient.SendMessage(ctx, string(jsonData))
return response, err
}
This approach ensures that sensitive fields like credit card numbers, SSNs, or internal IDs never reach the LLM endpoint. The sanitization layer acts as a security boundary between your database and LLM services.
Implement Cockroachdb's row-level security (RLS) to restrict data access at the database level. This prevents LLM services from accessing sensitive columns even if they somehow bypass application-layer controls:
-- Create security policy to mask sensitive data
CREATE POLICY llm_access_policy ON customers
FOR SELECT USING (
current_user = 'llm_service' AND
NOT (sensitive_data IS NOT NULL)
);
-- Alternative: Use views to expose only safe data
CREATE VIEW customer_safe AS
SELECT id, name, email FROM customers;
-- Grant LLM service access only to safe view
GRANT SELECT ON customer_safe TO llm_service;
These policies ensure that even if LLM services have database credentials, they can only access sanitized data views.
Configure Cockroachdb's logging to redact sensitive information from LLM-related queries. Use the redact option in your connection configuration to prevent credentials from appearing in logs:
# Cockroachdb connection configuration with redaction
conn:
host: localhost
port: 26257
database: mydb
redact: true
log_statement: none
log_sensitive_access: false
Implement network segmentation between LLM services and Cockroachdb nodes. Use firewall rules to restrict which services can communicate with your database:
# Firewall rules to isolate LLM services
iptables -A INPUT -p tcp --dport 26257 -s llm_service_network/24 -j DROP
iptables -A INPUT -p tcp --dport 26257 -s trusted_application_network/24 -j ACCEPT
This prevents direct LLM-to-database communication, forcing all data through your application's sanitization layer.
For LLM services that must interact with Cockroachdb, implement API gateways that enforce data sanitization policies. The gateway should strip any database credentials, transaction IDs, or sensitive metadata from requests and responses.
Finally, use Cockroachdb's audit logging to monitor LLM-related database access. Create specific audit policies for LLM service accounts:
-- Audit policy for LLM service
CREATE AUDIT LOGS FOR TABLE customers
INTO audit_log_customer_access
FILTER ON sensitive_columns = true
WHERE user_name = 'llm_service';
-- Monitor for unusual patterns
SELECT
user_name,
query,
statement_timing,
log_time
FROM
audit_log_customer_access
WHERE
log_time > NOW() - INTERVAL '1 hour'
ORDER BY
log_time DESC;
This provides visibility into how LLM services interact with your database and helps detect any data leakage attempts.
Related CWEs: llmSecurity
| CWE ID | Name | Severity |
|---|---|---|
| CWE-754 | Improper Check for Unusual or Exceptional Conditions | MEDIUM |