Llm Data Leakage in Flask with Cockroachdb
Llm Data Leakage in Flask with Cockroachdb — how this specific combination creates or exposes the vulnerability
When an LLM-enabled Flask application uses CockroachDB as its primary datastore, data leakage risks arise from how prompts, model responses, and database content intersect. The LLM/AI Security checks in middleBrick specifically look for system prompt leakage, output exposure of sensitive data, and unsafe consumption patterns. If your Flask routes expose database records directly to LLM input or return LLM-generated text without filtering, findings such as PII or API keys in model output may appear.
In a typical Flask + Cockroachdb setup, developers may construct prompts by concatenating user input or record fields (e.g., account numbers or emails) and sending them to an LLM endpoint. If the LLM response is returned raw to the client, sensitive information stored in Cockroachdb can be reflected in the model output, leading to data exposure. Additionally, if the application uses an unauthenticated LLM endpoint or incorporates user-influenced system prompts, middleBrick’s LLM security probes can detect system prompt leakage and test for prompt injection that may expose more data than intended.
Flask routes that dynamically build SQL queries using string formatting increase the risk of insecure data handling feeding into LLM interactions. Cockroachdb’s SQL compatibility means typical ORM or query patterns may inadvertently pass sensitive columns into LLM prompts. For example, including a user’s full profile or transaction history in a prompt can result in the model regurgitating that data. The combination of Flask’s lightweight request handling, Cockroachdb’s distributed SQL rows, and LLM endpoints that echo or rephrase content creates a chain where sensitive database content can appear in model responses without proper output scanning or input validation.
middleBrick’s scan tests for these conditions by analyzing the unauthenticated attack surface and, when an OpenAPI spec is available, cross-referencing endpoint definitions with observed runtime behavior. If your API spec documents an endpoint that accepts user-controlled data and forwards it to an LLM, and runtime probes show that database-derived fields are included in prompts, findings will highlight insecure consumption and potential data exposure. The LLM/AI Security section also checks for excessive agency patterns and output scanning for PII and API keys, which are relevant when Cockroachdb rows contain such data.
Cockroachdb-Specific Remediation in Flask — concrete code fixes
To reduce LLM data leakage risk with Cockroachdb in Flask, control what data reaches the LLM and ensure outputs are inspected. Use parameterized queries to limit exposure of sensitive columns, explicitly select only necessary fields, and apply output filters before returning text generated by the model. The following examples assume you have a Cockroachdb cluster and a Flask application; adapt connection settings to your environment.
import psycopg2
from flask import Flask, request, jsonify
app = Flask(__name__)
def get_db_connection():
# Example connection string for Cockroachdb; use secrets in production
return psycopg2.connect(
dbname='mydb',
user='app_user',
password='**',
host='localhost',
port='26257'
)
@app.route('/api/user/')
def get_user_profile(user_id):
conn = get_db_connection()
cur = conn.cursor()
# Explicitly select only safe fields; avoid SELECT *
cur.execute('SELECT id, display_name, country FROM users WHERE id = %s', (user_id,))
row = cur.fetchone()
cur.close()
conn.close()
if row is None:
return jsonify({'error': 'not found'}), 404
user_data = {'id': row[0], 'display_name': row[1], 'country': row[2]}
# Only pass safe, non-sensitive fields to the LLM prompt
prompt = f"Summarize preferences for user {user_data['display_name']} from {user_data['country']}."
# Here you would call your LLM endpoint; ensure the response is scanned
llm_response = call_llm(prompt) # Implement your own LLM call
# Basic output scan example: redact potential PII before returning
safe_response = redact_pii(llm_response)
return jsonify({'summary': safe_response})
def redact_pii(text: str) -> str:
# Simple placeholder; use a robust library or service in production
return text.replace('EMAIL', '[REDACTED]').replace('SSN', '[REDACTED]')
def call_llm(prompt: str) -> str:
# Stub for your LLM integration; ensure error handling and timeouts
return "Sample response"
if __name__ == '__main__':
app.run(debug=False)
The above example avoids exposing sensitive Cockroachdb columns (such as email, ssn, or internal IDs) by selecting only display_name and country. It demonstrates explicit column selection and a basic redaction step before returning text that may contain model-generated echoes of stored data. For production, integrate a vetted PII redaction or scanning library and enforce strict input validation on user_id.
Additionally, secure your LLM endpoints by avoiding unauthenticated access and validating prompts. If you provide an OpenAPI spec, middleBrick can compare it to runtime behavior to highlight mismatches where database-derived fields enter LLM prompts. Combine these practices with the framework-specific guidance from the middleBrick documentation to manage risk effectively.
Related CWEs: llmSecurity
| CWE ID | Name | Severity |
|---|---|---|
| CWE-754 | Improper Check for Unusual or Exceptional Conditions | MEDIUM |