HIGH llm data leakageexpresscockroachdb

Llm Data Leakage in Express with Cockroachdb

Llm Data Leakage in Express with Cockroachdb — how this specific combination creates or exposes the vulnerability

When an Express application interacts with CockroachDB, the combination of database drivers, ORM/query patterns, and LLM-facing endpoints can inadvertently expose sensitive data through LLM-related channels. This occurs when application logic that queries CockroachDB is invoked by or exposed to LLM tools, and responses or intermediate data contain information that should remain confidential.

In this stack, data leakage can happen in several concrete ways. An Express route that builds dynamic SQL for CockroachDB using string concatenation may embed sensitive values (e.g., PII, internal IDs, or tenant context) into query strings. If that route is called by an LLM tool or surfaced in logs that an LLM can access, those values may be exposed. For example, constructing queries via string interpolation can lead to verbose or verbose error messages that include raw data, which an LLM prompt injection or output scan might capture.

LLM-specific risks amplify when endpoints return data intended for model consumption without proper filtering. Consider an endpoint that retrieves user records from CockroachDB and returns JSON directly to an LLM client. If the response includes fields such as emails, API keys, or internal metadata, and the endpoint is registered in an LLM tool or exposed via an MCP server, the model may inadvertently reveal this data through its outputs. System prompt leakage patterns (27 regexes for ChatML, Llama 2, Mistral, Alpaca) can detect these accidental disclosures in model responses, especially when combined with active prompt injection probes that test for data exfiltration and cost exploitation.

Another vector involves introspection and inventory management. If an Express app exposes database schema or query metadata to an LLM (for example, to assist in generating SQL), sensitive schema details—table names, column definitions, or constraints in CockroachDB—may be surfaced. Output scanning for PII, API keys, and executable code becomes critical here, as models might return query results or debug traces that contain credentials or personal data. Excessive agency detection also matters: tool_calls or function_call patterns that allow an LLM to invoke multiple database queries can create chains where one overly permissive endpoint exposes related datasets.

Because CockroachDB supports distributed SQL, connection strings and driver configurations can inadvertently carry cluster-level details. If these are logged or echoed in responses reachable by LLMs, an attacker can infer deployment topology or infer tenant boundaries. This is especially relevant for unauthenticated LLM endpoint detection: if an endpoint is reachable without proper authorization and returns data derived from CockroachDB, the model itself becomes a leakage channel.

middleBrick’s LLM/AI Security checks are designed to surface these risks in this specific context. It runs active prompt injection probes against the Express endpoints, scans outputs for PII and API keys, detects system prompt leakage patterns, and flags unauthenticated LLM endpoints that interact with sensitive data sources. Findings map to OWASP API Top 10 and help prioritize remediation for data exposure in LLM-facing paths.

Cockroachdb-Specific Remediation in Express — concrete code fixes

To mitigate LLM data leakage when using CockroachDB with Express, focus on strict query parameterization, output filtering, and access controls. Avoid dynamic SQL construction and ensure responses sent to LLM clients do not contain unnecessary or sensitive fields.

Use parameterized queries or an ORM that supports placeholders to prevent injection and reduce data exposure in errors. For example, with the pg CockroachDB driver in Express, prefer placeholders over string concatenation:

const express = require('express');
const { Pool } = require('pg');
const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

app.get('/users/:id', async (req, res) => {
  try {
    const { rows } = await pool.query('SELECT id, name, email FROM users WHERE id = $1', [req.params.id]);
    if (!rows.length) return res.status(404).json({ error: 'Not found' });
    // Explicitly select only safe fields; avoid returning internal metadata
    res.json(rows[0]);
  } catch (err) {
    // Avoid leaking stack traces or raw query details
    console.error('Query failed:', err.message);
    res.status(500).json({ error: 'Internal error' });
  }
});

app.listen(3000);

For more complex queries, use an ORM that enforces parameterization and schema awareness. If you use an ORM, ensure it does not log raw queries containing sensitive data and that it respects field-level permissions. Always validate and sanitize inputs before they reach the database layer.

Apply output filtering before sending data to LLM tools. If an endpoint is consumed by an LLM, strip or mask fields that should not be exposed:

app.get('/reports/:reportId', async (req, res) => {
  const { rows } = await pool.query('SELECT id, title, generated_at FROM reports WHERE id = $1', [req.params.reportId]);
  if (!rows.length) return res.status(404).json({ error: 'Report not found' });
  // Remove sensitive fields before LLM consumption
  const safeReport = {
    id: rows[0].id,
    title: rows[0].title,
    generated_at: rows[0].generated_at,
  };
  res.json(safeReport);
});

Ensure that endpoints accessible to LLMs are properly authenticated and rate-limited, even if they are intended for internal use. Use middleware to enforce authorization checks and avoid exposing raw CockroachDB errors to clients:

function requireAuth(req, res, next) {
  if (!req.headers.authorization || !isValidToken(req.headers.authorization)) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
  next();
}

app.get('/llm-data', requireAuth, async (req, res) => {
  const { rows } = await pool.query('SELECT id, summary FROM data WHERE tenant_id = $1', [req.user.tenantId]);
  res.json(rows);
});

Finally, integrate middleBrick’s CLI or GitHub Action to scan your Express endpoints regularly. The CLI can be run as part of development workflows:

# Scan from terminal
middlebrick scan https://api.example.com

Or fail builds via GitHub Action if a new endpoint introduces high-risk findings that could lead to LLM data leakage.

Related CWEs: llmSecurity

CWE IDNameSeverity
CWE-754Improper Check for Unusual or Exceptional Conditions MEDIUM

Frequently Asked Questions

How can I detect if my Express endpoints are leaking data to LLMs?
Use middleBrick’s LLM/AI Security checks, which include active prompt injection probes, output scanning for PII and API keys, and detection of unauthenticated LLM endpoints. These checks surface leakage risks specific to stacks involving CockroachDB and Express.
Does parameterized queries alone prevent LLM data leakage?
Parameterized queries prevent injection and reduce error-based leakage, but they do not prevent intentional or accidental exposure of sensitive fields in responses. You must also filter outputs, enforce authentication, and scan endpoints with tools that test LLM-specific attack vectors.