Llm Data Leakage in Express with Firestore
Llm Data Leakage in Express with Firestore — how this specific combination creates or exposes the vulnerability
Llm Data Leakage occurs when an application exposes sensitive data through responses generated by an LLM or through the application layer that interfaces with an LLM endpoint. In an Express application using Google Cloud Firestore, the risk arises when Firestore documents—such as user records, configuration, or business data—are inadvertently included in prompts, logged outputs, or error messages that are visible to the LLM or returned to the end user.
Express apps often construct dynamic prompts by concatenating user input with Firestore data. If Firestore documents contain secrets, personal information, or internal business logic, and those documents are directly injected into prompt templates, the LLM may echo them in responses or expose them via tool calls or verbose error traces. For example, attaching a full Firestore document containing an internal API key or a user’s PII to a system prompt can result in system prompt leakage during an active prompt injection test, where an attacker attempts to extract the original instructions.
The LLM/AI Security checks in middleBrick specifically target this scenario by detecting system prompt leakage patterns across formats such as ChatML and by running active prompt injection probes that include system prompt extraction and data exfiltration attempts. When an Express endpoint uses Firestore data in LLM interactions without proper sanitization or access controls, these probes can reveal sensitive content. Additionally, if an unauthenticated endpoint exposes an LLM interface that internally queries Firestore, an attacker might manipulate inputs to retrieve documents they should not see, leading to data exposure.
Furthermore, Firestore data that is improperly handled in asynchronous routes or middleware may leak into LLM responses through output scanning vectors. For instance, if an Express route queries a Firestore collection and passes raw document fields into an LLM completion call, the model’s response might include those fields verbatim, especially when generating natural language summaries or debug information. middleBrick’s output scanning looks for PII, API keys, and executable code in LLM responses, which helps identify cases where Firestore data is unintentionally surfaced.
Another vector arises from tool usage or function calling patterns where an Express backend queries Firestore to populate tool parameters. If those parameters include sensitive fields and the tool call is exposed to an LLM with excessive agency enabled, the LLM might retain or repeat sensitive content in subsequent interactions. This aligns with excessive agency detection, where patterns such as tool_calls or function_call are monitored for risky behavior. In this context, Firestore data flowing into LLM tool integrations must be carefully constrained to prevent leakage through agent-like behaviors.
Firestore-Specific Remediation in Express — concrete code fixes
To mitigate Llm Data Leakage in Express applications using Firestore, apply strict data handling practices and avoid passing raw Firestore documents into LLM contexts. Use selective field extraction, enforce access controls, and sanitize all inputs and outputs that touch the LLM pipeline.
Below are concrete Express code examples that demonstrate secure patterns for interacting with Firestore while minimizing LLM exposure.
const { initializeApp } = require('firebase-admin');
const express = require('express');
const app = express();
initializeApp({
credential: initializeApp.credential.applicationDefault(),
});
const db = initializeApp().firestore();
// Safe: explicit field selection and no sensitive fields in prompt
app.get('/api/safe-prompt', async (req, res) => {
const docRef = db.collection('users').doc(req.query.userId);
const doc = await docRef.get();
if (!doc.exists) {
return res.status(404).send('User not found');
}
const data = doc.data();
const publicProfile = {
displayName: data.displayName,
role: data.role,
};
// Use only safe fields in LLM prompt
const prompt = `Summarize preferences for ${publicProfile.displayName} (${publicProfile.role}).`;
// Assume sendToLLM is a helper that calls your LLM endpoint securely
const summary = await sendToLLM(prompt);
res.json({ summary });
});
// Secure: parameterized query with strict schema validation
app.get('/api/orders/:orderId', async (req, res) => {
const orderRef = db.collection('orders').doc(req.params.orderId);
const orderSnap = await orderRef.get();
if (!orderSnap.exists) {
return res.status(404).send('Order not found');
}
const order = orderSnap.data();
// Exclude sensitive fields before any LLM interaction
const { total, items, currency } = order;
const prompt = `Generate a receipt for order totaling ${total} ${currency} with ${items.length} items.`;
const receipt = await sendToLLM(prompt);
res.json({ receipt });
});
async function sendToLLM(message) {
// Placeholder: implement secure LLM call with no sensitive metadata
return 'Processed securely';
}
module.exports = app;
Key remediation practices illustrated:
- Selective field extraction: only non-sensitive fields are used in prompts.
- Schema validation and access checks before data retrieval.
- Avoiding direct inclusion of Firestore document references or IDs in LLM inputs.
- Isolating LLM calls behind helper functions to centralize security controls.
For production deployments, combine these patterns with runtime security checks and continuous scanning using tools like middleBrick. The Pro plan supports continuous monitoring and CI/CD integration to catch regressions early, while the CLI allows you to scan endpoints from the terminal with commands such as middlebrick scan <url>. The GitHub Action can add API security checks to your CI/CD pipeline, failing builds if risk scores drop below your defined thresholds.
Related CWEs: llmSecurity
| CWE ID | Name | Severity |
|---|---|---|
| CWE-754 | Improper Check for Unusual or Exceptional Conditions | MEDIUM |