Llm Data Leakage in Hapi with Firestore
Llm Data Leakage in Hapi with Firestore — how this specific combination creates or exposes the vulnerability
When building server-side Hapi applications that use Google Cloud Firestore, developers often pass raw document data directly into prompts sent to LLM endpoints. This practice can lead to LLM data leakage, where sensitive Firestore documents—such as personally identifiable information (PII), authentication tokens, or internal business data—are unintentionally exposed to an external model. Because Firestore documents may contain nested fields, arrays, and metadata, it is easy to serialize an entire entity and include it in a request body without realizing the scope of the data being shared.
In a Hapi service, routes commonly retrieve documents using the Firestore SDK and then construct a response or forward data to an LLM for summarization, classification, or content generation. If the route handler does not explicitly filter fields before inclusion in the prompt or request payload, confidential information can be included in the model input. For example, a user profile document might contain email, phone number, and address fields; if the handler passes the full document to an LLM to generate a support reply, those fields are now present in the LLM’s input and could appear in model outputs or logs, leading to leakage.
The risk is compounded when developers use convenience methods that serialize Firestore documents with JSON.stringify() or spread operator patterns without redaction. Firestore document snapshots include metadata fields such as createTime, updateTime, and readTime, which may inadvertently reveal internal system timing or versioning details. Even when developers believe they are sending only a subset of fields, dynamic queries or incorrect assumptions about document structure can result in additional fields being included at runtime.
An additional vector specific to this combination involves the use of Firestore triggers (e.g., Cloud Functions) that invoke Hapi services or external LLMs. In such architectures, a triggered document write can propagate sensitive data outside the secure environment, especially if the integration does not validate or sanitize data before forwarding. Because LLM endpoints are often unauthenticated or only lightly guarded in experimentation phases, sensitive Firestore content can be transmitted without encryption or strict access controls, increasing exposure risk.
middleBrick’s LLM/AI Security checks detect this class of issue by scanning for system prompt leakage patterns, active prompt injection probes, and output scanning for PII or API keys. When Firestore data reaches an LLM endpoint without proper filtering, the scanner can identify missing redaction, excessive agency patterns, or unauthenticated endpoints that further elevate risk. These findings highlight the importance of explicitly defining which document fields are safe for LLM consumption and validating that no sensitive data is included in prompts or logged model outputs.
To mitigate LLM data leakage in Hapi with Firestore, developers should adopt a strict allowlist approach when preparing data for LLM interactions. This includes explicitly selecting only necessary, non-sensitive fields, removing nested objects that may contain hidden sensitive content, and avoiding direct serialization of Firestore snapshots. Where possible, transform Firestore documents into simplified view models before inclusion in prompts, and ensure that any data sent to LLM endpoints is reviewed against compliance frameworks such as OWASP API Top 10 and GDPR.
Firestore-Specific Remediation in Hapi — concrete code fixes
Remediation focuses on controlling which Firestore document fields are exposed to LLMs and ensuring that data transformations occur before any external call. The following examples demonstrate secure patterns for retrieving and preparing Firestore data within a Hapi route handler.
First, initialize Firestore and define a route that explicitly selects safe fields. This pattern avoids passing the entire document snapshot to an LLM and reduces the chance of accidental data exposure.
const Hapi = require('@hapi/hapi');
const { initializeApp } = require('firebase-admin/app');
const { getFirestore } = require('firebase-admin/firestore');
initializeApp();
const db = getFirestore();
const server = Hapi.server({ port: 4000, host: 'localhost' });
server.route({
method: 'GET',
path: '/support/suggestion/{docId}',
handler: async (request, h) =>
{
const docId = request.params.docId;
const docRef = db.collection('userQueries').doc(docId);
const doc = await docRef.get();
if (!doc.exists) {
return h.response({ error: 'Not found' }).code(404);
}
const data = doc.data();
// Explicitly allowlisted fields for LLM input
const safeInput = {
subject: data.subject || '',
message: data.message || '',
category: data.category || 'general',
// Intentionally excluding email, phone, userId, internalNotes
};
// Example: send safeInput to an LLM endpoint
// const llmResponse = await callLlm(safeInput);
return { summary: 'Processed with safe fields only' };
}
});
server.start().then(() => console.log('Server running'));
In this example, the handler retrieves a Firestore document but constructs safeInput using an explicit field allowlist. Sensitive fields such as email, phone, and internalNotes are omitted, ensuring that only intended data is forwarded to the LLM. This pattern aligns with best practices for input validation and helps meet compliance requirements by limiting data exposure.
Second, when dealing with arrays or nested objects, use transformation functions to flatten or filter content before inclusion. The following snippet shows how to sanitize nested arrays and remove potentially sensitive entries.
const sanitizeForLlm = (data) => {
const { notes, internalTags, ...safeData } = data;
// Remove or redact sensitive nested content
const publicNotes = Array.isArray(notes)
? notes.filter(n => !n.private).map(n => n.text)
: [];
return {
...safeData,
notes: publicNotes,
};
};
server.route({
method: 'POST',
path: '/process',
handler: async (request, h) =>
{
const docId = request.payload.docId;
const docRef = db.collection('items').doc(docId);
const doc = await docRef.get();
if (!doc.exists) {
return h.response({ error: 'Not found' }).code(404);
}
const data = doc.data();
const prepared = sanitizeForLlm(data);
// prepared now contains only safe, flattened data suitable for LLM consumption
return { prepared };
}
});
This second example demonstrates a sanitization function that removes internal notes and tags before constructing the payload for an LLM. By explicitly excluding or filtering fields, you reduce the likelihood of LLM data leakage and ensure that only necessary, non-sensitive information is used during model interactions.
Finally, adopt schema validation for Firestore documents to enforce field-level constraints and prevent unexpected data from reaching LLM endpoints. Using libraries such as Joi or Zod, you can define allowed structures and reject documents that do not conform, adding an additional layer of protection against inadvertent data exposure.
Related CWEs: llmSecurity
| CWE ID | Name | Severity |
|---|---|---|
| CWE-754 | Improper Check for Unusual or Exceptional Conditions | MEDIUM |