Unicode Normalization in Flask with Firestore
Unicode Normalization in Flask with Firestore — how this specific combination creates or exposes the vulnerability
Unicode normalization inconsistencies arise when equivalent characters or strings have multiple binary representations. In Flask applications that use Firestore as a backend, accepting user input, normalizing it inconsistently, and then persisting or querying it can lead to authentication bypass, data leakage, or unexpected record matches.
Firestore stores strings as UTF-8 and does not apply normalization automatically. If Flask receives an NFC form submission (e.g., composed characters) and queries Firestore with a NFD version (decomposed), the query may return no results, causing logic errors. Conversely, if Flask normalizes incoming data to one form but compares it against IDs or keys generated elsewhere in another form, authorization checks can incorrectly match or fail.
Consider a Flask route that looks up a user document by email:
@app.route("/user")
def get_user():
email = request.args.get("email")
normalized = unicodedata.normalize("NFC", email)
doc = db.collection("users").document(normalized).get()
return {"email": doc.to_dict() if doc else None}
If the stored document ID was created using NFD, this lookup will miss. Similarly, user-controlled input used in Firestore queries or document paths can lead to BOLA/IDOR when normalization differences allow one user to access another’s data by supplying a canonically equivalent but differently encoded identifier.
LLM-generated outputs or API payloads containing non-ASCII characters can also introduce normalization variants. If Flask does not enforce a consistent normalization policy across all inputs, outputs, and Firestore identifiers, findings related to Input Validation and Data Exposure may appear in scans. An unauthenticated LLM endpoint that returns strings with non-ASCII characters could, when stored or compared without normalization, expose PII or enable injection-style confusion in downstream logic.
To detect these issues, scans test whether canonically equivalent inputs produce different runtime behaviors, including whether document lookups, query filters, or path-based references resolve as expected. This helps surface inconsistencies in how Flask and Firestore handle normalization, which can map to OWASP API Top 10 items like Broken Object Level Authorization and Improper Input Validation.
Firestore-Specific Remediation in Flask — concrete code fixes
Apply normalization consistently before any interaction with Firestore. Decide on a canonical form (NFC is common for web forms) and enforce it at the boundary where Flask receives data, as well as when constructing document IDs or field values for queries.
Use Python’s unicodedata module to normalize inputs, and ensure any data read from Firestore is normalized before comparison or rendering. Below is a Flask example that normalizes email input before using it as a document ID lookup:
import unicodedata
from flask import Flask, request
from google.cloud import firestore
app = Flask(__name__)
db = firestore.Client()
@app.route("/user")
def get_user():
email = request.args.get("email")
key = unicodedata.normalize("NFC", email)
doc = db.collection("users").document(key).get()
return {"email": doc.to_dict() if doc else None}
When storing data, normalize both the document identifier and any searchable fields:
@app.route("/register", methods=["POST"])
def register_user():
data = request.json
email = unicodedata.normalize("NFC", data["email"])
user_data = {
"email": email,
"display_name": unicodedata.normalize("NFC", data.get("display_name", "")),
}
db.collection("users").document(email).set(user_data)
return {"status": "created"}, 201
For queries involving fields that may contain user-supplied strings, normalize the query value as well:
@app.route("/search")
def search_users():
term = request.args.get("q")
norm_term = unicodedata.normalize("NFC", term)
results = db.collection("users").where("email", "==", norm_term).stream()
return [doc.to_dict() for doc in results]
If your application uses Firestore document paths that include user input (e.g., usernames), normalize the path components to avoid BOLA/IDOR due to encoding mismatches:
@app.route("/profile")
def get_profile():
username = request.args.get("username")
safe_username = unicodedata.normalize("NFC", username)
doc = db.collection("profiles").document(safe_username).get()
return doc.to_dict() if doc else None
For the LLM/AI Security checks, ensure that any non-ASCII output from models is normalized before storage or comparison. This reduces the risk of hidden injection or access confusion that scans might flag under Input Validation or Data Exposure categories.
middleBrick scans can surface these normalization issues by testing equivalent inputs with different Unicode representations. If you use the CLI, you can run middlebrick scan <url> to validate that your endpoints handle normalization consistently. In CI/CD, the GitHub Action can fail builds when responses differ across normalization forms, and the Dashboard can track changes over time.