MEDIUM unicode normalizationflaskfirestore

Unicode Normalization in Flask with Firestore

Unicode Normalization in Flask with Firestore — how this specific combination creates or exposes the vulnerability

Unicode normalization inconsistencies arise when equivalent characters or strings have multiple binary representations. In Flask applications that use Firestore as a backend, accepting user input, normalizing it inconsistently, and then persisting or querying it can lead to authentication bypass, data leakage, or unexpected record matches.

Firestore stores strings as UTF-8 and does not apply normalization automatically. If Flask receives an NFC form submission (e.g., composed characters) and queries Firestore with a NFD version (decomposed), the query may return no results, causing logic errors. Conversely, if Flask normalizes incoming data to one form but compares it against IDs or keys generated elsewhere in another form, authorization checks can incorrectly match or fail.

Consider a Flask route that looks up a user document by email:

@app.route("/user")
def get_user():
    email = request.args.get("email")
    normalized = unicodedata.normalize("NFC", email)
    doc = db.collection("users").document(normalized).get()
    return {"email": doc.to_dict() if doc else None}

If the stored document ID was created using NFD, this lookup will miss. Similarly, user-controlled input used in Firestore queries or document paths can lead to BOLA/IDOR when normalization differences allow one user to access another’s data by supplying a canonically equivalent but differently encoded identifier.

LLM-generated outputs or API payloads containing non-ASCII characters can also introduce normalization variants. If Flask does not enforce a consistent normalization policy across all inputs, outputs, and Firestore identifiers, findings related to Input Validation and Data Exposure may appear in scans. An unauthenticated LLM endpoint that returns strings with non-ASCII characters could, when stored or compared without normalization, expose PII or enable injection-style confusion in downstream logic.

To detect these issues, scans test whether canonically equivalent inputs produce different runtime behaviors, including whether document lookups, query filters, or path-based references resolve as expected. This helps surface inconsistencies in how Flask and Firestore handle normalization, which can map to OWASP API Top 10 items like Broken Object Level Authorization and Improper Input Validation.

Firestore-Specific Remediation in Flask — concrete code fixes

Apply normalization consistently before any interaction with Firestore. Decide on a canonical form (NFC is common for web forms) and enforce it at the boundary where Flask receives data, as well as when constructing document IDs or field values for queries.

Use Python’s unicodedata module to normalize inputs, and ensure any data read from Firestore is normalized before comparison or rendering. Below is a Flask example that normalizes email input before using it as a document ID lookup:

import unicodedata
from flask import Flask, request
from google.cloud import firestore

app = Flask(__name__)
db = firestore.Client()

@app.route("/user")
def get_user():
    email = request.args.get("email")
    key = unicodedata.normalize("NFC", email)
    doc = db.collection("users").document(key).get()
    return {"email": doc.to_dict() if doc else None}

When storing data, normalize both the document identifier and any searchable fields:

@app.route("/register", methods=["POST"])
def register_user():
    data = request.json
    email = unicodedata.normalize("NFC", data["email"])
    user_data = {
        "email": email,
        "display_name": unicodedata.normalize("NFC", data.get("display_name", "")),
    }
    db.collection("users").document(email).set(user_data)
    return {"status": "created"}, 201

For queries involving fields that may contain user-supplied strings, normalize the query value as well:

@app.route("/search")
def search_users():
    term = request.args.get("q")
    norm_term = unicodedata.normalize("NFC", term)
    results = db.collection("users").where("email", "==", norm_term).stream()
    return [doc.to_dict() for doc in results]

If your application uses Firestore document paths that include user input (e.g., usernames), normalize the path components to avoid BOLA/IDOR due to encoding mismatches:

@app.route("/profile")
def get_profile():
    username = request.args.get("username")
    safe_username = unicodedata.normalize("NFC", username)
    doc = db.collection("profiles").document(safe_username).get()
    return doc.to_dict() if doc else None

For the LLM/AI Security checks, ensure that any non-ASCII output from models is normalized before storage or comparison. This reduces the risk of hidden injection or access confusion that scans might flag under Input Validation or Data Exposure categories.

middleBrick scans can surface these normalization issues by testing equivalent inputs with different Unicode representations. If you use the CLI, you can run middlebrick scan <url> to validate that your endpoints handle normalization consistently. In CI/CD, the GitHub Action can fail builds when responses differ across normalization forms, and the Dashboard can track changes over time.

Frequently Asked Questions

Why does a NFC-normalized input sometimes not match a Firestore document created with NFD?
Firestore stores strings as UTF-8 bytes and does not apply normalization. If a document ID or field is stored using NFD, a query using the NFC form of the same logical string will not match because the byte representations differ, leading to missed lookups or authorization gaps.
Should I normalize to NFC or NFD for Firestore in Flask?
Choose one canonical form for your application (NFC is common for web forms) and enforce it consistently: at input acceptance, before storing to Firestore, and before querying. Document this policy and ensure any data compared with Firestore values is normalized to the same form to avoid mismatches.