Unicode Normalization in Mongodb
How Unicode Normalization Manifests in MongoDB
MongoDB stores strings as UTF‑8, but the database does not automatically apply Unicode normalization when indexing or comparing values. This means that two strings that are canonically equivalent (e.g., the letter “e” followed by a combining acute accent versus the pre‑composed “é”) are treated as different keys unless the application normalizes them before storage or query. Attackers can exploit this in several ways:
- Authentication bypass: A login endpoint that checks
usernameagainst a unique index may acceptadmin(U+0061 U+0064 U+006D U+0069 U+006E) and rejectadmin(fullwidth ASCII, U+FF21 U+FF24 U+FF2D U+FF29 U+FF2E) if the index was built on the raw bytes. By supplying the fullwidth variant, an attacker can log in as a non‑existent user or create a duplicate account that later confuses authorization logic. - Data‑integrity subversion: An application that relies on a unique index on
emailto prevent duplicate accounts can be tricked into storing two records that appear identical to a user but differ in normalization form (e.g.,user@example.comvs.user@example.comwhere the latter uses a combining diaeresis on the “e”). This can lead to account takeover or privilege escalation when the application later normalizes the value for display. - Input‑validation evasion: Validation routines that use regular expressions to reject dangerous characters (e.g.,
[^a-zA-Z0-9@.]) may miss fullwidth equivalents or characters from other Unicode blocks that normalize to the same ASCII glyph. An attacker can injectadmin%EF%BC%A1(fullwidth “!”) to bypass a filter that only looks for the ASCII exclamation mark. - Query manipulation: MongoDB’s
$regexoperator works on the raw byte sequence. A pattern like/^admin$/iwill not match the fullwidthadmin, allowing an attacker to craft a payload that avoids detection by a regex‑based WAF or application‑level blocklist while still being interpreted as the intended username after the application normalizes it for lookup.
These patterns are catalogued in the OWASP API Security Top 10 2023 under A03:Injection, where Unicode normalization is listed as a common bypass technique for input validation and authentication mechanisms.
MongoDB‑Specific Detection
Detecting Unicode normalization issues requires checking both the application logic and the database schema. middleBrick’s Input Validation scan includes a set of tests that send payloads with various Unicode normalization forms (NFC, NFD, fullwidth ASCII, homoglyphs) to each endpoint and observes whether the response deviates from the expected behavior. If the scanner receives a successful login or data retrieval when a normalized‑variant payload is used, it flags the endpoint as vulnerable.
In addition, middleBrick examines the MongoDB schema when an OpenAPI/Swagger spec is supplied. It looks for:
- Unique indexes that do not specify a collation with a strength level that ignores case and diacritics (e.g.,
strength: 2or higher). - Fields used for authentication, authorization, or as keys in application logic that lack explicit validation rules.
- Endpoints that perform direct string comparison (
$eq) on user‑supplied values without applying a normalization step.
Example of a finding that middleBrick might return:
{
"endpoint": "/api/v1/login",
"method": "POST",
"finding": "Unicode normalization bypass possible",
"severity": "medium",
"details": "The endpoint accepts the fullwidth username ‘admin’ (U+FF21 U+FF24 U+FF2D U+FF29 U+FF2E) and authenticates as the user ‘admin’. No normalization is applied before querying the users collection.",
"remediation": "Normalize usernames to NFC (or NFKC) before querying, or create a unique index with collation { locale: 'en', strength: 2 }."
}
Because the scan is unauthenticated and black‑box, it does not require any credentials or agents; it simply sends the crafted payloads and analyses the responses.
MongoDB‑Specific Remediation
Fixing Unicode normalization issues in MongoDB involves two complementary steps: normalizing data at the application layer and, where appropriate, configuring the database to treat canonically equivalent strings as equal.
1. Application‑level normalization Before storing or using any user‑provided string that participates in security decisions (usernames, emails, tokens, etc.), convert it to a Unicode Normalization Form. The most common choice is NFC (Canonical Composition) or NFKC (Compatibility Composition) if you also want to fold compatibility characters.
Example in Node.js using the unorm package:
const { normalize } = require('unorm');
const { MongoClient } = require('mongodb');
async function registerUser(rawUsername, rawEmail) {
const username = normalize(rawUsername); // NFC
const email = normalize(rawEmail);
const client = await MongoClient.connect('mongodb://localhost:27017');
const db = client.db('app');
await db.collection('users').insertOne({ username, email, createdAt: new Date() });
await client.close();
}
// Usage
registerUser('admin', 'Usér@example.com');
Example in Python with pymongo and the built‑in unicodedata module:
import unicodedata
from pymongo import MongoClient
def normalize(value):
return unicodedata.normalize('NFC', value)
def register_user(raw_username, raw_email):
username = normalize(raw_username)
email = normalize(raw_email)
client = MongoClient('mongodb://localhost:27017')
db = client.get_database('app')
db.users.insert_one({'username': username, 'email': email})
client.close()
# Usage
register_user('admin', 'Usér@example.com')
2. Database‑level collation If you prefer to keep the raw strings in the database but want queries to treat canonically equivalent strings as identical, create a unique index with a collation that sets an appropriate strength level. Strength 2 ignores diacritics; Strength 3 ignores case; Strength 4 (identical) treats differences only as control characters.
Example using the MongoDB shell:
db.users.createIndex(
{ username: 1 },
{
unique: true,
collation: { locale: 'en', strength: 2 } // ignore accents, treat e and é as same
}
);
When the index exists, an insert of admin followed by an insert of admín will fail with a duplicate‑key error, preventing the bypass described earlier.
3. Validation and testing Add unit tests that attempt to register or authenticate with NFC, NFD, fullwidth, and homoglyph variants of legitimate values. Ensure the application either rejects the input (if it does not conform to the allowed character set) or treats all variants as the same account.
By combining application‑level normalization with, where needed, a collation‑aware unique index, you eliminate the attack surface that Unicode normalization introduces in MongoDB.