HIGH unicode normalizationflaskcockroachdb

Unicode Normalization in Flask with Cockroachdb

Unicode Normalization in Flask with Cockroachdb — how this specific combination creates or exposes the vulnerability

Unicode normalization inconsistencies between Flask request handling and Cockroachdb string comparison can create authentication bypass and data exposure vulnerabilities. When a Flask application receives user input, it may not normalize Unicode strings before using them in SQL queries against Cockroachdb. Cockroachdb stores and compares Unicode text according to its own normalization rules, which may differ from Python's standard normalization forms used in Flask.

For example, the character 'é' can be represented as a single code point U+00E9 or as a decomposed sequence 'e' + U+0301. If Flask does not normalize incoming usernames or passwords, an attacker could supply either representation to bypass authentication checks that compare normalized input against values stored in Cockroachdb. This becomes an IDOR-related issue when user-controlled identifiers such as usernames or API keys are involved.

The combination is risky because:

  • Flask may pass raw Unicode strings to Cockroachdb via SQL queries or ORM layers without normalization.
  • Cockroachdb performs its own normalization during comparison, leading to mismatches between what the application expects and what the database returns.
  • Search and filtering operations may return multiple records or incorrect records, enabling privilege escalation or data leakage.

In security testing, this pattern is observable in the BOLA/IDOR and Input Validation checks. An unauthenticated attacker could enumerate users by supplying canonically equivalent but non-identical Unicode strings, causing the application to behave differently depending on how Cockroachdb resolves the strings.

Cockroachdb-Specific Remediation in Flask — concrete code fixes

Remediation focuses on ensuring consistent Unicode normalization before any string is sent to Cockroachdb, and validating input against expected canonical forms. Use Python's unicodedata module to normalize incoming data, and apply the same normalization to any string literals used in SQL statements.

Example: Normalizing user input before database operations

import unicodedata
from flask import Flask, request, jsonify
import psycopg2

app = Flask(__name__)

def normalize_unicode(value: str) -> str:
    """Normalize to NFC form, recommended for consistent storage and comparison."""
    return unicodedata.normalize('NFC', value)

@app.route('/login', methods=['POST'])
def login():
    data = request.get_json()
    username = normalize_unicode(data.get('username', ''))
    password = normalize_unicode(data.get('password', ''))

    conn = psycopg2.connect(
        host='your-cockroachdb-host',
        port=26257,
        dbname='yourdb',
        user='youruser',
        password='yourpassword'
    )
    cur = conn.cursor()
    # Use parameterized queries to avoid SQL injection
    cur.execute(
        'SELECT id, username FROM users WHERE username = %s AND password_hash = crypt(%s, password_hash)',
        (username, password)
    )
    user = cur.fetchone()
    cur.close()
    conn.close()

    if user:
        return jsonify({'status': 'ok', 'user_id': user[0]})
    return jsonify({'status': 'invalid credentials'}), 401

Example: Normalizing identifiers in API endpoints

When using Cockroachdb identifiers such as tenant IDs or API keys, normalize before constructing queries:

import unicodedata
from flask import Flask, g
import psycopg2

app = Flask(__name__)

def normalize_identifier(value: str) -> str:
    return unicodedata.normalize('NFC', value)

@app.before_request
def resolve_tenant():
    raw_tenant_id = request.headers.get('X-Tenant-ID', '')
    g.tenant_id = normalize_identifier(raw_tenant_id)

@app.route('/data')
def get_tenant_data():
    conn = psycopg2.connect(
        host='your-cockroachdb-host',
        port=26257,
        dbname='yourdb',
        user='youruser',
        password='yourpassword'
    )
    cur = conn.cursor()
    cur.execute(
        'SELECT sensitive_info FROM tenant_data WHERE tenant_id = %s',
        (g.tenant_id,)
    )
    result = cur.fetchone()
    cur.close()
    conn.close()
    if result:
        return jsonify({'data': result[0]})
    return jsonify({'error': 'not found'}), 404

Database-side considerations

Cockroachdb stores text in the encoding and normalization form provided at insertion. Queries that compare normalized input against non-normalized stored data will fail to match. Therefore, ensure that:

  • All incoming strings are normalized to a consistent form (typically NFC) in the application layer before any database operation.
  • Any search or comparison involving user-controlled strings applies the same normalization to both sides of the comparison.
  • If you rely on ORM behavior, verify that the ORM does not alter Unicode representation before sending queries to Cockroachdb.

These steps reduce the risk of bypassing authentication, preventing IDOR, and avoiding inconsistent authorization checks that depend on string equality with Cockroachdb.

Frequently Asked Questions

Does middleBrick detect Unicode normalization issues in Flask applications using Cockroachdb?
Yes, middleBrick's Input Validation checks can identify inconsistent Unicode handling that may lead to authentication bypass or IDOR when Flask interacts with Cockroachdb.
Can the GitHub Action fail builds if Unicode normalization issues are found?
Yes, by configuring the GitHub Action with a security score threshold, builds can be automatically failed when findings related to input validation and data handling are detected.