HIGH unicode normalizationbearer tokens

Unicode Normalization with Bearer Tokens

How Unicode Normalization Manifests in Bearer Tokens

Unicode normalization vulnerabilities in Bearer tokens occur when authentication systems fail to consistently handle Unicode characters across different normalization forms. When a server accepts a token but processes it in a different Unicode normalization form than the client submitted, attackers can exploit this inconsistency to bypass authentication or access unauthorized resources.

The most common attack pattern involves submitting a token with Unicode characters that have multiple valid representations. For example, the character 'é' can be represented as U+00E9 (Latin small letter e with acute) or as U+0065 U+0301 (e followed by combining acute accent). If an authentication system normalizes tokens inconsistently, an attacker might craft a token that validates as legitimate when normalized one way but fails when normalized another way.

Consider this attack scenario: An API accepts a Bearer token containing Unicode characters. The client submits the token as-is, but the server normalizes it to NFC (Canonical Decomposition followed by Canonical Composition) before validation. An attacker who knows the original token can submit a version in NFD (Canonical Decomposition) form. If the server's comparison logic is case-insensitive or uses loose string matching, the NFD version might pass validation while being structurally different from the original token.

// Vulnerable comparison logic
function validateBearerToken(providedToken, storedToken) {
  // Normalizes provided token but not stored token
  const normalizedProvided = providedToken.normalize('NFC');
  return normalizedProvided === storedToken; // storedToken might be in different form
}

// Attack: Original token: "eyJ0eXAiOiJKV1QiLC..." 
// Attacker submits: "éýJ0..." (NFD form)
// If server normalizes to NFC, both may compare equal incorrectly

Another manifestation occurs in token generation systems. If a token generation service produces tokens with Unicode characters but doesn't specify the normalization form, different instances might generate tokens in different forms. This creates a situation where a token valid on one server instance fails on another, leading to inconsistent authentication behavior across a distributed system.

Property-based authorization checks are particularly vulnerable. If a Bearer token contains Unicode characters representing user IDs or role identifiers, inconsistent normalization can cause authorization decisions to be bypassed. An attacker might craft a token where the normalized form grants elevated privileges while the original form appears legitimate.

Bearer Tokens-Specific Detection

Detecting Unicode normalization vulnerabilities in Bearer tokens requires systematic testing across different normalization forms. The most effective approach is to scan API endpoints that accept Bearer tokens with tokens modified to different Unicode normalization forms.

middleBrick's Bearer tokens security scanning specifically tests this vulnerability by submitting the same token in multiple normalization forms and observing how the server responds. The scanner sends tokens in NFC, NFD, NFKC, and NFKD forms, then analyzes whether the server treats them differently.

// middleBrick scan output example
{
  "unicode_normalization": {
    "status": "vulnerable",
    "description": "Server accepts Bearer tokens in multiple Unicode normalization forms without consistent validation",
    "severity": "high",
    "remediation": "Implement consistent Unicode normalization using NFC form before token validation",
    "evidence": [
      {
        "submitted_form": "NFC",
        "response_code": 200,
        "response_time": 123
      },
      {
        "submitted_form": "NFD",
        "response_code": 200,
        "response_time": 125
      }
    ]
  }
}

Manual detection involves creating test tokens with Unicode characters and submitting them in different normalization forms. Use tools like Python's unicodedata module or Node.js's String.prototype.normalize() to generate different forms of the same token.

// Manual testing script
import requests
import unicodedata

def test_unicode_normalization(base_url, token):
    forms = {
        'NFC': unicodedata.normalize('NFC', token),
        'NFD': unicodedata.normalize('NFD', token),
        'NFKC': unicodedata.normalize('NFKC', token),
        'NFKD': unicodedata.normalize('NFKD', token)
    }
    
    results = {}
    for form, token_form in forms.items():
        headers = {'Authorization': f'Bearer {token_form}'}
        response = requests.get(f'{base_url}/protected', headers=headers)
        results[form] = {
            'status_code': response.status_code,
            'content': response.text[:200]  # truncate for brevity
        }
    
    return results

# Test with a token containing Unicode characters
original_token = "eyJ0eXAiOiJKV1QiLC...\u00e9..."
results = test_unicode_normalization('https://api.example.com', original_token)
print(results)

Look for inconsistent behavior across normalization forms. If the server returns different responses (200 vs 401, or different error messages) for the same logical token in different forms, you've identified a normalization vulnerability. Pay special attention to timing differences, as some implementations might normalize on-the-fly, causing subtle timing variations that leak information about the token's structure.

Bearer Tokens-Specific Remediation

Remediating Unicode normalization vulnerabilities in Bearer tokens requires implementing consistent normalization across the entire authentication and authorization pipeline. The key principle is to normalize all tokens to a single, well-defined form before any validation or processing occurs.

The most secure approach is to normalize all incoming Bearer tokens to NFC (Canonical Decomposition followed by Canonical Composition) form immediately upon receipt. This ensures consistent processing regardless of how the client submitted the token.

// Secure Bearer token validation middleware
function bearerTokenMiddleware(req, res, next) {
    const authHeader = req.headers.authorization;
    if (!authHeader || !authHeader.startsWith('Bearer ')) {
        return res.status(401).json({ error: 'Missing Bearer token' });
    }
    
    // Extract and normalize token
    let token = authHeader.substring(7); // remove 'Bearer '
    token = normalizeBearerToken(token);
    
    // Validate normalized token
    if (!validateToken(token)) {
        return res.status(401).json({ error: 'Invalid token' });
    }
    
    req.token = token;
    next();
}

function normalizeBearerToken(token) {
    // Normalize to NFC form - most compatible and widely accepted
    return token.normalize('NFC');
}

function validateToken(token) {
    // Your validation logic here
    // Always operates on normalized token
    return tokenService.verify(token);
}

For token generation systems, ensure all tokens are generated in a specific normalization form. If your token format allows Unicode characters (which is rare for JWTs but possible in custom token formats), generate them in NFC form and document this requirement for all client implementations.

// Token generation with consistent normalization
function generateSecureBearerToken(payload) {
    const header = { alg: 'HS256', typ: 'JWT' };
    const encodedHeader = base64url.encode(JSON.stringify(header));
    
    // Ensure payload is normalized before encoding
    const normalizedPayload = normalizePayload(payload);
    const encodedPayload = base64url.encode(JSON.stringify(normalizedPayload));
    
    const signature = crypto
        .createHmac('sha256', process.env.JWT_SECRET)
        .update(`${encodedHeader}.${encodedPayload}`)
        .digest('base64');
    
    return `${encodedHeader}.${encodedPayload}.${signature}`;
}

function normalizePayload(payload) {
    // Normalize all string properties in payload
    const normalized = {};
    for (const [key, value] of Object.entries(payload)) {
        normalized[key] = typeof value === 'string' 
            ? value.normalize('NFC') 
            : value;
    }
    return normalized;
}

When storing tokens in databases or caches, normalize them before storage and always compare using constant-time comparison functions to prevent timing attacks. Many Unicode characters can have visually similar representations that differ in byte length, making length-based timing attacks feasible if not properly mitigated.

// Secure token comparison
function constantTimeCompare(val1, val2) {
    if (val1.length !== val2.length) return false;
    
    let result = 0;
    for (let i = 0; i < val1.length; i++) {
        result |= val1.charCodeAt(i) ^ val2.charCodeAt(i);
    }
    return result === 0;
}

// Usage in validation
const normalizedProvided = providedToken.normalize('NFC');
const normalizedStored = storedToken.normalize('NFC');
if (!constantTimeCompare(normalizedProvided, normalizedStored)) {
    return res.status(401).json({ error: 'Invalid token' });
}

Implement comprehensive testing with tokens containing various Unicode characters, including those from different scripts, combining characters, and characters with multiple valid representations. Test your system's behavior with tokens in all four Unicode normalization forms to ensure consistent handling.

Frequently Asked Questions

Why does Unicode normalization matter for Bearer tokens if most tokens are base64-encoded?
While JWTs and many Bearer tokens are base64-encoded (which typically avoids Unicode issues), custom token formats, API keys, or tokens containing user identifiers may include Unicode characters. Additionally, base64 encoding doesn't guarantee consistent handling across all systems, especially when tokens are decoded for validation or when systems process token metadata containing Unicode characters.
Which Unicode normalization form should I use for Bearer token validation?
NFC (Canonical Decomposition followed by Canonical Composition) is the recommended form for most applications. It's the most widely supported form, provides good compatibility across systems, and represents characters in a composed form that's typically more compact. The key is consistency—once you choose a form, apply it uniformly across your entire authentication pipeline.