Who is this guide for?

This guide is designed for beginner-level users and takes about 1 minutes to read.

Best Practice Beginner 1 min read 249 words

Unicode Normalization: NFC, NFD, NFKC, NFKD Explained

The same visible character can have multiple Unicode representations. Learn when and how to normalize text to prevent comparison failures and search issues.

Featured Tool

Word Counter

Count words, characters, sentences, and paragraphs.

Try it Free

The Normalization Problem

Unicode allows the same visual character to be represented in multiple ways. The letter "e" (e-acute) can be a single code point (U+00E9) or a combination of "e" (U+0065) + combining acute accent (U+0301). These look identical but fail string comparison.

The Four Normalization Forms

NFC (Canonical Decomposition, followed by Canonical Composition) composes characters where possible. This is the most common web standard and what most applications expect. NFD (Canonical Decomposition) decomposes composed characters into base characters plus combining marks. NFKC and NFKD additionally handle compatibility equivalences — for example, converting the Roman numeral "IV" (single character) to separate "I" and "V" characters.

When to Normalize

Normalize text at system boundaries: when receiving user input, importing data, and before storing in a database. Use NFC for web content (W3C recommendation), database storage, and general-purpose text. Use NFKC for search indexing and string comparison where visual similarity matters more than semantic precision.

Real-World Impact

Without normalization, a search for "cafe" won't match "cafe" if one uses NFC and the other NFD encoding of the accent. File names with accented characters can create "duplicate" files that differ only in normalization form. URLs containing Unicode characters must be normalized to prevent routing failures.

Platform Differences

macOS file systems use NFD normalization, while Windows uses NFC. This means copying files between operating systems can change filename byte sequences. Linux file systems are normalization-agnostic, which can result in visually identical filenames coexisting in the same directory.

Outils associés

W Word Counter C Case Converter S Sort Lines L Lorem Ipsum Generator S Slug Generator F Find & Replace R Remove Duplicate Lines B Base64 Encoder/Decoder U URL Encoder/Decoder J JSON Formatter H HTML Entity Encoder/Decoder R Reverse Text A Add/Remove Line Numbers T Text Diff T Text Extractor

Formats associés

.csv .html .json .md .txt .xml

Guides associés

Text Encoding Explained: UTF-8, ASCII, and Beyond

Text encoding determines how characters are stored as bytes. Understanding UTF-8, ASCII, and other encodings prevents garbled text, mojibake, and data corruption in your applications and documents.

Regular Expressions: A Practical Guide for Text Processing

Regular expressions are powerful patterns for searching, matching, and transforming text. This guide covers the most useful regex patterns with real-world examples for common text processing tasks.

Markdown vs Rich Text vs Plain Text: When to Use Each

Choosing between Markdown, rich text, and plain text affects portability, readability, and editing workflow. This comparison helps you select the right text format for documentation, notes, and content creation.

How to Convert Case and Clean Up Messy Text

Messy text with inconsistent capitalization, extra whitespace, and mixed formatting is a common problem. This guide covers tools and techniques for cleaning, transforming, and standardizing text efficiently.

Troubleshooting Character Encoding Problems

Garbled text, question marks, and missing characters are symptoms of encoding mismatches. This guide helps you diagnose and fix the most common character encoding problems in web pages, files, and databases.