Character encoding is the foundation of all digital text. Whether you're debugging garbled emails, fixing database collation, or figuring out why your JSON file breaks on certain characters, understanding ASCII, Unicode, and UTF-8 is essential. This comprehensive guide explains how text is represented in computers, the evolution from ASCII to Unicode, and why UTF-8 encoding became the universal standard for the modern web.
Convert between binary and text with our Binary-Text Converter →
1. ASCII — The 7-Bit Foundation
ASCII (American Standard Code for Information Interchange) was published in 1963 and uses 7 bits to represent 128 characters (0–127). It includes 33 control characters (0–31 and 127) and 95 printable characters (32–126).
ASCII was designed for English text and teletype machines. Every character fits in a single byte, with the high bit (bit 7) unused or used for parity checking.
The 95 printable characters include uppercase letters (A–Z), lowercase letters (a–z), digits (0–9), and 33 punctuation/symbol characters. Control characters include NUL (0), TAB (9), LF (10), CR (13), and ESC (27).
| Decimal | Hex | Character | Description |
|---|---|---|---|
0 | 00 | NUL | Null character |
9 | 09 | TAB | Horizontal tab |
10 | 0A | LF | Line feed (newline) |
13 | 0D | CR | Carriage return |
27 | 1B | ESC | Escape |
32 | 20 | (space) | Space |
48 | 30 | 0 | Digit zero |
57 | 39 | 9 | Digit nine |
65 | 41 | A | Uppercase A |
90 | 5A | Z | Uppercase Z |
97 | 61 | a | Lowercase a |
122 | 7A | z | Lowercase z |
127 | 7F | DEL | Delete |
2. Extended ASCII — The Code Page Chaos
Since ASCII uses only 7 bits, the 8th bit in a byte (values 128–255) was free. Different systems used this range for different characters, creating code pages — incompatible extensions of ASCII.
ISO 8859-1 (Latin-1) added Western European characters like é, ü, ñ, and å. It was the default encoding for HTTP/1.1 and early HTML. Windows-1252 is Microsoft's superset of ISO 8859-1, adding characters like curly quotes (“ ”), em dash (—), and the euro sign (€) in the range 128–159.
The fundamental problem: byte 0xE9 means é in Latin-1 but щ in Windows-1251 (Cyrillic) and é in Windows-1252. Without knowing which code page was used, you cannot correctly decode text. This is why "mojibake" (garbled text) was rampant in the early web.
Other notable code pages include ISO 8859-5 (Cyrillic), ISO 8859-15 (Latin-9, replaced Latin-1 with the euro sign), Shift_JIS and EUC-JP (Japanese), Big5 (Traditional Chinese), and GB2312/GBK (Simplified Chinese).
# Same byte, different interpretations:
Byte: 0xE9
Latin-1 (ISO 8859-1): é (LATIN SMALL LETTER E WITH ACUTE)
Windows-1251 (Cyrillic): щ (CYRILLIC SMALL LETTER SHCHA)
Windows-874 (Thai): ้ (THAI CHARACTER MAI THO)
# This is why encoding metadata is critical!3. Unicode — The Universal Character Set
Unicode is a universal character set that assigns a unique code point to every character in every script. Code points are written as U+ followed by 4–6 hex digits (e.g., U+0041 = A, U+4E16 = 世, U+1F600 = 😀).
Unicode currently defines over 154,000 characters covering 168 scripts, from Latin and Cyrillic to Egyptian hieroglyphs and emoji. The maximum code point is U+10FFFF, giving a theoretical limit of 1,114,112 code points.
Unicode is organized into 17 <strong>planes</strong>, each containing 65,536 code points:
| Plane | Name | Code Point Range | Content |
|---|---|---|---|
0 | Basic Multilingual Plane (BMP) | U+0000 - U+FFFF | Latin, Cyrillic, Greek, CJK, Arabic, Hebrew, most symbols |
1 | Supplementary Multilingual Plane (SMP) | U+10000 - U+1FFFF | Emoji, historic scripts, musical notation, math symbols |
2 | Supplementary Ideographic Plane (SIP) | U+20000 - U+2FFFF | Rare CJK ideographs |
3-13 | Unassigned | U+30000 - U+DFFFF | Reserved for future use |
14 | Supplementary Special-purpose Plane (SSP) | U+E0000 - U+EFFFF | Tag characters, variation selectors |
15-16 | Private Use Areas | U+F0000 - U+10FFFF | Custom characters (not standardized) |
Key distinction: Unicode is a character set (mapping numbers to characters), not an encoding. The encoding determines how those numbers are stored as bytes. This is where UTF-8, UTF-16, and UTF-32 come in.
# Unicode code points are abstract numbers:
U+0041 → A (Latin capital letter A)
U+00E9 → é (Latin small letter e with acute)
U+4E16 → 世 (CJK Unified Ideograph - "world")
U+1F600 → 😀 (Grinning face emoji)
# These are just numbers — the encoding determines
# how they become bytes in memory or files.4. UTF-8 Encoding — Variable-Length Brilliance
UTF-8 (Unicode Transformation Format — 8-bit) is a variable-length encoding that uses 1 to 4 bytes per character. It was designed by Ken Thompson and Rob Pike in 1992 and is now the dominant encoding on the web (used by over 98% of websites).
UTF-8's design is elegant: it's backward compatible with ASCII (any valid ASCII file is also valid UTF-8), it's self-synchronizing (you can find character boundaries from any position), and it never produces zero bytes in non-NUL characters (safe for C strings).
The encoding algorithm uses a prefix system to indicate the number of bytes:
| Bytes | Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|---|
1 | U+0000 - U+007F | 0xxxxxxx | - | - | - |
2 | U+0080 - U+07FF | 110xxxxx | 10xxxxxx | - | - |
3 | U+0800 - U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | - |
4 | U+10000 - U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Let's trace through encoding the character é (U+00E9, Latin small letter e with acute):
# Encoding é (U+00E9) to UTF-8:
# Step 1: 0x00E9 = 0000 0000 1110 1001 in binary
# Step 2: U+00E9 falls in range U+0080–U+07FF → 2 bytes
# Step 3: Template: 110xxxxx 10xxxxxx
# Step 4: Fill in bits from right:
# 00E9 = 000 1110 1001
# Byte 1: 110_00011 = 0xC3
# Byte 2: 10_101001 = 0xA9
# Result: é = C3 A9 (2 bytes in UTF-8)
echo -n "é" | xxd
# Output: 00000000: c3a9And a more complex example — the Chinese character 世 (U+4E16, meaning "world"):
# Encoding 世 (U+4E16) to UTF-8:
# Step 1: 0x4E16 = 0100 1110 0001 0110 in binary
# Step 2: U+4E16 falls in range U+0800–U+FFFF → 3 bytes
# Step 3: Template: 1110xxxx 10xxxxxx 10xxxxxx
# Step 4: Fill in bits:
# 4E16 = 0100 1110 0001 0110
# Byte 1: 1110_0100 = 0xE4
# Byte 2: 10_111000 = 0xB8
# Byte 3: 10_010110 = 0x96
# Result: 世 = E4 B8 96 (3 bytes in UTF-8)
echo -n "世" | xxd
# Output: 00000000: e4b8 96For emoji like 😀 (U+1F600, grinning face):
# Encoding 😀 (U+1F600) to UTF-8:
# Step 1: 0x1F600 = 0001 1111 0110 0000 0000 in binary
# Step 2: U+1F600 falls in range U+10000–U+10FFFF → 4 bytes
# Step 3: Template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
# Step 4: Fill in bits:
# 1F600 = 0 0001 1111 0110 0000 0000
# Byte 1: 11110_000 = 0xF0
# Byte 2: 10_011111 = 0x9F
# Byte 3: 10_011000 = 0x98
# Byte 4: 10_000000 = 0x80
# Result: 😀 = F0 9F 98 80 (4 bytes in UTF-8)
echo -n "😀" | xxd
# Output: 00000000: f09f 98805. UTF-16 and UTF-32 — Alternatives to UTF-8
UTF-16 uses 2 or 4 bytes per character. Characters in the BMP (U+0000–U+FFFF) use 2 bytes. Characters outside the BMP use a surrogate pair — two 16-bit code units.
Surrogate pairs work as follows: subtract 0x10000 from the code point, split the 20 resulting bits into a high surrogate (0xD800–0xDBFF) and a low surrogate (0xDC00–0xDFFF).
For example, 😀 (U+1F600): 0x1F600 - 0x10000 = 0xF600. High 10 bits: 0x3D → 0xD800 + 0x3D = 0xD83D. Low 10 bits: 0x200 → 0xDC00 + 0x200 = 0xDE00. So 😀 in UTF-16 is D83D DE00.
Where UTF-16 is used: JavaScript strings (String.charCodeAt() returns UTF-16 code units), Java char type, Windows APIs (wchar_t), and .NET System.String.
// JavaScript uses UTF-16 internally
const emoji = "😀";
// .length counts UTF-16 code units, NOT characters
console.log(emoji.length); // 2 (surrogate pair)
console.log(emoji.charCodeAt(0)); // 55357 (0xD83D - high surrogate)
console.log(emoji.charCodeAt(1)); // 56832 (0xDE00 - low surrogate)
// Use codePointAt() for actual Unicode code points
console.log(emoji.codePointAt(0)); // 128512 (0x1F600)
// Use spread or Array.from for correct character counting
console.log([...emoji].length); // 1 (correct!)
console.log(Array.from(emoji).length); // 1
// String.fromCodePoint handles supplementary characters
console.log(String.fromCodePoint(0x1F600)); // 😀UTF-32 uses exactly 4 bytes per character. It's the simplest encoding (code point = stored value) but wastes space for Latin text (4x the size of ASCII). It's used internally in some programs for easy character indexing but rarely for storage or transmission.
| Encoding | Size | Primary Use |
|---|---|---|
| UTF-8 | 1-4 bytes/char | Web, files, APIs, JSON, databases |
| UTF-16 | 2 or 4 bytes/char | JavaScript, Java, Windows, .NET |
| UTF-32 | 4 bytes/char (fixed) | Internal processing, random access |
6. BOM (Byte Order Mark)
The Byte Order Mark (BOM) is a special Unicode character U+FEFF placed at the beginning of a file to indicate its encoding and byte order.
For UTF-16 and UTF-32, the BOM is essential — it tells the reader whether the file uses big-endian or little-endian byte order.
| Encoding | BOM Bytes | Description |
|---|---|---|
| UTF-8 | EF BB BF | Optional (discouraged) |
| UTF-16 BE | FE FF | Big-endian byte order |
| UTF-16 LE | FF FE | Little-endian byte order |
| UTF-32 BE | 00 00 FE FF | Big-endian byte order |
| UTF-32 LE | FF FE 00 00 | Little-endian byte order |
The UTF-8 BOM controversy: UTF-8 has no byte order issue (bytes are always in the same order), so a BOM is technically unnecessary. Microsoft tools (Notepad, Excel) add a UTF-8 BOM (EF BB BF) by default, which can cause problems:
- PHP scripts may output the BOM before headers, causing "headers already sent" errors
- Shell scripts may fail if the shebang line is preceded by BOM bytes
- JSON parsers may reject files starting with a BOM (RFC 8259 forbids it)
- Concatenating files can embed BOMs in the middle of the output
Recommendation: Do not use BOM for UTF-8 files unless required by a specific tool (e.g., Excel CSV import). Most modern editors and systems handle UTF-8 without a BOM.
# Check for BOM in a file:
xxd file.txt | head -1
# UTF-8 BOM: 00000000: efbb bf...
# Remove UTF-8 BOM with sed (Linux/macOS):
sed -i '1s/^\xEF\xBB\xBF//' file.txt
# Remove BOM with Python:
with open('file.txt', 'rb') as f:
content = f.read()
if content.startswith(b'\xef\xbb\xbf'):
content = content[3:]
with open('file.txt', 'wb') as f:
f.write(content)7. Encoding in HTML
Correct encoding declaration ensures browsers render your pages correctly. There are three ways the browser determines encoding:
1. HTTP Content-Type header (highest priority):
Content-Type: text/html; charset=utf-82. HTML meta tag (must be within the first 1024 bytes):
<!-- HTML5 (recommended) -->
<meta charset="UTF-8">
<!-- HTML4 / XHTML equivalent -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">3. HTML character references for individual characters:
<!-- Named entity -->
<p>Copyright © 2025</p>
<!-- Decimal numeric entity -->
<p>© = é = 世</p>
<!-- Hex numeric entity -->
<p>© = é = 世</p>
<!-- Emoji via hex entity -->
<p>😀 = 😀</p>Priority order: HTTP header > BOM > meta tag > encoding sniffing. Always set both the HTTP header and meta tag for maximum reliability.
For HTML5, the <meta charset="UTF-8"> declaration should always be the first element inside <head> to ensure it's parsed before any text content.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8"> <!-- Must be first! -->
<title>My Page</title>
</head>
<body>
<p>café 世界 😀</p> <!-- All render correctly with UTF-8 -->
</body>
</html>8. Encoding in Programming Languages
Different languages handle string encoding differently. Here's how the major languages work:
Python 3 distinguishes between str (text, Unicode) and bytes (raw bytes). All strings are Unicode by default:
# Python 3: str is Unicode, bytes is raw bytes
text = "café 世界 😀" # str (Unicode)
encoded = text.encode('utf-8') # bytes
decoded = encoded.decode('utf-8') # back to str
print(type(text)) # <class 'str'>
print(type(encoded)) # <class 'bytes'>
print(encoded) # b'caf\xc3\xa9 \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x98\x80'
print(len(text)) # 8 (characters)
print(len(encoded)) # 18 (bytes)
# Handling encoding errors
bad_bytes = b'\xc3\x28' # Invalid UTF-8
# text = bad_bytes.decode('utf-8') # UnicodeDecodeError!
text = bad_bytes.decode('utf-8', errors='replace') # '\ufffd('
text = bad_bytes.decode('utf-8', errors='ignore') # '('
# Reading files with explicit encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()JavaScript / TypeScript uses UTF-16 internally. The TextEncoder and TextDecoder APIs handle encoding conversion:
// JavaScript: strings are UTF-16, use TextEncoder for UTF-8
const text = "café 世界 😀";
// TextEncoder converts string → UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode(text);
console.log(bytes); // Uint8Array(18) [99, 97, 102, ...]
console.log(bytes.length); // 18 bytes
// TextDecoder converts UTF-8 bytes → string
const decoder = new TextDecoder('utf-8');
const decoded = decoder.decode(bytes);
console.log(decoded); // "café 世界 😀"
// Handling errors
const badBytes = new Uint8Array([0xC3, 0x28]);
const strictDecoder = new TextDecoder('utf-8', { fatal: true });
try {
strictDecoder.decode(badBytes); // throws TypeError
} catch (e) {
console.log('Invalid UTF-8 sequence');
}
// Correct character counting with Intl.Segmenter
const segmenter = new Intl.Segmenter();
const graphemes = [...segmenter.segment(text)];
console.log(graphemes.length); // 8 (correct grapheme clusters)Go uses the rune type (alias for int32) to represent Unicode code points. Strings are UTF-8 byte sequences:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
text := "café 世界 😀"
// len() returns byte count
fmt.Println(len(text)) // 18 bytes
// utf8.RuneCountInString() returns character count
fmt.Println(utf8.RuneCountInString(text)) // 8 runes
// Range over string iterates by rune (not byte)
for i, r := range text {
fmt.Printf("byte %d: U+%04X %c (%d bytes)\n",
i, r, r, utf8.RuneLen(r))
}
// byte 0: U+0063 c (1 bytes)
// byte 1: U+0061 a (1 bytes)
// byte 2: U+0066 f (1 bytes)
// byte 3: U+00E9 é (2 bytes)
// byte 5: U+0020 (1 bytes)
// byte 6: U+4E16 世 (3 bytes)
// byte 9: U+754C 界 (3 bytes)
// byte 12: U+0020 (1 bytes)
// byte 13: U+1F600 😀 (4 bytes)
}9. Common Encoding Errors
Mojibake (literally "character transformation" in Japanese) is garbled text caused by decoding bytes with the wrong encoding. Here are common patterns and fixes:
Common mojibake patterns and their causes:
éinstead ofé— UTF-8 bytes interpreted as Latin-1蟩instead of世— UTF-8 bytes interpreted as Latin-1 (3-byte character)���(replacement characters) — invalid byte sequences in UTF-8 decoding???(question marks) — characters not representable in target encoding
# Why does "é" become "é"?
# UTF-8 encodes é as two bytes: C3 A9
# If those bytes are read as Latin-1:
# C3 → Ã
# A9 → ©
# Result: "é" instead of "é"
# Why does "世" become "世"?
# UTF-8 encodes 世 as three bytes: E4 B8 96
# If those bytes are read as Latin-1:
# E4 → ä B8 → ¸ 96 → (control char, often shown as –)
# Result: "世" instead of "世"Debugging tools:
xxdorhexdump— inspect raw bytes in a filefile -i filename(Linux/macOS) — detect file encodingiconv -f FROM -t TO filename— convert between encodingschardet/chardetect(Python library) — auto-detect encoding
Common fix with iconv:
# Convert from Latin-1 to UTF-8:
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
# Convert from Windows-1252 to UTF-8:
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt
# List all available encodings:
iconv -l
# Detect encoding with chardet (Python):
pip install chardet
chardetect file.txt
# file.txt: utf-8 with confidence 0.99In Python, you can repair double-encoded text:
# Fix double-encoded UTF-8 in Python:
# If UTF-8 bytes were incorrectly decoded as Latin-1 then re-encoded
broken = "café" # é was double-encoded
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed) # "café"
# Using chardet to auto-detect encoding:
import chardet
with open('mystery.txt', 'rb') as f:
raw = f.read()
result = chardet.detect(raw)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
text = raw.decode(result['encoding'])10. Emoji Encoding
Emoji are Unicode characters, mostly in the Supplementary Multilingual Plane (U+1F000–U+1FFFF) and other supplementary planes. In UTF-8, most emoji require 4 bytes.
# Simple emoji: single code point
😀 = U+1F600
UTF-8: F0 9F 98 80 (4 bytes)
UTF-16: D83D DE00 (4 bytes, surrogate pair)ZWJ (Zero Width Joiner) sequences combine multiple emoji into a single glyph using U+200D. For example, the family emoji is composed of individual person emoji joined with ZWJ:
# ZWJ (Zero Width Joiner) sequences:
# 👨👩👧👦 = Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy
# U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466
# In JavaScript:
const family = "👨👩👧👦";
console.log(family.length); // 11 (UTF-16 code units!)
console.log([...family].length); // 7 (code points, still wrong!)
// Correct: use Intl.Segmenter
const seg = new Intl.Segmenter();
console.log([...seg.segment(family)].length); // 1 (one grapheme cluster)
# In Python:
import unicodedata
family = "👨\u200D👩\u200D👧\u200D👦"
print(len(family)) # 7 code points
# For grapheme clusters, use the 'grapheme' libraryVariation selectors modify the presentation of a character. U+FE0F (VS16) requests emoji presentation, while U+FE0E (VS15) requests text presentation:
# Variation selectors change presentation:
# ❤ (U+2764) + U+FE0F → ❤️ (emoji presentation, red heart)
# ❤ (U+2764) + U+FE0E → ❤︎ (text presentation, outline heart)
# ☺ (U+263A) + U+FE0F → ☺️ (emoji style)
# ☺ (U+263A) + U+FE0E → ☺︎ (text style)
# In HTML:
# <span>❤️</span> → red emoji heart
# <span>❤︎</span> → text-style heartSkin tone modifiers (U+1F3FB–U+1F3FF) follow a base emoji to change skin tone. The base + modifier appear as a single character:
# Skin tone modifiers (Fitzpatrick scale):
# 👋 (U+1F44B) + 🏻 (U+1F3FB) → 👋🏻 (light skin)
# 👋 (U+1F44B) + 🏽 (U+1F3FD) → 👋🏽 (medium skin)
# 👋 (U+1F44B) + 🏿 (U+1F3FF) → 👋🏿 (dark skin)
# Each modifier adds 4 bytes in UTF-8:
# 👋 alone: F0 9F 91 8B (4 bytes)
# 👋🏽 : F0 9F 91 8B F0 9F 8F BD (8 bytes)Flag sequences use pairs of Regional Indicator Symbols (U+1F1E6–U+1F1FF). Each letter corresponds to a country code:
# Flag emoji use Regional Indicator Symbols:
# 🇺🇸 = U+1F1FA (RI U) + U+1F1F8 (RI S) = US flag
# 🇯🇵 = U+1F1EF (RI J) + U+1F1F5 (RI P) = JP flag
# 🇩🇪 = U+1F1E9 (RI D) + U+1F1EA (RI E) = DE flag
# Each regional indicator is 4 bytes in UTF-8
# So each flag emoji is 8 bytes in UTF-8Because of these combining mechanisms, a single visible "character" (grapheme cluster) can be many Unicode code points and many bytes long. The family emoji with skin tones can be over 25 bytes in UTF-8.
11. Best Practices
Follow these rules to avoid encoding issues in your projects:
- Always use UTF-8. Set it in your editor, your database, your HTTP headers, and your HTML meta tags. There is no reason to use any other encoding for new projects.
- Declare encoding explicitly. Never rely on encoding detection or defaults. Use
<meta charset="UTF-8">in HTML andContent-Type: text/html; charset=utf-8in HTTP headers. - Database collation matters. Use
utf8mb4(notutf8) in MySQL/MariaDB. MySQL'sutf8only supports 3-byte characters (no emoji). For collation, useutf8mb4_unicode_ciorutf8mb4_0900_ai_ci. - Save files as UTF-8 without BOM. Configure your editor to save in UTF-8. Avoid BOM unless specifically required.
- Handle encoding at the boundary. Decode bytes to strings as early as possible (at input), and encode strings to bytes as late as possible (at output).
- Test with non-ASCII data. Use strings like
"café 世界 😀"in your test data to catch encoding bugs early. - Use parameterized queries. Never construct SQL with string concatenation. Parameterized queries handle encoding correctly and prevent SQL injection.
- Normalize Unicode. Use NFC normalization for storage and comparison. The character é can be represented as a single code point (U+00E9) or as e + combining acute (U+0065 U+0301). NFC ensures a consistent form.
Database creation example with proper encoding:
-- MySQL / MariaDB: Always use utf8mb4
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
bio TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- PostgreSQL: UTF-8 is the default and recommended encoding
CREATE DATABASE myapp
ENCODING 'UTF8'
LC_COLLATE 'en_US.UTF-8'
LC_CTYPE 'en_US.UTF-8';
-- Check current encoding:
-- MySQL: SHOW VARIABLES LIKE 'character_set%';
-- PostgreSQL: SHOW server_encoding;# .editorconfig — enforce UTF-8 across your project
root = true
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true12. Frequently Asked Questions
What is the difference between ASCII and UTF-8?
ASCII is a 7-bit character set with 128 characters (English letters, digits, symbols, and control characters). UTF-8 is a variable-length encoding that can represent all 154,000+ Unicode characters. UTF-8 is backward compatible with ASCII — the first 128 UTF-8 characters are identical to ASCII, using a single byte each. Characters beyond ASCII use 2–4 bytes in UTF-8.
Why should I use UTF-8 instead of UTF-16 or UTF-32?
UTF-8 is the most space-efficient for text that's primarily Latin-based (which includes most code, markup, and web content). It's backward compatible with ASCII, so existing ASCII tools work unchanged. It's also the standard for the web (98%+ of websites), JSON (required by RFC 8259), and most modern APIs. UTF-16 is used internally by JavaScript and Java, and UTF-32 is used for easy code point indexing, but neither is recommended for interchange or storage.
How do I fix mojibake (garbled characters) in my text?
First, identify the original encoding by examining the byte pattern. Common patterns: é instead of é means UTF-8 was misread as Latin-1. Fix by re-encoding: in Python, use text.encode("latin-1").decode("utf-8"). For files, use iconv -f WRONG_ENCODING -t utf-8 file.txt. The chardet Python library can auto-detect the original encoding.
What is the difference between Unicode and UTF-8?
Unicode is a character set — a mapping from numbers (code points) to characters. For example, U+0041 = A, U+4E16 = 世. UTF-8 is an encoding — the rules for converting those code points into bytes for storage and transmission. Think of Unicode as the dictionary and UTF-8 as the handwriting style. Other encodings like UTF-16 and UTF-32 can also represent Unicode characters, just using different byte patterns.
Why does MySQL utf8 not support emoji, and what should I use instead?
MySQL's utf8 charset is actually a non-standard 3-byte-max subset of UTF-8. It cannot store characters that require 4 bytes in UTF-8, which includes all emoji (e.g., 😀 U+1F600) and many CJK characters. Use utf8mb4 instead, which supports the full UTF-8 range (1–4 bytes). When creating databases or tables, specify: CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci.
Understanding character encoding is fundamental to building reliable software. The evolution from ASCII to Unicode to UTF-8 solved the global text representation problem. Always use UTF-8, declare it explicitly, and test with diverse characters to avoid encoding bugs.
Convert text to binary and back with our Binary-Text Converter →