UTF-8

The most commonly used encoding for Unicode.

Uses 1 byte to represent a code point.

Compared to UTF-32 and UTF-16 wastes less space for the Unicode characters with values < 128 (includes letters, numbers, punctuation commonly used in English), but can still support other Unicode code points with larger values by expanding to a maximum of 4 bytes.

Unlike UTF-32 and UTF-16, you don’t have to worry about little-endian versus big-endian. It also allows you to look at any byte in a sequence and tell if you are at the start of a UTF-8 sequence or somewhere in the middle. That means you can’t accidentally read a character incorrectly.

The only downside is that you cannot randomly access a string encoded with UTF-8. While you can detect if you are in the middle of a character, you can’t tell how many characters in you are. You need to start at the beginning of the string and count.

INFO

Fun fact: UTF-8 was invented in 1992 by Ken Thompson and Rob Pike, two of the creators of Go.

Artem Udovyk

Explorer

UTF-8

Explorer

Graph View

Backlinks