Article Link: https://www.reedbeta.com/blog/programmers-intro-to-unicode/
The goal of Unicode is to represent the entire world’s writing systems. “Enabling people around the world to use computers in any language”
Unicode Codespace
The elements of a Unicode (or it’s characters) are called Code Points. Code Points are identified by number, written in hexadecimal with the prefix “U+“. Each Code Point has a short name, and quite a few other properties specified in the Unicode Character Database
Set of all Code Points is called the codespace. The Unicode codespace consists of 1,114,122 code points
Codespace Allocation
This is a map of the codespace
Each small square is 256 code points, each large square is a “plane” of 65,536 code points. There are 17 planes in total
-
White - Unassigned Space
-
Green - Private Use Areas (usecase can be defined)
-
Red - Surrogates
-
Plane 0 - Basic Multilingual Plane (BMP) contains all the characters for modern text in any script
-
Plane 1 - Historical Scripts as well as emojis and other symbols
-
Plane 2 - Large block of less-common and Historical Han characters
-
Plane 3 - 13 - Empty
-
Plane 14 - Rarely used formatting characters
-
Plane 15 - 16 - Reserved for private use
In the past codespace was just the BMP and nothing more, it was originally conceived as a straightforward 16-bit encoding with only 65,536 code points. It was expanded to it’s current size in 1996
Encodings
Unicode points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF but how do these points get represented as bytes, in memory or in a file
The most convenient way would be to store code point as a 32-bit integer. This works but it’s resources expensive as it consumes 4 bytes per code point.
More common encoding formats are UTF-8
and UTF-16
UTF-8
Each code point is stored using 1 to 4 bytes, based on it’s index value