Article Link: https://www.reedbeta.com/blog/programmers-intro-to-unicode/

The goal of Unicode is to represent the entire world’s writing systems. “Enabling people around the world to use computers in any language”

Unicode Codespace

The elements of a Unicode (or it’s characters) are called Code Points. Code Points are identified by number, written in hexadecimal with the prefix “U+“. Each Code Point has a short name, and quite a few other properties specified in the Unicode Character Database

Set of all Code Points is called the codespace. The Unicode codespace consists of 1,114,122 code points

Codespace Allocation

Map of the Unicode codespace (click to zoom)

This is a map of the codespace

Each small square is 256 code points, each large square is a “plane” of 65,536 code points. There are 17 planes in total

  • White - Unassigned Space

  • Green - Private Use Areas (usecase can be defined)

  • Red - Surrogates

  • Plane 0 - Basic Multilingual Plane (BMP) contains all the characters for modern text in any script

  • Plane 1 - Historical Scripts as well as emojis and other symbols

  • Plane 2 - Large block of less-common and Historical Han characters

  • Plane 3 - 13 - Empty

  • Plane 14 - Rarely used formatting characters

  • Plane 15 - 16 - Reserved for private use

In the past codespace was just the BMP and nothing more, it was originally conceived as a straightforward 16-bit encoding with only 65,536 code points. It was expanded to it’s current size in 1996

Encodings

Unicode points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF but how do these points get represented as bytes, in memory or in a file

The most convenient way would be to store code point as a 32-bit integer. This works but it’s resources expensive as it consumes 4 bytes per code point.

More common encoding formats are UTF-8 and UTF-16

UTF-8

Each code point is stored using 1 to 4 bytes, based on it’s index value