A Programmer's Introduction to Unicode

Article Link: https://www.reedbeta.com/blog/programmers-intro-to-unicode/

The goal of Unicode is to represent the entire world’s writing systems. “Enabling people around the world to use computers in any language”

Unicode Codespace

The elements of a Unicode (or it’s characters) are called Code Points. Code Points are identified by number, written in hexadecimal with the prefix “U+“. Each Code Point has a short name, and quite a few other properties specified in the Unicode Character Database

Set of all Code Points is called the codespace. The Unicode codespace consists of 1,114,122 code points

Codespace Allocation

This is a map of the codespace

Each small square is 256 code points, each large square is a “plane” of 65,536 code points. There are 17 planes in total

White - Unassigned Space
Green - Private Use Areas (usecase can be defined)
Red - Surrogates
Plane 0 - Basic Multilingual Plane (BMP) contains all the characters for modern text in any script
Plane 1 - Historical Scripts as well as emojis and other symbols
Plane 2 - Large block of less-common and Historical Han characters
Plane 3 - 13 - Empty
Plane 14 - Rarely used formatting characters
Plane 15 - 16 - Reserved for private use

In the past codespace was just the BMP and nothing more, it was originally conceived as a straightforward 16-bit encoding with only 65,536 code points. It was expanded to it’s current size in 1996

Encodings

Unicode points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF but how do these points get represented as bytes, in memory or in a file

The most convenient way would be to store code point as a 32-bit integer. This works but it’s resources expensive as it consumes 4 bytes per code point.

More common encoding formats are UTF-8 and UTF-16

UTF-8

Each code point is stored using 1 to 4 bytes, based on it’s index value

Ayush Garg

Recently Updated

Worker Threads

Directed Acyclic Graph (DAG)

Celery

SQL

A Programmer's Introduction to Unicode

Unicode Codespace

Codespace Allocation

Encodings

UTF-8

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Worker Threads

Directed Acyclic Graph (DAG)

Celery

SQL

A Programmer's Introduction to Unicode

Unicode Codespace §

Codespace Allocation §

Encodings §

UTF-8 §

Graph View

Table of Contents

Backlinks

Unicode Codespace

Codespace Allocation

Encodings

UTF-8