Unicode Basics: What's Character Set, Character Encoding, UTF-8?

Buy Xah Emacs Tutorial. Master emacs benefits for life.
, , …,

What's a Character Set?

A character set is a fixed collection of symbols. For example, the English alphabet “A” to “Z” and “a” to “z” can be a character set, with a total of 52 symbols.

One of the simplest standardized character set is “ASCII”, started from 1960s, and is almost the only one used in USA up to 1990s. (ASCII = American Standard Code for Information Interchange). ASCII contains 128 symbols. It includes all the {letters, digits, punctuations} you see on a PC keyboard.

ASCII is designed for languages that use Latin alphabet only. ASCII cannot be used for Chinese characters (漢字), Arabic alphabet (أبجدية عربية‎), Russian alphabet (русский алфавит), etc. Also, ASCII does not contain symbols such as { © α β « » …}. Nor can ASCII be used for some European languages that has characters such as è é å ñ ü.

Here's the complete list of ASCII characters:

Dec   Hex   Char
─────────────────────────────────────────────
0     00    NUL '\0'
1     01    SOH (start of heading)
2     02    STX (start of text)
3     03    ETX (end of text)
4     04    EOT (end of transmission)
5     05    ENQ (enquiry)
6     06    ACK (acknowledge)
7     07    BEL '\a' (bell)
8     08    BS  '\b' (backspace)
9     09    HT  '\t' (horizontal tab)
10    0A    LF  '\n' (new line)
11    0B    VT  '\v' (vertical tab)
12    0C    FF  '\f' (form feed)
13    0D    CR  '\r' (carriage ret)
14    0E    SO  (shift out)
15    0F    SI  (shift in)
16    10    DLE (data link escape)
17    11    DC1 (device control 1)
18    12    DC2 (device control 2)
19    13    DC3 (device control 3)
20    14    DC4 (device control 4)
21    15    NAK (negative ack.)
22    16    SYN (synchronous idle)
23    17    ETB (end of trans. blk)
24    18    CAN (cancel)
25    19    EM  (end of medium)
26    1A    SUB (substitute)
27    1B    ESC (escape)
28    1C    FS  (file separator)
29    1D    GS  (group separator)
30    1E    RS  (record separator)
31    1F    US  (unit separator)

32    20    SPACE
33    21    !
34    22    " 
35    23    #
36    24    $
37    25    %
38    26    &
39    27    ´
40    28    (
41    29    )
42    2A    *
43    2B    +
44    2C    ,
45    2D    -
46    2E    .
47    2F    /
48    30    0
49    31    1
50    32    2
51    33    3
52    34    4
53    35    5
54    36    6
55    37    7
56    38    8
57    39    9
58    3A    :
59    3B    ;
60    3C    <
61    3D    =
62    3E    >
63    3F    ?
64    40    @

65    41    A
66    42    B
67    43    C
68    44    D
69    45    E
70    46    F
71    47    G
72    48    H
73    49    I
74    4A    J
75    4B    K
76    4C    L
77    4D    M
78    4E    N
79    4F    O
80    50    P
81    51    Q
82    52    R
83    53    S
84    54    T
85    55    U
86    56    V
87    57    W
88    58    X
89    59    Y
90    5A    Z

91    5B    [
92    5C    \  '\\'
93    5D    ]
94    5E    ^
95    5F    _
96    60    `

97    61    a
98    62    b
99    63    c
100   64    d
101   65    e
102   66    f
103   67    g
104   68    h
105   69    i
106   6A    j
107   6B    k
108   6C    l
109   6D    m
110   6E    n
111   6F    o
112   70    p
113   71    q
114   72    r
115   73    s
116   74    t
117   75    u
118   76    v
119   77    w
120   78    x
121   79    y
122   7A    z

123   7B    {
124   7C    |
125   7D    }
126   7E    ~
127   7F    DEL

What's Character Encoding?

Any file has to go thru encoding/decoding in order to be properly stored as file or displayed on screen. Suppose your language is Chinese (or Japanese, Russian, Arabic, or even English.). Your computer needs a way to translate the character set of your language's writing system into a sequence of 1s and 0s. This transformation is called Character encoding.

There are many encoding systems. The most popular encoding systems used today are:

Character Set and Encoding System

Character Set and Encoding System are different concepts, but often confused together.

In the early days of computing, these two concepts are not clearly made distinct, and are just called a char set or encoding system. For example, ASCII does not really separate the concepts, since it's very simple, dealing with only 128 chars (including invisible “control characters” (code sequence)). Another example: HTML has <meta http-equiv="Content-Type" content="text/html;charset=utf-8">; the syntax contains the word “charset”, but it's actually about encoding, not charset. 〔➤ HTML: Character Sets and Encoding

A encoding system defines a character set implicitly. Because it needs to define what characters it is designed to handle.

Unicode's Character Set and Encoding Systems

Unicode's Character Set

Unicode's character set includes ALL human language's written symbols. It includes the tens of thousands Chinese characters, math symbols, as well as characters of dead languages, such as Egyptian Hieroglyph. 〔➤ Sample Characters of Unicode

Unicode Search

Unicode Character's Code Point

Each character in Unicode is given a unique ID. This id is a number (integer), and is called the char's code point.

For example, the code point for the greek alpha α char is 945. In hexadecimal it's “3b1”. In the standard Unicode notation it is written as “U+03B1”.

Unicode's Encoding System: UTF-8, UTF-16, …

Then, Unicode defines several encoding system. UTF-8 and UTF-16 are the two most popular Unicode encoding systems. Each encoding system has advantages and disadvantages.

UTF-8 is suitable for texts that are mostly Latin alphabet letters. For example, English, Spanish, French, and most web technology such as HTML, CSS, JavaScript. Most Linux's files are in UTF-8 by default. UTF-8 encoding system is backwards compatible with ASCII. (meaning: If a file only contain characters in ASCII, then encoding the file using UTF-8 results the same byte sequence as using ASCII as encoding scheme.)

UTF-16 is another coding system from Unicode. With UTF-16, every char is encoded into least 2 bytes, and commonly used characters in Unicode are exactly 2 bytes. For Asian languages containing lots of Chinese characters, such as Chinese & Japanese, UTF-16 creates smaller file size.

There's also UTF-32, which always uses 4 bytes per character. It creates larger file size, but is simpler to parse. Currently, UTF-32 is not being used much.

Decoding

When a editor opens a file, it needs to know the encoding system used, in order to decode the binary stream and map it to fonts to display the original characters properly. In general, the info about the encoding system used for a file is not bundled with the file.

Before internet, there's not much problem because most English speaking world use ASCII, and non-English regions use encoding schemes particular to their regions.

With internet, files in different languages started to exchange a lot. When opening a file, Windows applications may try to guess the encoding system used, by some heuristics. When opening a file in a app that assumed a wrong encoding, typically the result is gibberish. Usually, you can explicitly tell a app to use a particular encoding to open the file. (⁖ in web browsers, usually there's a menu. In Firefox, under View, Character Encoding.) Similarly, when saving a file, there's usually a option for you to specify what encoding to use. For example, in Microsoft Notepad, when you save a file, there's a “Encoding” menu at the bottom of the Save dialog.

Font

When a computer has decoded a file, it then needs to display the characters as glyphs on the screen. For our purposes, this set of glyphs is a font. So, your computer now needs to map the Unicode code points to a font.

For Asian languages, such as Chinese, Japanese, Korean, or languages using Arabic alphabet as its writing system (Arabic, Persian), you also need the proper font to display the file correctly.

See: Best Unicode Fonts for Programing.

Input Method

For languages that are not based on alphabet, such as Chinese, you need a input method to type it. For a example, see: Emacs Chinese Input for Studying Chinese.

What's the Most Popular Encoding?

growth of unicode on web
Unicode on web. 〔Unicode nearing 50% of the web . By Mark Davis, Senior International Software Architect at Google. @ googleblog.blogspot.com…

The ones likely to remain widely used in the future are:

See also: Intro to Chinese Encoding; What Character Encoding Does Chinese Sites Use?.

For more detail, see: 〔General questions, relating to UTF or Encoding Form By Unicode Consortium. @ http://www.unicode.org/faq/utf_bom.html

Like it?
Buy Xah Emacs Tutorial
or share
blog comments powered by Disqus