What is Unicode? DefinÂiÂtion and exÂplanÂaÂtion
Unicode is an inÂterÂnaÂtionÂal standard for encoding, disÂplayÂing, and proÂcessing text charÂacÂters from nearly all the worldâs writing systems. Each character is assigned a unique code point, which can be stored in various character encodings like UTF-8 or UTF-16. This allows Unicode to provide conÂsistÂent repÂresÂentÂaÂtion and proÂcessing of texts across different platforms and languages.
- Free WordPress with .co.uk
- Free website proÂtecÂtion with one Wildcard SSL
- Free Domain Connect for easy DNS setup
What is Unicode?
Unicode stands for âUniversal Character Encodingâ and is a global standard for repÂresÂentÂing text charÂacÂters in binary form. It enables conÂsistÂent storage, exchange, and proÂcessing of text across different digital systems and platforms.
Unicode is inÂnovÂatÂive in that it is not tied to the formats and encodings of a single alphabet of a parÂticÂuÂlar human language. Rather, Unicode was created with the aim of serving as a unified standard for repÂresÂentÂing all writing systems and charÂacÂters developed by humans.
Since the release of Unicode 1.0 at the end of 1991, the standard has fulfilled its purpose. Unicode is inÂternÂally used by browsers and operating systems as a unified format. With the release of version 16.0 by the Unicode ConÂsorÂtiÂum in 2024, the Unicode Standard now enÂcomÂpasses a repÂerÂtoire of 154,998 charÂacÂters. The character set covered by the Unicode Standard is comÂpletely identical to the âUniversal Coded Character Setâ (UCS), which is inÂterÂnaÂtionÂally standÂardÂised as ISO/IEC 10646.
Technical basis for character encoding
First, itâs important to unÂderÂstand that all inÂformÂaÂtion present in a digital system consists of endless chains of zeros and ones on a deeper level. This is also referred to as âbinary repÂresÂentÂaÂtionâ. The binary code is somewhat like an alphabet in itself. However, in binary code, there are only two âlettersâ: zeros and ones. Each position within a sequence of zeros and ones is called a âbitâ.
The basic trick of digital inÂformÂaÂtion techÂnoÂlogy is to represent charÂacÂters from different alphabets as sequences of zeros and ones. This allows for encoding numbers and letters, as well as any other disÂtinÂguishÂable states. Usually, these are called âsymbolsâ. The longer the sequence of zeros and ones for repÂresÂentÂing a single symbol, the more symbols can be depicted. With each added bit, the number of possible symbols doubles.
A concrete example: Imagine we have binary âwordsâ that are two bits long. This would allow us to encode four numbers:
| 2-bit word | Number |
|---|---|
| 00 | 0 |
| 01 | 1 |
| 10 | 2 |
| 11 | 3 |
If we add another bit to the beginning of the sequence, the number of possible bit-words doubles. These consist of the already known bit sequences, each preceded by a zero or one. Thus, we can encode eight numbers:
| 3-bit word | Number |
|---|---|
| 000 | 0 |
| 001 | 1 |
| 010 | 2 |
| 011 | 3 |
| 100 | 4 |
| 101 | 5 |
| 110 | 6 |
| 111 | 7 |
An 8-bit word is referred to as an octet or byte.
For simÂpliÂcity, weâve shown the encoding of numbers as an example here. However, the same principle applies to digital systems for encoding letters or any other charÂacÂters and states. Here is a highly simÂpliÂfied example of binary encoding of letters:
| 3-bit word | Letter |
|---|---|
| 000 | A |
| 001 | B |
| 010 | C |
The graphic repÂresÂentÂaÂtion of a character is called a glyph. Depending on the font used, there are different glyphs for the same character, and even within a single font, there can be multiple variÂations for a glyph. Think, for instance, of different weights, ligatures, italics, etc. Here is an expanded repÂresÂentÂaÂtion that includes the mapping from the character to the glyph:
| Binary repÂresÂentÂaÂtion | Decimal number | Encoded character | Glyph |
|---|---|---|---|
| 1000001 | 65 | uppercase âAâ of the Latin alphabet | A |
| 1100001 | 97 | lowercase âaâ of the Latin alphabet | a |
| 0110000 | 48 | Arabic numeral â0â | 0 |
| 0111001 | 57 | Arabic numeral â9â | 9 |
| 11000100 | 196 | uppercase âĂâ | Ă |
| 11000001 | 193 | uppercase âĂâ | Ă |
TerÂminÂoÂlogy of character encoding
Digital character encoding involves a range of specific terms and concepts. In everyday usage, some of these may be used inÂterÂchangeÂably, but in technical contexts â esÂpeÂcially when working with Unicode â itâs important to disÂtinÂguish them clearly. Below are key terms along with their definÂiÂtions:
| Term | Meaning |
|---|---|
| Character set | A colÂlecÂtion of possible charÂacÂters, such as digits â0â9â or letters âaâzâ |
| Code point | A numerical value assigned to a specific character within a coding system |
| Coded character set | A system that maps each character to exactly one code point |
| Character encoding | The process of conÂvertÂing charÂacÂters into a digital format (e.g., binary) |
Overview of common character encodings
Before the advent of Unicode, there was a wide variety of specific encodings. The norm was to use a distinct encoding for each language or language family. This often led to display errors and data inÂconÂsistÂenÂcies. To counter this, character encodings were freÂquently modelled as backward-comÂpatÂible supersets of an existing standard. The modern Unicode standard builds on the earlier ISO Latin-1 encoding, which in turn is based on the ASCII character code.
| Character encoding | Bits per character | Possible charÂacÂters | Character set |
|---|---|---|---|
| ASCII | 7 bits | 128 | Letters, numbers, and special charÂacÂters of the American keyboard, as well as control charÂacÂters for teletypes |
| ISO Latin-1 (ISO 8859-1) | 8 bits | 256 | First 128 charÂacÂters like ASCII, next 128 charÂacÂters for special charÂacÂters of European languages |
| Universal Coded Character Set 2 (UCS-2) | 16 bits | 65,536 | CharÂacÂters of the âBasic MulÂtiÂlinÂgual Planeâ (BMP); first 256 charÂacÂters like in ISO Latin-1 |
| Universal Coded Character Set 4 (UCS-4) | 32 bits | 1,114,111 | CharÂacÂters of the BMP and adÂdiÂtionÂal beyond; total of 143,859 charÂacÂters in Unicode Version 13.0; first 256 charÂacÂters like ISO Latin-1 |
| UCS TransÂformÂaÂtion Format 8 Bit (UTF-8) | 8/16/24/32 bits | 1,114,111 | Any charÂacÂters from UCS-2 and UCS-4; first 256 charÂacÂters like ISO Latin-1 |
Structure of the Unicode Standard
The Unicode Standard defines charÂacÂters and corÂresÂpondÂing code points for letters, sylÂlabÂarÂies, ideograms, puncÂtuÂation marks, special charÂacÂters, and numerals. It supports the Latin, Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets. AdÂdiÂtionÂally, it includes Japanese (Katakana, Hiragana), Chinese, and Korean scripts (Hangul). There are also mathÂemÂatÂicÂal, comÂmerÂcial, and technical special charÂacÂters, as well as hisÂtorÂicÂal control charÂacÂters for teletypes.
The charÂacÂters are compiled in a series of character tables. We provide an overview of the most common character tables here.
Writing systems of the Unicode Standard
| Character table | Includes these alphabets, among others |
|---|---|
| European Writing Systems | Armenian, Georgian, Greek, Latin |
| African Writing Systems | Ethiopian, Egyptian HieroÂglyphs, Coptic |
| Middle Eastern Writing Systems | Arabic, Hebrew, Syriac |
| Central Asian Writing Systems | Mongolian, Tibetan, Old Turkic |
| South Asian Writing Systems | Brahmi, Tamil, Vedic |
| Southeast Asian Writing Systems | Khmer, Rohingya, Thai |
| Writing Systems of Indonesia and Oceania | Balinese, Buginese, Javanese |
| East Asian Writing Systems | CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese) |
| American Writing Systems | Cherokee, Canadian Syllabics, Osage |
Symbols and puncÂtuÂation of the Unicode Standard
| Character table | Includes these charÂacÂters, among others |
|---|---|
| Notation systems | Braille Patterns, Musical Notation, Duployan Shorthand |
| PuncÂtuÂation | PuncÂtuÂation of the English Language, PuncÂtuÂation of European Languages, CJK PuncÂtuÂation |
| AlÂphaÂnuÂmerÂic symbols | MathÂemÂatÂicÂal Unicode Letters, Circled Unicode Letters |
| Technical symbols | Symbols of the APL ProÂgramÂming Language, Symbols for Optical Character ReÂcogÂniÂtion |
| Numbers & numerals | Maya Numerals, Ottoman Siyaq Numerals, Numerals of Sumerian Cuneiform |
| MathÂemÂatÂicÂal symbols | Arrows, MathÂemÂatÂicÂal Operators, Geometric Shapes |
| Emoji & picÂtoÂgrams | Emoticons, Dingbats, Other PicÂtoÂgrams |
| Other symbols | AlÂchemÂicÂal Symbols, Currency CharÂacÂters, Chess, Domino, and Mahjong CharÂacÂters |
What is Unicode used for?
The Unicode Standard primarily serves as a universal foundÂaÂtion for proÂcessing, storing, and exÂchanÂging text in any language. Most modern software comÂponÂents, such as libraries, protocols, databases, etc., that operate on text are based on Unicode. We ilÂlusÂtrate the range of possible uses with the following examples.
Operating systems
Unicode is the internal standard for text repÂresÂentÂaÂtion in most modern operating systems. Some operating systems, like Appleâs macOS, allow the use of Unicode charÂacÂters in file names.
Websites
The Unicode variant UTF-8 has become the standard for encoding HTML documents. As early as 2016, more than 80 percent of the worldâs most visited websites used UTF-8 for storing and disÂplayÂing their HTML documents. The Punycode standard has become esÂtabÂlished for using non-ASCII letters in domain names.
- Intuitive website builder with AI asÂsistÂance
- Create capÂtivÂatÂing images and texts in seconds
- Domain, SSL and email included
ProÂgramÂming languages
Many modern proÂgramÂming languages use Unicode as the basis for text proÂcessing. A recent deÂvelÂopÂment is the ability to use Unicode charÂacÂters for naming variables and functions. This is possible in ECMAScript/JavaSÂcript, as ilÂlusÂtrated in the following code:
let ïžđ = true;
let đ = false;
if (bool_var === ïžđ) {
// âŠ
}javasÂcriptDatabases
The popular and widely used database MySQL supports the complete Unicode character set with the character encoding âutf8mb4â. In contrast, using the âutf8â encoding results in the loss of charÂacÂters whose code points encompass more than three bytes.
Fonts
Fonts contain the glyphs used for the graphic repÂresÂentÂaÂtion of text. Due to the large number of charÂacÂters included in the Unicode Standard, there is no font that contains all charÂacÂters. Even the subset of the Basic MulÂtiÂlinÂgual Plane is covered comÂpletely by only a few fonts. Here are a few examples:
| Unicode font | Glyphs | License |
|---|---|---|
| Noto | approx. 77,000 | Open Font License |
| Sun-ExtA/B | approx. 50,000 | Freeware |
| Unifont | approx. 63,000 | GNU GPL |
| Code2000 | approx. 63,000 | Shareware |
- Store, share and edit data easily
- ISO-certified European data centres
- Highly secure and GDPR compliant
How is Unicode used?
In many cases, users employ Unicode without ever being aware of it. Digital text is presented in most documents and apÂplicÂaÂtions as Unicode and can be freely copied, pasted, and edited by users. Sometimes, the end user may need to insert a specific Unicode character into text. There are various methods for doing this, which we will present below.
Special software keyboards
The use of special software keyboards is probably the most common method to insert Unicode charÂacÂters into text. UbiÂquitÂous on mobile devices, software keyboards allow for switching between languages and their reÂspectÂive alphabets. The key layout changes, with all charÂacÂters oriÂginÂatÂing from the Unicode repÂerÂtoire. These charÂacÂters can be mixed and combined freely in texts.
A good example of this is emojis: Emojis are regular Unicode charÂacÂters like letters, numbers, and special symbols. As with digital charÂacÂters, the repÂresÂentÂaÂtion of emojis is inÂdeÂpendÂent of their internal modelling. Each operating system displays the same emoji slightly difÂferÂently.
The useful software keyboards are not only found on mobile devices. Theyâre also available on desktops. They can be easily accessed in Windows, macOS, and many Linux disÂtriÂbuÂtions, disÂplayÂing a different set of charÂacÂters depending on the selected language. Since the number of keys is limited, not all Unicode charÂacÂters are shown. Instead, thereâs a language-specific selection of the most commonly used charÂacÂters.
Unicode character tables
Besides software keyboards, Unicode character tables are probably the most useful way to access Unicode charÂacÂters. Remember, a character set (âCoded character setâ) is the colÂlecÂtion of all charÂacÂters along with their corÂresÂpondÂing unique code points. Such a structure lends itself to a table format, and indeed the Unicode Standard includes exactly such tables called Unicode Code Charts. From these tables, users can copy specific charÂacÂters to use elsewhere. AlÂternÂatÂively, end users can read the corÂresÂpondÂing code point, for example, to use it as a numeric character referenceâmore on this in the next section.
Many desktop operating systems also include a Unicode character table. This provides an overview of all available Unicode charÂacÂters along with their code point, deÂscripÂtion, and glyph. A character can be inserted or copied with a click. A character table can also be created with just a few lines of code. Later in this article, weâll show an example using the Python proÂgramÂming language.
Numeric character reference
The core of the Unicode Standard is the mapping of charÂacÂters to code points. Knowing a characterâs code point allows it to be used to embed the corÂresÂpondÂing character in various contexts. On Windows, entering Unicode symbols is done using the standard hardware keyboard with a special key comÂbinÂaÂtion. Note that the code point number typically needs to be entered in hexaÂdecimÂal format.
ProÂgramÂmers most often need numeric character refÂerÂences. The hexaÂdecimÂal repÂresÂentÂaÂtion of code points allows for the mapping of a Unicode character into charÂacÂters of the ASCII character set. We demonÂstrate this approach in HTML; funÂdaÂmentÂally, it works the same in Python, C++, etc.
The general scheme for embedding a character using a numeric reference includes the reference itself, as well as an opening and closing term: In HTML documents, the numeric reference starts with &#x and ends with ;. In between, without any spaces, the two- to four-digit hexaÂdecimÂal code point is entered, resulting in the pattern &#xNNNN;.
To insert the copyright symbol â©â into an HTML document by example, we proceed with the following scheme:
-
Search for the character in a Unicode table.
-
Read the code point asÂsoÂciÂated with the character. In our example, the code point is listed as âU+00A9â, which is the hexaÂdecimÂal repÂresÂentÂaÂtion.
-
Compose the character reference and enter it into HTML source code or a Markdown document. In our case, we input
©; this renders the character â©â.
A less common approach allows for the use of code points in decimal rather than hexaÂdecimÂal repÂresÂentÂaÂtion. In this case, the numeric reference begins with &# (without the âxâ) and ends as usual with ;. In between, the code point is written in decimal form. In our example, the numeric reference © results in the copyright symbol.
Use the Unicode Character Inspector to quickly find the different codes for a character.
Named character entities
Since the notation of Unicode charÂacÂters as numeric refÂerÂences is not intuitive for humans, there is another method: named character entities. These are defined for commonly used charÂacÂters and assign a short, memorable name to the character. A named character entity starts with the ampersand & and ends with a semicolon ;. The defined name is placed in between without spaces. To insert the copyright symbol â©â in HTML, simply write ©.
The complete list of defined character entities is docÂuÂmented in the HTML Standard.
ProÂgramÂming languages
Most proÂgramÂming languages include basic functions to convert charÂacÂters and code points. The corÂresÂpondÂing functions are often called ord(character) and chr(code point). The following applies:
chr(ord(character)) == character
Note that it is always possible to determine the code point corÂresÂpondÂing to a character. ConÂversely, the asÂsignÂment only works for numbers that are actually defined as code points of the character code. We demonÂstrate the basic scheme here with a short Python example:
# Determine the decimal code point of a character
ord('A') # `65`
# Determine the hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Determine the character corresponding to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, because code point > `0x110000`pythonWith the help of these functions, itâs easy to create a character table for code points of the Unicode character set. For this, you iterate the code points and output the corÂresÂpondÂing charÂacÂters. With Python, this can be done in just a few lines of code:
# Start `range` at `32` to avoid control characters being printed
# Print ASCII character set
for code_point in range(32, 128):
print(code_point, hex(code_point), chr(code_point))
# Print ISO Latin-1
for code_point in range(32, 256):
print(code_point, hex(code_point), chr(code_point))pythonProgram Library ICU
The InÂterÂnaÂtionÂal ComÂponÂents for Unicode (ICU) are conÂsolÂidÂated in a program library provided by the Unicode ConÂsorÂtiÂum. The library is released under an open-source license and can be used on many operating systems. The software serves the purpose of proÂgramÂmatÂic InÂterÂnaÂtionÂalÂisaÂtion (often abÂbreÂviÂated as âi18nâ). Its apÂplicÂaÂtions include:
- ProÂcessing of Unicode texts
- Support for regular exÂpresÂsions in Unicode
- Parsing and formatÂting of calendar dates, times, numbers, curÂrenÂcies, and messages
The ICU library is available in two versions:
- âicu4câ is written in C/C++ and provides an API for these languages.
- âicu4jâ is written in Java and provides an API for this language.
The use of the comÂponÂents provides conÂsistÂent results reÂgardÂless of the unÂderÂlyÂing platform.
Charset meta tag in the head of HTML documents
Most HTML documents today use the UTF-8 character encoding. To ensure that visitors see the document without erroneous charÂacÂters, a âCharsetâ meta tag should be placed in the head of the HTML document. This instructs the browser to interpret the retrieved document as UTF-8 and is ilÂlusÂtrated below:
<head>
<meta charset="utf-8">
<!-- additional head elements -->
</head>htmlInstagram fonts
The popular social network Instagram does not allow text formatÂting for biography inÂformÂaÂtion, posts, or stories. This limits usersâ creative options. However, clever deÂvelopers have found a workÂaround: Instagram uses Unicode, making it possible to compose text that appears formatted using special charÂacÂters. This often involves charÂacÂters that resemble Latin letters. The easiest way to create such text is with an Insta Fonts Generator. AdÂdiÂtionÂally, using Instagram fonts also works in other social networks.

