‘UTF-8’ stands for ‘8-Bit UCS Trans­form­a­tion Format’ and rep­res­ents the most wide­spread character encoding on the World Wide Web. The in­ter­na­tion­al Unicode standard captures all language char­ac­ters and text elements (virtually) of all world languages for data pro­cessing. UTF-8 plays a crucial role in the Unicode character set.

Website Builder
From idea to website in record time with AI
  • Intuitive website builder with AI as­sist­ance
  • Create cap­tiv­at­ing images and texts in seconds
  • Domain, SSL and email included

The de­vel­op­ment of UTF-8 coding

UTF-8 is a character encoding. It assigns every existing Unicode character a specific bit sequence, which can also be read as a binary number. UTF-8 works by assigning a unique binary number to every character—letters, numbers, and symbols—from an ever-expanding range of languages. In­ter­na­tion­al or­gan­isa­tions focused on setting internet standards, such as the W3C and the Internet En­gin­eer­ing Task Force (IETF), are actively promoting UTF-8 as the universal standard for character encoding. In fact, as early as 2009, the majority of websites had already adopted UTF-8. According to a W3Techs report from April 2025, 98.6% of all websites now use this encoding format.

Problems faced before UTF-8 was in­tro­duced

Different regions with related languages and writing systems developed their own coding standards because they had different needs. In English-speaking countries, for instance, the ASCII encoding was suf­fi­cient, as it allowed 128 char­ac­ters to be rep­res­en­ted as computer-readable strings.

Languages that use Asian scripts or the Cyrillic alphabet, however, require a much larger set of unique char­ac­ters. Even German umlauts—such as the letter ä—are not included in the ASCII character set. On top of that, different encoding systems could assign the same binary values to entirely different char­ac­ters. As a result, a Russian document opened on an American computer might appear not in Cyrillic, but in Latin letters mapped by the local encoding—producing un­read­able text. This kind of mismatch seriously disrupted in­ter­na­tion­al com­mu­nic­a­tion.

Creation of UTF-8

To solve this problem, Joseph D. Becker developed the universal character set Unicode for Xerox between 1988 and 1991. From 1992, the IT industry con­sor­ti­um X/Open was also in search of a system to replace ASCII and expand the character rep­er­toire. The coding was still meant to remain com­pat­ible with ASCII.

This re­quire­ment was not met by the first coding named UCS-2, as it simply trans­ferred character numbers into 16-bit values. UTF-1 also failed because Unicode as­sign­ments partially collided with existing ASCII character as­sign­ments. A server set to ASCII thus sometimes output incorrect char­ac­ters. This was a sig­ni­fic­ant issue since most English-speaking computers operated this way at the time.

The next attempt was the File System Safe UCS Trans­form­a­tion Format (FSS-UTF) by Dave Prosser, which elim­in­ated overlap with ASCII char­ac­ters. In August of the same year, the draft cir­cu­lated among experts. At Bell Labs, known for numerous Nobel laureates, Unix co-founders Ken Thompson and Rob Pike were working on the Plan 9 operating system. They adopted Prosser’s idea, developed a self-syn­chron­ising coding (each character indicates how many bits it needs), and es­tab­lished rules for the as­sign­ment of letters that could be rep­res­en­ted dif­fer­ently in the code (example: ‘ä’ as its own character or ‘a+¨’). They suc­cess­fully used the coding for their operating system and presented it to the au­thor­it­ies. Thus, FSS-UTF, now known as ‘UTF-8’, was es­sen­tially completed.

UTF-8 in the Unicode character set is a standard for all languages

The UTF-8 coding is a trans­form­a­tion format within the Unicode standard. The in­ter­na­tion­al stand­ard­isa­tion ISO 10646 largely defines Unicode, there known as the ‘Universal Coded Character Set’. The Unicode de­velopers set certain para­met­ers for practical ap­plic­a­tion. The standard aims to ensure the in­ter­na­tion­ally uniform and com­pat­ible coding of char­ac­ters and text elements.

When Unicode was in­tro­duced in 1991, it defined 24 modern writing systems and currency symbols for data pro­cessing. In the Unicode standard published in 2024, there were 168. There are various Unicode Trans­form­a­tion Formats, or ‘UTF’, which reproduce the 1,114,112 possible code­points. Three formats have prevailed: UTF-8, UTF-16, and UTF-32. Other encodings like UTF-7 or SCSU also have their ad­vant­ages but have not been es­tab­lished. Unicode is divided into 17 levels, each con­tain­ing 65,536 char­ac­ters. Each level consists of 16 columns and rows. The zeroth level, the ‘Basic Mul­ti­lin­gual Plane’, covers most of the writing systems currently used worldwide, along with punc­tu­ation, control char­ac­ters, and symbols. Six ad­di­tion­al levels are currently in use:

  • Sup­ple­ment­ary Mul­ti­lin­gual Plane (Level 1): his­tor­ic­al writing systems, rarely used char­ac­ters
  • Sup­ple­ment­ary Ideo­graph­ic Plane (Level 2): rare CJK char­ac­ters (‘Chinese, Japanese, Korean’)
  • Tertiary Ideo­graph­ic Plane (Level 3): more CJK char­ac­ters have been encoded here since Unicode Version 15.1.
  • Sup­ple­ment­ary Special-Purpose Plane (Level 14): in­di­vidu­al control char­ac­ters
  • Sup­ple­ment­ary Private Use Area – A (Level 15): private use
  • Sup­ple­ment­ary Private Use Area – B (Level 16): private use

The UTF encodings provide access to all Unicode char­ac­ters. The specific prop­er­ties are re­com­men­ded for certain areas of ap­plic­a­tion.

UTF-32 and UTF-16 as al­tern­at­ives

UTF-32 always operates with 32 bits, or 4 bytes. The simple structure increases the read­ab­il­ity of the format. In languages that primarily use the Latin alphabet and thus only the first 128 char­ac­ters, the encoding takes up much more storage space than necessary (4 instead of 1 byte).

UTF-16 es­tab­lished itself as a display format in operating systems like Apple macOS and Microsoft Windows. It is also used in many software de­vel­op­ment frame­works. It is one of the oldest UTFs still in use. Its structure is par­tic­u­larly suitable for memory-efficient encoding of non-Latin char­ac­ters. Most char­ac­ters can be rep­res­en­ted in 2 bytes (16 bits), with the length doubling to 4 bytes only for rare char­ac­ters.

UTF-8 if efficient and scalable

UTF-8 uses up to four sequences of 8 bits (one byte each), while its pre­de­cessor, ASCII, relies on a single 7-bit sequence. Both encodings represent the first 128 char­ac­ters in exactly the same way, covering letters and symbols commonly used in English. As a result, char­ac­ters from the English-speaking world can be stored using just one byte, making UTF-8 par­tic­u­larly efficient for texts in Latin-based languages. This efficient use of storage is one reason why operating systems like Unix and Linux use UTF-8 in­tern­ally. However, UTF-8 plays its most important role in internet ap­plic­a­tions, es­pe­cially when dis­play­ing text on websites or in emails.

Thanks to the self-syn­chron­ising structure, read­ab­il­ity is main­tained despite the variable length per character. Without Unicode lim­it­a­tion, UTF-8 could the­or­et­ic­ally allow 4,398,046,511,104 character mappings. Due to the 4-byte re­stric­tion in Unicode, it’s ef­fect­ively 221, which is more than suf­fi­cient. Even the Unicode range still has empty planes for many more writing systems. The precise mapping prevents codepoint overlaps, which in the past limited com­mu­nic­a­tion.

While UTF-16 and UTF-32 also allow for precise mapping, UTF-8 utilises storage space par­tic­u­larly ef­fi­ciently for the Latin writing system and is designed so that different writing systems can exist alongside each other seam­lessly and be covered. This enables their con­cur­rent, mean­ing­ful display within a text field without com­pat­ib­il­ity issues.

The basics of UTF-8 coding and com­pos­i­tion

The UTF-8 coding stands out not only for its backward com­pat­ib­il­ity with ASCII but also for a self-syn­chron­ising structure, making it easier for de­velopers to identify sources of error af­ter­wards. For all ASCII char­ac­ters, UTF uses only 1 byte. The total number of bit sequences can be re­cog­nised by the first digits of the binary number. Since ASCII code en­com­passes only 7 bits, the leading digit is the iden­ti­fi­er 0. The 0 fills the storage to a full byte and signals the start of a chain without follow-up chains. The name ‘UTF-8’ would be rep­res­en­ted as a binary number with UTF-8 coding, for instance, as follows:

Character U T F - 8
UTF-8, binary 01010101 01010100 01010100 00101101 00111000
Unicode Point, hexa­decim­al U+0055 U+0054 U+0046 U+002D U+0038

ASCII char­ac­ters, like those used in the table, are assigned a single bit sequence by UTF-8 coding. All sub­sequent char­ac­ters and symbols within Unicode have two to four 8-bit sequences. The first sequence is called the start byte, with ad­di­tion­al sequences being con­tinu­ation bytes. Start bytes with con­tinu­ation bytes always begin with 11, while con­tinu­ation bytes begin with 10. If you manually search for a specific point in the code, you can recognise the start of an encoded character by the markers 0 and 11. The first printable multi-byte character is the inverted ex­clam­a­tion mark:

Character ¡
UTF-8, binary 11000010 10100001
Unicode Point, hexa­decim­al U+00A1

Prefix coding prevents another character from being encoded within a byte sequence. If a byte stream starts in the middle of a document, the computer still displays readable char­ac­ters correctly, as it doesn’t render in­com­plete ones. When searching for the beginning of a character, the 4-byte lim­it­a­tion means you only need to go back at most three byte sequences at any given point to find the start byte.

Another struc­tur­ing element: The number of ones at the beginning of the start byte indicates the length of the byte sequence:

  • 110xxxxx rep­res­ents 2 bytes
  • 1110xxxx rep­res­ents 3 bytes
  • 11110xxx rep­res­ents 4 bytes

In Unicode, each byte value cor­res­ponds directly to a character number, which enables a logical, lexical order. However, this sequence includes some gaps. The range U+007F to U+009F is reserved for non-visible control char­ac­ters rather than printable ones. In this section, the UTF-8 standard doesn’t assign any readable symbols—only command functions or control codes.

As mentioned, UTF-8 coding can the­or­et­ic­ally link up to eight byte sequences. However, Unicode pre­scribes a maximum length of 4 bytes. This results in byte sequences of 5 bytes or more being invalid by default. Moreover, this re­stric­tion reflects the aim to create code that is as compact—using minimal storage space—as possible, and as struc­tured as possible. A fun­da­ment­al rule when using UTF-8 is to always use the shortest possible encoding.

However, for some char­ac­ters, there are multiple equi­val­ent encodings. For example, the letter ä is encoded using 2 bytes: 11000011 10100100. The­or­et­ic­ally, it’s possible to combine the code points for the letter ‘a’ (01100001) and the diaeresis mark ‘ ’ (11001100 10001000) to represent ‘ä’: 01100001 11001100 10001000. This uses the so-called Unicode Nor­mal­iz­a­tion Form NFD, where char­ac­ters are ca­non­ic­ally de­com­posed. Both encodings shown lead to the exact same result (namely ‘ä’) and are therefore ca­non­ic­ally equi­val­ent*.

Note

Nor­m­al­isa­tions are used to unify different Unicode rep­res­ent­a­tions of the same character. Canonical equi­val­ence is important because it means that two sequences of char­ac­ters can be encoded dif­fer­ently but have the same meaning and ap­pear­ance. Com­pat­ible equi­val­ence, on the other hand, also allows sequences that differ in format or style but are sub­stant­ively the same. Unicode nor­m­al­isa­tion forms (e.g., NFC, NFD, NFKC, NFKD) use these concepts to stand­ard­ise texts. This ensures that com­par­is­ons, sorting, and searches work con­sist­ently and reliably.

Some Unicode value ranges were not defined for UTF-8 because they are reserved for UTF-16 sur­rog­ates. The overview shows which bytes in UTF-8 under Unicode are con­sidered valid according to the Internet En­gin­eer­ing Task Force (IETF) (green marked areas are valid bytes, orange marked are invalid).

Image: Table: UTF-8 value ranges
The table provides an overview of the valid UTF-8 value ranges.

Con­ver­sion from Unicode Hexa­decim­al to UTF-8 binary

Computers read only binary numbers, while humans use a decimal system. An interface between these forms is the hexa­decim­al system. It helps to compactly represent long chains of bits. It uses the digits 0 through 9 and the letters A through F and operates on the base of the number 16. As the fourth power of 2, the hexa­decim­al system is better suited than the decimal system for rep­res­ent­ing eight-digit byte ranges.

A hexa­decim­al digit rep­res­ents a quartet (‘nibble’) within the octet. A byte with eight binary digits can therefore be rep­res­en­ted with just two hexa­decim­al digits. Unicode uses the hexa­decim­al system to describe the position of a character within its own system. From this, the binary number and finally the UTF-8 codepoint can be cal­cu­lated.

First, the binary number must be converted from the hexa­decim­al number. Then you fit the code­points into the structure of the UTF-8 coding. To simplify the struc­tur­ing, use the following overview, which shows how many code­points fit into a byte chain and what structure can be expected in which Unicode value range.

Size in Bytes Free Bits for De­term­in­a­tion First Unicode Codepoint Last Unicode Codepoint Start Byte / Byte 1 Follow Byte 2 Follow Byte 3 Follow Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Within a given code range, you can predict the number of bytes used because the lexical order is con­sist­ent in both the Unicode code­points and the cor­res­pond­ing UTF-8 binary values. For the range U+0800 to U+FFFF, UTF-8 uses 3 bytes. This range provides 16 bits to represent the codepoint of each symbol. The binary number is in­teg­rated into the UTF-8 encoding from right to left, with any unused bits on the left filled with zeros.

Cal­cu­la­tion example:

The char­ac­terᅢ (Hangul Junseong, Ä) is located at position U+1162 in Unicode. To calculate the binary number, first convert the hexa­decim­al number into a decimal number. Each digit in the number cor­res­ponds to the cor­rel­at­ing power of 16. The rightmost digit has the lowest value with 160 = 1. Starting from the right, multiply the digit’s numeric value by the power’s value. Then, add up the results.

Image: Example calculation: Convert Hexadecimal Number to Decimal
Convert the hexa­decim­al number to a decimal number in the first step.

4450 is the cal­cu­lated decimal number. Now, convert this into a binary number. To do this, re­peatedly divide the number by 2 until the result is 0. The remainder, written from right to left, is the binary number.

Image: Example calculation: Convert decimal number to Binary
Convert the decimal number to a binary number in the next step.

The UTF-8 code pre­scribes 3 bytes for the codepoint U+1162 because the codepoint is between U+0800 and U+FFFF. Therefore, the start byte begins with 1110. The two sub­sequent bytes each start with 10. Fill in the binary number in the free bits, which do not dictate the structure, from right to left. Complete remaining bit positions in the start byte with 0 until the octet is full. The UTF-8 coding then looks like this:

11100001 10000101 10100010 (the inserted codepoint is bold)

Character Unicode Codepoint, hexa­decim­al Decimal number Binary number UTF-8
á…¢ U+1162 Decimal number 4450 1000101100010 111000011000010110100010

UTF-8 in the Editor

UTF-8 is the most wide­spread standard on the internet, but simple text editors do not ne­ces­sar­ily save texts in this format by default. Microsoft Notepad, for instance, uses a default encoding referred to as ‘ANSI’ (which is actually the ASCII-based encoding Windows-1252). If you want to convert a text file from Microsoft Word to UTF-8 (for example, to represent various writing systems), proceed as follows: Go to ‘Save As’ and select ‘Plain Text’ in the File Type option.

Image: Screenshot: Saving document in Word
You also have the option to save documents as plain text in Microsoft Word.

The pop-up window ‘File Con­ver­sion’ will open. Under ‘Text Encoding’, select ‘Other encoding’ and from the list, choose ‘Unicode (UTF-8)’. In the drop-down menu ‘End lines with’, choose ‘Carriage Return/Line Feed’ or ‘CR/LF’. This is how you easily convert a file to the Unicode character set with UTF-8.

Image: Screenshot: File conversion in Word
In addition to UTF-8, the ‘File Con­ver­sion’ window also offers options such as Unicode (UTF-16) with and without Big-Endian, as well as ASCII and many other encodings.

Opening an unmarked text file, where you don’t know be­fore­hand which encoding was applied, can lead to issues during editing. In Unicode, the Byte Order Mark (BOM) is used for such situ­ations. This invisible character indicates whether the document is in Big-Endian or Little-Endian format. If a program decodes a UTF-16 Little-Endian file using UTF-16 Big-Endian, the text will be output in­cor­rectly.

Documents based on the UTF-8 character set do not have this problem, as the byte order is always read as a Big-Endian byte sequence. In this case, the BOM merely serves as an in­dic­a­tion that the document is UTF-8 encoded.

Note

Char­ac­ters rep­res­en­ted with more than one byte can have the most sig­ni­fic­ant byte at the front (left) or the back (right) in some encodings (UTF-16 and UTF-32). If the most sig­ni­fic­ant byte (MSB) is at the front, the encoding is labeled ‘Big-Endian’. If the MSB is at the back, ‘Little-Endian’ is added.

You place the BOM before a data stream or at the start of a file. This marker takes pre­ced­ence over all other dir­ect­ives, even over the HTTP Header. The BOM acts as a sort of signature for Unicode encodings and has the code point U+FEFF. Depending on the encoding used, the BOM appears dif­fer­ently in its encoded form.

Encoding Format BOM, Code point: U+FEFF (hex.)
UTF-8 EF BB BF
UTF-16 Big-Endian FE FF
UTF-16 Little-Endian FF FE
UTF-32 Big-Endian 00 00 FE FF
UTF-32 Little-Endian FF FE 00 00

Do not use the Byte Order Mark if the protocol ex­pli­citly prohibits it or if your data is already assigned a specific type. Some programs, according to the protocol, expect ASCII char­ac­ters. Since UTF-8 is backward com­pat­ible with ASCII coding and its byte order is fixed, you don’t need a BOM. In fact, Unicode re­com­mends not using the BOM with UTF-8. However, since they can appear in older code and cause problems, it’s important to identify any existing BOM as such.

Create a website with your own domain
The fast track to your own website
  • Pro­fes­sion­al templates
  • One-click design changes
  • Free domain, SSL and email
Go to Main Menu