Preamble: Smoke signals
The smoke signal is one of the oldest forms of communication; alerting of an impending attack, signalling the guards to shut the gates and for everyone to batten down the hatches. Additionally separate signals can mean different things. One lit fire might indicate enemies are approaching, whereas two might indicate the king has returned. However this only works on the basis that there is a shared understanding of what each signal means. If you forget how many fires to light you might cause some rather unintended consequences...
Character Encoding
What has this got to do with character encoding you say? Similar to the smoke signals, when you send a message to someone, you both need a shared understanding of how to interpret that message.
Let's say two people, Alice and Bob, would like to exchange messages over the internet. Alice would like to send "hello". The message needs to get translated into a steam of 0s and 1s that the computer understands, sent to Bob and then converted back again into a human readable format.
Note the above example is of a fake encoding.
ASCII
ASCII provides us one way of doing this conversion step and this is called an encoding system. ASCII defines a mapping of characters into 128 different numbers. The letter ‘A’ for example is represented by the (decimal) number 65. In binary this is 1000001 (represented with 7 bits).
A full mapping can be seen here: https://www.ascii-code.com/.
If we translated each letter (remember letters are case-sensitive) in the message “Hello” (note the uppercase H) we get:
So (using 8 bit bytes) the message we send from Alice to Bob looks like this:
01001000 01100101 01101100 01101100 01101111
Bob’s computer receives this sequence and to convert it into something that makes sense to Bob, it uses the shared understanding of what these numbers mean: ASCII. He performs the reverse lookup of the process we just did and can successfully read the message. Fantastic!
What’s the problem?
ASCII uses 7 bits for its encoding. This means 2^7 = 128 different characters can be represented using ASCII. That’s great if you are only using plain English but for anyone that needs other characters (European, Arabic, Japanese for example) , you need more than the allowed 128 characters.
Since ASCII only uses 7 bits, there is one spare bit in your standard 8 bit byte. This allows for double the number of possible characters to 256! But because there are so many other characters in different languages that need encoding, there was an explosion of competing opinions on what the extra 128 characters should be. Therefore people all over the world started producing their own tables making use of the 8th bit in order to represent more characters that they needed.
If you typed up a message on one computer using one character encoding and sent it to someone else who was using a different character encoding, the message would show up differently.
The one saving grace was that almost all of these different encodings agreed on the first 128 characters. They adhered to the original ASCII standard. However even with the extra 8th bit, some languages have thousands of characters which simply cannot fit into 256 slots.
Another solution was needed. Enter Unicode...
✨ Unicode ✨
Unicode is designed to be a universal encoding to unify all the different characters and the ways they can be represented. Therefore it supports many different alphabets and even emojis. There are currently 143,859 different characters in the Unicode specification.
One major difference between ASCII and Unicode is that Unicode has no opinion on how this mapping should actually be implemented in terms of representing characters as binary bits. Instead it only has a specification of which character refers to which Code Point. A code point is a Hexadecimal number representing a character.
E.g. The letter A has a code point of U+0041
Note that 0041 in Hexadecimal refers to the number 65 in Decimal. The same mapping as in ASCII! Brilliant!
🎉 🙌 🎉
If Unicode is more abstract and only defines a mapping of characters to code points, then how does this actually get implemented? As it turns out there are multiple ways the Unicode spec is implemented, each with different pros and cons. However the Web has more or less settled on UTF-8 for a number of good reasons.
UTF-8 (Unicode transformation format)
“An algorithmic mapping from every Unicode code point to a unique byte sequence” - https://unicode.org/faq/utf_bom.html
UTF-8 is a variable length encoding. This means that values can be stored in 1, 2, 3, 4, 5 or even 6 bytes!
Problem: how do you know if one byte refers to one code point or if it's part of a larger byte sequence?
The answer to this, is using that extra bit we talked about earlier! By using a special sequence of bits, you are able to determine how to read the rest of the sequence. Any byte that starts with a 0 we know is always a single byte character. This has the very useful property of being backwards compatible with regular ASCII encoding.
E.g. 01000001
= letter A in both UTF-8 encoding and ASCII!
For characters above the 127 range we need two bytes to store this value.
2 byte encoding (UTF-8)
In order to indicate that we need to read the next two bytes the first three bits in the first byte are set to 110
and the first 2 in second byte are set to 10
. The remaining bits are filled in from right to left and padded with 0's (if needed).
Example: the symbol µ has a code point of U+00B5 (181 in unicode, 10110101 in binary).
So the therefore µ gets represented in UTF-8 as 11000010 10110101
.
Using two bytes gives us 11 bits to use, so we can represent 2048 characters. However as mentioned earlier some asian languages have thousands of characters and so will need even more bytes to represent all their characters.
3 byte encoding (UTF-8)
When you have a character that requires three bytes to encode, the first 4 bits of the first byte are set to 1110
and then the first two bits of the next two bytes are set to 10
. The same rules apply for filling in the remaining bits for the specific character.
4,5 and 6 byte encoding (UTF-8)
The pattern repeats all the way up to 6 leading 1’s allowing for up to 6 bytes to encode a character.
Summary
At this point hopefully you now have a decent understanding of the differences between ASCII, Unicode and UTF-8. There are other Unicode encodings that are not covered here such as UTF-16 and GB 18030. There's a great bunch of resources out there so I've left a few sources and further readings below. Enjoy!
Further reading 📚
Characters, Symbols and the Unicode Miracle - Computerphile - Tom Scott (YouTube)
Unicode, UTF8 & Character Sets: The Ultimate Guide (Blog Article)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (Blog Article)
What is Unicode, UTF-8, UTF-16? (Stack Overflow)
Introduction to UTF-8 and Unicode (YouTube)