Have you ever encountered problems displaying foreign characters on your app or website, or been confused by the appearance of strange question marks like this: ���? These are the result of character encoding mismatches. And character encoding mismatches can turn a software localization job into an encoding nightmare.
Encoding nightmares can overrun product deadlines and spark frustration and doubt for your clients. If a website or app has an international future, then you’ll need to know your way around character encoding before having it translated. A little knowledge up front can save you hours and even days of debugging.
Character Encoding and the Modern Tower of Babel
All computing is made possible by standards for encoding and decoding information. To render letters, words, images and sound from strings of 1s and 0s, there must be an agreed upon way to “encode” them through binary language.
Character encoding refers to the way that “character sets” for different languages are mapped to computer systems. They are defined in “code pages” or “character maps,” tables that match characters with specific sequences of 1s and 0s.
In the early history of computing, when English dominated, it was a relatively simple task to map 100 or so distinct characters into what became known as ASCII code. Although the code itself used only 7 bits (making it capable of storing up to 2^7 or 128 different characters), they were stored and transmitted in 8-bit blocks called “bytes.” The eighth bite was not needed or officially used by ASCII, so it was usually encoded with the value “0.”
As personal computers began to be used more often in European countries, they required encoding schemes that would display additional characters, like the Ü and the ß. Because France and Germany use a greater number of unique characters than were required for the English language, they began to make use of the 8th bit in the byte. With that, code page tables had room for twice the number of characters as basic ASCII (2^8 = 256). That didn’t quite account for all of the extra characters in modern use, but it was enough to communicate. In subsequent years, many other countries made use of the extra 128 characters afforded by that 8th bit to make their own character sets. Thai and Farsi script, for example, were encoded using the extra 8th bit.
When Asian languages with much larger character sets entered the scene, things got more complicated. In Chinese, for example, you need to read and write at least 2,000 distinct characters to be considered literate. More than 60,000 characters are defined in large dictionaries. New “double byte” (16 bit) encoding schemes emerged in China, Japan and Korea, giving them the ability to store up to 64,000 different characters.
With the number of distinct character encodings exploding worldwide, sharing files across international boundaries was fraught with problems. The same binary text file opened in different countries would display in dramatically different ways. Often, it would be simply unintelligible. Authors who for whatever reason might want to display two different languages in a single document were often out of luck. Computer scientists realized something had to be done. The Unicode standard was developed to allow for a common mapping of characters for all the world’s languages.
Rise of Unicode
It’s important to clarify that the Unicode standard is not a character encoding, per se. Unicode defines a master table of the world’s characters linked to unique “code points.” It’s up to a specific Unicode character encoding, like UTF-8, to specify what sequence of bits and bytes is mapped to what Unicode “code point.” These encodings vary, for example, in the number of bits used to store a given character.
The Unicode standard comprises more than 1 million unique code points (1,114,112 to be exact) divided into 17 “planes”, each with a capacity of 65,536 characters. The vast majority of these planes have yet to be assigned. This is by design, to accommodate not only growth and change in existing languages, but also the invention and discovery of new ones.
Emoji, defined on the “Supplementary Multilingual Plane,” are some of the more recent entrants into the standard. There seems to be little worry that Unicode will ever run out of room. Unicode even has “private use planes” where more whimsical projects have emerged, like the encoding of characters for the Klingon language.
Character Encoding Still a Problem Despite Unicode
While Unicode has been around and growing since the early 1990s, there are still many non-Unicode encodings in the world – Shift_JIS-2004 (Japanese), GB 2312 (simplified Chinese), Windows-1252 (European countries and the US) – and you’re likely to encounter them during your localization project.
Computer operating systems in current use have not yet made a complete switch to Unicode. Popular applications, like MS Word and Excel, often misapply and/or hide encoding information and can corrupt data beyond repair. To complicate things even further, Unicode itself supports different specific character encodings, including UTF-8, 16, and 32. While UTF-8 is by far the most popular for Internet communication, UTF-16 is the standard for Windows API, and is common in China, Japan and Korea.
If you receive a file assuming it was UTF-8, or without assuming much of anything, you might end up corrupting the files to the point where recovery becomes impossible. In our March 30, 2017 blog post, we’ll discuss tips and tricks for managing software localization projects and avoiding Encoding Nightmares.