Translating and localizing digital media brings up technical issues that not all translators have the know-how to resolve. In our last post on Encoding Nightmares, we explained the character encodings used for different languages around the globe. Unicode was developed to establish character encoding that works for all languages, and that can be opened and read by any application. Despite this development, there are still many different encodings in use. In this post, we share rules of thumb and tricks of the trade for managing files and avoiding encoding nightmares during the localization process.
Rules of Thumb
Understanding the purpose of character encoding and the role of Unicode in facilitating global communication is a big step toward avoiding an encoding nightmare. Here are some general rules of thumb to preserve the integrity of your digital content.
Limit your software applications
Every different application presents an opportunity for file corruption. By limiting the number of unique software applications used to open and save the files, you (and your extended team) can significantly lower the risk of data corruption from encoding mismatches.
It should go without saying that any project with a global perspective should be Unicode compliant, but there are multiple Unicode character encodings out in the field (UTF-8, UTF-16, and UTF-32 are among the most common). For the vast majority of purposes, if you are localizing a website (by extracting translatable files from your CMS) or a mobile app, you should be using UTF-8. If you are dealing with a language partner who uses UTF-16 or UTF-32 encoding, just be aware of this and (when possible) convert the files before attempting to modify them. If you don’t pay attention to the differences between them, you may be one of the unlucky ones that discover encoding nightmares are still possible inside the Unicode universe.
Use metadata for character encoding declarations
You are probably familiar with the “Content Type” meta tag in HTML that allows the author to declare the specific character set being used. Modern browsers like Chrome and Firefox read that metadata and display the page accordingly. Other file formats, like XML and PDF, also support character encoding declarations, so we recommend that you use them.
Know the difference between characters, glyphs and fonts
The Unicode standard encodes root characters, not their specific visual forms, or glyphs. A font is a collection of glyphs for displaying a particular character set. Fonts don’t usually provide glyphs for all (100,000 +) characters in the Unicode standard, so if you are having trouble displaying foreign language characters, it might not be an encoding problem. Instead, you may need to find a font that supports the characters you want to display.
Confirm your partners are encoding savvy
Confirm that all resources involved in the chain of a localization process are aware of the source coding and how to preserve it. This is especially important if you are working with individual translators directly. Ask them which types of files and applications they work with, and confirm they know how to preserve encoding integrity at transfer points. And be careful that nobody opens a file “just to look” and inadvertently changes the encoding.
Tricks of the Trade
Excel spreadsheets and localization in CSV Files
Technically, Microsoft Excel has had the capacity to set and change character encodings since Office 2007. However, the controls are effectively hidden and hard to use. What’s more, Excel has trouble maintaining character encodings with imported CSV (comma separated value) files. If you’re not careful with CSV files, you can irretrievably damage character encoding. If you receive a CSV file, the trick is to open it outside of Microsoft Office, in either Apple iCalc or Google Docs, and the save it there as an Excel (.xlsx extension) file. You can then work with the file in Excel. When you are ready to send or transfer the file as CSV, you need to open the Excel file in either Apple iCalc or Google Docs again and save it as CSV there. It’s a circuitous route, but one well worth taking to protect you from an encoding nightmare.
Checking encoding upon receipt is always a good idea. Although HTML pages should include metatags indicating the character encoding, they are often left out and they could be wrong if the source files was re-saved incorrectly somewhere in the workflow. If you get a batch of files from a language service provider and you want to a do a quick check to make sure that they were delivered in proper UTF-8 format, you can use open source software, File Encoding Checker, to check all files in a folder at once.
UTF-8 with BOM
UTF-8 files can be packaged in two different varieties: with and without Byte Order Marks (BOM). Essentially, the BOM is a string of bytes at the beginning of a file (specifically, 0xEF,0xBB,0xBF) that acts as a signal to receiving applications that the subsequent bits are encoded in UTF-8. Popular applications like Notepad add BOM to the beginning of UTF-8 files by default. For many applications, the BOM is unnecessary and unexpected, and can lead to programming errors. PHP scripts for example, will choke on UTF-8 with BOM. Make sure you convert any UTF-8 with BOM files to pure UTF-8 before using in a programming environment. Or better yet, choose a text editor that will keep your files free of BOM, like Sublime Text.
How to know when to throw in the towel
Are you seeing question marks framed by back diamonds? That’s a sign that some characters from the source file have been irretrievably lost. If someone opens a file with incorrect encoding and then saves it, the original sequence of bits can be corrupted to the point where restoring them will be either impossible or extremely labor intensive. In those situations, you may want to start over. If this is necessary, having a translation memory can save you lot of time and frustration.
To avoid nightmares like this in the future, be sure that everyone in the localization workflow knows their way around character encoding.