In the beginning…
…there was ASCII, and it was good.
Computer technology accelerated quickly in the United States, and accordingly so did certain standards. Foremost was the decision to codify the basic unit of data in a byte (1). A byte was large enough to hold all characters in the English language as well as all digits, common punctuation, and still have room left over. In the end, the American National Standard Code for Information Interchange (ASCII) was devised to standardize how computers would store and communicate a, b, c, 1, 2, 3…
But anything as useful as a computer could not remain the province of one country or language, so software systems evolved to support people around the word. The big problem was… well… big characters.
English has an amazingly compact alphabet – just 26 characters. Double that to allow for capital and lower case, and toss in digits 0 through 9, and you get a whopping 82 possible combinations before including punctuation. Since a byte can hold 256 different representations, ASCII and a one-byte-per-character system worked just fine for Americans, using 1/2 less than space available in a single byte.
But it didn’t work for the Japanese, Chinese, and a number of cultures around the globe. Depending on the source, the idiomatic Chinese language can have upwards of 80,000 distinct characters. Using basic binary math, we see that instead of one byte for every character, Chinese computers would need to use upwards of three bytes. Add other languages and regional variations, and you had a mess. So different computer manufacturers, standards organizations, and government agencies went forward to solve this problem.
Digital Tower of Babble
The nice thing about standards is that you have so many to choose from!
In the rush to support all possible character sets, several different systems for codifying characters came into existence. This of course meant that if you created software on one operating system, it likely would not run on another. This made exporting software an absurd business since basic functions – like sorting strings of characters – would have to change from system to system and language to language.
But time and market dynamics have helped reduce this hodge-podge of character sets to a manageable few, with some obvious choices. Here we document those that really matter.
As noted, ASCII is the primordial character set. It serves all English speaking countries, and with common extensions in the extra storage provided in one byte of data, even local variations (such as the British Pound symbol – £ – or common European characters – ö) can be accommodated. By using the spare either bit, ASCII was extended to include characters for other languages such as Cyrillic, Arabic, Greek, Hebrew. If your product will never be sold outside of the US and Western Europe, then ASCII may be sufficient. Just remember, never is a long, long time.
Double byte – a nice concept, but…
Doubling the size of ASCII encoding – from one byte to two – would offer 65,536 possible combinations, as opposed to a mere 256. Though this is not enough to hold all possible characters sets of all languages, it would hold enough to make common communications possible (2).
But there was a problem, namely money. Not long ago, computer memory and storage was expensive. Computer programmers constantly searched for ways on economizing storage needs. This led to a number of half steps to a universal encoding scheme. Most notable was the multibyte system.
Programmers, being the slick people they are, devised a complicated way of using a little space as possible for storing characters, yet allowing for language representation from compact English to the full range of Chinese.
However, for the sake of compactness, multi-byte added complexity. A language like Chinese might represent a character in one, two or three bytes depending on its position in a character table. Needless to say, this complicated even simple tasks like scanning text for specific elements, or sorting strings, or even displaying text on screen.
This is not to say that multi-byte systems are rare. The UTF-8 standard is common in systems that were born in the age of ASCII (UNIX being an obvious example). Multilingual web sites are often encoded in UTF-8, which provides both flexibility for supporting many languages as well as compactness in transmitting data across potentially slow internet connections.
The ideal solution would be one where all characters from all languages could be stored in identically sized units (i.e., the same number of bytes regardless of the language in use). Once again, time and market pressures addressed the problem.
Unicode – a double byte standard
As computer storage became cheaper (as everything associated with computers do over time), a more direct way of encoding was needed. Having a uniform character size simplified systems software, application programming, and a few grey hairs.
Back in the dark ages of computing – around 1986 – some bright fellows at Xerox started to map the relationships between identical Japanese and Chinese characters to quickly build a font for extended Chinese characters. This humble start led other developers and companies (notably Apple) to drive toward a new standard called Unicode.
Unicode fixes all characters at two bytes and carries the fundamental characters of most languages. This means one character encoding scheme can store and present text in any language. For example, Unicode contains characters for Latin, Arabic, Cyrillic, (Uni)Han, Hebrew, and more. More to the point, these characters maintain a specific place in character tables, with the original ASCII characters in their original positions (talk about backwards compatibility!).
Unicode has become the de facto standard for most software development. Using Unicode allows a company to develop new applications for English speaking countries first, and rapidly localize for other regions without fundamental coding changes. Unicode is so popular, it is supported by all modern operating systems including all commercial versions of UNIX, Linux, Windows, MacOS and the World Wide Web.
Take a byte
Though the history of character encoding on computers is haggard, the future is clear. Standards have shaken out and your options for new product development are few. It is best to design for internationalization from the start because the effort is little more than if you devolved back to ASCII.
(1) Actually, the most fundamental data element is a ‘bit’, but bits do not by themselves transfer anything that is truly ‘information’.
(2) Unicode, which we discussed earlier, does just this. By eliminating obscure and infrequently used characters, Unicode compacts all industrialized languages into two bytes.