Pp. 135140 of the Proceedings |
The ISO C/SUS V2 standard uses two different encodings for multilingual support: an internal and an external encoding. We use the term "internal encoding" to mean a multilingual encoding system used inside C programs (like inside variables for manipulation), normally declared as an array of wchar_t. On other papers the term "process code" is also used. "External encoding" means a multilingual encoding system used outside of programs (like files or screen output stream). It is also called as "file code". The ISO C/SUS V2 libraries will supply conversion functions between external and internal encoding. Here we call the functions "encoding engine".
Many of existing ISO C/SUS V2 libraries, especially freely available ones like GNU libc, assume that the internal encoding is Unicode-based. More specifically, they assume UCS4 as the internal encoding. They also advocate UTF-8 [Yergeau, 1998] as the external encoding. The encoding engine is hardcoded for internal UCS4 encoding. External encodings other than UTF-8 are supported by conversion functions in encoding engine, which convers external encodings to UCS4, and vice versa. To implement a correct multilingual support, we believe it is very important to not hardcode any encoding, including Unicode. The Citrus project library does not hardcode any existing encoding. In the next section we discuss more fully why Unicode is not enough.
In order to be able to make less assumptions about multilingual encodings, we have designed our libraries to support multiple external encodings, and multiple encoding engines and internal encodings. In other words, each internal encoding has its own encoding engine. We call this concept a "multi-script framework." To support multiple encodings, we simply need to supply the appropriate encoding engines. We also use dynamic loading to load encoding engines, so that we can add more engines on the fly.
|inside programs | | | | | | User programs | | | | | | | | | ISO C/SUS V2 functions | | | | | | | | | Files/char devices+--encoding engine-----wchar_t stream (external encoding)| (internal encoding) | | | | |
We specifically have chosen NOT to hardcode Unicode in our library suite. Since we started using computers for text processing, we have experienced many transitions in character set encodings. Here we list our history in chronological order:
Separately from the above direction, there were a couple of regional encodings (encodings that support single language only within a single plaintext stream), including EUC (like euc-kr), MS-Kanji (Shift JIS), or Unicode. Here we list Unicode under "regional encodings", as we cannot mix Chinese/Taiwanese/Japanese text in a plaintext - we need to annotate plaintext with font designation to do it.
Unicode cannot handle multiple Asian characters in a plaintext at the same time, due to "han unification". Unicode maps multiple characters from different Asian regions, into the same Unicode codepoint - this is called "han unification". It was introduced to reduce the amount of codepoint used in Unicode (to fit characters into UCS2 16bit region). While some of the unified chararcters are indeed same across Asian regions, others have totally different glyphs and meanings in different Asian regions [KUBOTA, ] . Suppose that we have assigned the same codepoint for O, and O with umlaut. Some people cannot notice the difference, however, this will become a significant problem for people who uses O with umlaut in their language (like for German language). Also note that, if we convert Asian multilingual text into Unicode and then convert it back to other multilingual encoding, the conversion will not be able to preserve the information contained in the original multilingual text, due to the han unification. When we convert the multilingual text into Unicode, we will lose the information encoded in the source due to the han unification.
NOTE: there are proposals to perform language tagging [Whistler, 1999] in Unicode, however, the language tagging jeopardizes one of the very important aspects of Unicode, uniform 32bit wide-char representation, and we do not consider it to be useful.
Every time we change from one encoding system to another, the transition is very painful. In fact, there still are applications that are not 8bit-clean. Even for Unicode, there are implementations assume 16bit UCS2, and they need to migrate to 32bit UCS4. From our experience, it makes no sense any more to pick a single encoding to rely on. We do not advocate the use of ISO-2022, either. We believe that no encodings should be hardcoded, since to do so means that we will have to bear painful transitions over and over again.
Here is another reason for us to avoid hardcoded encoding, and avoid hardcoded Unicode support. There has been widely deployed userbase which uses non-Unicode multilingual text, including big5, euc-kr, KOI8-R, MS-Kanji and some other encodings. If someone says "all people just need to transition to Unicode", that is wrong. Unicode imposes no pain to Latin-1 userbase, while imposing huge pain to Asian and other non-Latin-1 people. And, even if we transition to Unicode, we are unsure if it is going to be enough. So we conclude that we should hardcode no encodings.
What we should do is very similar to the approach with MIME [Freed, 1996] . We will have a way to identify encoding in a plaintext stream, and have support for multiple different encodings. We will switch encoding engines according to the encoding identification. In C library case, we identify encoding by locale settings made by setlocale(3).
External encoding is normally a stateful, or stateless multibyte encoding, like ISO-2022-JP [Murai, 1993] or UTF-8. An octet stream will be used for multilingual representation inside files. Many of the existing external encodings use variable-length encodings; one letter will be presented as a octet, two octets, or more octets. When we read in external encoding representation into C program, we normally use array of char to hold it. Internal encoding is a stateless, fixed- bitwidth encoding. It is defined internally by the library, depending on the current encoding engine we are using. We use a type called "wchar_t" to hold it. At this moment wchar_t is a 32bit integer (int32_t).
When the external encoding is 8bit (like Latin-1), we can just typecast external encoding (char) into internal encoding (wchar_t). When the external encoding is UTF-8, the natural choice for internal encoding is UCS4. RFC2279 [Yergeau, 1998] defines starndard conversion between them. When the external encoding is ISO-2022, we use a compressed representation of ISO-2022 stream as the internal encoding.
With setlocale(3), we pick a pair of internal and external encoding, and encoding engine. A programmer can convert internal encoding and external encoding, using ISO C/SUS V2-compatible library calls, like mbstowcs(3).
User programs should manipulate wchar_t stream only, and should not manipulate text encoded with external encoding. The details of internal and external encoding are embedded into the library API. Programs should not, and need not to care about internal nor external encoding at all. If a programmer hardcodes some assumption about internal/external encoding to their programs, the programs will not be future-proven. We will also supply wchar_t-ready curses, regex and other libraries, to keep external encoding outside of user programs.
From our strategy, we have two major benefits. First, our library is future-proven, as long as internal encoding fits into the bitwidth of wchar_t (currently 32bit). Next benefit is that we can simplify encoding engine as we wish. Suppose we need to support JIS X0201 or Latin 1 as internal/external encodings. In this case, the encoding engine can be a simple memory copy logic. We do not need to visit a slow encoding engine for simple encodings.
Our strategy avoids problem with Unicode and han unification. Since we can use specific encoding engine that matches the external and internal encodings, we can preserve information supplied by the external data representation, into internal data representation. Therefore, we can have no data lossage during conversion from external to internal encoding, or vice versa.
If we use Unicode as internal and external encoding, we would not be able to enjoy the above mentioned benefits. With UCS4 as the internal encoding, it is very hard to support external encodings other than UTF8. To support them, we would need a huge conversion table to convert the external encoding into the internal encoding, and vice versa. Also, it will not be possible to simplify the econding engine, even if internal/external encodings are simple enough.
There are various applications that would need a library support for multiple encodings during their runtime session. Examples would be web clients, text editors and email readers. For these applications, we would need to pick internal and external encodings based on input data stream. So, ISO C/SUS V2 API is not sufficient for these type of applications.
We have a temporary workaround to provide a better multiple encodings support. ISO C/SUS V2 API has a data type, mbstate_t, to hold the intermediate state for encoding engine. Our implementation includes a hidden reference to encoding engine inside mbstate_t. By holding an mbstate_t variable with a wchar_t array, we can identify the encoding engine used to encode the wchar_t stream. When we convert wchar_t (internal encoding) back to external encoding, we can automatically use the appropriate encoding engine.
We still are investigating a better API to abstract the manupulation of multiple encodings in a single program. We would like to make a proposal when we are done.
At this moment, we are using a tool called mklocale(1) for converting LC_CTYPE locale definition files into binary representation. The mklocale(1) tool and the locale definition file format are derived from runelocale implementation. We should migrate to more standard tool like localedef(1), and the standard file format for locale definition files.
There still are couple of issues to be resolved. We picked 32bit wchar_t for now, just like many of other UNIX operating systems do. It is good enough for future? It is a good question. We believe 32bit is a good compromise. A good thing is that we actually are future-proven. As long as people do not hardcode the current assumption that wchar_t is of 32bit quantity, we will be able to expand widechar representation to 64bit, or something larger. It requires full recompilation of operating system and userland programs, but the transition will impose no change in code. The transition will be just like transitioning from 32bit time_t to 64bit.
For full locale support (including LC_TIME and LC_COLLATE), we will need a large database for localized character tables, time format and others. For voluntary free software effort it can be way too hard to manitain. The maintenance cost for the LC_CTYPE locale databases is also high. We are wondering if we can integrate database from ICU [IBM, ] , and reuse it in our multi-script framework.
DEC/Compaq Tru64Unix uses a similar technology as we have used [Compaq, ] , including multiple switchable internal encoding, dynamic library support for additional encodings support, and the use of 32bit wchar_t. As mentioned in the abstract, vendor UNIX implementations are very ahead of free software implementations, regarding to multilingual support. We wish to see more of vendor technologies to be made available to public consumption, under BSD or GNU license.
The author would like to thank Freenix reviewers including Wendy Rannenberg, and Citrus project members including Henry Nelson, for the time they gave me improve the paper.
KUBOTA, . Tomohiro KUBOTA, Most Important 1006 Ideographs for Japanese. https://www.debian.or.jp/~kubota/unicode/.
Whistler, 1999. K. Whistler and G. Adams, "Language Tagging in Unicode Plain Text" in RFC2482 (January 1999). ftp://ftp.isi.edu/in-notes/rfc2482.txt.
Freed, 1996. N. Freed and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies" in RFC2045 (November 1996.). ftp://ftp.isi.edu/in-notes/rfc2045.txt.
Murai, 1993. J. Murai, M. Crispin, and E. van der Poel, "Japanese Character Encoding for Internet Messages" in RFC1468 (June 1993). ftp://ftp.isi.edu/in-notes/rfc1468.txt.
Borman, . Paul Borman, 4.4BSD rune(3) implementation.
IBM, . IBM, ICU: International Components for Unicode. https://oss.software.ibm.com/developerworks/opensource/icu/project/.
Compaq, . Compaq, "Writing software for the International Market" in Tru64 UNIX Version 5.1 programming online documentation. https://tru64unix.compaq.com/faqs/publications/base_doc/DOCUMENTATION/V51_HTML/ARH9YBTE/TITLE.HTM.
This paper was originally published in the
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference,
June 25-30, 2001, Boston, Masssachusetts, USA
Last changed: 21 June 2001 bleu |
|