Technical Q&As

TX 01 - Kanji and Special Text-Processing (1-May-95)

Q Our application supports Kanji, but it still has to perform some special text-processing operations on certain documents to search for control characters, remove control characters, and other similar operations. To accomplish this, we need additional information on international character-set encoding.

It appears that $0D is a carriage return, and $09 is a tab, even in Kanji (the Kanji character set seems to encompass the entire Roman character set), if they occupy the first byte of a double-byte character. However, we are finding what appear to be control characters in the second bytes of double-byte characters. Can we safely assume that $0D, or any other common control-character code, is a carriage return wherever we find one in a Kanji document? If so, does this hold true for other non-Roman script systems?

A All scripts have the same low ASCII values ($00-$7F), and all double-byte scripts use only high ASCII values ($80-FF) for high-byte (first byte) values and $40-$FF for low-byte (second byte) values. Therefore, control characters, numbers, and elementary punctuation characters are all unique.

To see exactly what is permitted for the particular script you are working with, call the parseTable script-manager routine to obtain a table of high/low byte values. The difficulty of dealing with control characters in scripts will disappear when Unicode (which uses 16-bit characters and can have any combination of them) is in widespread use. Because of fundamental compatibility problems with our system software and any application that assumes that $0D is always a <CR>, Unicode will never be a 'script system'. Instead, it will probably be an alternate encoding platform, with all new rules. There is no reason to plan extensively for Unicode at the present time, but you should make as few assumptions as possible in your code. This will help to minimize the effort required to make your code compatible Unicode in the future.

To obtain a more in-depth understanding of international character-set encoding, software localization, and Unicode, locate a copy of Guide to Macintosh Software Localization (an Addison-Wesley publication). While this is available as soft copy on one of the developer CDs, you may find that some of the content won't display properly unless you have all of the appropriate fonts installed, so it might be best to obtain a printed copy.

Technical Q&As
Contents | Next Question