Friday, February 10, 2006

Web Internationalization [I18N]: Part V

Sorry for long break, here we are again.

After we discussed how characters are transferred over Internet, now we can tell are there any scramble possibility in each step.

  • Step 1: Input via input method
    Actually, almost all the input method now support multi-codepage. But, when u running your system on a special code page(character encoding) like Abric, then input your Japanese characters or Chinese characters in a non-unicode-support editor, then you will se the scramble characters. And if you save it and publish it as a html page, then you will find the scramble code in your browser.
  • Step 2: Stored in text file.
    Text file has some tricks to indicate which character encoding is used in this text file. When you open a text file in a Hex Editor like UltraEdit or 001Editor, you will see the difference between a ANSI text file and Unicode Text file. ANSI text file means using system default code page(characters encoding) to parse these characters. Using Unicode will cause 3 different situations:
    A. Start with 0xFE 0xFF, that means this is unicode text file and using big endian.
    B. Start with 0xFF 0xFE, that means this is a unicode text file and using little endian.
    C. Start with 0xEF 0xBB 0xBF, that means this is a unicode text file and using UTF-8.

    If non of the above signature is discovered at the beginning of a text file, then the edit will try to use the system specified character encoding to interpret these characters.
    Here comes the problem: if you stored a text file in ANSI format and contained Japanese characters and when you try to open it in a Chinese System, you will see the scramble characters.
  • Step 3: Transferred over HTTP
    The HTTP protocol has a implict header indicate the character encoding used in the HTML file. Here comes the problem: if you specified a HTML using Shift-JIS(it's an Japanese Character Encoding format under ANSI), and actually using GB2312 in the HTML file then the page must be scrambled in client browser. Here, the consistency of HTML file and the HTTP protocol is very important.
  • Step 4: Find the correct interpreter according to character encoding specified by HTTP protocol.
    As we mentioned above, what if the character encoding specified in HTTP header is not consistency to the really character encoding used in HTML file, then the browser will use the wrong code page to mapping these characters. That's the cause of scramble characters.
  • Step 5: Mapping from font to character code.
    Also as we mentions in above 2 steps, the system is trying to find the correct glyphs in the font according to character code. The unicode has conformed East Asian Characters like Chinese, Japanese and Korea, so the character will have the same code under Unicode format. But those ANSI code have there own character code. In 2 different code page, the same code stands for 2 different character, and if you choose Japanese code page to interpret Chinese characters, the result is so obviously that we will get the scramble code.

Till now, we have find out how the characters are inputed, transferred, interpreted and displayed to user, and in each step, we discussed how the scramble characters are happened. With this analysis, we can drop a conclusion:

In order to avoid scramble characters, we must use UNICODE and use it through out the entire procedure of authoring HTML file and transportation it.


No comments:

Post a Comment