Friday, November 25, 2005

Web Internationalization [I18N]: Part II

Today, I'd like to talk about the "Character Encodings"

Since the very beginning of the Computer Science, Character Encodings is as old as CS. The most famous ASCII table, is one of the most popular Character Encodings.

So, what means character encoding? Character encoding is some kind of organization of numeric codes that represent the characters of a character set in memory.

There are many character encodings in this world because a lot of people had tried how to express their own language or characters in computer.

Before we take a deep digging into character encoding, we need understand some basic concepts.

  • Character: According to the glossary of Unicode standard[Unicode standard 4.0], a character is the smallest component of a written language that has semantic value.
  • Phoneme: A phoneme is a minimally distinct sound in the context of a particular spoken language. Also we can say that Phoneme is the unit of aural rendering, and in some scripts, character has a close relation to phoneme, while others have a close relation to meaning. There is no one-to-one correspondence between the characters and Phonemes.
  • Glyph: Glyphs are defined by ISO/IEC 9541-1[ISO/IEC 9541-1] as "a recoganizable abstract graphic symbol which is independent of a specific design". Usually, also referred as the unit of visual rendering. There is no one-to-one correspondence between the characters and Glyphs.
  • Unit of input: In keyboard input, it's NOT ALWAYS the case that keystrokes and input characters correspond one-to-one. Only a few language like English can correspond the keystroke and the character one-to-one, there are many other languages outside there and they are using far more complex writing system. It's impossible to fit them all to the keyboard and they must rely on some kind of input method which transform keystroke sequence into character sequence.
  • Unit of collation: String comparison are used on sorting and searching which based on collation but not characters. Those collation does not have a one-to-one relation with characters. For example, in triditional Spanish sorting, the character sequence 'ch' and 'll' are treated as atomic collation unit.
  • Unit of storage: All information is stored in physical storage, the basic principle of CS, as usual, we know bits and bytes, thus the most complex part. A frequent error in specification and implementations is the equating of characters with unit of physical storage. That's mapping is our object, usally called the Character encoding.

The above terms are the basic conecpts for understanding character encoding.

Here is the end of Part II


No comments:

Post a Comment