::scr Ramblings of a Classic Refugee or How I Learned To Stop Worrying and Love OS X

Andy Wardley scr@thegestalt.org
Wed, 6 Feb 2002 11:03:33 +0000


On Wed, Feb 06, 2002 at 10:48:26AM +0000, Alaric Snell wrote:
> Unicode, not ASCII. Never forget that. An XML processor is a complex piece of 
> software since it *must* operate at the Unicode level (even if it's just 
> mapping some ASCII variant to Unicode) to be able to process valid XML 
> documents, which may contain character references to Unicode, or be encoded 
> in UTF-8.

Ooops!  Yes, of course, you're absolutely right.  No wait, you're _almost_ 
right.  To be extremely pedantic, the XML spec defines:

  A character is an atomic unit of text as specified by ISO/IEC 10646.

Quoting from http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html

  ISO/IEC 10646 is a relatively new character set standard, published in 
  1993 by the International Organization for Standardization (ISO). Its 
  name is "Universal Multiple-Octet Coded Character Set".   (UCS)

  Unicode is a coded character set specified by a consortium of major 
  American computer manufacturers, primarily to overcome the chaos of 
  different coded character sets in use when creating multilingual programs 
  and internationalizing software. From version 1.1 on, Unicode is 
  scrupulously kept compatible with ISO/IEC 10646 and its extensions. 
  The consortium is also an important contributor to the ISO work to 
  further develop ISO/IEC 10646.

I never knew that until just now, and I'm not sure that my life has been 
enhanced at all by it, but there you go... 10646 is the man :-)


A