Microsoft KB Archive/99884

= Unicode and Microsoft Windows NT =

Article ID: 99884

Article Last Modified on 11/1/2006

-

APPLIES TO


 * Microsoft Windows NT Advanced Server 3.1
 * Microsoft Windows NT Workstation 3.1

-



This article was previously published under Q99884



SUMMARY
Windows NT version 3.1 employs a relatively new standard of character representation called Unicode. This new standard allows for greater flexibility in adding support for localized versions of Microsoft Windows NT.



MORE INFORMATION
The first and most prominent character standard in use by computers today is ASCII. This format is adequate for western languages, but as computers became more popular in European countries, the limitations of ASCII became clear.

In an effort to overcome some of these limitations, the International Standards Organization (ISO) established a new standard called Latin-1 that defined European characters that were omitted from ASCII. Microsoft Windows modified the Latin-1 standard even further and called the character set Windows ANSI. However, by continuing use of an 8-bit coding scheme, ASCII is only capable of representing 256 unique symbols--considerably less than the 10,000 symbols that are common in such languages as Chinese, Korean, and Japanese. In addition to the language barriers, as the capabilities of computers broaden beyond uppercase, mono-spaced fonts, the requirements for a large set of unique characters (for example, letters, punctuation, mathematical and technical symbols, and publishing characters) have also grown far beyond the capabilities of 8-bit text.

The lowest level of localization (adaptation to a particular language) is the actual binary representation of characters: the code set. To overcome the limitations of the other coding methods, several major computer companies, including Apple Computer, Inc., Sun Microsystems, Inc., Xerox Corp., and IBM (International Business Machines Corp.), formed Unicode Inc., a non-profit consortium, to set out to define a new standard for international character sets. At the same time, the ISO began developing a standard. Eventually, these standards merged and became Unicode. Unicode is published as The Unicode Standard, Worldwide Character Encoding.

Unicode employs a 16-bit coding scheme that allows for 65,536 distinct characters--more than enough to include all languages in use today. In addition, it supports several archaic or arcane languages such as Sanskrit and Egyptian hieroglyphs. Unicode also includes representations for punctuation marks, mathematical symbols, and dingbats, with room left for future expansion. Because it establishes a unique code for each character in each script, Windows NT can ensure that the character translation from one language to another is accurate.

Unicode in Windows NT
Unicode is the native code set of Windows NT, but the Win32 subsystem provides both ANSI and Unicode support. Character strings in the system, including object names, path names, and file and directory names are represented with 16-bit Unicode characters. The Win32 subsystem converts any ANSI characters it receives into Unicode strings before manipulating them. It then converts them back to ANSI, if necessary, upon exit from the system.

