[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
another opinion
- To: ietf-charsets@INNOSOFT.COM
- Subject: another opinion
- From: Borka Jerman-Blazic <jerman-blazic@ijs.si>
- Date: Fri, 05 Nov 1993 09:10:32 +0100
- Conversion: Prohibited
- Resent-message-id: <01H4XSE0XTLU9BVGPI@INNOSOFT.COM>
- X400-Content-type: P2-1984 (2)
- X400-MTS-identifier: [/PRMD=ac/ADMD=mail/C=si/;931105101032]
- X400-Originator: jerman-blazic@ijs.si
- X400-Received: by mta kanin.arnes.si in /PRMD=ac/ADMD=mail/C=si/; Relayed; Fri,5 Nov 1993 10:11:42 +0100
- X400-Received: by /PRMD=ac/ADMD=mail/C=si/; Relayed; Fri,5 Nov 1993 09:10:32 +0100
- X400-Recipients: ietf-charsets@innosoft.com
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The address of Mr.Masami Hasegawa from his business card is:
ENET:JRDV04::MA_HASEGAWA LOC.CODE:JRD
The chair of the Japanese delegation to the Rennes meeting was Prof.Wada.
His address is the following: wada@ccut.utyo.junet.jp
I will try to find out their newer e-mail addresses and mail them to you.
Forwarded messages:
+-+-+-+-+-+-+-+-+-+ C I N E T - L N e w s l e t t e r +-+-+-+-+-+-+-+-+-+
Issue No. 5, Sunday, October 31, 1993
+--------------------------------------------------------------------------+
| China's InterNET Technical Forum (CINET-L) is a non-public discussion |
| list. CINET-L is technically sponsored by China News Digest and CINET-L |
| newsletter is published by volunteers in CINET-EDITOR@CND.ORG. For more |
| information regarding CINET-L, please see the end of this message. |
+--------------------------------------------------------------------------+
Table of Contents # of Lines
============================================================================
1. News Briefs (2 Items) ................................................ 21
2. Book Review: The New Nutshell Handbook on sendmail ................... 70
3. Move Over, ASCII! Unicode Is Here .................................. 310
4. Unicode, Wide Characters, and C ..................................... 344
============================================================================
- ----------------------------------------------------------------------------
1. News Briefs (2 Items) ................................................ 21
- ----------------------------------------------------------------------------
Contributed by: Hao Xin of Computer Network Center NCFC
Date: October 27, 1993
The local distribution list at CNC for the China InterNET Tech Forum
(CINET-L) has grown to include 21 users on CASnet from zero a month ago.
Most of the users on the local distribution list are graduate students and
young staff members who do not have direct international email access yet.
Other users on the local list do have international mail access, but choose
to receive CINET-L via local distribution to save network traffic and money.
___ ___ ___
Contributed by: Hao Xin of Computer Network Center NCFC
Date: October 27, 1993
More than two weeks after their arrival on Chinese shores, the long-awaited
DEC computers for the NCFC project now get to see the light of day: they are
being taken out of their crates today. The equipment includes a VAX4000/100,
a VAX Station 4000, a DEC Station 5000/25, and an NIS 600 multi-protocol
router. They will be used for the final link of the three campus networks
(Beijing Univ., Tsinghua, and CAS) which comprise the NCFC project. It is
hoped that the NCFC network will be functioning by the end of January 1994.
- ----------------------------------------------------------------------------
2. Book Review: The New Nutshell Handbook on sendmail ................... 70
- ----------------------------------------------------------------------------
Fowarded by: Liedong Zheng
Title: Sendmail
By Bryan Costales, with Eric Allman & Neil Rickert
1st Edition November 1993 (est.)
750 pages (est.), ISBN: 1-56592-056-2, $32.95 (est.)
Book Description:
This new Nutshell Handbook is far and away the most comprehensive book ever
written on sendmail, a program that acts like a traffic cop in routing and
delivering mail on UNIX-based networks. Although sendmail is used on almost
every UNIX system, it's one of the last great uncharted territories--and
most difficult utilities to learn-- in UNIX system administration.
This book provides a complete sendmail tutorial, plus extensive reference
material on every aspect of the program. What's more, it's authoritative,
having been co-authored by Eric Allman, the developer of sendmail, and Neil
Rickert, one of the leading sendmail gurus on the Net.
The book covers both major versions of sendmail: the standard version
available on most systems, and IDA sendmail, a version from Europe that uses
a much more readable configuration file. Part One of the book is a tutorial
on understanding sendmail from the ground up. Starting from an empty file,
it has the reader work through exercises, building a configuration file and
testing the results. Part Two covers practical issues in sendmail
administration, while Part Three is a comprehensive reference section.
Author Information:
Bryan Costales is System Manager at the International Computer Science
Institute in Berkeley, California. He has been writing articles and books
about computer software for over ten years. His most notable books are C
from A to Z (Prentice Hall), and Unix Communications (Howard Sams). In his
free time (chuckle, chuckle) he sails San Francisco Bay in his 26 foot
sloop, goes camping with his Land Rover, and walks his dog Zypher. He is an
avid movie viewer, reads tons of science fiction, and plays chess and
volleyball.
Eric Allman is the Lead Programmer on the Mammoth Project at the University
of California at Berkeley. This is his second incarnation at Berkeley;
previously, he was the Chief Programmer on the INGRES database management
project. In addition to his assigned tasks, he got involved with the early
UNIX effort at Berkeley. His first experiences with UNIX were with 4th
Edition, and he still has the manuals to prove it (and has been accused of
being a pack rat because of it). Over the years, he wrote a number of
utilities that appeared with various releases of BSD, including the -me
macros, tset, trek, syslog, vacation, and of course sendmail.
Eric spent the years between the two Berkeley incarnations at Britton Lee
(later Sharebase) doing database user and application interfaces, and at the
International Computer Science Institute, contributing to the Ring Array
Processor project for neural-net-based speech recognition. He also
co-authored the "C Advisor" column for Unix Review for several years.
Eric has been accused of working incessantly, enjoys writing with fountain
pens, and collects wines which he stashes in the cellar of the house that he
shares with Kirk McKusick, his partner of 14 and-some-odd years. He is a
member of the Board of Directors of USENIX Association, which is much more
work than he had expected.
Neil Rickert earned his Ph.D. at Yale in Mathematics. He is currently a
professor of computer science at Northern Illinois University. He likes to
keep contact with the practical side of computing, and so spends part of his
time in UNIX system adminstration. He has been involved with the IDA
sendmail project, and is largely responsible for the current version of the
IDA configuration.
- ----------------------------------------------------------------------------
2. Move over, ASCII! Unicode Is Here .................................. 310
- ----------------------------------------------------------------------------
Forwarded by: Liu Jian
Source: PC Magazine, October 26, 1993
Written by: Petzold, Charles
A great concept deserves a great name, and that name is Unicode.
Say it to yourself a few times. Get used to it. The prefix uni is from the
Latin word for one, and the word code (as defined in sense 3a in The
American Heritage Dictionary) is "A system of signals used to represent
letters or numbers in transmitting messages."
Say it again: Unicode. How hip and mellifluous the word is, particularly
when compared with ASCII (pronounced ASS-key), EBCDIC (EB-see-dik), or even
Baudot (baw-DOE). These comparisons are quite valid, for the goal of Unicode
is nothing less than to dislodge and replace what is perhaps the most
dominant standard in personal computing--the American Standard Code for
Information Interchange.
Ambitious? Of course. But Unicode makes so much sense, it seems inevitable.
Check out some of the companies that collaborated in The Unicode Consortium
to bring Unicode about: IBM, Microsoft, Apple, Xerox, Sun, Digital, Novell,
Adobe, NeXT, Lotus, and WordPerfect.
With the release of Windows NT, Unicode has become not just a proposed
standard, but a reality, and in the next couple of issues we'll take a look
at that reality. Let's begin, however, with a historical perspective, and
why Unicode is so important to the future of computing.
CODING LANGUAGE
Human beings differ from other species in their comparatively high level of
communication and the development of spoken language. The need to record
spoken language led to writing, which makes it possible to preserve and
convey knowledge and experience. Computers and other digital systems work
entirely with numbers, so to represent text in our computers, it is
necessary to create an equivalence between numbers and characters.
Until the invention of the telegraph by Samuel Morse in the mid-1800s,
long-distance communication required letters to be transported by person,
horse, or train. The telegraph made long-distance communication nearly
instantaneous by transmitting a series of electrical pulses through a wire.
But what do electrical pulses have to do with language? The telegraph
required that a code be devised correlating each letter in the alphabet with
a particular series of short and long pulses (dots and dashes) that sounded
like clicks on the receiving end.
Morse code was not the first instance of written language being represented
by something other than drawn or printed glyphs. Braille came earlier and
was inspired by a system for coding secret military messages. And Morse code
was not a binary system: The long and short pulses had to be separated by
different delays between letters and words. Binary systems for representing
written language (letters represented by a fixed-length series of 0s and 1s)
came later.
One of the early binary systems used in telexes was called Baudot (named
after a French engineer who died in 1903). Baudot was a 5-bit code.
Normally, the use of 5 bits is limited to representing 32 characters, which
is sufficient for the 26 characters of the alphabet (not differentiated by
case) but not much else. However, one Baudot code represented a "shift" that
made subsequent codes map to numbers and punctuation symbols. This feature
extended Baudot to be nearly as extensive as a 6-bit code.
The American Standard Code for Information Interchange (ASCII) was crowned
as a standard by the American National Standards Institute (ANSI) some 20
years ago. As defined by ANSI, ASCII is a 7-bit code, of which the first 32
codes and the last code are control characters (such as a carriage return,
line-feed, and tab). That leaves room for 26 lowercase letters, 26 uppercase
letters, 10 numbers, and 33 symbols and punctuation marks.
ASCII has become the dominant standard for all computers, except for
mainframes made by a not-insignificant company called IBM. The IBM heavy
iron machines use an 8-bit system called the Extended Binary Coded Decimal
Interchange Code (EBCDIC). Using 8 bits should allow for twice as many codes
as ASCII, but much of the EBCDIC code space is not assigned. One peculiarity
is that EBCDIC doesn't represent the alphabet with consecutive codes--the
capital letters A through I are hexadecimal codes 0xC1 through 0xC9; J
through R are 0xD1 through 0xD9; and S through Z are 0xE2 through 0xE9. This
only makes sense when you see the patterns on punch cards!
THE WORLD BEYOND OUR BORDERS
With the exception of IBM mainframes, ASCII is just about the only standard
common among computers. No other standard is as prevalent or as ingrained in
our keyboards, video displays, system hardware, printers, font files,
operating systems, electronic mail, and information services.
But there's a big problem with ASCII, and that problem is indicated by the
first word of the acronym. ASCII is truly an American standard, but there's
a whole wide world outside our borders where ASCII is simply inadequate. It
isn't even good enough for countries that share our language, for where is
the British pound sign in ASCII?
Among written languages that use the Latin (or Roman) alphabet, English is
unusual in that almost all of our words use the bare letters without accent
marks. Go across the Atlantic and take a look at the French, German, or
Swedish languages in print to see a variety of diacritics that originally
aided in adopting the Latin alphabet to the differences in spoken sounds
among these languages.
Journey farther east or south, and you'll encounter written languages that
don't use the Latin alphabet at all, such as Greek, Hebrew, Arabic, and
Russian (which uses Cyrillic). And if you travel even farther east, you'll
discover the logographic Han characters of Chinese, which were also adopted
in Japan and Korea. (Interestingly enough, in Vietnam you'll come across the
Latin alphabet again, a triumph of sorts for early missionaries!)
I live in one of the most ethnically diverse cities of the world--New York.
Every day I witness this diversity in a potpourri of languages heard and
seen on the streets. There are Ukrainian churches, Korean delicatessens,
Chinese restaurants, Pakistani newsstands, and subway advertisements in
languages I don't even recognize.
And then I come home and use ASCII, a character-encoding system that is not
only inadequate for the written languages of much of the world, but also for
many people who live right in my own neighborhood.
We simply can't be so parochial as to foster a system as exclusive and
limiting as ASCII. The personal computing revolution is quickly encompassing
much of the world, and it's totally absurd that the dominant standard is
based solely on English as spoken in the U.S.
I can't pretend to be dispassionate on this subject. The character encoding
used in our computers must truly reflect the diversity of the world's people
and languages.
ONGOING ATTEMPTS
Of course, there have been some partial solutions to this problem. Because
ASCII is a 7-bit code, and 8-bit bytes have become common in many systems,
it is possible to extend ASCII with another 128 characters. The original IBM
extended character set included some accented characters and a lowercase
Greek alphabet (useful for mathematics notation), as well as some block- and
line-drawing characters.
Unfortunately, this extended character set did not include enough accented
letters for all European languages that used the Latin alphabet, so
alternative extended character sets were devised. These are called code
pages, and they still exist in DOS and (in great profusion) in OS/2. OS/2
users and programs can switch among code pages and get a different mapping
of 8-bit codes to characters. An OS/2 program can even select extended
EBCDIC code pages for over ten different languages!
Microsoft didn't entirely abandon the IBM extended character set in the
first and subsequent versions of Windows, but most Windows fonts were built
around an alternative extended character set. Microsoft called this the
"ANSI character set," but it was actually based on an ISO (International
Standards Organization) standard. The ANSI character set abandons the block-
and line-drawing characters to include more accented characters that are
useful for European languages employing the Latin alphabet.
But what about non-Latin alphabets? Some font vendors devised solutions to
rendering other alphabets (such as Hebrew) with fonts designed specifically
for that purpose. With such fonts, ASCII codes normally corresponding to the
Latin alphabet are mapped to characters in other alphabets.
With either code pages or alternative fonts, the interpretation of 8-bit
character codes is ambiguous because it depends upon the selected code page
or font. And then there's the problem of communicating with the Macintosh,
which uses a different extended character set than either the original IBM
PC or Windows uses. Even when communicating over electronic mail in American
English, I sometimes see odd characters in letters from my Mac-user friends.
Another response to the limitations of ASCII is the double-byte character
set (DBCS). With DBCS, some characters require 1 byte and some require 2
bytes (indicated by an initial byte greater than hexadecimal 0x80). This
system allows representing both the ASCII character set and a non-Latin
alphabet. DBCS has problems of its own, though, such as multiple standards.
Also, because DBCS characters are not of uniform length, programmed parsing
becomes difficult. For example, you can't simply skip ahead 6 characters by
skipping ahead 6 bytes. You have to look at each and every character to see
if it's represented by 1 or 2 bytes.
UNICODE TO THE RESCUE
The basic problem is that the world's written languages simply cannot be
represented by only 256 8-bit codes. The previous solutions have proven
insufficient and awkward. What's the real solution?
It doesn't take a genius to figure out that if 8 bits are inadequate, then
16 bits might be just fine. Congratulations! You've just invented Unicode!
Unicode is truly as simple as that: Rather than the confusion of multiple
256-character code mappings or double-byte character sets that have some
1-byte codes and some 2-byte codes, Unicode is a uniform 16-bit system, thus
allowing the representation of 65,536 characters. This is sufficient for all
the most common characters and logographs in all the written languages of
the world (including some math and symbol collections), with about half the
code space left free.
Sixteen-bit characters are often called wide characters. Wide characters do
not necessarily have to be Unicode characters, although for our purposes
I'll tend to use the terms Unicode and wide characters synonymously.
THE WINDOWS NT SUPPORT
Of course, dealing with character codes that are 16 bits in length rather
than 8 is quite a foreign concept to many of us. We are so accustomed to
identifying a character with 8 bits that it seems unnatural and impossible.
What about our operating systems? What about our programming languages? What
about our hardware and printers?
It's really not as bad as it sounds, although obviously quite a few years
will pass before Unicode replaces ASCII as the universal system of character
coding. Still, some essential support is already falling into place.
You can write programs for Windows NT that continue to use the ASCII
character set, or you can write programs that use Unicode. You can even mix
the use of ASCII and wide characters in the same program. How does this
work? Well, every function call in Windows NT that requires a character
string as a parameter (and there are quite a few of them) has two different
entry points in the operating system. For example, there is a TextOutA
function (the ASCII version) and a TextOutW function (the wide-character
version); depending on the definition of an identifier, the name TextOut is
defined as one or the other of them. We'll see how this works in more detail
in a future column.
The ANSI standard for C also has support for wide characters, and this
support is included in Microsoft's C compiler for Windows NT. Rather than
using the strlen() function to find the length of a string, for example, you
can use wcslen() (which translates to "wide character string length").
Instead of using sprintf() to format a string for display, you can use
swprintf().
What about displaying non-Latin characters on our video displays and
printers? Well, TrueType also supports wide character sets, and while a
TrueType font file containing all the Unicode characters might be somewhere
in the region of 5 to 10 megabytes, that's not an inordinate size for
representing characters of all the world's written languages.
THE REFERENCE BOOKS
Unicode is documented in two volumes compiled by The Unicode Consortium and
published by Addison Wesley in 1991 called The Unicode Standard: Worldwide
Character Encoding, Version 1.0.
Because the books contain charts showing all the characters of Unicode, they
are marvelous to explore, and I highly recommend them. These books reveal
the richness and diversity of the world's written languages in a way that
few other documents have. In addition, the books provide the rationale and
details behind the development of Unicode.
You'll probably be pleased to know that the first 128 codes of Unicode are
identical to ASCII, thus facilitating a conversion from ASCII to Unicode.
(Just add another zero byte to each ASCII character.) The second 128
characters are called the Latin 1 character set, which is the same as the
Windows character set except that hexadecimal codes 0x0080 through 0x009F
are defined as control characters in Latin 1. Many blocks of non-Latin
characters are also based on existing standards, also easing conversions.
Codes 0x0100 through 0x01FF provide additional variations of the Latin
character set. The codes 0x0400 through 0x04FF are for Cyrillic. Armenian,
Hebrew, and Arabic come next, and soon you'll encounter more esoteric
languages such as Devanagari (used in classical Sanskrit and modern Hindi),
Bengali, Gurmukhi, and Gujarati (all North Indian scripts). And on and on
and on.
The famous Zapf Dingbats character set uses codes 0x2700 through 0x27BF, and
the Han ideographs begin at 0x4E00. These Han characters are used to
represent whole words or concepts in Chinese, and they are also used in
Japanese and Korean. Unicode contains over 20,000 Han characters, about a
third of the entire code space.
WHAT UNICODE DOESN'T ADDRESS
Sorting words in English is made easier by the consecutive coding of letters
in the ASCII character set. The coding of characters in Unicode does not
imply any collation sequence, and if you think about it, it doesn't make
much sense to pretend that we know how to alphabetize a collection of words
using different alphabets such as Latin, Hebrew, and Bengali.
Even sorting English words is not as straightforward as ASCII would imply,
because alphabetizing words is usually case-insensitive. Thus, sorting
always requires special consideration beyond the simple numeric sequence of
character codes. And even with extended ASCII character sets, sorting gets
more complex with the accented Latin letters of many European languages. But
at least with Unicode we have a consistent encoding of accented letters, so
a table could be created that allows reasonable sorting.
Another issue Unicode doesn't address is the use of similar alphabets and
logographs in different countries. For example, there are no separate
French, German, or Finnish character sets within Unicode. The written
languages of these countries share many unaccented and accented Latin
characters.
The situation is similar for the Han logographs. Quite often, the same Han
character represents something different depending on whether it's used in
Chinese, Japanese, or Korean. Unicode makes no attempt to reflect these
differences. If the character is in Unicode, then it can be used in any of
those three languages, regardless of its meaning.
Another problem with international programming is that some written
languages do not run from left to right on the printed page. Unicode
specifies that character strings be stored in logical order, which is the
order that someone would type the characters from a keyboard. Properly
displaying such text is left up to an application, but the Unicode reference
books contain some information on this issue.
CAN IT BE DONE?
Unicode is certainly an important step in the move towards truly
international programming. The question of whether it can really replace
ASCII as the standard for worldwide character coding is almost irrelevant.
It simply must.
In my next columns on this subject, I'll refrain from further proselytizing
and will focus on the programming mechanics involved in using Unicode.
- ----------------------------------------------------------------------------
2. Unicode, Wide Characters, and C ..................................... 344
- ----------------------------------------------------------------------------
Forwarded by: Liu Jian
Source: PC Magazine, November 9, 1993
Written by: Petzold, Charles
People who write about computers in more general interest magazines often
avoid using the word byte, instead describing storage capabilities in terms
of characters. That's a pretty simple conversion, because one character of
text requires 1 byte of storage. Right?
Wrong! When dealing with ASCII character sets, the equivalence is certainly
correct. But ASCII character sets (even when extended by another 128 codes)
are unable to represent anything beyond the Latin alphabet and some accented
letters used in European alphabets. Several makeshift solutions exist, such
as non-Latin fonts and double-byte character sets (DBCS). In a DBCS, some
characters require 1 byte and some require 2; those requiring 2 bytes are
used for Far Eastern languages.
A far better solution for international computing is to replace ASCII with a
uniform 2-byte character encoding. As I discussed in the last issue, the
system that shows the most promise of becoming a standard is Unicode.
Unicode was developed by a consortium of big names in the computer industry
and is supported by Windows NT. Its first 128 codes are the same as ASCII,
but it is capable of representing all the characters of all the written
languages of the world. It may even come to pass someday that journalists
who write about computers in general-interest magazines will have to adjust
their convenient equivalence of bytes and characters. (Presumably they can
divide by 2!)
When Unicode text is stored in memory or files, character strings are
represented as a series of 16-bit values rather than as bytes. The first
time I encountered this concept, I got the shivers. How on earth do you use
your favorite programming language with Unicode? Luckily, other people have
considered that problem, and support for "wide characters" (as they are
called) is part of the ANSI C standard. That's what I'll examine in this
issue.
EIGHT-BIT CHARACTERS
We all know how to store characters and character strings in our C programs.
You simply use the char data type. But to facilitate an understanding of
how C handles wide characters, let's first review normal character
definition.
The following statement defines and initializes a variable containing a
single character:
char c = 'A' ;
The variable c requires 1 byte of storage containing the value 65 (the
hexadecimal value 0x41), which is the ASCII code for the letter A.
You can define a pointer to a character string like so:
char * p ;
Windows NT, being a 32-bit operating system, reserves 4 bytes of storage for
the character pointer. You can also initialize a pointer to a character
string:
char * p = "Hello!" ;
In this case, the variable p requires 4 bytes of storage, and the character
string is stored in static memory using 7 bytes of storage--the 6 bytes of
the string plus a terminating zero.
You can also define an array of characters, like this:
char a[10] ;
In this case, the compiler reserves 10 bytes of storage for the array. If
the array variable is global (outside any function), you can initialize it
with
char a[] = "Hello!" ;
If you define this array as a local variable to a function, it must be
defined as a static variable, as follows:
static char a[] = "Hello!" ;
In either case, the string is stored in static program memory with a zero
appended at the end, thus requiring 7 bytes of storage.
LET'S GET WIDER
The char data type continues to be a single-byte value in C. To use 16-bit
wide characters in a C program, you must include the WCHAR.H (wide
character) header file in your program:
#include <WCHAR.H>
This header file contains definitions of new data types, structures, and
functions for using wide characters. In particular, WCHAR.H defines the new
data type wchar, t as
typedef unsigned short wchar, t ;
Although the int data type has grown from 16 bits to 32 bits under the
Windows NT C compiler, the short data type is still a 16-bit value. Thus,
the wchar, t data type is the same as an unsigned short integer.
To define a variable containing a single wide character, you use the
following statement:
wchar, t c = 'A' ;
The variable c is the two-byte value 0x0041, the Unicode representation of
the letter A. (However, given the Intel protocol of storing
least-significant bytes first, the bytes are actually stored in memory in
the sequence 0x41, 0x00. Keep this fact in mind as we examine the output of
a sample program shortly.)
You can also define and initialize a pointer to a wide character string:
wchar, t * p = L"Hello!"
Notice the L (for long) directly preceding the first quotation mark. This
indicates to the compiler that the string is to be stored with wide
characters, that is, with every character occupying 2 bytes. The variable p
requires 4 bytes of storage, as usual, but the character string requires 14
bytes--2 bytes for each character with two bytes of zeros at the end.
Similarly, you can define an array of wide characters this way:
wchar, t a[] = L"Hello!" ;
The string again requires 14 bytes of storage.
Although it looks rather ugly and unnatural, that L preceding the first
quotation mark is very important, and there must not be a space between the
two symbols. Only with that L will the compiler know you want the string to
be stored with 2 bytes per character. Later on, when we look at wide
character strings in places other than variable definitions, you'll
encounter the L preceding the first quotation mark again. Fortunately, the C
compiler will often give you an error message if you forget to include the
L.
LET'S TRY IT OUT
If the concept of wide character strings is new to you, you're definitely
not alone, and you may be skeptical that this can really work. So let's try
it out. The UNITEST1 program is shown in Figures 1 and 2.
To compile and run this program, you'll need Windows NT 3.1 and the
Microsoft Win32 Software Development Kit installed. You can compile and link
the program using the command line
NMAKE UNITEST1.MAK
Notice that the UNITEST1.C program includes the WCHAR.H header file at the
beginning. The program defines two character strings (the text "Hello,
world!"), one using the char data type and the other using wchar, t. The two
variable names for these character strings are acString (the ASCII version)
and wcString (the wide-character version). UNITEST1 then uses the printf
function to display each string, determine its storage size, determine the
number of characters using the strlen function, and then display the first
five characters in both character and hexadecimal format.
The first thing you'll notice when compiling the program is a warning
message reading (in part) "incompatible types - from 'unsigned short[14] to
'const char*'." This message results from passing the wcString variable to
the strlen function, which expects a pointer to a character string but gets
instead a pointer to a string of short integers. It's only a warning
message, so the compilation will continue and you can run the program. The
results are shown in Figure 3.
The top half of the output looks fine, exactly as we expected. But what
happened with the wide character string? First, printf simply displays the
string as 'H. Why is this? Well, printf expected a string of single-byte
characters terminated by a zero byte. The first character has a 16-bit
hexadecimal representation of 0x0048. But these bytes are stored in memory
in the sequence 0x48, 0x00. The printf function thus assumed that the
string was only one character long. Similarly, the strlen function reported
that the string was only a single character.
Everything else seems to work. In particular, the sizeof operator reported
that the ASCII string required 14 bytes of storage, and the wide character
string required 28 bytes of storage. Also, indexing the wchar, t array
correctly retrieves each character of the string for printf to display.
This program clearly illustrates the differences between the C language
itself and the runtime library functions. The compiler interprets the string
L"Hello, world!" as a collection of 16-bit short integers and stores them in
the wchar, t array. The compiler also handles the array indexing and the
sizeof operator correctly. But the runtime library functions strlen and
printf are added during link time. These functions expect strings comprised
of single-byte characters. When confronted with wide character strings, they
don't perform as we'd like.
THE WCHAR LIBRARY FUNCTIONS
The solution is alternate runtime library functions that accept wide
character strings rather than single-byte character strings. Fortunately,
such functions exist in Microsoft's 32-bit C compiler package, and they're
all defined in WCHAR.H.
For example, the wide-character version of the strlen function is called
wcslen (wide character string length); the wide-character version of the
printf function is called wprintf. Let's put these two functions to use in
the UNITEST2 program shown in Figures 4 and 5. Notice that in the second
part of the UNITEST2 program, the strlen program has been replaced with
wcslen, and all the printf functions have been replaced with wprintf
(although only one of them gave us trouble in UNITEST1).
The only other code change is that a capital L now precedes the formatting
string in the wprintf functions. From personal experience, I guarantee
you'll frequently forget to include the L when you first start working with
wide character strings. When you use the wide character functions defined in
WCHAR.H, every string you pass to them must be composed of wide characters.
The output of UNITEST2 is shown in Figure 6. This is what we want. Although
the size of the wide character string is 28 bytes (the 13 wide characters
plus the terminating 16-bit zero), the wcslen function reports 13
characters. Keep in mind that the character length of a string does not
change when you move to wide characters--only the byte length changes. And
as I explained earlier, a byte is not necessarily a character.
BUT WHAT'S THE POINT?
Of course we haven't yet established any real benefit to using Unicode in
these two programs. We're still displaying pure ASCII text in character
mode. The character mode font in the U.S. version of Windows NT isn't
capable of displaying the extra Unicode characters. If such characters
appeared in a string, they'd simply be ignored upon display. (You can test
this by inserting, for example, the character x0413 into the wcString array.
This is the character code for a letter in the Cyrillic alphabet.)
Of course, where Unicode is most important is in graphical Windows NT
programs. Indeed, the retail release of Windows NT is shipped with a
TrueType font containing a small subset of the complete Unicode character
set. It's called Lucida Sans Unicode, and it includes additional accented
Latin letters; the Greek, Cyrillic, and Hebrew alphabets; and a bunch of
symbols. We'll be making good use of this font in future columns when we
begin exploring the use of Unicode in graphical programs. For now, we're
simply trying to nail down the mechanics of using wide characters in a C
program with the C runtime library functions.
MAINTAINING A SINGLE SOURCE
There are, of course, certain disadvantages to using Unicode. First and
foremost is that every string in your program will occupy twice as much
space. In addition, you'll observe that the functions in the wide character
runtime library are larger than the usual functions. For this reason, you
might want to create two versions of a program--one for a U.S. market that
works strictly with ASCII and another for an international market that uses
Unicode. The best solution would be to maintain a single source code file
that you could compile for either ASCII or Unicode.
That's a bit of a problem, though, because the runtime library functions
have different names, you're defining characters differently, and then
there's that nuisance of preceding the string literals with an L.
One answer is to use the TCHAR.H header file supplied with Microsoft's
32-bit C compiler. (My speculation is that the T of TCHAR stands for text.)
This header file is not part of the ANSI standard, so every function and
macro defined therein is preceded by an underscore. TCHAR.H provides a set
of alternative names for the normal runtime library functions requiring
string parameLB<ters; for example , tprintf and , tcslen.
If an identifier called , UNICODE is defined and the TCHAR.H header file is
included in your program, then , tprintf is defined to be wprintf:
#define , tprintf wprintf
If not, then , tprintf is defined to be printf:
#define , tprintf printf
And so on. TCHAR.H also solves the problem of the two character data types
with a new data type named TCHAR. If the , UNICODE identifier is defined,
then TCHAR is wchar, t:
typedef wchar, t TCHAR ;
Otherwise, TCHAR is simply a char:
typedef char TCHAR ;
Now it's time to address that L problem. If the , UNICODE identifier is
defined, then a macro called , , T is defined like this:
#define , , T(x) L##x
That pair of number signs is called a "token paste" and causes the letter L
to be appended to the macro parameter. If the , UNICODE identifier is not
defined, the , , T macro is simply defined in the following way:
#define , , T(x) x
Regardless, two other macros are defined to be the same as , , T:
#define , T(x) , , T(x)
#define , TEXT(x) , , T(x)
Which you use will depend on how concise or verbose you would like to be.
Basically, you must define your string literals inside the , T or , TEXT
macro in the following way:
, TEXT ("Hello, world!") ;
This causes the string to be interpreted as composed of wide characters if
the , UNICODE identifier is defined, and as 8-bit characters if not.
Let's test it out with a single source code module named UNITEST3.C. Figures
7 and 8 show two make files, one for creating the ASCII version of the
program (UNITESTA) and the other for the Unicode version (UNITESTW).
UNITESTA.MAK compiles UNITEST3.C, shown in Figure 9, to create an object
module named UNITESTA.OBJ. (Note that the compile command line uses the -Fo
option to give the object file a different name than the source code file.)
The UNITESTA.OBJ file is linked to create UNITESTA.EXE. UNITESTW.MAK is
similar, except that the compile line also uses the -D (define) option to
define the identifier , UNICODE.
UNITEST3 displays only one set of output lines. The printf and wprintf
functions have been replaced with , tprintf. The strlen and wcslen functions
have been replaced with , tcslen. The definition of the character string now
uses the TCHAR data type. All character strings are enclosed in the , TEXT
macro. Note that the program includes both WCHAR .H and TCHAR.H. The output
from UNITESTA.EXE and UNITESTW .EXE is identical except for the line that
reports the number of bytes occupied by the string in memory.
WHAT HAVE WE LEARNED?
We've seen how to use both ASCII strings and Unicode strings in the same
source code file, and how to have a single source code file that can be
compiled for either ASCII or Unicode. Of course what I've discussed in this
column doesn't represent the extent of converting an existing program to
Unicode. You'll have to find any places in your code where you've
previously assumed that the size of a character is a byte or where you
access a binary buffer or file as if it were a collection of characters.
(You can download the UNITEST program and source code file from PC MagNet's
Programming Forum, archived as UNI.ZIP.)
In the next installment of this column, we'll examine how the Windows NT API
provides methods for using ASCII and Unicode in the same program, or for
creating a single source that can be compiled for either. The methods are a
little different from those provided by the C compiler, C runtime library,
and header files, but as you will see, the results are similar.
+--------------------------------------------------------------------------+
| Executive Editor: Sifeng Ma (U.S.A.) |
+--------------------------------------------------------------------------+
| CINET-L (China's InterNET Tech Forum) is a non-public discussion list, |
| however, CINET-EDITOR@CND.ORG welcomes contributions on networking in |
| China. Some related discussions may be found on CHINANET@TAMVM1.TAMU.EDU |
| To join the forum CHINANET@TAMVM1.TAMU.EDU (or CHINANET@TAMVM1.BITNET), |
| send a mail to LISTSERV@TAMVM1.TAMU.EDU or LISTSERV@TAMVM1.BITNET |
| (Note: NOT CHINANET@TAMVM1) with FIRST LINE of the mail body as follows: |
| SUB CHINANET Your_First_Name Last_Name |
+--------------------------------------------------------------------------+
------- End of Forwarded Message