A brief description of Normative Addendum 1
Introduction
When the (then draft) ANSI C Standard was being considered for adoption of an International Standard in 1990, there were several objections because it didn't address internationalization issues. Because the Standard had already been several years in the making, it was agreed that a few changes would be made to provide the basis (for example, the functions in subclause 7.10.7 were added), and work would be carried out separately to provide proper internationalization of the Standard. This work has culminated in Normative Addendum 1.
Normative Addendum 1 embodies C's reaction to both the limitations
and promises of international character sets.
Digraphs and the
<iso646.h>
header were meant to improve the appearance of C
programs written in national variants of ISO 646 without, e.g., {
or }
characters.
On the other end of the spectrum, the facilities
connected to <wchar.h>
and <wctype.h>
extend the old Standard's barely adequate basis into a complete and
consistent set of utilities for handling wide characters and multibyte strings.
This document summarizes Normative Addendum 1. It is intended to quickly inform readers who are already familiar with the Standard; it does not, and cannot, introduce the complex subject matter behind NA1, nor can it replace the original document as a reference manual. (Nevertheless, it tries to be as accurate as possible, and its author would like to hear about any errors or omissions.)
Version
The macro __STDC
_VERSION
__ shall expand to
199409L
.
(The Normative Addendum was formally registered with ISO in September 1994.)
Alternative Spellings
The following two controversial changes became known as the ``Danish Proposal'' during their discussion, and were intended to make C programs more visually appealing on terminals that only offer the sevenbit ISO 646 character set (which does not contain the characters [] {} #, and some others.) They add no new functionality to the trigraphs introduced by ISO/IEC 9989:1990.Digraphs
The following new tokens and preprocessing tokens are added:<: :> <% %> %: %:%:These tokens behave identically to the tokens and preprocessing tokens:
[ ] { } # ##respectively (except that they are spelled differently, and so stringize differently).
It is possible to construct source files which undergo a quiet change because of the introduction of
%:
and %:%:
.
Header <iso646.h>
This defines the following 11 macros as shown:#define and && #define and_eq &= #define bitand & #define bitor | #define compl ~ #define not ! #define not_eq != #define or || #define or_eq |= #define xor ^ #define xor_eq ^=These macro names are reserved for all purposes in translation units that include the header, but are not reserved in those that do not (this is the same as for any other Standard macros).
Extended character sets
The remainder of NA1 is designed to improve the facilities for handling text in complex or multiple character sets. In the C model, each locale defines a set of characters - abstract entities - that can be represented in two ways:-
as wide characters:
each character is given a code value that can be stored
-
in an object of type
wchar
_t
. Not all code values have to represent a character; those that do not must not appear in wide strings that are converted to multibyte characters. Code value 0 is reserved for the ``end of string'' indicator.as multibyte characters: each character is represented by a sequence of
- one or more bytes (values that can be stored in an object of type
char
).
The interpretation of a multibyte string (a sequence of bytes) depends on the current state. There is a special state called the initial state, and most strings are interpreted starting in the initial state. A multibyte string is then a sequence of zero or more of the following:- a byte sequence that represents a character and might or
might not also change the state;
- a byte sequence which changes the state but does not represent a character (this is called a shift sequence).
A character can have representations in more than one state, and can have more than one representation in any given state. The representation in different states can differ. Not all byte sequences are necessarily valid; an invalid sequence causes an encoding error when interpreted (normally shown by setting
errno
toEILSEQ
). - one or more bytes (values that can be stored in an object of type
However, for encodings used by other library functions, there are further restrictions:
- the zero byte is reserved as the ``end of string''
indicator, and may not occur in any other byte sequence.
- in the C locale, the 99 characters required by the Standard have representations in the initial state which are one byte long and do not alter the state.
Reserved Identifiers
Certain identifiers are defined with external linkage by NA1, but were not reserved as identifiers with external linkage by the original Standard (for examplefwprintf
);
all these identifiers are declared by <wctype.h>
or <wchar.h>
.
These identifiers are reserved with external linkage in
all the translation units of a program if and only if
any translation unit includes either of those
headers (thus changes in one translation unit may cause another
translation unit to invoke undefined behavior).
Header <errno.h>
A new macroEILSEQ
is added to the list of error
conditions (currently this list consists of EDOM
and ERANGE
).
Header <wctype.h>
-
An integral type unchanged by integral promotion.
It must be capable of holding every valid wide character,
and also the value
WEOF
(described below). It can be the same type aswchar
_t
.typedef
...wctrans
_t;
typedef
...wctype
_t;
- Scalar types used to hold magic cookies.
wctype
_t
represents a classification of characters (like ``is lower case'' or ``is accented''), whilewctrans
_t
represents a character conversion (like ``change to upper case'' or ``remove any accent'').- is an objectlike macro which evaluates to a constant expression of type
wint
_t
. It need not be negative nor equalEOF
, but it serves the same purpose: the value, which must not be a valid wide character, is used to represent an end of file or as an error indication. - Scalar types used to hold magic cookies.
typedef
... wint
_t;
LC
_CTYPE
category
of the current locale.
Extended character testing functions
-
The argument must be
WEOF
or representable as awchar
_t
. The function will return nonzero if and only if the argument is a wide character of the appropriate type. The types are the same as for the<ctype.h>
functions, except thatiswprint
andiswgraph
are guaranteed to return false not only for space (as theirchar
counterparts do), but for any character thatiswspace()
considers white space. Thusisgraph('\t')
is true, butiswgraph(L'\t')
is false. For the remaining nine functions the expression(!isXXXXX(wctob(wc))
||
iswXXXXX(wc))
is true for every wide character. That is, for any wide character which has a corresponding singlebyte character (which is whatwctob
returns), if the latter has the given property, then so does the former. Note that this is not a symmetric relationship.wctype
_t wctype
(const char *);
int iswctype
(wint
_t, wctype
_t);
- While an implementation can add extra
is
XXXXX orisw
XXXXX functions to test for other properties (e.g. ``is a katakana character''), it was felt that this cluttered the namespace (though the names are all reserved) without being flexible enough for future needs. Instead, the committee introduced a mechanism that can be extended at run-time.
The string argument towctype()
names a category to test for;wctype()
returns awctype
_t
magic cookie that can be handed toiswctype
to test for the named category, or zero if it does not recognize the category. The eleven builtin categories"alnum"
,"alpha"
, ..."xdigit"
must be recognized by all implementations. Thus,iswctype(ch, wctype("punct"))
is the same asiswpunct(ch)
. Thewctype
_t
value is only valid for theLC
_CTYPE
category used to create it. - While an implementation can add extra
int iswalnum
(wint
_t);
int iswalpha
(wint
_t);
int iswcntrl
(wint
_t);
int iswdigit
(wint
_t);
int iswgraph
(wint
_t);
int iswlower
(wint
_t);
int iswprint
(wint
_t);
int iswpunct
(wint
_t);
int iswspace
(wint
_t);
int iswupper
(wint
_t);
int iswxdigit
(wint
_t);
Extended character conversion functions
-
Wide character versions of
toupper
andtolower
.
There is no requirement that the mapping corresponds to that of singlebyte characters (thus it might be that
toupper('é')
==
'E'
,
while
towupper(L'é')
==
L'É'
on a system where É is not a singlebyte character).wctrans
_t wctrans
(const char *);
wint
_t towctrans
(wint
_t, wctrans
_t);
- Provide extensible conversions, in just the same way as
wctype()
andiswctype()
provide extensible tests. - Provide extensible conversions, in just the same way as
wint
_t towlower
(wint
_t);
wint
_t towupper
(wint
_t);
Header <wchar.h>
-
All described elsewhere.
This header does not complete the definition of
struct
tm
; it is still necessary to include<time.h>
before defining a variable of this type.typedef
...mbstate
_t;
- A nonarray object type that can represent a conversion state (the internals of the representation are not specified).
WCHAR
_MAX
andWCHAR
_MIN
- evaluate to the maximum and minimum values a
wchar
_t
can hold. They are integral constant expressions of typewchar
_t
, but not necessarily valid as wide characters. For example, ifwchar
_t
is a typedef forunsigned
short
, thenWCHAR_MIN
will be zero andWCHAR
_MAX
will be the same asUSHRT_MAX
. - A nonarray object type that can represent a conversion state (the internals of the representation are not specified).
struct tm;
typedef
... size
_t;
typedef
... wchar
_t;
typedef
... wint
_t;
#define NULL
...#define WEOF
...
Input and output
Each stream has associated with it an object of typembstate
_t
, and an orientation;
it can be byteoriented, wideoriented,
or unoriented.
When a stream is opened (including stdin
etc.,
and calls to freopen
), it is
unoriented.
The functions ungetc
, fgetc
,
fputc
, and those defined to work though them,
change an unoriented stream to byteoriented, and shall
not be called on a wideoriented stream.
The functions ungetwc
, fgetwc
,
fputwc
, and those defined to work though them,
change an unoriented stream to wideoriented,
and shall not be called on a byteoriented stream.
Wide binary streams shall obey the positioning restrictions of both text and binary streams. Positioning a wideoriented stream within the middle of an existing character representation and then writing makes all following contents undefined.
The mbstate
_t
object associated with a
stream is saved by fgetpos
and restored
by fsetpos
.
The object is initialized when the stream is opened as if it were
an object declared with static lifetime (i.e. all
zeroes and null pointers).
The *scanf
and *printf
functions have the ability to handle strings of
the opposite type to the majority (that is,
wide strings in fprintf
etc.
and multibyte strings in fwprintf
etc.).
These strings are converted to the majority form before
(for *printf
) or after (for *scanf
)
any other processing.
This conversion is done as if using calls to
mbrtowc
or
wcrtomb
,
but with an mbstate
_t
object set to the initial state before each
such conversion.
-
Reads bytes from the stream and passes them to
mbrtowc
(using the stream'smbstate
_t
object) until a complete wide character has been read, or an error occurs. The character orWEOF
is returned; the latter can indicate end of file (the eof indicator is set), a read error (the error indicator is set), or a conversion error (errno
is set toEILSEQ
). All other wide character input is done as if viafgetwc
.wint
_t fputwc
(wchar
_t, FILE *);
- Passes the wide character to
wcrtomb
(using the stream'smbstate
_t
object) and writes the resulting bytes to the stream. The character orWEOF
is returned; the latter can indicate a write error (the error indicator is set) or a conversion error (errno
is set toEILSEQ
). All other wide character output is done as if viafputwc
.Two new conversions are added to
fprintf
(andprintf
andsprintf
):%lc
, which requires awint
_t
argument, and%ls
, which requires awchar_t
*
argument.%lc
is equivalent to%ls
called with a two element array (the argument in the first element, and zero in the second).%ls
converts the wide characters to bytes; the precision indicates the maximum number of bytes written (conversion will also stop on a zero wide character); a partial multibyte character will not be output, though complete trailing shift sequences might be.Three new conversions are added to
fscanf
(andscanf
andsscanf
):%lc,
%ls,
and%l[
; all take a pointer towchar
_t
, and convert the input to multibyte representation after matching. (The qualified and unqualified conversions match the same input.)int fwprintf
(FILE *, const wchar
_t *, ...);
int wprintf
(const wchar
_t *, ...);
int swprintf
(wchar
_t *, size
_t, const wchar
_t *, ...);
int vfwprintf
(FILE *, const wchar
_t *, va
_list);
int vwprintf
(const wchar
_t *, va
_list);
int vswprintf
(wchar
_t
*,
size
_t,
const wchar
_t*, va
_list);
- Valid conversions are the same as for
fprintf
, including the extensions above. With%c
, the character is converted usingbtowc
; with%s
, the string is converted to wide characters before output. With all formats, width and precision are measured in wide characters. The second argument ofswprintf
is the the number of elements of the destination array (including the terminating zero which is always written).int fwscanf
(FILE *, const wchar
_t *, ...);
int wscanf
(const wchar
_t *, ...);
int swscanf
(const wchar
_t *, const wchar
_t *, ...);
- Valid conversions are the same as for
fscanf
, including the extensions above. With%c
,%s
, and%[
, the accepted input field will be converted to its multibyte equivalent after being matched. With all formats, width and precision are measured in wide characters.wchar
_t *fgetws
(wchar
_t *, int, FILE *);
int fputws
(const wchar
_t *, FILE *);
wint
_t getwc
(FILE *);
wint
_t getwchar
(void);
wint
_t putwc
(wchar
_t, FILE *);
wint
_t putwchar
(wchar
_t);
wint
_t ungetwc
(wint
_t, FILE *);
- Equivalent to the corresponding functions in subclause 7.9.7 of the Standard (including the multiple expansion rules for
getwc
andputwc
'sFILE
*
argument.)int fwide (FILE *, int);
- If the second argument is greater than zero, attempts to make the stream wideoriented; if it is less than zero, attempts to make it byteoriented. Returns the orientation of the stream after the call (<0 for byteoriented, 0 for unoriented, >0 for wideoriented). Once a stream has been given an orientation, it cannot be changed.
- Passes the wide character to
wint
_t fgetwc
(FILE *);
String handling facilities
-
Equivalent to the corresponding functions in
subclauses 7.10, 7.11, and 7.12 of the Standard.
wchar
_t *wcstok
(wchar
_t*, const wchar
_t*, wchar
_t**);
- Tokenizes a wide string in the same way as
strtok
, but uses the object pointed to by the third argument to keep state, rather than keeping it internally asstrtok
does. This change makes it possible to interleave calls towcstok
over different input strings. - Tokenizes a wide string in the same way as
double wcstod
(const wchar
_t *, wchar
_t **);
long int wcstol
(const wchar
_t *, wchar
_t **, int);
unsigned
long
int wcstoul
(const
wchar
_t*,
wchar
_t**,
int);
wchar
_t *wcscpy
(wchar
_t *, const wchar
_t *);
wchar
_t *wcsncpy
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wcscat
(wchar
_t *, const wchar
_t *);
wchar
_t *wcsncat
(wchar
_t *, const wchar
_t *, size
_t);
int wcscmp
(const wchar
_t *, const wchar
_t *);
int wcscoll
(const wchar
_t *, const wchar
_t *);
int wcsncmp
(const wchar
_t *, const wchar
_t *, size
_t);
size
_t wcsxfrm
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wcschr
(const wchar
_t *, wchar
_t);
size
_t wcscspn
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcspbrk
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcsrchr
(const wchar
_t *, wchar
_t);
size
_t wcsspn
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcsstr
(const wchar
_t *, const wchar
_t *);
size
_t wcslen
(const wchar
_t *);
wchar
_t *wmemchr
(const wchar
_t *, wchar
_t, size
_t);
int wmemcmp
(const wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemcpy
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemmove
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemset
(wchar
_t *, wchar
_t, size
_t);
size
_t wcsftime
(wchar
_t *, size
_t, const wchar
_t *, const struct tm *);
Conversion facilities
Most of these functions take a pointer to anmbstate
_t
object that they keep their conversion state in.
Such an object can be set to all zeroes (e.g. by
assigning to it the value of an mbstate
_t
object with static lifetime which has not been explicitly
initialized)
and is then in its initial state.
When an object is in the initial state
(no matter how this occurred),
it is prepared for conversion in either direction
(from multibyte to wide characters or vice versa)
starting in the initial state.
Once an object has left its initial state
(which happens whenever it is used with one
of the following functions unless the description says otherwise),
it shall only be used in the same
LC
_CTYPE
category [*]
and same direction as the previous call,
and shall not be used after a conversion error.
If a null pointer is passed, each
function uses its own internal object
which is initialized to all zeroes at program startup.
___________________________________________________________________
[*] The
mbstate
_t
object associated with a stream is bound
to an encoding by the first fgetwc
or fputwc
call after the stream is opened, and can then be used with any locale.
-
Converts the argument (treated as an
unsigned char
) to the corresponding wide character, if any, or else returnsWEOF
.int wctob
(wint
_t);
- If the argument wide character has a multibyte encoding in the initial shift state which is a single byte, returns that byte. Otherwise returns
EOF
.int mbsinit
(const mbstate
_t *);
- Returns nonzero if passed a null pointer, or if the
mbstate
_t
object is in the initial state (the object is unaffected).size
_t mbrlen
(const char *s, size
_t n, mbstate
_t *pcs);
- Equivalent to
mbrtowc
(NULL,
s,
n,
pcs)
, except that it uses its own internalmbstate
_t
object, not that ofmbrtowc
, when given a null pointer.size
_t
mbrtowc
(wchar
_t
*ws,
const
char
*s,
size
_t
n,
mbstate
_t
*pcs);
Converts a multibyte character froms
(inspecting no more thann
bytes) to a wide character. Ifws
is not a null pointer, the wide character is stored in*ws
. Ifs
is a null pointer,mbrtowc
ignoresws
andn
and acts as if the first three arguments are a null pointer, an empty string, and 1 respectively.
The return value can be:- The inspected bytes have been used to update the
mbstate
_t
, but no complete wide character has been found. - An encoding error has occurred.
- The inspected bytes represent a zero wide character, which is stored
in the array pointed to by the first argument (if not null), and
the
mbstate
_t
object has been restored to the initial state. other - That number of bytes represent a single wide character, which is
stored in the array pointed to by the first argument (if not null),
and the
mbstate
_t
object has been updated.
(size
_t)-2
(size
_t)-1
0
mbstate
_t
object; the inspected bytes do not need to be passed to the function a second time.size
_t wcrtomb
(char *, wchar
_t, mbstate
_t *);
- If the first argument is a null pointer, ignores the second argument and acts as if they are a pointer to an internal buffer and zero respectively. Otherwise it converts the wide character to at most
MB
_CUR
_MAX
bytes and places them in the array pointed to by the first argument; if the wide character is zero, the resulting sequence will end in the initial state, followed by a zero byte, and thembstate
_t
object will be in the initial state.
wcrtomb
returns the number of bytes written to the character buffer, or(size
_t)-1
to indicate an encoding error (errno
is set toEILSEQ
).size
_t mbsrtowcs
(wchar
_t *ws, const char **ps, size
_t n,
mbstate
_t *pcs);
Converts the multibyte character sequence pointed to by*ps
to wide characters. The result is either(size
_t)-1
if a conversion error occurs (in which caseerrno
is set toEILSEQ
), or else the number of bytes processed.- If
ws
is a null pointer, processing stops at the end of the string (the terminating zero byte is not counted in the returned value), and*pcs
will be set to the initial state. - If
ws
is not a null pointer, the resulting wide character sequence is stored in the array it points to. Conversion stops when:-
n
wide characters have been stored;*pcs
will be set to the conversion state after processing the indicated number of bytes, and*ps
will point to the first unprocessed byte - at the end of the string if this comes first (the terminating
zero byte is not counted in the returned value);
*pcs
will be set to the initial state,*ps
will be set to a null pointer, and a zero wide character will have been stored.
-
size
_t wcsrtombs
(char *s, const wchar
_t **pws, size
_t n,
mbstate
_t *pcs);
Converts the wide character sequence pointed to bypws
to a multibyte character sequence. The result is either(size
_t)-1
if a conversion error occurs (in which caseerrno
is set toEILSEQ
), or else the number of bytes in the resulting multibyte string. Processing of the wide string stops either when a zero wide character - indicating the end of the wide string - is reached (the resulting multibyte string will end with a zero byte which is not included in the returned result), or (if s is not a null pointer) when it is not possible to process another wide character without placing more thann
bytes into the array pointed to bys
. In the first case,*pcs
will be left in the initial state. Ifs
is a null pointer, the value ofn
is ignored. Otherwise*pws
will be set to either a null pointer (if conversion stopped on a zero wide character) or a pointer to the first unprocessed wide character. In the latter case, the returned value will be at least(n-MB
_CUR
_MAX+1)
. - If the argument wide character has a multibyte encoding in the initial shift state which is a single byte, returns that byte. Otherwise returns
wint
_t btowc
(int);
Future directions
<wctype.h>
reserves function names beginning
with is or to followed by a lowercase
letter.
<wchar.h>
reserves function names beginning
with wcs followed by a lowercase letter.
Lowercase letters are reserved as conversion
specifiers for fwprintf
and fwscanf
.