lysator.liu.se

A brief description of Normative Addendum 1

Introduction

When the (then draft) ANSI C Standard was being considered for adoption of an International Standard in 1990, there were several objections because it didn't address internationalization issues. Because the Standard had already been several years in the making, it was agreed that a few changes would be made to provide the basis (for example, the functions in subclause 7.10.7 were added), and work would be carried out separately to provide proper internationalization of the Standard. This work has culminated in Normative Addendum 1.

Normative Addendum 1 embodies C's reaction to both the limitations and promises of international character sets. Digraphs and the <iso646.h> header were meant to improve the appearance of C programs written in national variants of ISO 646 without, e.g., { or } characters. On the other end of the spectrum, the facilities connected to <wchar.h> and <wctype.h> extend the old Standard's barely adequate basis into a complete and consistent set of utilities for handling wide characters and multibyte strings.

This document summarizes Normative Addendum 1. It is intended to quickly inform readers who are already familiar with the Standard; it does not, and cannot, introduce the complex subject matter behind NA1, nor can it replace the original document as a reference manual. (Nevertheless, it tries to be as accurate as possible, and its author would like to hear about any errors or omissions.)

Version

The macro __STDC_VERSION__ shall expand to 199409L. (The Normative Addendum was formally registered with ISO in September 1994.)

Alternative Spellings

The following two controversial changes became known as the ``Danish Proposal'' during their discussion, and were intended to make C programs more visually appealing on terminals that only offer the sevenbit ISO 646 character set (which does not contain the characters [] {} #, and some others.) They add no new functionality to the trigraphs introduced by ISO/IEC 9989:1990.

Digraphs

The following new tokens and preprocessing tokens are added:

    <:   :>   <%   %>   %:   %:%:

These tokens behave identically to the tokens and preprocessing tokens:

    [    ]    {    }    #    ##

respectively (except that they are spelled differently, and so stringize differently).
It is possible to construct source files which undergo a quiet change because of the introduction of %: and %:%:.

Header <iso646.h>

This defines the following 11 macros as shown:

    #define  and     &&
    #define  and_eq  &=
    #define  bitand  &
    #define  bitor   |
    #define  compl   ~
    #define  not     !
    #define  not_eq  !=
    #define  or      ||
    #define  or_eq   |=
    #define  xor     ^
    #define  xor_eq  ^=

These macro names are reserved for all purposes in translation units that include the header, but are not reserved in those that do not (this is the same as for any other Standard macros).

Extended character sets

The remainder of NA1 is designed to improve the facilities for handling text in complex or multiple character sets. In the C model, each locale defines a set of characters - abstract entities - that can be represented in two ways:

as wide characters: each character is given a code value that can be stored

in an object of type wchar_t. Not all code values have to represent a character; those that do not must not appear in wide strings that are converted to multibyte characters. Code value 0 is reserved for the ``end of string'' indicator.

as multibyte characters: each character is represented by a sequence of

one or more bytes (values that can be stored in an object of type char).
The interpretation of a multibyte string (a sequence of bytes) depends on the current state. There is a special state called the initial state, and most strings are interpreted starting in the initial state. A multibyte string is then a sequence of zero or more of the following:

a byte sequence that represents a character and might or might not also change the state;
a byte sequence which changes the state but does not represent a character (this is called a shift sequence).

A character can have representations in more than one state, and can have more than one representation in any given state. The representation in different states can differ. Not all byte sequences are necessarily valid; an invalid sequence causes an encoding error when interpreted (normally shown by setting errno to EILSEQ).

For the encoding of file contents, the above rules are complete. For example, Unicode uses 16bit values to represent each character. On a file system based on bytes, this means that each character is represented by two bytes - the upper and lower half of the value (in some order). This is a valid multibyte encoding for files.

However, for encodings used by other library functions, there are further restrictions:

the zero byte is reserved as the ``end of string'' indicator, and may not occur in any other byte sequence.
in the C locale, the 99 characters required by the Standard have representations in the initial state which are one byte long and do not alter the state.

The Unicode encoding described above cannot be used here, because there are many codes where one or the other byte is zero, and because no code is one byte long.

Reserved Identifiers

Certain identifiers are defined with external linkage by NA1, but were not reserved as identifiers with external linkage by the original Standard (for example fwprintf); all these identifiers are declared by <wctype.h> or <wchar.h>. These identifiers are reserved with external linkage in all the translation units of a program if and only if any translation unit includes either of those headers (thus changes in one translation unit may cause another translation unit to invoke undefined behavior).

Header <errno.h>

A new macro EILSEQ is added to the list of error conditions (currently this list consists of EDOM and ERANGE).

Header <wctype.h>

typedef ... wint_t;

An integral type unchanged by integral promotion. It must be capable of holding every valid wide character, and also the value WEOF (described below). It can be the same type as wchar_t.

typedef ... wctrans_t;
typedef ... wctype_t;

Scalar types used to hold magic cookies. wctype_t represents a classification of characters (like ``is lower case'' or ``is accented''), while wctrans_t represents a character conversion (like ``change to upper case'' or ``remove any accent'').

WEOF

is an objectlike macro which evaluates to a constant expression of type wint_t. It need not be negative nor equal EOF, but it serves the same purpose: the value, which must not be a valid wide character, is used to represent an end of file or as an error indication.

All the following functions are affected by the LC_CTYPE category of the current locale.

Extended character testing functions

int iswalnum (wint_t);
int iswalpha (wint_t);
int iswcntrl (wint_t);
int iswdigit (wint_t);
int iswgraph (wint_t);
int iswlower (wint_t);
int iswprint (wint_t);
int iswpunct (wint_t);
int iswspace (wint_t);
int iswupper (wint_t);
int iswxdigit (wint_t);

The argument must be WEOF or representable as a wchar_t. The function will return nonzero if and only if the argument is a wide character of the appropriate type. The types are the same as for the <ctype.h> functions, except that iswprint and iswgraph are guaranteed to return false not only for space (as their char counterparts do), but for any character that iswspace() considers white space. Thus isgraph('\t') is true, but iswgraph(L'\t') is false. For the remaining nine functions the expression (!isXXXXX(wctob(wc)) || iswXXXXX(wc)) is true for every wide character. That is, for any wide character which has a corresponding singlebyte character (which is what wctob returns), if the latter has the given property, then so does the former. Note that this is not a symmetric relationship.

wctype_t wctype (const char *);
int iswctype (wint_t, wctype_t);

While an implementation can add extra isXXXXX or iswXXXXX functions to test for other properties (e.g. ``is a katakana character''), it was felt that this cluttered the namespace (though the names are all reserved) without being flexible enough for future needs. Instead, the committee introduced a mechanism that can be extended at run-time.
The string argument to wctype() names a category to test for; wctype() returns a wctype_t magic cookie that can be handed to iswctype to test for the named category, or zero if it does not recognize the category. The eleven builtin categories "alnum", "alpha", ... "xdigit" must be recognized by all implementations. Thus, iswctype(ch, wctype("punct")) is the same as iswpunct(ch). The wctype_t value is only valid for the LC_CTYPE category used to create it.

Extended character conversion functions

wint_t towlower (wint_t);
wint_t towupper (wint_t);

Wide character versions of toupper and tolower.
There is no requirement that the mapping corresponds to that of singlebyte characters (thus it might be that
toupper('é') == 'E',
while
towupper(L'é') == L'É'
on a system where É is not a singlebyte character).

wctrans_t wctrans (const char *);
wint_t towctrans (wint_t, wctrans_t);

Provide extensible conversions, in just the same way as wctype() and iswctype() provide extensible tests.

Header <wchar.h>

struct tm;
typedef ... size_t;
typedef ... wchar_t;
typedef ... wint_t;
#define NULL ...
#define WEOF ...

All described elsewhere. This header does not complete the definition of struct tm ; it is still necessary to include <time.h> before defining a variable of this type.

typedef ... mbstate_t;

A nonarray object type that can represent a conversion state (the internals of the representation are not specified).

WCHAR_MAX and WCHAR_MIN

evaluate to the maximum and minimum values a wchar_t can hold. They are integral constant expressions of type wchar_t, but not necessarily valid as wide characters. For example, if wchar_t is a typedef for unsigned short, then WCHAR_MIN will be zero and WCHAR_MAX will be the same as USHRT_MAX.

Input and output

Each stream has associated with it an object of type mbstate_t, and an orientation; it can be byteoriented, wideoriented, or unoriented. When a stream is opened (including stdin etc., and calls to freopen), it is unoriented. The functions ungetc, fgetc, fputc, and those defined to work though them, change an unoriented stream to byteoriented, and shall not be called on a wideoriented stream. The functions ungetwc, fgetwc, fputwc, and those defined to work though them, change an unoriented stream to wideoriented, and shall not be called on a byteoriented stream.

Wide binary streams shall obey the positioning restrictions of both text and binary streams. Positioning a wideoriented stream within the middle of an existing character representation and then writing makes all following contents undefined.

The mbstate_t object associated with a stream is saved by fgetpos and restored by fsetpos. The object is initialized when the stream is opened as if it were an object declared with static lifetime (i.e. all zeroes and null pointers).

The *scanf and *printf functions have the ability to handle strings of the opposite type to the majority (that is, wide strings in fprintf etc. and multibyte strings in fwprintf etc.). These strings are converted to the majority form before (for *printf) or after (for *scanf) any other processing. This conversion is done as if using calls to mbrtowc or wcrtomb, but with an mbstate_t object set to the initial state before each such conversion.

wint_t fgetwc (FILE *);

Reads bytes from the stream and passes them to mbrtowc (using the stream's mbstate_t object) until a complete wide character has been read, or an error occurs. The character or WEOF is returned; the latter can indicate end of file (the eof indicator is set), a read error (the error indicator is set), or a conversion error (errno is set to EILSEQ). All other wide character input is done as if via fgetwc.

wint_t fputwc (wchar_t, FILE *);

Passes the wide character to wcrtomb (using the stream's mbstate_t object) and writes the resulting bytes to the stream. The character or WEOF is returned; the latter can indicate a write error (the error indicator is set) or a conversion error (errno is set to EILSEQ). All other wide character output is done as if via fputwc.

Two new conversions are added to fprintf (and printf and sprintf):

%lc, which requires a wint_t argument, and %ls, which requires a wchar_t * argument. %lc is equivalent to %ls called with a two element array (the argument in the first element, and zero in the second). %ls converts the wide characters to bytes; the precision indicates the maximum number of bytes written (conversion will also stop on a zero wide character); a partial multibyte character will not be output, though complete trailing shift sequences might be.

Three new conversions are added to fscanf (and scanf and sscanf):

%lc, %ls, and %l[; all take a pointer to wchar_t, and convert the input to multibyte representation after matching. (The qualified and unqualified conversions match the same input.)

int fwprintf (FILE *, const wchar_t *, ...);
int wprintf (const wchar_t *, ...);
int swprintf (wchar_t *, size_t, const wchar_t *, ...);
int vfwprintf (FILE *, const wchar_t *, va_list);
int vwprintf (const wchar_t *, va_list);
int vswprintf (wchar_t *, size_t, const wchar_t*, va_list);

Valid conversions are the same as for fprintf, including the extensions above. With %c, the character is converted using btowc; with %s, the string is converted to wide characters before output. With all formats, width and precision are measured in wide characters. The second argument of swprintf is the the number of elements of the destination array (including the terminating zero which is always written).

int fwscanf (FILE *, const wchar_t *, ...);
int wscanf (const wchar_t *, ...);
int swscanf (const wchar_t *, const wchar_t *, ...);

Valid conversions are the same as for fscanf, including the extensions above. With %c, %s, and %[, the accepted input field will be converted to its multibyte equivalent after being matched. With all formats, width and precision are measured in wide characters.

wchar_t *fgetws (wchar_t *, int, FILE *);
int fputws (const wchar_t *, FILE *);
wint_t getwc (FILE *);
wint_t getwchar (void);
wint_t putwc (wchar_t, FILE *);
wint_t putwchar (wchar_t);
wint_t ungetwc (wint_t, FILE *);

Equivalent to the corresponding functions in subclause 7.9.7 of the Standard (including the multiple expansion rules for getwc and putwc's FILE * argument.)

int fwide (FILE *, int);

If the second argument is greater than zero, attempts to make the stream wideoriented; if it is less than zero, attempts to make it byteoriented. Returns the orientation of the stream after the call (<0 for byteoriented, 0 for unoriented, >0 for wideoriented). Once a stream has been given an orientation, it cannot be changed.

String handling facilities

double wcstod (const wchar_t *, wchar_t **);
long int wcstol (const wchar_t *, wchar_t **, int);
unsigned long int wcstoul (const wchar_t*, wchar_t**, int);
wchar_t *wcscpy (wchar_t *, const wchar_t *);
wchar_t *wcsncpy (wchar_t *, const wchar_t *, size_t);
wchar_t *wcscat (wchar_t *, const wchar_t *);
wchar_t *wcsncat (wchar_t *, const wchar_t *, size_t);
int wcscmp (const wchar_t *, const wchar_t *);
int wcscoll (const wchar_t *, const wchar_t *);
int wcsncmp (const wchar_t *, const wchar_t *, size_t);
size_t wcsxfrm (wchar_t *, const wchar_t *, size_t);
wchar_t *wcschr (const wchar_t *, wchar_t);
size_t wcscspn (const wchar_t *, const wchar_t *);
wchar_t *wcspbrk (const wchar_t *, const wchar_t *);
wchar_t *wcsrchr (const wchar_t *, wchar_t);
size_t wcsspn (const wchar_t *, const wchar_t *);
wchar_t *wcsstr (const wchar_t *, const wchar_t *);
size_t wcslen (const wchar_t *);
wchar_t *wmemchr (const wchar_t *, wchar_t, size_t);
int wmemcmp (const wchar_t *, const wchar_t *, size_t);
wchar_t *wmemcpy (wchar_t *, const wchar_t *, size_t);
wchar_t *wmemmove (wchar_t *, const wchar_t *, size_t);
wchar_t *wmemset (wchar_t *, wchar_t, size_t);
size_t wcsftime (wchar_t *, size_t, const wchar_t *, const struct tm *);

Equivalent to the corresponding functions in subclauses 7.10, 7.11, and 7.12 of the Standard.

wchar_t *wcstok (wchar_t*, const wchar_t*, wchar_t**);

Tokenizes a wide string in the same way as strtok, but uses the object pointed to by the third argument to keep state, rather than keeping it internally as strtok does. This change makes it possible to interleave calls to wcstok over different input strings.

Conversion facilities

Most of these functions take a pointer to an mbstate_t object that they keep their conversion state in. Such an object can be set to all zeroes (e.g. by assigning to it the value of an mbstate_t object with static lifetime which has not been explicitly initialized) and is then in its initial state. When an object is in the initial state (no matter how this occurred), it is prepared for conversion in either direction (from multibyte to wide characters or vice versa) starting in the initial state. Once an object has left its initial state (which happens whenever it is used with one of the following functions unless the description says otherwise), it shall only be used in the same LC_CTYPE category [*] and same direction as the previous call, and shall not be used after a conversion error. If a null pointer is passed, each function uses its own internal object which is initialized to all zeroes at program startup.
___________________________________________________________________
[*] The mbstate_t object associated with a stream is bound to an encoding by the first fgetwc or fputwc call after the stream is opened, and can then be used with any locale.

wint_t btowc

(int);

Converts the argument (treated as an unsigned char) to the corresponding wide character, if any, or else returns WEOF.

int wctob (wint_t);

If the argument wide character has a multibyte encoding in the initial shift state which is a single byte, returns that byte. Otherwise returns EOF.

int mbsinit (const mbstate_t *);

Returns nonzero if passed a null pointer, or if the mbstate_t object is in the initial state (the object is unaffected).

size_t mbrlen (const char *s, size_t n, mbstate_t *pcs);

Equivalent to mbrtowc(NULL, s, n, pcs), except that it uses its own internal mbstate_t object, not that of mbrtowc, when given a null pointer.

size_t mbrtowc (wchar_t *ws, const char *s, size_t n,

mbstate_t *pcs);
Converts a multibyte character from s (inspecting no more than n bytes) to a wide character. If ws is not a null pointer, the wide character is stored in *ws. If s is a null pointer, mbrtowc ignores ws and n and acts as if the first three arguments are a null pointer, an empty string, and 1 respectively.
The return value can be:

(size_t)-2

The inspected bytes have been used to update the mbstate_t, but no complete wide character has been found.

(size_t)-1

An encoding error has occurred.

0

The inspected bytes represent a zero wide character, which is stored in the array pointed to by the first argument (if not null), and the mbstate_t object has been restored to the initial state.

other

That number of bytes represent a single wide character, which is stored in the array pointed to by the first argument (if not null), and the mbstate_t object has been updated.

Note that a return value of -2 means that a partial sequence might have been stored in the mbstate_t object; the inspected bytes do not need to be passed to the function a second time.

size_t wcrtomb (char *, wchar_t, mbstate_t *);

If the first argument is a null pointer, ignores the second argument and acts as if they are a pointer to an internal buffer and zero respectively. Otherwise it converts the wide character to at most MB_CUR_MAX bytes and places them in the array pointed to by the first argument; if the wide character is zero, the resulting sequence will end in the initial state, followed by a zero byte, and the mbstate_t object will be in the initial state.
wcrtomb returns the number of bytes written to the character buffer, or (size_t)-1 to indicate an encoding error (errno is set to EILSEQ).

size_t mbsrtowcs (wchar_t *ws, const char **ps, size_t n,

mbstate_t *pcs);
Converts the multibyte character sequence pointed to by *ps to wide characters. The result is either (size_t)-1 if a conversion error occurs (in which case errno is set to EILSEQ), or else the number of bytes processed.

If ws is a null pointer, processing stops at the end of the string (the terminating zero byte is not counted in the returned value), and *pcs will be set to the initial state.
If ws is not a null pointer, the resulting wide character sequence is stored in the array it points to. Conversion stops when:
- n wide characters have been stored; *pcs will be set to the conversion state after processing the indicated number of bytes, and *ps will point to the first unprocessed byte
- at the end of the string if this comes first (the terminating zero byte is not counted in the returned value); *pcs will be set to the initial state, *ps will be set to a null pointer, and a zero wide character will have been stored.

size_t wcsrtombs (char *s, const wchar_t **pws, size_t n,

mbstate_t *pcs);
Converts the wide character sequence pointed to by pws to a multibyte character sequence. The result is either (size_t)-1 if a conversion error occurs (in which case errno is set to EILSEQ), or else the number of bytes in the resulting multibyte string. Processing of the wide string stops either when a zero wide character - indicating the end of the wide string - is reached (the resulting multibyte string will end with a zero byte which is not included in the returned result), or (if s is not a null pointer) when it is not possible to process another wide character without placing more than n bytes into the array pointed to by s. In the first case, *pcs will be left in the initial state. If s is a null pointer, the value of n is ignored. Otherwise *pws will be set to either a null pointer (if conversion stopped on a zero wide character) or a pointer to the first unprocessed wide character. In the latter case, the returned value will be at least (n-MB_CUR_MAX+1).

Future directions

<wctype.h> reserves function names beginning with is or to followed by a lowercase letter. <wchar.h> reserves function names beginning with wcs followed by a lowercase letter. Lowercase letters are reserved as conversion specifiers for fwprintf and fwscanf.