Re: [Patch] Support UTF-8 scripts

From: "Martin v. LÃwis"
Date: Mon Sep 19 2005 - 16:41:06 EST


Bernd Petrovitsch wrote:
>>>It depends on the definition of "character". There are other standards
>>>which define "character" as "byte".
>>
>>Certainly. However, you specifically talked about 'wc -c', and, in
>>wc(1), atleast in the implementation commonly used on Linux, characters
>>and bytes are not the same.
>
>
> Yes, now since multi-byte character sets gets more commonly used.
> However, I don't think you get this into the C standard. But we are now
> far off the discussion ....

It does indeed, so just one final clarification. wc(1) is not part
of the C standard - ISO 9899 does not talk about command line utilities
at all. The relevant standard is POSIX; IEEE Std 1003.1, 2004 Edition
says, in

http://www.opengroup.org/onlinepubs/009695399/utilities/wc.html

-c
Write to the standard output the number of bytes in each input file.
[...]
-m
Write to the standard output the number of characters in each input
file.

[...]
RATIONALE
[...]
The -c option stands for "character" count, even though it counts bytes.
This stems from the sometimes erroneous historical view that bytes and
characters are the same size. Due to international requirements, the -m
option (reminiscent of "multi-byte") was added to obtain actual
character counts.

Regards,
Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/