Re: [Patch] Support UTF-8 scripts

From: Bodo Eggert
Date: Sun Sep 18 2005 - 14:24:26 EST


Bernd Petrovitsch <bernd@xxxxxxxxx> wrote:
> On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
> [...]
>> >>Hmm. What does that have to do with the patch I'm proposing? This
>> >>patch does *not* interfere with all text files. It is only relevant
>> >>for executable files starting with the #! magic.
>> >
>> > It *does* interfere since scripts are also text files in every aspect.
>> > So every feature you want for "scripts" you also get for text files (and
>> > vice versa BTW).
>>
>> The specific feature I get is that when I pass a file starting
>> with <utf8sig>#! to execve, Linux will execute the file following
>> the #!. In what way do I get this feature for text in general?
>> And if I do, why is that a problem?
>
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

It will be the first POSIX kernel to correctly support utf-8 scripts.
It's 2005, and according to other(?) posters, this should be standard.

> I
> suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
should.-)

> BTW even some standards body thinks that this is the way to go,

Not surprisingly the Unicode Consortium is one of them.

> it
> raises more problems and questions than resolves anything.

The problem of ow to handle BOM is solved by reading the standard.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

There is no problem arising from the patch, it solves one.
To solve the rest, use recode.

[...]
> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?
> How do I get rid of the marker in the middle transparently?

The unicode standard defines how to handle them.

>> > Let alone the confusion why the size of a file with `ls -l` is different
>> > from the size in the editor or a marker-aware `wc -c`.
>>
>> This is true for any UTF-8 file, or any multibyte encoding. For any
>> multibyte encoding, the number of bytes in the file is different from
>> the number of characters. That doesn't (and shouldn't) stop people from
>> using multi-byte encodings.
>
> It is different even if a pure ASCII file is marked as UTF-8.

No pure ASCII file will be marked, since a marked file will be no
ASCII file.

> And sure, the problem exists in general with multi-byte encodings.

ACK, but that's not a kernel problem nor a specific unicode problem.
Fix it by making China, Greece an Japan convert to ASCII and by making
all mathematicans stop using strange characters. All other users will
follow.

>> What the editor displays as the number of "things" is up to its own.
>> The output of wc -c will always be the same as the one of ls -l,
>> as wc -c does *not* give you characters:
>>
>> -c, --bytes
>> print the byte counts
>>
>> You might have been thinking of 'wc -m'.
>
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

There are architectures defining a byte to be 32 bit.
They are irrelevant, too.

[...]
>> Not sure what this has
>> to do with the specific patch, though.
>
> It is not supported by the kernel. So either you remove it or you make
> some compatibility hack (like an appropriate sym-link

-EDOESNOTWORK

#!/usr/bin/perl -T -s -w

>, etc.). Since the
> kernel can start java classes directly, you can probably make a similar
> thing for the UTF-8 stuff.

If MSDOS text files are text files are legal scripts, the kernel
should recognize [\x0D\x0A] as valid line breaks.

(The real reason would be unicode allowing NEL to be encoded as 0x0D
or 0x0A.)

This compile-tested patch adds 32 bytes to binfmt_script:

--- ./fs/binfmt_script.c.old 2005-09-18 20:28:32.000000000 +0200
+++ ./fs/binfmt_script.c 2005-09-18 20:29:44.000000000 +0200
@@ -18,7 +18,7 @@

static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
{
- char *cp, *i_name, *i_arg;
+ char *cp, *cp2, *i_name, *i_arg;
struct file *file;
char interp[BINPRM_BUF_SIZE];
int retval;
@@ -47,6 +47,9 @@ static int load_script(struct linux_binp
bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
if ((cp = strchr(bprm->buf, '\n')) == NULL)
cp = bprm->buf+BINPRM_BUF_SIZE-1;
+ if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
+ && cp2 < cp)
+ cp = cp2;
*cp = '\0';
while (cp > bprm->buf) {
cp--;
--
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/