Re: [Patch] Support UTF-8 scripts

From: Bernd Petrovitsch
Date: Sun Sep 18 2005 - 09:53:48 EST


On Sun, 2005-09-18 at 09:23 +0200, "Martin v. LÃwis" wrote:
[...]
> >>Hmm. What does that have to do with the patch I'm proposing? This
> >>patch does *not* interfere with all text files. It is only relevant
> >>for executable files starting with the #! magic.
> >
> > It *does* interfere since scripts are also text files in every aspect.
> > So every feature you want for "scripts" you also get for text files (and
> > vice versa BTW).
>
> The specific feature I get is that when I pass a file starting
> with <utf8sig>#! to execve, Linux will execute the file following
> the #!. In what way do I get this feature for text in general?
> And if I do, why is that a problem?

After applying this patch it seems that "Linux" is supporting this
marker officially in general - especially if the kernel supports it. I
suppose the next kernel patch is to support Win-like CR-LF sequences
(which is not the case AFAIK).
BTW even some standards body thinks that this is the way to go, it
raises more problems and questions than resolves anything.

> > If you think "script" and "text file" are different, define both of
> > them, please, otherwise a discussion is pointless.
>
> A script file (in the context of this discussion) is a text file
> that is executable (i.e. has the appropriate subset of
> S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
> name of an executable file after the #!.
>
> More generally, a script file is a text file written in a scripting
> language. A scripting language is a programming language which
> supports "direct" execution of source code. So in the more
> general definition, a script file does not need to start with
> #!; for the context of this discussion, we should restrict
> attention to files actually affected by the patch.

And though scripts are usually edited/changed/"parsed"/... with an text
editor, it is not always the case. Therefore the automatic extension to
*all text files* (especially as the marker basically applies to all text
files, not only scripts).
You want to focus just on your patch and ignore the directly implied
potential problems arising ...

[...]
> > It *may* break just because of some to-be-ignored inline marking due to
> > some questionable feature.
>
> Be more specific. For what specific kind of file will cat(1) break?

`cat` as such will not break (as such).

> Unless cat(1) has a 2GB limitation, I very much doubt it will break
> (i.e. fail to do its job, "concatenate files and print on the standard
> output") for any kind of input - whether this is text files, binary
> files, images, sound files, HTML files. cat always does what it is
> designed to do.

Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
a.txt and b.txt have this marker, then c.txt have the marker of b.txt
somewhere in the middle. Does this make sense in anyway?
How do I get rid of the marker in the middle transparently?

> > Let alone the confusion why the size of a file with `ls -l` is different
> > from the size in the editor or a marker-aware `wc -c`.
>
> This is true for any UTF-8 file, or any multibyte encoding. For any
> multibyte encoding, the number of bytes in the file is different from
> the number of characters. That doesn't (and shouldn't) stop people from
> using multi-byte encodings.

It is different even if a pure ASCII file is marked as UTF-8.
And sure, the problem exists in general with multi-byte encodings.

> What the editor displays as the number of "things" is up to its own.
> The output of wc -c will always be the same as the one of ls -l,
> as wc -c does *not* give you characters:
>
> -c, --bytes
> print the byte counts
>
> You might have been thinking of 'wc -m'.

It depends on the definition of "character". There are other standards
which define "character" as "byte".

[...]
> > Then write a short python script (with a "#!/usr/bin/python" line at the
> > start [without parameters]) natively on a Win*-system, copy it binary
> > over to an arbitrary Linux system and see what's happening.
>
> It depends on the editor I use, of course: the kernel will consider any

No, more on the OS the editor runs on.

> CR after the n as part of the interpreter name. Not sure what this has

ACK.

> to do with the specific patch, though.

It is not supported by the kernel. So either you remove it or you make
some compatibility hack (like an appropriate sym-link, etc.). Since the
kernel can start java classes directly, you can probably make a similar
thing for the UTF-8 stuff.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/