Re: [Patch] Support UTF-8 scripts

From: "Martin v. LÃwis"
Date: Sun Sep 18 2005 - 23:55:31 EST


Bernd Petrovitsch wrote:
>>The specific feature I get is that when I pass a file starting
>>with <utf8sig>#! to execve, Linux will execute the file following
>>the #!. In what way do I get this feature for text in general?
>>And if I do, why is that a problem?
>
>
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

What makes it seem so? That binfmt_script supports a certain convention
doesn't mean that all other programs also somehow need to support that
convention - and certainly not in the same way.

> I suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

What makes you suppose that? I have no plans to submit such a patch.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

Because there are no problems arising. The next time somebody submits
a patch to cat(1) to strip off UTF-8 signatures, you *then* complain
that this is the wrong thing to do, because it violates the
specification of cat.

This reasoning is just flawed: it is like saying to a web browser
developer: "don't _support_ XHTML, because there are so many tools
which use HTML 4".

> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?

Indeed, it does. There is nothing inherently wrong with having
the marker in the middle.

> How do I get rid of the marker in the middle transparently?

http://www.unicode.org/faq/utf_bom.html#38

>>What the editor displays as the number of "things" is up to its own.
>>The output of wc -c will always be the same as the one of ls -l,
>>as wc -c does *not* give you characters:
>>
>> -c, --bytes
>> print the byte counts
>>
>>You might have been thinking of 'wc -m'.
>
>
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

Certainly. However, you specifically talked about 'wc -c', and, in
wc(1), atleast in the implementation commonly used on Linux, characters
and bytes are not the same.

>>It depends on the editor I use, of course
>
>
> No, more on the OS the editor runs on.

You talked about Windows specifically. On Windows, most editors give you
the choice of chosing the line ending, and will preserve whatever line
ending they find when adding new lines to a file.

Regards,
Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/