Re: UTF-8, OSTA-UDF [why?], Unicode, and miscellaneous gibberish

Alex Belits (abelits@phobos.illtel.denver.co.us)
Tue, 26 Aug 1997 01:38:20 -0700 (PDT)


On 26 Aug 1997, H. Peter Anvin wrote:

> Actually, UTF-8 is open-ended; it is only defined to 2^31 at this
> point; depending on how you extend it it could be expanded
> indefinitely.
>
> We already have:
>
> 0xxxxxxx for up to 7 bits
> 110xxxxx 10xxxxxx for up to 11 bits
> 1110xxxx (2 * 10xxxxxx) for up to 16 bits
> 11110xxx (3 * 10xxxxxx) for up to 21 bits
> 111110xx (4 * 10xxxxxx) for up to 26 bits
> 1111110x (5 * 10xxxxxx) for up to 31 bits
>
> ... we can then define ...
>
> 11111110 (6 * 10xxxxxx) for up to 36 bits
> 11111111 100xxxxx (7 * 10xxxxxx) for up to 41 bits
> 11111111 1010xxxx (8 * 10xxxxxx) for up to 46 bits
> 11111111 10110xxx (9 * 10xxxxxx) for up to 51 bits
> 11111111 101110xx (10 * 10xxxxxx) for up to 56 bits
> 11111111 1011110x (11 * 10xxxxxx) for up to 61 bits
> 11111111 10111110 (12 * 10xxxxxx) for up to 66 bits
> 11111111 10111111 100xxxxx (13 * 10xxxxxx) for up to 71 bits
>
> ... etc ...

...and with no doubt, we can use more. Or use the same to, say, replace
fixed-size integers and floats (portability! no problems with 64-bit
processors!). The question is, why one will need such a slow monster
(just in example with integers overhead is more noticeable)? A lot of
things can be done, but it doesn't mean that they are harmless.

--
Alex