Stream: Mirror: Isabelle Development Mailing List

Topic: [isabelle-dev] Script characters in words


view this post on Zulip Email Gateway (Mar 06 2024 at 12:20):

From: Lawrence Paulson <lp15@cam.ac.uk>
I have noticed something unusual, which is true both on Isabelle2023 and in the current version: most script letters seem to be regarded as word characters (as we would expect), but not all of them.

To see this, it is enough to type

thm ๐’œ_def โ„ฌ_def ๐’ž_def ๐’Ÿ_def โ„ฐ_def โ„ฑ_def โ„›_def ๐’ฎ_def

and double click on the letter f. It's pretty hard to explain why A should be treated differently from B here, or R from S.

Larry


isabelle-dev mailing list
isabelle-dev@in.tum.de
https://mailman46.in.tum.de/mailman/listinfo/isabelle-dev

view this post on Zulip Email Gateway (Mar 06 2024 at 13:51):

From: Makarius <makarius@sketis.net>
You have rediscovered a long-standing problem of "Unicode". It started out as
a universal and uniform 16-bit character sets many decades ago (UCS-2), but is
now a conglomerate of many add-ons and adhoc complexity, in order to
approximate text representation for every language on the planet (including
mathematics).

The problem here is that the script "๐’œ" was introduced much later than "โ„ฌ":
thus it requires 2 16-bit characters instead of just 1. Many Java (or
JavaScript) programs will usually get it wrong, because they work with type
Char ("character") instead of Int (codepoint).

This complexity (and failure) of Unicode is avoided in Isabelle by using a
more stable and uniform notion of "Isabelle symbols". The Isabelle/jEdit
front-end uses "poor-man's rendering of Isabelle symbols in Unicode". Thus the
underlying jEdit text editor may get things wrong occasionally: it would be
better to work natively with Isabelle symbols everywhere.

Here are some snippets to illustrate the situation:

(Isabelle/ML symbols: good)
ML โ€นSymbol.explode "๐’œ_def โ„ฌ_def" |> map (fn s => (s, Symbol.is_letter s))โ€บ

(Isabelle/Scala symbols or UTF-16 codepoints: good)
Symbol.explode("๐’œ_def โ„ฌ_def").map(s => (s, Symbol.is_letter(s)))

(Java UTF-16 codepoints: good)
Codepoint.iterator("๐’œ_def โ„ฌ_def").toList.map(c => (c, Character.isLetter(c)))

(Java UCS-2 characters: bad)
"๐’œ_def โ„ฌ_def".toList.map(c => (c, Character.isLetter(c)))

The first one is for the Isabelle/jEdit document buffer, within a formal
theory context. The other ones are for the Console/Scala REPL: here you may
notice that the Java GUI component cannot display the "๐’œ" anymore, after it
got chopped up into 2 characters.

I am already used to post tickets to the average Java project, in order to
make such fine-points more correct. But for jEdit we have a certain status-quo
that won't move much.

Makarius


isabelle-dev mailing list
isabelle-dev@in.tum.de
https://mailman46.in.tum.de/mailman/listinfo/isabelle-dev

view this post on Zulip Email Gateway (Mar 06 2024 at 14:56):

From: Makarius <makarius@sketis.net>
On 06/03/2024 15:01, Lawrence Paulson wrote:

Thanks for the explanation. I guess you are saying this is outside of our
control.

It would mean to submit patches to the jEdit project. I did this for really
important things in the past: it does work, but can take a long time.

You do have to wonder how they did not introduce the entire script
alphabet at the same time.

The Unicode committee has always been a bit of a mess. I guess the problem is
that US-centric mentality has tried to rule over the world's languages, with
limited success.

In recent years, related to the STIX project, much of the mathematical
repertoire from TeX/LaTeX has been added in a post-hoc fashion.

Makarius


isabelle-dev mailing list
isabelle-dev@in.tum.de
https://mailman46.in.tum.de/mailman/listinfo/isabelle-dev

view this post on Zulip Email Gateway (Mar 06 2024 at 15:36):

From: Tobias Nipkow <nipkow@in.tum.de>
On 06/03/2024 15:56, Makarius wrote:

On 06/03/2024 15:01, Lawrence Paulson wrote:

Thanks for the explanation. I guess you are saying this is outside of our
control.

It would mean to submit patches to the jEdit project. I did this for really
important things in the past: it does work, but can take a long time.

You do have to wonder how they did not introduce the entire script alphabet at
the same time.

The Unicode committee has always been a bit of a mess. I guess the problem is
that US-centric mentality has tried to rule over the world's languages, with
limited success.

Introducing script "๐’œ" much later than "โ„ฌ" is a typical US-centric ploy.

Tobias

In recent years, related to the STIX project, much of the mathematical
repertoire from TeX/LaTeX has been added in a post-hoc fashion.

Makarius


isabelle-dev mailing list
isabelle-dev@in.tum.de
https://mailman46.in.tum.de/mailman/listinfo/isabelle-dev

smime.p7s

view this post on Zulip Email Gateway (Mar 06 2024 at 16:41):

From: Makarius <makarius@sketis.net>
Yes. If you want to understand Unicode history, see this detailed exposition:
http://utf8everywhere.org

I was in particular thinking of the original work by Joseph D. Becker that is
cited early on that webpage. It reads a bit like "16bit should be sufficient"
for everything". And then they had to omit many mathematical script
characters, just because they were not widely used at that time.

Makarius


isabelle-dev mailing list
isabelle-dev@in.tum.de
https://mailman46.in.tum.de/mailman/listinfo/isabelle-dev


Last updated: Dec 21 2024 at 16:20 UTC