Stream: Archive Mirror: Isabelle Users Mailing List

Topic: [isabelle] Potential Spam: Isabelle/PIDE as IDE for Stand...


view this post on Zulip Email Gateway (Aug 19 2022 at 16:40):

From: Michael Norrish <michael.norrish@nicta.com.au>
Some minor criticisms: the Isabelle/ML handling of string constants is
contaminating SML in this mode.

The lexeme "\<forall>" is not a valid string literal in SML (Poly, mlton and
Moscow ML, certainly agree with me), but it's fine in a file being interpreted
with SML_file.

The SML_file mode also rejects "\\<" as a string literal, which is incorrect.

Similarly, the string literal "∀" (UTF-8 encoding) is interpreted as if it were
"\<forall>". It's perhaps arguable what SML really requires of "∀": mlton and
Poly/ML both see it as an error; Moscow ML accepts it as a string of size 3. If
mlton and Poly/ML were to accept it, I expect they would follow Moscow ML and
evaluate

"∀" = "\226\136\128"

to true. The SML_file facility does not.

Worse, if I load a file with a genuine occurrence of the UTF-8 encoded "∀", then
the IDE will silently write it out to disk as "\<forall>" when the file is
saved. This even happens if the ∀ occurs bare in a comment. I find this
disturbing.

More importantly, only some special characters get this treatment. If I write

String.explode "大学";

the IDE (at least on my machine) just displays black boxes in the edit window,
but does decompose the sequence into the UTF-8 encoding as chars:

val it = [#"\229", #"\164", #"\167", #"\229", #"\173", #"\166"]: char list

So, not only is the SML mode perhaps deviating from the standard by accepting
string literals like "∀" and "大学", but it's doing so in a confusing way: some
string literals are silently turned into \<....> forms, and others are not.

Note that this is also a minor annoyance in Isabelle proper. Users can't access
files called ∀.c by passing that name as written to an Isabelle/ML routine.
Instead they have to write \226\136\128.c

Michael
signature.asc

view this post on Zulip Email Gateway (Aug 19 2022 at 16:41):

From: Makarius <makarius@sketis.net>
On Tue, 18 Nov 2014, Michael Norrish wrote:

The lexeme "\<forall>" is not a valid string literal in SML (Poly, mlton and
Moscow ML, certainly agree with me), but it's fine in a file being interpreted
with SML_file.

That is not a "lexeme", but an "Isabelle symbol". This concept is below
the lexical syntax, like \uXXXX in Java. It is documented in the
"implementation" and "isar-ref" manual.

The deeper question here is how plain 7 bit ASCII -- the only universal
text representation standard that actually works -- can be extended to do
a bit more, without breaking too many things. Unicode with its many
uncertainties does break a lot. The UTF-8 encoding could be called
half-decent today, but the JVM uses the older UTF-16, which causes many
headaches.

The SML_file mode also rejects "\\<" as a string literal, which is incorrect.

Formally that is a "malformed symbol". One could try harder to accept
that. The question is if this is of practical relevance. Are there
concrete SML programs where this happens?

We don't have millions of SML projects around. So it might be simpler to
change these, e.g. by using "\\" ^ "<".

In Isabelle/PIDE, the support for "auxiliary files" is limited to
something that conforms to Isabelle text represented with this special
symbol notation. So far I have never seen practical problems, by mere
luck. If there are concrete counter-examples, we can think about concrete
counter-measures to avoid problems.

Similarly, the string literal "∀" (UTF-8 encoding) is interpreted as if
it were "\<forall>".

That is a additional aspect. The PIDE default encoding is UTF-8-Isabelle,
which identifies certain Unicode points with certain Isabelle symbols, as
specified in the cumulative etc/symbols files of the Isabelle system
installation.

Encoding for Unicode introduce inherent unreliably. In practice one
normally commits to just one encoding. For Isabelle, the above does the
job without too many people ever noticing the point.

For pure SML, one could try harder to conform to strict UTF-8, but again
the question is practical relevance. Traditionally, SML never really
supported Unicode anyway, so it cannot be relied on in existing programs.

It's perhaps arguable what SML really requires of "∀": mlton and
Poly/ML both see it as an error; Moscow ML accepts it as a string of size 3. If
mlton and Poly/ML were to accept it, I expect they would follow Moscow ML and
evaluate

"∀" = "\226\136\128"

to true.

Such things are generally dangerous in Unicode. Even with a well-defined
encoding, so-called "combining sequences" introduce some uncertainty about
the actual text content. Do you mean before or after normalization?

Worse, if I load a file with a genuine occurrence of the UTF-8 encoded
"∀", then the IDE will silently write it out to disk as "\<forall>"
when the file is saved. This even happens if the ∀ occurs bare in a
comment. I find this disturbing.

It is quite well documented in the Isabelle/PIDE documentation and
literature. You could disable these conversions by removing all
etc/symbols files. For Isabelle that would be very impractical. For SML
it could be done, if there were a real need for it.

Note that this is also a minor annoyance in Isabelle proper. Users can't access
files called ∀.c by passing that name as written to an Isabelle/ML routine.
Instead they have to write \226\136\128.c

Does that have any practical relevance, or is it just a synthetic example
to break the system?

Unicode file-names are a sure way to ask for trouble. In conjunction with
the "portable" JVM things are particularly fragile. I have seen routine
problems with Far Eastern file names on Windows and sometimes Linux.
There are also problems with continental European diacritics on Windows.

I don't think that Unicode will ever get there to deliver something that
just works everywhere and for everyone. We can't just subscribe to a
standard that is not a standard at all, but many of them with ongoing
changes and no clear perspective.

Makarius


http://stop-ttip.org 909,790 people so far


view this post on Zulip Email Gateway (Aug 19 2022 at 16:42):

From: Michael Norrish <michael.norrish@nicta.com.au>
On 19/11/14 03:28, Makarius wrote:

On Tue, 18 Nov 2014, Michael Norrish wrote:

The lexeme "\<forall>" is not a valid string literal in SML (Poly, mlton and
Moscow ML, certainly agree with me), but it's fine in a file being interpreted
with SML_file.

That is not a "lexeme", but an "Isabelle symbol". This concept is below the
lexical syntax, like \uXXXX in Java. It is documented in the "implementation"
and "isar-ref" manual.

I appreciate that all of these notions make sense in the context of Isabelle.
All of my criticisms are of the IDE in the context of developing SML rather than
Isabelle/ML.

The SML_file mode also rejects "\\<" as a string literal, which is incorrect.

Formally that is a "malformed symbol". One could try harder to accept that.
The question is if this is of practical relevance. Are there concrete SML
programs where this happens?

Indeed, any SML program attempting to emit Isabelle symbols in a way that is
valid SML may want to contain string literals like "\\<forall>".

We don't have millions of SML projects around. So it might be simpler to change
these, e.g. by using "\\" ^ "<".

Of course there are workarounds. I just hesitate to recommend a tool for
general SML usage that doesn't quite implement SML.

Unicode file-names are a sure way to ask for trouble. In conjunction with the
"portable" JVM things are particularly fragile. I have seen routine problems
with Far Eastern file names on Windows and sometimes Linux. There are also
problems with continental European diacritics on Windows.

I adjusted the seL4 C parser precisely to cope with Chinese in filenames (a
user-reported problem). So, yes, there was a problem with a Unicode filename,
but luckily, I could create a system that worked. The files were part of a C
project that clearly did work in that context. If the C infrastructure could
cope with Unicode filenames, I felt I should try too.

Michael
signature.asc

view this post on Zulip Email Gateway (Aug 19 2022 at 16:42):

From: Makarius <makarius@sketis.net>
On Wed, 19 Nov 2014, Michael Norrish wrote:

On 19/11/14 03:28, Makarius wrote:

That is not a "lexeme", but an "Isabelle symbol". This concept is below the
lexical syntax, like \uXXXX in Java. It is documented in the "implementation"
and "isar-ref" manual.

The SML_file mode also rejects "\\<" as a string literal, which is incorrect.

Formally that is a "malformed symbol". One could try harder to accept that.
The question is if this is of practical relevance. Are there concrete SML
programs where this happens?

Indeed, any SML program attempting to emit Isabelle symbols in a way that is
valid SML may want to contain string literals like "\\<forall>".

That is an application of generating sources, which could be easily
changed to emit "\027\027<forall>" instead.

BTW, "malformed symbol" above only refers to \< not a proper \<forall>,
but "\\<forall>" is also rejected due to the way how sources are passed
through the system: text is always treated at Isabelle symbol boundaries.

This is the reason why the Prover IDE works with symbols at all. The UTF-8
that is occasionally seen in Coq sources does not work in that respect.

In any case, I would like to see real applications and real problems.
Then one can find ways to make it work. In contrast, the general problem
of adding Unicode support everywhere and uniformly cannot be solved.

Unicode file-names are a sure way to ask for trouble. In conjunction with the
"portable" JVM things are particularly fragile. I have seen routine problems
with Far Eastern file names on Windows and sometimes Linux. There are also
problems with continental European diacritics on Windows.

I adjusted the seL4 C parser precisely to cope with Chinese in filenames (a
user-reported problem). So, yes, there was a problem with a Unicode filename,
but luckily, I could create a system that worked.

Such things can be made to suck less in a case-by-case basis. I am doing
it myself whenever it is feasible, but the general problem remains. E.g.
when you drag-and-drop files on some application, it depends on many
side-conditions of the operating system desktop what really happens.

Makarius


http://stop-ttip.org 912,049 people so far



Last updated: Nov 21 2024 at 12:39 UTC