[isabelle] Pattern matching on literal characters · Archive Mirror: Isabelle Users Mailing List

Stream: Archive Mirror: Isabelle Users Mailing List

Topic: [isabelle] Pattern matching on literal characters

Email Gateway (Aug 22 2022 at 16:21):

From: Andreas Lochbihler <andreas.lochbihler@inf.ethz.ch>
Dear all and Florian in particular,

While update some old Isabelle files, I noticed that fun no longer support pattern
matching on characters. It used to work in Isabelle2016, but does not in Isabelle2016-1
and Isabelle2017. Here is an small example:

fun test :: "char ⇒ nat" where
"test (CHR ''a'') = 0"
| "test (CHR ''b'') = 1"
| "test _ = 2"

Nowadays, fun complains about non-constructor patterns. Unfortunately, the NEWS file entry
"Characters (type char) are modelled as finite algebraic type corresponding to {0..255}."
does not say how this is supposed to be done nowadays. I can of course rewrite my
functions as a longish cascade of ifs, but that's not really nice and I don't get any
useful case analysis and simplification rules. Is there a better way to do it?
(Registering Char as a free constructor does not work because Char is not injective :-(.)

Moreover, the Tutorial on Function Definitions still pretends as if pattern matching was
possible (section 7.4). In the underlying theory file, I found that this is just a
manually typeset example. So this should be adapted to whatever is the canonical way nowadays.

Andreas

Email Gateway (Aug 22 2022 at 16:21):

From: Dominique Unruh <unruh@ut.ee>
Dear Andreas,

I ran into this problem as well.

While I do not have a nice solution, I wrote the attached theory that
defines a datatype legacy_char (basically a copy of the old definition of
char), and bijections between char and legacy_char.

In some cases, this may simplify migration, because one can first convert a
char into a legacy_char and then pattern match.
(Although not using the nice syntax CHR''a''.)

I am aware that this is far from an ideal solution, and in many cases might
not help at all, but I am providing it in case someone with Andreas'
problem finds it useful. (For example, automated methods may work better on
datatypes.)

Best wishes,
Dominique.
Legacy_Char.thy

Email Gateway (Aug 22 2022 at 16:22):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi Andreas,

that indeed is unsatisfactory. But the current situation need not be
the final word on this.

How many clauses do you have? For a small number of clauses, the
following works:

The disadvantage is that you have to spell out the »otherwise« in the
last clause explicitly, which is burdening if there are many clauses.
Also the »otherwise« is just a formal logical completion and not
suitable for execution.

If there are many clauses, I would think about introducing an explicit
case selector for char and registering that using free_constructors, but
I don't know yet how this will scale.

Paritcularly wrt. executability, depending on your application it might
also be feasible to constructor some kind of lookup association list
using »zip (Enum.enum :: char list) …« or similar.

What is your particular application?

Cheers,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 16:22):

From: Andreas Lochbihler <andreas.lochbihler@inf.ethz.ch>
Hi Florian,

I ran into this problem while porting the solution to an assignment in our Isabelle course
at ETH (analysing cryptographic messages in a Dolev-Yao intruder model). In there, there
were always just a few cases (no more than three) that involved matching on characters. So
it's not a huge problem because I know how to rewrite them into if-then-else.
Nevertheless, the definitions somewhat get redundant because I have to repeat the
otherwise case several times. Moreover, the characters are not top-level arguments, but
the nesting occurs somewhere deep down in the message terms, say

...
| "f a (Enc m (Key ''I'')) = ..."
...
| "f b _ = otherwise b"

i.e., the pattern match on CHR ''I'' is happens inside "Enc _ (Key (Cons _ Nil))". So the
sequential option to function is really an essential convenience here. Executability is
also a key issue for us.

Regarding the case combinator for char, I don't yet see how this solves the problem. We'd
have to register a free constructor for chars and the parse translation for CHR would have
to use this constructor. Otherwise we cannot use the pattern matching. On the plus side,
the predicate compiler should then also be able to match against chars and strings again.
(We're not using the predicate compiler for this at the moment.).

Andreas

Email Gateway (Aug 22 2022 at 16:22):

From: Lars Hupel <hupel@in.tum.de>
Dear Andreas,

to offer yet another point in the design space: I have created a type
for bytes that is essentially an 8-tuple of bools. It's not efficient,
but it does have the advantage that it's a free constructor "for free".
Combined with a parse/print translation it should satisfy all your
requirements.

In the attached theory there's some additional stuff to make the code
generator use bytes as implementation for chars.

Cheers
Lars
Byte.thy

Email Gateway (Aug 22 2022 at 16:25):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi all,

I take this opportunity to set out some thoughts that have guarded my
work on characters before sketching possible solutions to the pattern
matching problem.

a) Using num to represent characters dramatically simplifies any kind of
»exporting« literal expressions like characters or numerals, since there
is only one constituting »literal« type to represent them, »num«.

b) Why having a different representation for characters in the regular
logic and in code equations seems to be a bad idea.

For illustration, have a look at the following example:

definition max_int :: int
where "max_int = 32767"

definition overflow :: "int ⇒ int option"
where "overflow k = (if k > max_int then None else Some k)"

code_thms overflow

Here »32767« has to be represented differently to accomplish code
generation; but the loss of readability is bearable since numeral
literals most times are not used directly, but as symbolic constants.

Compare this with an example involving a literal string:

definition overflow' :: "int ⇒ int"
where "overflow' k = (if k > max_int then Code.abort (STR Integer overflow) (λ_. k) else k)"

code_thms overflow'

Since string literals are likely to be used as rather arbitrary
notification, the appear scattered over the whole code equations;
replacing them by something without readable syntax obfuscates
everything considerably.

c) What to do next?

Disregarding b), there is always the possibility to re-establish a
canonical datatype representation for char, where I have slightly more
sympathy for Lars' suggestion than the quite involved traditional nibble
construction.

What IMHO should be thought about alternatively is to separate bytes and
characters, which is how programming languages deal with the fact that
in the world of unicode a particular glyph is one thing and its bit
representation another.

This could work along the following lines:

Type »char« is isomorphic to »nat«, i. e. a type representing an
unbounded set of possible glyphs.
Type »string« remains an alias for »char list«.
Type »char« is treated wrt. code generation more or less as now, using
explicit conversion functions from integers if a particular code point
is not directly representable in a particular target language.
Type »byte« is isomorphic to [0..<256]; some representants have
suggestive character syntax, e. g. »BYTE A«, cf. type »char« in ANSI C.
Type »byte« is a proper datatype.
Type »byte« is not treated specifically by code generation.
Some canonical conversion functions between »char« and »byte« can be
specified, e.g. »utf8_of_char :: char => byte list« and »string_of_utf8
:: byte list => string«

Curious for comments,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 16:25):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi Lars,

just a few remarks on your theory:

»nat_of_bool«: there is already »of_bool«
»n mod 2 = 1«: usually better to write »odd n«
»nth_or_else«: there is already »nth_default« in More_List.thy

Generally, I like your idea for using proper bit representations rather
the quite ad-hoc traditional nibble auxiliary.

Cheers,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 16:25):

From: Lars Hupel <hupel@in.tum.de>

»nat_of_bool«: there is already »of_bool«

»n mod 2 = 1«: usually better to write »odd n«

»nth_or_else«: there is already »nth_default« in More_List.thy

Thanks, that made some things easier.

Generally, I like your idea for using proper bit representations rather
the quite ad-hoc traditional nibble auxiliary.

I specifically created this theory to make interop between CakeML native
characters and HOL characters easier. There, I have hand-written
functions that convert between bytes and CakeML characters.

It could be potentially used for other target languages. "byte" can be
code-adapted to machine bytes (e.g. "Byte" in Scala). This would make it
more efficient, but still won't give us integration into target language
strings. But at least it's correct.

Cheers
Lars

Email Gateway (Aug 22 2022 at 16:25):

From: Lars Hupel <hupel@in.tum.de>
Hi Florian,

Disregarding b), there is always the possibility to re-establish a
canonical datatype representation for char, where I have slightly more
sympathy for Lars' suggestion than the quite involved traditional nibble
construction.

this will only work assuming everything is ASCII. Unfortunately the
situation in the various target languages is not uniform.

Haskell: "Char" is an opaque type representing Unicode codepoints,
"String = [Char]", but everybody uses "Text" that comes in strict and
lazy flavours

Scala: "char" is 16 bits for what was thought to be the final say
(UCS-2), "strings" are allegedly sequences of these "char"s, but a
proper iteration over them requires taking surrogate pairs into account
(UTF-16)

OCaml/SML: everything is ASCII, lol

What IMHO should be thought about alternatively is to separate bytes and
characters, which is how programming languages deal with the fact that
in the world of unicode a particular glyph is one thing and its bit
representation another.

Minor nitpick: "glyph" → "codepoint"

Type »char« is isomorphic to »nat«, i. e. a type representing an
unbounded set of possible glyphs.

Yes. but it is still not clear how we can make pattern matching work
with that. Hypothetically, if "Chr" is the morphism for "char", we can
register "Chr" as a free constructor. Then, "CHR ''a''" could be a
parse-translation resulting in "Chr 97". But

fun f :: "nat ⇒ nat" where
"f 97 = 97"

is not an acceptable function definition.

(Even if it were, it would probably take horribly long to be defined.)

Type »string« remains an alias for »char list«.

Also yes.

Type »char« is treated wrt. code generation more or less as now, using
explicit conversion functions from integers if a particular code point
is not directly representable in a particular target language.

Not sure how this can be made to work sort of uniformly, but I'm open to
suggestions.

Type »byte« is isomorphic to [0..<256]; some representants have
suggestive character syntax, e. g. »BYTE A«, cf. type »char« in ANSI C.

Are you suggesting to only allow pattern matching on bytes (0..255)
instead of codepoints? I could live with that, and I guess others could,
too.

Adapting "byte" to target languages would probably require a functrans
akin to what we have for "nat".

Type »byte« is a proper datatype.

Type »byte« is not treated specifically by code generation.

Yes.

Cheers
Lars

Email Gateway (Aug 22 2022 at 16:25):

From: Makarius <makarius@sketis.net>
On 16/11/17 15:56, Lars Hupel wrote:

Scala: "char" is 16 bits for what was thought to be the final say
(UCS-2), "strings" are allegedly sequences of these "char"s, but a
proper iteration over them requires taking surrogate pairs into account
(UTF-16)

That is indeed very complicated. It shows the full failure of the
Unicode project over the past decades. Just a few weeks ago, I was again
struggling with problems of performance and correctness in some corner
cases of Unicode in Isabelle/Scala.

OCaml/SML: everything is ASCII, lol

ASCII has actually only 7 bits, but OCaml and SML strings consist of
clean well-defined 8-bit bytes. Thus they fit nicely into the following
scheme http://utf8everywhere.org and turn out more modern than Java,
Scala, Python etc. who attempted a bit too early to get on the wrong train.

What IMHO should be thought about alternatively is to separate bytes and
characters, which is how programming languages deal with the fact that
in the world of unicode a particular glyph is one thing and its bit
representation another.

Minor nitpick: "glyph" → "codepoint"

It is even more complicated. See again http://utf8everywhere.org for
more details of Unicode that nobody really understands.

Makarius

Email Gateway (Aug 22 2022 at 16:27):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi all,

after some personally discussions, I come up with the following sketch
how to regain pattern matching on HOL chars while at the same time
clarifying some historical misperceptions around strings and code
generation.

Details would have be to figured out iteratively.

Cheers,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 16:27):

From: Andreas Lochbihler <andreas.lochbihler@inf.ethz.ch>
Hi Florian,

Your agenda looks good, but I would like to point out one further usage requirement for
strings:

Strings are used to exchange data between generated and hand-written code. At the moment,
the functions String.implode and String.explode provide this interface between char list
and String.literal. IMO we should provide such an interface after the refactoring, too.
This does not have to be between char list and String.literal; some other element type for
target-language strings (brand: type character, akin to integer and natural) should do as
well.

Andreas

Email Gateway (Aug 22 2022 at 17:07):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi all,

I take the opportunity to post two patches (applicable on isabelle
b5e29bf0aeab and efd58f8b1659) containing a refinement of string-like
types as sketched in this mail thread.

There are still some AFP sessions to be sorted out, but the way to go
has become reasonable clear. Quoth NEWS:

* Type "char" is now a proper datatype of 8-bit values.

* Conversions "nat_of_char" and "char_of_nat" are gone; use more
general conversions "of_char" and "char_of" with suitable
type constraints instead.

* The zero character is just written "CHR 0x00", not
"0" any longer.

* Type "String.literal" (for code generation) is now isomorphic
to lists of 7-bit (ASCII) values; concrete values can be written
as "STR ''...''" for sequences of printable characters and
"ASCII 0x..." for one single ASCII code point given
as hexadecimal numeral.

* Type "String.literal" supports concatenation "... + ..."
for all standard target languages.

* Theory Library/Code_Char is gone; study the explanations concerning
"String.literal" in the tutorial on code generation to get an idea
how target-language string literals can be converted to HOL string
values and vice versa.

* Imperative-HOL: operation "raise" directly takes a value of type
"String.literal" as argument, not type "string".

When open ends are finally ironed out, there remain still some minor
questions:

Renaming of type "char" to "byte", to make its semantics clear once
and forall.
Avoid the ASCII marker for literals – use "STR" uniformly.
Revisit the Syntax for literals more deeply. The "STR" is traditional
but need not be the last word on that topic. Suggestions welcome.

Cheers,
Florian
char_8_bit@char
char_8_bit@char
signature.asc

Email Gateway (Aug 22 2022 at 17:07):

From: Andreas Lochbihler <mail@andreas-lochbihler.de>
Dear Florian,

Thanks for moving this forward. Could you please name all the lemmas that you declare as
[transfer_rule]? Without the name, it becomes much harder to remove them later in case
they confuse the transfer prover in applications. (I haven't tried your rules yet; it's
just my experience that there may easily be unexpected interactions between rules.)

As for the AFP entry Native_Word, I am puzzled that deleted the conversions between char
and uint8. Do you plan to unify the two types uint8 and char?

Best,
Andreas

Email Gateway (Aug 22 2022 at 17:08):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
See now 1f9f973eed2a including follow-ups ad4b8b6892c3 and 27ba50c79328
for the current matter of affairs.

I am still considering two further refinements:

a) Renaming of type "char" to "byte", to make its semantics clear once
and forall.

b) The "STR" syntax for values of type String.literal has a certain
tradition but is not very suggestive.

At the moment I am inclined to pursue a) further but I am open for comments.

Concerning b), I have no better idea at hand and would put that aside.
But maybe there are innovative ideas around (cartouches instead of
''…''?) that would allow to reconsider that question.

Cheers,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 17:08):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Hi Andreas,

Thanks for moving this forward. Could you please name all the lemmas> that you declare as [transfer_rule]? Without the name, it becomes
much> harder to remove them later in case they confuse the transfer
prover in> applications. (I haven't tried your rules yet; it's just my
experience> that there may easily be unexpected interactions between rules.)

I am happy to name them but so far this has never been done for the
transfer rules e. g. on integer. They are not used by default
(lifting_update / lifting_forget) – is there really a need to address
them as single theorems by name? (This is an open question to me)

As for the AFP entry Native_Word, I am puzzled that deleted the> conversions between char and uint8. Do you plan to unify the two
types> uint8 and char?

By its very nature the HOL char type by itself has no guaranteed
connection to possibly existing char types in target languages.
Explicit conversions to target-language types can be done e.g. using
char_of and of_char with suitable constraints or intermediate conversions.

It would still be an option to map char to e.g. Word8.word in SML
directly in Native_Word, but it seems to be a better idea to have a
particular target-language-specific mapping from the very beginning –
and not to optionally add it later on. This tradition has emerged with
the integer type (in contrast to int) and is now pervasive.

Am I missing something?

Cheers,
Florian
signature.asc

Email Gateway (Aug 22 2022 at 17:08):

From: Makarius <makarius@sketis.net>
Concerning type string (not String.literal) I am still hoping that we
may eventually replace the double-single quotes ''...'' by normal double
quotes "...", and thus make it less surprising for the world out there.

De-facto, this requires to use cartouches to embed inner syntax within
outer syntax by default, instead of old-fashioned double quotes. A
systematic conversion for existing sources requires batch-builds with a
PIDE markup database, to know precisely where inner syntax items need to
be replaced. This is not available yet, but we are close to have that soon.

Another reason why only a very small elite has adopted cartouches for
inner syntax quotation so far: the notation in LaTeX looks a bit odd,
because \<open> and \<close> are not really French single-quotes, but
angle brackets with adhoc manipulation of the size. Maybe anybody has a
better idea to get the "guillemets simples" right in LaTeX.

This is all for type string, but there might be reasons to treat
String.literal the same way, or reasons not to treat it the same.

A cartouche without additional marker is always somewhat special: it is
the "one shot for free" principle for nested sub-languages. E.g. in
Isabelle/ML a plain cartouche is a literal for uninterpreted
Input.source, and in the document language for uninterpreted formal text.

My feeling is that type String.literal is too exotic for the prominent
notation of cartouche without marker. Or maybe, type string could use
double quotes and type String.literal double single quotes, now without
marker.

Makarius
signature.asc

Email Gateway (Aug 22 2022 at 17:08):

From: Andreas Lochbihler <mail@andreas-lochbihler.de>
Hi Florian,

Thanks for moving this forward. Could you please name all the lemmas> that you declare
as [transfer_rule]? Without the name, it becomes
much> harder to remove them later in case they confuse the transfer
prover in> applications. (I haven't tried your rules yet; it's just my
experience> that there may easily be unexpected interactions between rules.)

I am happy to name them but so far this has never been done for the
transfer rules e. g. on integer. They are not used by default
(lifting_update / lifting_forget) – is there really a need to address
them as single theorems by name? (This is an open question to me)

OK, I overlooked the lifting_forget part. In that case, it is indeed less likely that
transfer gets confused about them. But it is nevertheless good practice to name all lemmas
that end up in some list of theorems. The style guides for the AFP, e.g., demand that:

Only named lemmas should carry attributes such as [simp].
https://www.isa-afp.org/submitting.html

At least when I simulate transfer for debugging or manually write a transfer proof, it is
a pain to do so if I have to extract those unnamed rules from transfer_raw with an index.

Thanks for the explanation. So the type char (soon to be called byte) could be the logical
type of bytes that an implementation theory akin to Code_Target_Int links to an
implementation type of bytes like uint8. That sounds like a good plan.

Best,
Andreas

Email Gateway (Aug 22 2022 at 17:16):

From: Florian Haftmann <florian.haftmann@informatik.tu-muenchen.de>
Dear list,

after an explorative try this seems out of scope for me: there are too
many references to the well-established name »char« to make that
economically feasible.

Cheers,
Florian
signature.asc

Last updated: Aug 16 2025 at 04:27 UTC