Stream: Archive Mirror: Isabelle Users Mailing List

Topic: [isabelle] Unicode strings in HOL


view this post on Zulip Email Gateway (Aug 22 2022 at 10:25):

From: Lars Hupel <hupel@in.tum.de>
Dear list,

while playing some more with cartouches I was wondering why we don't
support Unicode strings in HOL.

For example, the following can't be written:

term "''ß''"

... because it fails with the message:

Bad character: "ß"

In fact, a very "character" in HOL is defined to be in the range of 0
and 255, so only old-fashioned charsets like ISO-8859-* can be represented.

In "Unicode_Examples.thy" [0], you will find input and output syntax to
support this. I've implemented appropriate parse and print translations,
such that at least literals work. However, the encoding I use is rather
cumbersome [1]. Conversions to and from the existing string type, code
setup and actual operations are missing.

Anyhow, this is on my "low-priority nice-to-have" list. Does anybody
actually need that? I whipped this up quickly during a commute, but if
somebody is seriously interested in communicating with the outside world
in a formal setting, this might be worth investigating.

Cheers
Lars

[0] <https://github.com/larsrh/purely-experimental>

[1] Many programming languages will store Unicode strings as a sequence
of codepoints (e.g. Haskell). A codepoint is a number designating a
specific Unicode character and tells nothing about the specific encoding
used. Now, Isabelle sources are encoded in UTF-8 and any proper
parse/print routine has to perform a conversion between UTF-8 surface
syntax and internal codepoint representation. I didn't do that. Instead,
I'm just storing chunks of UTF-8 bytes. The reason is that I couldn't
find appropriate encoding/decoding functions in Isabelle/ML.


Last updated: Nov 21 2024 at 12:39 UTC