From: Lars Hupel <hupel@in.tum.de>
Dear list,
while playing some more with cartouches I was wondering why we don't
support Unicode strings in HOL.
For example, the following can't be written:
term "''ß''"
... because it fails with the message:
Bad character: "ß"
In fact, a very "character" in HOL is defined to be in the range of 0
and 255, so only old-fashioned charsets like ISO-8859-* can be represented.
In "Unicode_Examples.thy" [0], you will find input and output syntax to
support this. I've implemented appropriate parse and print translations,
such that at least literals work. However, the encoding I use is rather
cumbersome [1]. Conversions to and from the existing string type, code
setup and actual operations are missing.
Anyhow, this is on my "low-priority nice-to-have" list. Does anybody
actually need that? I whipped this up quickly during a commute, but if
somebody is seriously interested in communicating with the outside world
in a formal setting, this might be worth investigating.
Cheers
Lars
[0] <https://github.com/larsrh/purely-experimental>
[1] Many programming languages will store Unicode strings as a sequence
of codepoints (e.g. Haskell). A codepoint is a number designating a
specific Unicode character and tells nothing about the specific encoding
used. Now, Isabelle sources are encoded in UTF-8 and any proper
parse/print routine has to perform a conversion between UTF-8 surface
syntax and internal codepoint representation. I didn't do that. Instead,
I'm just storing chunks of UTF-8 bytes. The reason is that I couldn't
find appropriate encoding/decoding functions in Isabelle/ML.
Last updated: Nov 21 2024 at 12:39 UTC