Hello,
I have a question about parsing theory files. I've searched the chat but i haven't found the answer i was looking for, hence my message.
I need to split theory files into specs and proofs: so definitions go to one file and lemmas with their proofs in another. Makarius pointed me towards Outer_Syntax.scala
and, with the help of Copilot, i have some (incomplete) code running on simple theories such as the Fibonacci example presented in the code generation doc.
I would just like to see if the approach is "right": i currently match on span.name
and i wonder if i should match instead on span.kind
.
I also wonder if i should include keywords such as "::", ":", ".", "(", ")", "[code]" in Outer_Syntax and, if there is a unique way to classify keywords (like mapping ":" to "thy_goal" when creating thy_syntax
).
For reference, the code i have is below.
Thank you,
Lacramioara
package isabelle
object Test_Thy_Parser {
private def is_definition(span: Command_Span.Span): Boolean = span.name match {
case "theory" | "imports" | "begin" | "end" | "definition" | "fun" | "datatype" | "::" => true
case _ => false
}
private def is_proof(span: Command_Span.Span): Boolean = span.name match {
case "theorem" | "lemma" | "qed" | "proof" | "by" | "(" | ")" | ":" | "[code]" => true
case _ => false
}
private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit = {
val content = new StringBuilder()
spans.zipWithIndex.foreach { case (span, index) =>
content.append(span.content.map(_.source).mkString(" "))
spans.lift(index + 1).foreach { nextSpan =>
if (!Set("::", ":", "(", ")", "[code]", "by").contains(nextSpan.name)) {
content.append("\n\n")
} else if (nextSpan.name == "[code]") {
content.append(" ")
} else if (nextSpan.name == "by") {
content.append("\n ")
}
}
}
File.write(path, content.toString)
}
def main(args: Array[String]): Unit = {
if (args.length != 1) {
println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala' <theory_file.thy>")
sys.exit(1)
}
val thy_file = Path.explode(args(0))
if (!thy_file.is_file) {
println(s"Error: File not found - ${thy_file}")
sys.exit(1)
}
// Setup syntax with common theory keywords
val thy_syntax = Outer_Syntax.empty +
("::", "thy_defn") +
("theory", "thy_begin") +
("imports", "thy_decl") +
("begin", "thy_begin") +
("fun", "thy_defn") +
("datatype", "thy_defn") +
("definition", "thy_defn") +
(":", "thy_goal") +
("(", "thy_goal") +
(")", "thy_goal") +
("lemma", "thy_goal") +
("[code]", "thy_goal") +
("theorem", "thy_goal") +
("proof", "proof") +
("by", "proof") +
("qed", "proof") +
("end", "thy_end")
// Read and parse theory content
val theory_content = File.read(thy_file)
val spans = thy_syntax.parse_spans(theory_content)
// Separate definitions and proofs
val definitions = spans.filter(is_definition)
val proofs = spans.filter(is_proof)
// Generate output files
val base_path = thy_file.expand.implode
val def_path = Path.explode(base_path.replaceAll("\\.thy$", "_definitions.thy"))
val proof_path = Path.explode(base_path.replaceAll("\\.thy$", "_proofs.thy"))
write_spans(def_path, definitions)
write_spans(proof_path, proofs)
}}
From: lacramioara <lacramioara.astefanoaei@gmail.com>
Hello,
I have posted the above on the zulip chat but i'm reposting here in case
someone in this audience has advice.
My question is about parsing theory files. I've searched the zulip chat but i
haven't found the answer i was looking for, hence my message.
I need to split theory files into specs and proofs: definitions go to one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scala
and, with the help of Copilot, i have some (incomplete)
code running on simple theories such as the Fibonacci example presented in the
code generation doc.
What is not clear to me is if there is a "right" way to construct the mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files. I would also
like to see if my understanding is correct in that i can choose the mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof
. Would this be
something not inline with the underlying philosophy of Isabelle/Scala?
For reference, the code i have is below.
Thank you,
Lacramioara
package isabelle
object Test_Thy_Parser {
private def is_definition(span: Command_Span.Span, keywords:
Keyword.Keywords): Boolean = {
keywords.kinds.get(span.name) match {
case Some(kind) => Keyword.theory_defn.contains(kind)
case None => false
}
}
private def is_proof(span: Command_Span.Span, keywords: Keyword.Keywords):
Boolean = {
keywords.kinds.get(span.name) match {
case Some(kind) => Keyword.proof.contains(kind)
case None => false
}
}
private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit =
{
val content = new StringBuilder()
spans.zipWithIndex.foreach { case (span, index) =>
content.append(span.content.map(_.source).mkString(" "))
spans.lift(index + 1).foreach { nextSpan =>
if (!Set("(", ")", "[code]", "by").contains(nextSpan.name)) {
content.append("\n\n")
} else if (nextSpan.name == "[code]") {
content.append(" ")
} else if (nextSpan.name == "by") {
content.append("\n ")
}
}
}
File.write(path, content.toString)
}
def main(args: Array[String]): Unit = {
if (args.length != 1) {
println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala'
<theory_file.thy>")
sys.exit(1)
}
val thy_file = Path.explode(args(0))
if (!thy_file.is_file) {
println(s"Error: File not found - ${thy_file}")
sys.exit(1)
}
type Keywords = List[(String, Keyword.Spec)]
val thy_keywords : Keywords = List(("%", Keyword.Spec()),
("(", Keyword.Spec()),
(")", Keyword.Spec()),
(",", Keyword.Spec()),
("::", Keyword.Spec()),
(":", Keyword.Spec()),
(".", Keyword.Spec()),
("=", Keyword.Spec()),
("and", Keyword.Spec()),
("begin", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
//("imports", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
("text", Keyword.Spec(kind = Keyword.DOCUMENT_BODY)),
("fun", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("fun"))),
("datatype", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("datatype"))),
("definition", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("definition"))),
("theory", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("theory"))),
("proof", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("proof"))),
("lemma", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("lemma"))),
("qed", Keyword.Spec(kind = Keyword.QED, tags = List("qed"))),
("by", Keyword.Spec(kind = Keyword.PRF_ASM_GOAL, tags = List("by"))),
("next", Keyword.Spec(kind = Keyword.NEXT_BLOCK, tags = List("next"))),
("end", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("end")))
)
// Setup syntax with common theory keywords
val thy_syntax = Outer_Syntax.empty.add_keywords(thy_keywords)
// Read and parse theory content
val theory_content = File.read(thy_file)
val spans = thy_syntax.parse_spans(theory_content)
print(thy_syntax.keywords)
// Separate definitions and proofs
val definitions = spans.filter(span => is_definition(span,
thy_syntax.keywords))
val proofs = spans.filter(span => is_proof(span, thy_syntax.keywords))
// Generate output files
val base_path = thy_file.expand.implode
val def_path = Path.explode(base_path.replaceAll("\\.thy$",
"_definitions.thy"))
val proof_path = Path.explode(base_path.replaceAll("\\.thy$",
"_proofs.thy"))
write_spans(def_path, definitions)
write_spans(proof_path, proofs)
// Report results
println(s"\nProcessed theory file: ${thy_file}")
println("Generated files:")
println(s" Definitions: ${def_path}")
println(s" Proofs: ${proof_path}")
println(s"\nFound ${definitions.length} definitions and ${proofs.length}
proofs")
}
}
From: Fabian Huch <huch@in.tum.de>
On 1/31/25 13:14, lacramioara wrote:
I need to split theory files into specs and proofs: definitions go to one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scala
and, with the help of Copilot, i have some (incomplete)
code running on simple theories such as the Fibonacci example presented in the
code generation doc.What is not clear to me is if there is a "right" way to construct the mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.
Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.
I would also
like to see if my understanding is correct in that i can choose the mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear inKeyword.proof
. Would this be
something not inline with the underlying philosophy of Isabelle/Scala?
This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.
Fabian
From: Lacramioara Astefanoaei <lacramioara.astefanoaei@gmail.com>
On Fri, 31 Jan 2025 at 15:03, Fabian Huch <huch@in.tum.de> wrote:
On 1/31/25 13:14, lacramioara wrote:
I need to split theory files into specs and proofs: definitions go to
one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scala
and, with the help of Copilot, i have some
(incomplete)
code running on simple theories such as the Fibonacci example presented
in the
code generation doc.What is not clear to me is if there is a "right" way to construct the
mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not
need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.
Thank you for the pointer. I read there
*Both minor keywords and major keywords of the Isar command language need
to be specified, in order to make parsing of proof documents work properly.
*which answers my question that there is no shortcut. (I still wonder why
the parsing works in my case without me having to explicitly add the
"imports" keyword.)
I would also
like to see if my understanding is correct in that i can choose the
mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear inKeyword.proof
. Would
this be
something not inline with the underlying philosophy of Isabelle/Scala?This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.
For the moment, the code i have achieves what i want in that it separates
definitions from lemmas and their proofs. I understand from your comparison
that "lemma" shouldn't have kind "PRF_GOAL". What was somewhat implicit in
my question was: where could one find the "right" kinds for keywords, as i
did not find this information in the docs, while in the code the only
relevant piece i found was src/Pure/Thy/thy_header.scala
. I see now that
i can find examples of associatiations (keyword, kind) in what is generated
by isabelle dump
. For instance, i see in the markup file for a toy
theory: accepted command_span name=lemma kind=thy_goal_stmt
.
If i do a pdfgrep -r "thy_goal_stmt" doc/
i find no results, so i
conclude that to get this information about the "right" mapping i just have
to extract it from the output of dump.
Thank you for taking the time to respond,
Lacramioara
Fabian
From: Lacramioara Astefanoaei <lacramioara.astefanoaei@gmail.com>
On Mon, 3 Feb 2025 at 10:31, Fabian Huch <huch@in.tum.de> wrote:
On 2/1/25 10:15, Lacramioara Astefanoaei wrote:
On Fri, 31 Jan 2025 at 15:03, Fabian Huch <huch@in.tum.de> wrote:
On 1/31/25 13:14, lacramioara wrote:
I need to split theory files into specs and proofs: definitions go to
one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scala
and, with the help of Copilot, i have some
(incomplete)
code running on simple theories such as the Fibonacci example presented
in the
code generation doc.What is not clear to me is if there is a "right" way to construct the
mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do
not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.Thank you for the pointer. I read there
*Both minor keywords and major keywords of the Isar command language need
to be specified, in order to make parsing of proof documents work properly.
*which answers my question that there is no shortcut. (I still wonder why
the parsing works in my case without me having to explicitly add the
"imports" keyword.)The imports keyword is part of Pure and not part of the user-space, so it
is always part of the outer syntax.
I see, thank you!
I would also
like to see if my understanding is correct in that i can choose the
mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear inKeyword.proof
. Would
this be
something not inline with the underlying philosophy of Isabelle/Scala?This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.
For the moment, the code i have achieves what i want in that it separates
definitions from lemmas and their proofs. I understand from your comparison
that "lemma" shouldn't have kind "PRF_GOAL". What was somewhat implicit in
my question was: where could one find the "right" kinds for keywords, as i
did not find this information in the docs, while in the code the only
relevant piece i found wassrc/Pure/Thy/thy_header.scala
. I see now that
i can find examples of associatiations (keyword, kind) in what is generated
byisabelle dump
. For instance, i see in the markup file for a toy
theory:accepted command_span name=lemma kind=thy_goal_stmt
.If i do a
pdfgrep -r "thy_goal_stmt" doc/
i find no results, so i
conclude that to get this information about the "right" mapping i just have
to extract it from the output of dump.I am not sure if keyword kinds are explicitly documented anywhere;
however, I find the categories in Pure/Isar/keyword.scala very helpful to
understand them.You are right. (I was reusing the functionalities from
Keyword.scala
in
my code, following the suggestions from Makarius.)
What i was missing was the mappings (span_name, span_kind) which, in the
end, i extracted from the YXML files generated byisabelle dump
. These
mappings do exist, i've just found them insrc/Pure/Pure.thy
(previously,
i overlooked them as i was searching in scala files.). It would be amazing
if i could figure out howPure.thy
is used so that the mappings appear in
YXML files.
Also note Tools/Find_Facts/src/thy_blocks.scala, which uses the keyword
kinds to parse the block structure of a theory.Thank you, yes, the code i have is a bit more verbose but basically works
like yourspan.is_of_kind(Keyword.<x>)
.
Fabian
Last updated: Mar 09 2025 at 12:28 UTC