Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] How to parse theory files with Outer_Syntax.scala


view this post on Zulip Lacramioara Astefanoaei (Jan 29 2025 at 07:41):

Hello,

I have a question about parsing theory files. I've searched the chat but i haven't found the answer i was looking for, hence my message.

I need to split theory files into specs and proofs: so definitions go to one file and lemmas with their proofs in another. Makarius pointed me towards Outer_Syntax.scalaand, with the help of Copilot, i have some (incomplete) code running on simple theories such as the Fibonacci example presented in the code generation doc.
I would just like to see if the approach is "right": i currently match on span.name and i wonder if i should match instead on span.kind.
I also wonder if i should include keywords such as "::", ":", ".", "(", ")", "[code]" in Outer_Syntax and, if there is a unique way to classify keywords (like mapping ":" to "thy_goal" when creating thy_syntax).

For reference, the code i have is below.

Thank you,

Lacramioara  

package isabelle
object Test_Thy_Parser {
  private def is_definition(span: Command_Span.Span): Boolean = span.name match {
    case "theory" | "imports" | "begin" | "end" | "definition" | "fun" | "datatype" | "::" => true
    case _ => false
  }
  private def is_proof(span: Command_Span.Span): Boolean = span.name match {
    case "theorem" | "lemma" | "qed" | "proof" | "by" | "(" | ")" | ":" | "[code]" => true
    case _ => false
  }
  private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit = {
    val content = new StringBuilder()
    spans.zipWithIndex.foreach { case (span, index) =>
      content.append(span.content.map(_.source).mkString(" "))
      spans.lift(index + 1).foreach { nextSpan =>
        if (!Set("::", ":", "(", ")", "[code]", "by").contains(nextSpan.name)) {
          content.append("\n\n")
        } else if (nextSpan.name == "[code]") {
          content.append(" ")
        } else if (nextSpan.name == "by") {
          content.append("\n ")
        }
      }
    }
    File.write(path, content.toString)
  }
  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala' <theory_file.thy>")
      sys.exit(1)
    }
    val thy_file = Path.explode(args(0))
    if (!thy_file.is_file) {
      println(s"Error: File not found - ${thy_file}")
      sys.exit(1)
    }
    // Setup syntax with common theory keywords
    val thy_syntax = Outer_Syntax.empty +
      ("::", "thy_defn") +
      ("theory", "thy_begin") +
      ("imports", "thy_decl") +
      ("begin", "thy_begin") +
      ("fun", "thy_defn") +
      ("datatype", "thy_defn") +
      ("definition", "thy_defn") +
      (":", "thy_goal") +
      ("(", "thy_goal") +
      (")", "thy_goal") +
      ("lemma", "thy_goal") +
      ("[code]", "thy_goal") +
      ("theorem", "thy_goal") +
      ("proof", "proof") +
      ("by", "proof") +
      ("qed", "proof") +
      ("end", "thy_end")
    // Read and parse theory content
    val theory_content = File.read(thy_file)
    val spans = thy_syntax.parse_spans(theory_content)
    // Separate definitions and proofs
    val definitions = spans.filter(is_definition)
    val proofs = spans.filter(is_proof)
    // Generate output files
    val base_path = thy_file.expand.implode
    val def_path = Path.explode(base_path.replaceAll("\\.thy$", "_definitions.thy"))
    val proof_path = Path.explode(base_path.replaceAll("\\.thy$", "_proofs.thy"))
    write_spans(def_path, definitions)
    write_spans(proof_path, proofs)
}}

view this post on Zulip Email Gateway (Jan 31 2025 at 12:15):

From: lacramioara <lacramioara.astefanoaei@gmail.com>
Hello,

I have posted the above on the zulip chat but i'm reposting here in case
someone in this audience has advice.

My question is about parsing theory files. I've searched the zulip chat but i
haven't found the answer i was looking for, hence my message.

I need to split theory files into specs and proofs: definitions go to one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scalaand, with the help of Copilot, i have some (incomplete)
code running on simple theories such as the Fibonacci example presented in the
code generation doc.

What is not clear to me is if there is a "right" way to construct the mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files. I would also
like to see if my understanding is correct in that i can choose the mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof. Would this be
something not inline with the underlying philosophy of Isabelle/Scala?

For reference, the code i have is below.

Thank you,
Lacramioara 

package isabelle

object Test_Thy_Parser {

  private def is_definition(span: Command_Span.Span, keywords:
Keyword.Keywords): Boolean = {
    keywords.kinds.get(span.name) match {
      case Some(kind) => Keyword.theory_defn.contains(kind)
      case None => false
    }
  }

  private def is_proof(span: Command_Span.Span, keywords: Keyword.Keywords):
Boolean = {
    keywords.kinds.get(span.name) match {
      case Some(kind) => Keyword.proof.contains(kind)
      case None => false
    }
  }

  private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit =
{
    val content = new StringBuilder()
    spans.zipWithIndex.foreach { case (span, index) =>
      content.append(span.content.map(_.source).mkString(" "))
      spans.lift(index + 1).foreach { nextSpan =>
        if (!Set("(", ")", "[code]", "by").contains(nextSpan.name)) {
          content.append("\n\n")
        } else if (nextSpan.name == "[code]") {
          content.append(" ")
        } else if (nextSpan.name == "by") {
          content.append("\n ")
        }
      }
    }
    File.write(path, content.toString)
  }

  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala'
<theory_file.thy>")
      sys.exit(1)
    }

    val thy_file = Path.explode(args(0))
    if (!thy_file.is_file) {
      println(s"Error: File not found - ${thy_file}")
      sys.exit(1)
    }

    type Keywords = List[(String, Keyword.Spec)]

    val thy_keywords : Keywords = List(("%", Keyword.Spec()),
      ("(", Keyword.Spec()),
      (")", Keyword.Spec()),
      (",", Keyword.Spec()),
      ("::", Keyword.Spec()),
      (":", Keyword.Spec()),
      (".", Keyword.Spec()),
      ("=", Keyword.Spec()),
      ("and", Keyword.Spec()),
      ("begin", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
      //("imports", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
      ("text", Keyword.Spec(kind = Keyword.DOCUMENT_BODY)),
      ("fun", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("fun"))),
      ("datatype", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("datatype"))),
      ("definition", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("definition"))),
      ("theory", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("theory"))),
      ("proof", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("proof"))),
      ("lemma", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("lemma"))),
      ("qed", Keyword.Spec(kind = Keyword.QED, tags = List("qed"))),
      ("by", Keyword.Spec(kind = Keyword.PRF_ASM_GOAL, tags = List("by"))),
      ("next", Keyword.Spec(kind = Keyword.NEXT_BLOCK, tags = List("next"))),
      ("end", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("end")))
      )

    // Setup syntax with common theory keywords
    val thy_syntax = Outer_Syntax.empty.add_keywords(thy_keywords)

    // Read and parse theory content
    val theory_content = File.read(thy_file)
    val spans = thy_syntax.parse_spans(theory_content)


    print(thy_syntax.keywords)
    // Separate definitions and proofs
    val definitions = spans.filter(span => is_definition(span,
thy_syntax.keywords))
    val proofs = spans.filter(span => is_proof(span, thy_syntax.keywords))

    // Generate output files
    val base_path = thy_file.expand.implode
    val def_path = Path.explode(base_path.replaceAll("\\.thy$",
"_definitions.thy"))
    val proof_path = Path.explode(base_path.replaceAll("\\.thy$",
"_proofs.thy"))

    write_spans(def_path, definitions)
    write_spans(proof_path, proofs)

    // Report results
    println(s"\nProcessed theory file: ${thy_file}")
    println("Generated files:")
    println(s"  Definitions: ${def_path}")
    println(s"  Proofs: ${proof_path}")
    println(s"\nFound ${definitions.length} definitions and ${proofs.length}
proofs")
  }
}

view this post on Zulip Email Gateway (Jan 31 2025 at 14:03):

From: Fabian Huch <huch@in.tum.de>

On 1/31/25 13:14, lacramioara wrote:

I need to split theory files into specs and proofs: definitions go to one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scalaand, with the help of Copilot, i have some (incomplete)
code running on simple theories such as the Fibonacci example presented in the
code generation doc.

What is not clear to me is if there is a "right" way to construct the mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.

Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.

I would also
like to see if my understanding is correct in that i can choose the mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof. Would this be
something not inline with the underlying philosophy of Isabelle/Scala?

This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.

Fabian

view this post on Zulip Email Gateway (Feb 01 2025 at 09:15):

From: Lacramioara Astefanoaei <lacramioara.astefanoaei@gmail.com>
On Fri, 31 Jan 2025 at 15:03, Fabian Huch <huch@in.tum.de> wrote:

On 1/31/25 13:14, lacramioara wrote:

I need to split theory files into specs and proofs: definitions go to
one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scalaand, with the help of Copilot, i have some
(incomplete)
code running on simple theories such as the Fibonacci example presented
in the
code generation doc.

What is not clear to me is if there is a "right" way to construct the
mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not
need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.

Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.

Thank you for the pointer. I read there

*Both minor keywords and major keywords of the Isar command language need
to be specified, in order to make parsing of proof documents work properly.
*

which answers my question that there is no shortcut. (I still wonder why
the parsing works in my case without me having to explicitly add the
"imports" keyword.)

I would also
like to see if my understanding is correct in that i can choose the
mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof. Would
this be
something not inline with the underlying philosophy of Isabelle/Scala?

This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.

For the moment, the code i have achieves what i want in that it separates
definitions from lemmas and their proofs. I understand from your comparison
that "lemma" shouldn't have kind "PRF_GOAL". What was somewhat implicit in
my question was: where could one find the "right" kinds for keywords, as i
did not find this information in the docs, while in the code the only
relevant piece i found was src/Pure/Thy/thy_header.scala. I see now that
i can find examples of associatiations (keyword, kind) in what is generated
by isabelle dump. For instance, i see in the markup file for a toy
theory: accepted command_span name=lemma kind=thy_goal_stmt.

If i do a pdfgrep -r "thy_goal_stmt" doc/ i find no results, so i
conclude that to get this information about the "right" mapping i just have
to extract it from the output of dump.

Thank you for taking the time to respond,
Lacramioara

Fabian

view this post on Zulip Email Gateway (Feb 03 2025 at 13:38):

From: Lacramioara Astefanoaei <lacramioara.astefanoaei@gmail.com>
On Mon, 3 Feb 2025 at 10:31, Fabian Huch <huch@in.tum.de> wrote:

On 2/1/25 10:15, Lacramioara Astefanoaei wrote:

On Fri, 31 Jan 2025 at 15:03, Fabian Huch <huch@in.tum.de> wrote:

On 1/31/25 13:14, lacramioara wrote:

I need to split theory files into specs and proofs: definitions go to
one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scalaand, with the help of Copilot, i have some
(incomplete)
code running on simple theories such as the Fibonacci example presented
in the
code generation doc.

What is not clear to me is if there is a "right" way to construct the
mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do
not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files.

Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.

Thank you for the pointer. I read there

*Both minor keywords and major keywords of the Isar command language need
to be specified, in order to make parsing of proof documents work properly.
*

which answers my question that there is no shortcut. (I still wonder why
the parsing works in my case without me having to explicitly add the
"imports" keyword.)

The imports keyword is part of Pure and not part of the user-space, so it
is always part of the outer syntax.

I see, thank you!

I would also
like to see if my understanding is correct in that i can choose the
mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof. Would
this be
something not inline with the underlying philosophy of Isabelle/Scala?

This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.

For the moment, the code i have achieves what i want in that it separates
definitions from lemmas and their proofs. I understand from your comparison
that "lemma" shouldn't have kind "PRF_GOAL". What was somewhat implicit in
my question was: where could one find the "right" kinds for keywords, as i
did not find this information in the docs, while in the code the only
relevant piece i found was src/Pure/Thy/thy_header.scala. I see now that
i can find examples of associatiations (keyword, kind) in what is generated
by isabelle dump. For instance, i see in the markup file for a toy
theory: accepted command_span name=lemma kind=thy_goal_stmt.

If i do a pdfgrep -r "thy_goal_stmt" doc/ i find no results, so i
conclude that to get this information about the "right" mapping i just have
to extract it from the output of dump.

I am not sure if keyword kinds are explicitly documented anywhere;
however, I find the categories in Pure/Isar/keyword.scala very helpful to
understand them.

You are right. (I was reusing the functionalities from Keyword.scala in
my code, following the suggestions from Makarius.)
What i was missing was the mappings (span_name, span_kind) which, in the
end, i extracted from the YXML files generated by isabelle dump. These
mappings do exist, i've just found them in src/Pure/Pure.thy (previously,
i overlooked them as i was searching in scala files.). It would be amazing
if i could figure out how Pure.thy is used so that the mappings appear in
YXML files.

Also note Tools/Find_Facts/src/thy_blocks.scala, which uses the keyword
kinds to parse the block structure of a theory.

Thank you, yes, the code i have is a bit more verbose but basically works
like your span.is_of_kind(Keyword.<x>).

Fabian


Last updated: Mar 09 2025 at 12:28 UTC