The Data Scientist’s Toolbox: Parsing

Parsing complex documents can be easy if you have the rights tools

Published in

Towards Data Science

9 min readNov 11, 2023

Source code of the rd2md parser and transformer discussed in this article. Image by the author. — Source code of the new Python-based rd2md parser and transformer discussed in this article. Image by the author.

For many Data Scientists, converting complex documents into usable data is a common problem. Let’s look at a complex document, and explore different methods of transforming the data.

TLDR;

We’ll explore these rules as we develop a complex parser:

Rule 1: Be lazy; don’t do any more than is what is needed
Rule 2: Start with the easy parts of the problem.
Rule 3: Don’t be afraid to throw away code and start over!
Rule 4: Use the simplest method possible to get the job done.

The Problem

As the Head of Research at an ML company, I am often faced with a variety of problems that need to be explored and solutions designed. Last week an interesting little problem arose: we needed a way to generate markdown documentation for our open source R SDK that allows ML experiments to log important details. And we needed a solution quickly without spending a lot of time on it.

This problem may be a little more complex than what a Data Scientist would encounter on a daily basis, but it will serve as a nice example of how to use different methods of parsing. And as a bonus, we’ll end up with a new open source project that fills a particular niche. Let’s dive in!

On hearing of the problem, my first rule of Research and Design (R&D) kicked in:

Rule 1: Be lazy; don’t do any more than is what is needed (Laziness was identified by Larry Wall as one of the Three Great Virtues of a Programmer).

So I started looking to see if converting R code to markdown was a solved problem. It appear that it was! However, after trying all of the available programs I could find (such as R’s old Rd2md) they just didn’t work and the git repositories were no longer active. Okay, so I was on my own. If I were a better R programmer, I probably would have tried to fix the existing solutions. But I enjoy Python more, and thought it would make an nice parsing example. So, yes, we’ll be parsing R documentation in Python.

So, I started to write some code. And that reminded me of my next R&D rule:

Rule 2: Start with the easy parts of the problem.

Rule 2 is probably just my way of satisfying my need to have some instantaneous feedback. But it does also solve a more important role: if you start with the easy parts, maybe the hard parts won’t turn out to be so hard. It also serves as a warm-up to start working on a problem. I typically have one or two false starts in coding a solution. Which leads to my next rule:

Rule 3: Don’t be afraid to throw away code and start over!

Finally, when you are on the right track, the last rule is:

Rule 4: Use the simplest method possible to get the job done.

Ok, so what is the easiest method to convert R documentation files into markdown? First, what is an R documentation file? R documentation is converted directly from the R code into something that looks a lot like LaTeX. Here is an example (files end in .Rd):

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/experiment.R
\name{Experiment}
\alias{Experiment}
\title{A Comet Experiment object}
\description{
A comet experiment object can be used to modify or get information about an active
experiment. All methods documented here are the different ways to interact with an
experiment. Use \code{\link[=create_experiment]{create_experiment()}} to create or \code{\link[=get_experiment]{get_experiment()}} to
retrieve a Comet experiment object.
}

The goal is to convert the LaTeX into markdown that looks something like:

## Description

A comet experiment object can be used to modify or get information about an active
experiment. All methods documented here are the different ways to interact with an
experiment. Use [`create_experiment()`](../create_experiment) to create or [`get_experiment()`](../get_experiment) to
retrieve a Comet experiment object.

which renders like:

Example markdown output. Image by the author.

Ok, let’s start with something very simple. We’ll go through line-by-line of the Rd file with something like:

doc = Documentation()
...
for line in lines:
    if line.startswith("%"):
        pass
    elif line.startswith("\\name{"):
        matches = re.search("{(.*)}", line)
        groups = groups()
        name = groups[0]
        doc.set_name(name)
    ...

In this code, we look to see if a line starts with “%” and if it does, we skip it (it is just a comment in the Rd file). Likewise, if it starts with “\name” then we set the current doc name. Note that we need to escape the backslash if we don’t use “raw” Python strings. The code re.search(“{(.*)}”, line) assumes that the line will contain the ending curly brace. This assumption holds true in all of the examples in our SDK, so I won’t make this code any more complicated than it needs to be, as per Rule 3.

Note that we construct a Documentation() instance before processing the lines in the file. We do this to collect all of the parts, and then call doc.generate() at the end. We do that (rather than generating the markdown on the fly) because some of the items that we parse will be in a different order in the markdown.

We can handle some of the R code in exactly this fashion: look for a pattern on a line from the Rd file, and immediately process it. However, let’s look at the next easiest part that can’t be handled this way:

\usage{
create_experiment(
  experiment_name = NULL,
  project_name = NULL,
  workspace_name = NULL,
  api_key = NULL,
  keep_active = TRUE,
  log_output = TRUE,
  log_error = FALSE,
  log_code = TRUE,
  log_system_details = TRUE,
  log_git_info = FALSE
)
}

The usage section always starts with a line that is \usage{, and ends with a line that is a single }. As this is the case, we can use these facts to create a slightly more complicated parser:

...
for line in lines:
    ....
    elif line.startswith("\\usage{"):
        usage = ""
        line = fp_in.readline().rstrip()
        while line != "}":
            usage += line + "\n"
            line = fp_in.readline().rstrip()
        doc.set_usage(usage)

This will read line-by-line, gathering up all of the text in the \usage{} section.

As we move on to the next most-complicated part, we have to start being a bit clever, and, for the first time, use the idea of “state.”

Consider this LaTeX code:

\item{log_error}{If \code{TRUE}, all output from 'stderr' (which includes errors,
warnings, and messages) will be redirected to the Comet servers to display as message
logs for the experiment. Note that unlike \code{auto_log_output}, if this option is on then
these messages will not be shown in the console and instead they will only be logged
to the Comet experiment. This option is set to \code{FALSE} by default because of this
behavior.}

This is tricky. The top-level format is:

\item{NAME}{DESCRIPTION}

However, DESCRIPTION can itself have curly braced items inside it. If you had this section of code as a string (even with newlines), you could use Python’s re (Regular Expression) module, like so:

text = """\item{log_error}{If \code{TRUE}, all output from 'stderr' (which includes errors,
warnings, and messages) will be redirected to the Comet servers to display as message
logs for the experiment. Note that unlike \code{auto_log_output}, if this option is on then
these messages will not be shown in the console and instead they will only be logged
to the Comet experiment. This option is set to \code{FALSE} by default because of this
behavior.}"""

matches = re.search("{(.*)}{(.*)}", text, re.DOTALL)

You can get NAME and DESCRIPTION as matches.groups(). The parentheses in the re pattern “{(.*)}{(.*)}” indicate to match two groups: the first group between the first set of curly braces, and the second group between the next set. This works great, assuming that text is only that section. To be able to parse this without first breaking out this section we’ll actually have to parse the text, character by character. But this isn’t hard.

Here is a little function that will get a number of curly-brace sections given a file pointer (also known as a “file-like” in modern Python parlance):

def get_curly_contents(number, fp):
    retval = []
    count = 0
    current = ""
    while True:
        char = fp.read(1)
        if char == "}":
            count -= 1
            if count == 0:
                if current.startswith("{"):
                    retval.append(current[1:])
                elif current.startswith("}{"):
                    retval.append(current[2:])
                else:
                    raise Exception("malformed?", current)
                current = ""
        elif char == "{":
            count += 1
        if len(retval) == number:
            return retval
        current += char

In the function get_curly_contents() you would pass in the number of curly-braced sections, and a file pointer. So, to get 2 curly braced sections from a file, you can do:

fp = open(FILENAME)
name, description = get_curly_contents(2, fp)

get_curly_contents() is almost as complicated as it gets in this project. It has three state variables: retval, count, and current. retval is a list of parsed sections. count is the depth of the current curly-braced items. current is what is currently being processed. This function is actually useful in a few places, as we will see.

Finally, there is one more level of complexity. The problem area is the Method subsection of a R class definition. Here is a stripped-down example:

\if{html}{\out{<hr>}}
\if{html}{\out{<a id="method-Experiment-new"></a>}}
\if{latex}{\out{\hypertarget{method-Experiment-new}{}}}
\subsection{Method \code{new()}}{
Do not call this function directly. Use \code{create_experiment()} or \code{get_experiment()} instead.
\subsection{Usage}{
\if{html}{\out{<div class="r">}}\preformatted{Experiment$new(
  experiment_key,
  project_name = NULL
)}\if{html}{\out{</div>}}
}

\subsection{Arguments}{
\if{html}{\out{<div class="arguments">}}
\describe{
\item{\code{experiment_key}}{The key of the \code{Experiment}.}

\item{\code{project_name}}{The project name (can also be specified using the \code{COMET_PROJECT_NAME}
parameter as an environment variable or in a comet config file).}
}
\if{html}{\out{</div>}}
}
}

This is complicated because we have nested sections: Usage and Arguments are inside Method. We’re going to bring out the full parsing arsenal for this one.

To make this somewhat easier on ourselves, the first thing we’ll do is “tokenize” the Method subsection. That is a fancy word for breaking text into relevant strings. For example, consider this LaTeX text:

\subsection{Usage}{
\if{html}{\out{<div class="r">}}\preformatted{Experiment$new(
  experiment_key,
  project_name = NULL
)}\if{html}{\out{</div>}}
}

It could be tokenized into a list of strings, like:

[
 "\\", "subsection", "{", "Usage", "}", "\\", "if",
 "{", "html", "}", "{", "\\", "out", "{", "<", "div", 
 " ",  "class", "=", "\"r\"", ">", "}", "}", "\\", 
 "preformatted", "{", "Experiment$new", "(", "experiment_key",
 "project_name", "=", "NULL", ")", "}", "\\", "if",
 "{", "html", "}", "{", "\\", "out", "{", "<", "/", "div",
 ">", "}", "}", "}"
]

A list of tokenized string gives you the ability to easily process it into its sub-parts. In addition, you can easily “look-ahead” one or more tokens to see what is coming. This can be hard to do with Regular Expressions, or if you are dealing with individual characters rather than tokens. Here is an example of parsing a tokenized section:

doc = Documentation()
...
method = Method()
position = 0
preamble = ""
tokens = tokenize(text)
while position < len(tokens):
    token = tokens[position]
    if token == "\\":
        if tokens[position + 1] == "subsection":
            in_preamble = False
            if tokens[position + 3] == "Usage":
                position, usage = get_tokenized_section(
                    position + 5, tokens
                ) 
                method.set_usage(usage)
            elif tokens[position + 3] == "Arguments":
                # skip this, we'll get with describe
                position += 5
            elif tokens[position + 3] == "Examples":
                position, examples = get_tokenized_section(
                    position + 5, tokens
                )
                method.set_examples(examples)
            elif tokens[position + 3] == "Returns":
                position, returns = get_tokenized_section(
                    position + 5, tokens
                ) 
                method.set_returns(returns)
            else:
                raise Exception("unkown subsection:", tokens[position + 3])
        elif tokens[position + 1] == "describe":
            position, describe = get_tokenized_section(position + 2, tokens)  # noqa
            method.set_describe(describe)
        else:
            # \html
            position += 1
    else:
        if in_preamble:
            preamble += token
        position += 1

method.set_preamble(preamble)
doc.add_method(method)

That’s it! To see the finished project, checkout the new Python-based rd2md. It is a pip-installable, open source Python library for generating markdown from R’s Rd files. We’ve used it on our own R documentation here:

https://www.comet.com/docs/v2/api-and-sdk/r-sdk/overview/

Is this a bit of an afternoon hack? Yes. It is composed of no less than 4 different methods of parsing. But it gets the job done, and is the only working Rd to markdown converter that works that I know of. If I were to refactor it, I’d probably tokenize the entire file first, and then process it using the last method shown above. Remember Rule 3: Don’t be afraid to throw away code and start over!

If you would like to contribute to the github repo, please do. If you have questions, please let us know in the Issues.

Interested in Artificial Intelligence, Machine Learning, and Data Science? Consider a clap and a follow. Doug is Head of Research at comet.com, an ML experiment-tracking and model-monitoring company.

The Data Scientist’s Toolbox: Parsing

Parsing complex documents can be easy if you have the rights tools

Written by Douglas Blank, PhD