An overview and comparison between concepts and where they fit in

Donald Knuth formally introduced the idea of "Literate Programming" in 1984 (Knuth, 1992). In his book he suggested the following change in attitude within the construction of programs:
Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want the computer to do.
At Stanford University the idea of literate programming was realized through the WEB
system. It involves combining a document formatting language with a programming language (Knuth, 1992). In the original report submitted to The Computer Journal in 1983, the example presented of the WEB
system consisted of the combination of TeX
, and PASCAL
. This choice is not surprising, as Donald Knuth created TeX
. The code was written in a single.WEB
file. A reason Knuth was happy with the term WEB
is because he considered a complex piece of code as a web simple materials delicately placed together (Knuth, 1992). The ‘weaving’ process involves separating out the document formatting language into a separate file. While the ‘tangling’ entails extracting the programming language to produce a machine-executable code (Knuth, 1992). This process was depicted as figure 1 (Knuth, 1992)

.WEB
file. The ‘weaving’ separates out the TeX
from the .WEB
file compiles it to an output document, while the ‘tangling’ extracts the PASCAL
code to produce a machine-executable program. The diagram was created in draw.io.If you consider using a tool for literate programming their website has a long list of tools which can be used for writing code with literate programming ("More Tools"). While it is comprehensive, it does not list a fantastic tools which will be mentioned later. There are also several other projects online which seem to be popular, such as Zachary Yedidia’s Literate project.
Data Science and Reproducible Research
The term "data science" likely first appeared in 1974 and initially defined by the following quote (Cao, 2017).
The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields of data.
The term has increased in popularity since 2012 (Cao, 2017). This increase is likely due to the emergence of data mining, big data, and the advances in Artificial Intelligence (AI) and Machine Learning (ML). Due to the accumulation of large amounts of data, interdisciplinary tools are being combined to form a "new" field. A more modern and comprehensive definition of data science is as follows (Cao, 2017).
From the disciplinary perspective, data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) in order to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology.
A central point in literate programming is the intent of ‘weaving’ and ‘tangling’ the original document into two separate files. The original web document was not necessarily intended to be used as the final result. However, the modern use of "reproducible research" implies using a similar literate programming web document as the final result (Goodman et al., 2018).

The computer scientist Jon Claerbout coined the term "reproducible research" and described it with a reader of a document being able to see the entire process from the raw data to the code producing tables and figures (Goodman et al., 2018). The term can be slightly confusing, as reproducibility has long been an important part of the research and the scientific method.
One of the issues in with the use of software in scientific field generally, is the reproducibility of the research being conducted (Hinsen, 2013). Therefore, the greater context surrounding the code is more important and possibly not self explanatory through the code alone. Experts within the field of exploratory data science even struggle keeping track of the experiments they conduct. It can lead to confusion of what experiments have been performed and how they came by the results (Hinsen, 2013). Since digitalization and digital tools have become a more important part of many fields, reproducible research has become popular in such fields such as epidemiology, computational biology, economics, and clinical trials (Goodman et al., 2018).
While literate programming and reproducible research are closely related in form, they were intended for slightly different use-cases. Literate programming is not specifically for research, but is a more broad approach to programming. While reproducible research focuses on providing the full context of an experiment or study and not making the code itself more legible.
Some of the more popular tools for reproducible research are Jupyter Notebooks for programming in python (example in figure 2), and the integration of SWeave
and knitr
into R Markdown for programming in the statistical language R (Kery et al., 2018). Jupyter Notebooks is an open source spin-off project from the IPython (Interactive Python) project. Python is considered a fairly easy language to learn, there are several python projects for both data manipulation, and support for different machine learning frameworks such as Tensorflow and PyTorch. This makes Jupyter Notebooks particularly well suited for data science.
Readable and Clean Code
Knuth’s attitude change mentioned above was primarily about making code more readable for humans. Literate programming is a method of accomplishing this goal, but it is not the only avenue to take. What if the code was written so well, it was completely self-explanatory.
Robert C. Martin, affectionately called "Uncle Bob", is a software engineer and the author of the bestselling book Clean Code: A Handbook of Agile Software Craftsmanship (Martin & Coplien, 2009). He has an interesting talk on clean code where he explains that code should read like "well written prose". In his book he writes (Martin & Coplien, 2009):
Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.
While writing readable code is something to always strive towards, it will likely never be completely self-explanatory. Therefore, it would likely be good to documentation to complement the code.
Self-Documentation with Docstrings
Emacs has a nice method of self-documentation which is very helpful to understanding the editors source code written in elisp
. It uses documentation strings (or docstrings), which are comments directly written into the code. Writing comments in code to describe functions is not at all uncommon. However, traditional in-file documentation is stripped from the source during runtime, such as Javadocs. Docstrings are maintained throughout the runtime of the code in which it is written. The programmer can then interactively inspect this throughout the runtime of the program.
Some notable languages which support docstrings include:
- Lisp/Elisp
- Python
- Elixir
- Clojure
- Haskell
- Gherkin
- Julia
Emacs is almost like working in an lisp
terminal. The editor is written entirely in a variant of lisp
, called elisp
. For example, in order to split a window vertically, the function split-window-below
is called. The code snippet below is the start of the source code for this function.
When calling the function describe-function
and inputting the function name, "split-window-below", it will generate the following output in a mini-buffer (a kind of little buffer). This documentation is created dynamically during runtime.
split-window-below is an interactive compiled Lisp function in ‘window.el’.
It is bound to SPC w -, SPC w s, M-m w -, M-m w s, C-x 2,
(split-window-below &optional SIZE)
Probably introduced at or before Emacs version 24.1.
Split the selected window into two windows, one above the other. The selected window is above. The newly split-off window is below and displays the same buffer. Return the new window.
If optional argument SIZE is omitted or nil, both windows get the same height, or close to it. If SIZE is positive, the upper (selected) window gets SIZE lines. If SIZE is negative, the lower (new) window gets -SIZE lines.
If the variable ‘split-window-keep-point’ is non-nil, both windows get the same value of point as the selected window. Otherwise, the window starts are chosen so as to minimize the amount of redisplay; this is convenient on slow terminals.
[back]
Comparing the docstring in the source code and the output of describe-function
, there is more information added to the output documentation. This was a slight aside, but using describe-*
functions in Emacs is probably of the most useful and helpful ways to learn emacs. However, this does show a benefit of how docstrings are used in emacs.
Org-Mode for Literate Programming and Reproducible Research
While Jupyter Notebooks are a fantastic way of writing reproducible research, they are not a method of literate programming as they are not intended to be ‘weaved’ and ‘tangled’. The original tool such as WEB
system for literate programming, does not allow for compiling embedded code interactively. However, Org-mode was the first to provide full support for reproducible research and literate programming (Schulte et al., 2012).
Org-mode is something called a major mode in emacs. An .org
file is essentially a plain text markup language, but there are so many things that can be done in org-mode it is mind-boggling. With some configuration it can be used as almost anything including an advanced agenda system, calendar, financial ledger, zettlekasten note taking system (org-roam), and an exporter into almost any text formatting. Even that was an inadequate list, but for those interested this is a more complete list.

.org'
file with code blocks in emacs-lisp
which will be tangled to a file called spacemacs.el
. This is a screenshot of my .spacemacs.org literate configuration file.Using a package called org-babel
it can allow for both ‘weaving’ using org-mode’s built in export functionality, and ‘tangling’ using org-babel-tangle
of the entire file (see figure 3). As with other markup languages,.org
files already support rich text formatting of code blocks. With simple tags and commands the code can run code blocks can be run dynamically by using certain session
tags. Running embedded code blocks in this manner through the same, or multiple, sessions in the same file is important for Reproducible Research. More details on using org-mode for both literate programming and reproducible research is the paper by Schulte et al., 2012. A blog post which explains and demonstrates how to use org-babel
can be found here.
Closing thoughts
While literate programming is the a useful tool for providing more context to programming projects, in many cases writing self-explanatory code might be the best avenue for pure software engineering. Reproducible research is more suited for explaining a study or an experiment which is particularly useful in the scientific practice today as the use of digital tools continues to grow. Contrary to software engineering, these use-cases aren’t well suited for simply writing self-explanatory code. Literate programming and reproducible research attack mixed natural language and computational language documents for different ends (Schulte et al., 2012). While literate programming introduces natural language to programming files, reproducible research adds embedded executable code into natural language documents (Schulte et al., 2012).
If humans were capable of easily writing in machine code, there would be no need for all the different types of programming languages. The reason we use programming languages is to be able to interpret and understand what we are instructing the computer to do. Therefore, making the code understandable should be a top priority. Even though Knuth’s literate programming might not be suitable in all settings, I think his attitude about telling the reader what we are instructing the machine to perform is valuable in any programming situation. This is especially true for collaboration and long term complexity of coding projects.
While some of the differences between literate programming and reproducible research have been mentioned, particular examples of literate programming were not discussed. Configuration files are ideal for literate programming. They typically require explanations of the context of the settings or an explanation of the setting itself. After writing the literate configuration, they can be tangled to generate the actual configuration. I hope to write a post soon about showing how I created a literate configuration for spacemacs.
Bibliography
[1] Knuth, D. E., Literate programming (1992), Cent. Study Lang. Inf.
[2] Cao, L., Data science: A comprehensive overview (2017), ACM Comput. Surv., Vol. 50, pp. 1–42. Association for Computing Machinery. Retrieved from https://dl.acm.org/doi/10.1145/3076253
[3] Goodman, S. N., Fanelli, D., & Ioannidis, J. P., What does research reproducibility mean? (2018), Get. to Good Res. Integr. Biomed. Sci. (Vol. 8, pp. 96–102). Springer International Publishing. Retrieved from www.ScienceTranslationalMedicine.org
[4] Hinsen, K., Software development for reproducible research (2013), Comput. Sci. Eng., 15(4), 60–63.
[5] Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. A., The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool (2018), Retrieved from https://doi.org/10.1145/3173574.3173748
[6] Martin, R. C., & Coplien, J. O., Clean code: a handbook of agile software craftsmanship (2009), Upper Saddle River, NJ: Prentice Hall. Retrieved from https://www.amazon.de/gp/product/0132350882/ref=oh%5Fdetails%5Fo00%5Fs00%5Fi00
[7] Schulte, E., Davison, D., Dye, T., & Dominik, C., A multi-language computing environment for literate programming and reproducible research (2012), J. Stat. Softw., 46 (3), 1–24. Retrieved from [https://www.jstatsoft.org/index.php/jss/article/view/v046i03](https://www.jstatsoft.org/index.php/jss/article/view/v046i03)/v46i03.pdf https://www.jstatsoft.org/index.php/jss/article/view/v046i03
[8] More Tools. Lit. Program. Retrieved from http://www.literateprogramming.com/tools.html
Originally published at http://www.olavpedersen.com on March 13, 2021.