Extracting Information from Historical Genealogical Documents

How HTR (Handwritten Text Recognition) and Related Technologies Are Empowering Family Discoveries

Jon Morrey
Towards Data Science

--

Overview

This article explains how Machine Learning (ML) technologies such as HTR (Handwritten Text Recognition) and NLP (Natural Language Processing) are improving people’s prospects for learning about their family’s roots via historical documents. It also outlines some of the difficult (yet worthwhile) problems which remain for ML researchers to solve.

Introduction

Our world is awash in digital content. It is quite possible there is more new text authored in one year than was written throughout the millennia prior to the digital age. For example, on Twitter alone, there are an estimated 200 billion tweets a year (Twitter Usage Statistics, 2021). Assuming an average tweet length of 28 characters, that is over 5.5 trillion characters of new text a year on a single social media platform. By comparison, by 1975 (prior to the modern digital age), The United States Library of Congress had approximately 17 million volumes in its book collection [1]. Estimating an average of 450,000 characters per book, that equates to about 7.5 trillion characters. In other words, in one year, a single social media platform now produces almost as much text as all the books in the pre-digital Library of Congress!

The staggering amount of content online today can perhaps lead one who is unfamiliar with efforts to digitize the world’s genealogical records to believe that most historical (non-digital) content has already been digitized and made available. Or, even that the most important content has already been digitized. This is simply not true.

The process of turning physical historical documents into online searchable text is time consuming, complex, expensive, and hard to reproduce generically. And technology is hardly the only problem. Often, the bigger challenges are relationships with archives, needed contracts, navigation of local laws, and physical access to the source documents. Every major digitization project involves a large amount of “custom work.” This fact often causes large brokers of online information to look elsewhere for their content. Yet we underestimate the value of text still locked up in older, undigitized documents that we cannot access. In many cases, that undigitized content may be far more informative, interesting, and useful compared to anything currently available online.

It is important to remember that before personal computers and the Internet, written communication was more difficult and costly. People thus generally focused their writing efforts on what they believed was important. Also, the advent of mass print in the 16th century did not change the fact that most written content still was not duplicated in any way. As a result, surviving physical documents often contain key information found nowhere in our vast digital world. Discoveries and digitization of such documents continue to change our global historical narratives. Furthermore, there may be documents that will change your personal historical narrative!

To illustrate, consider the experience of a FamilySearch user (who happens to be the author) in the United States wanting to know where in Germany his great-grandmother (Susanna Neises) was born and who her parents were. Not only did the historical marriage record below (figure 1) answer those questions, but it was also a starting point to identify and connect with relatives still living in Germany.

Figure 1: The marriage record (in Latin) which indicates the parents and birthplace of Susanna Neises. Image copyright FamilySearch.org. Used with permission.
Figure 2: The author’s mother (left) reunited with her Neises cousin in Germany in 2015. Image by author.

Enter the discipline of genealogy, which is the study of family history and the tracing of one’s lineage (family tree). Genealogists and family historians rely on historical documents like censuses, church and civil records, journals, wills, etc. to research lineages and ancestral stories far into the past. Unfortunately, to date, only a small fraction of historical genealogical documents has been made readily searchable online by name and other fields that enable you to hone-in on who or what you are seeking. The reason is simply one of cost-versus benefits. Digital scanning and manual transcription remain expensive, so archives and genealogical businesses focus their resources on collections which appeal to the greatest number of paying customers — which usually means English records. This works out well for you if your ancestors appear in one of those important collections. However, often the information you need is somewhere in another historical document, or a different language, which has yet to be transcribed and rendered searchable. That document may not be valuable to the masses, but it is certainly important to you in unlocking additional clues to your family’s history.

OCR (Optical Character Recognition), HTR (Handwritten Text Recognition), and related information extraction technologies promise to change this cost equation and make virtually any historical document searchable online. Indeed, FamilySearch and others in our industry have already begun leveraging these technologies to favorable effect. On the other hand — as we will see — historical documents may contain the hardest variety of problems to solve. In this article, we will outline some of those challenges and what we are doing to address them.

Background: About Us

FamilySearch International is a non-profit organization which helps people discover their family’s history through its website, mobile apps, and in-person help at over 5,000 local family history centers. Operated by The Church of Jesus Christ of Latter-day Saints, FamilySearch provides these services free of charge to everyone, regardless of tradition, culture, or religious affiliation. FamilySearch resources help millions of people around the world discover their heritage and connect with family members.

FamilySearch grew out of The Genealogical Society of Utah, which was founded by the Church in 1894 to help the waves of pioneers immigrating to the Rocky Mountains from across the world to document their family’s history for future generations in their new homeland. In 1938, the Society began microfilming (photographing) historical documents found in church and government archives worldwide.

The Birth of Computer-assisted Indexing (CAI)

In the early 2000s, after The Genealogical Society of Utah became FamilySearch, we began to convert these microfilms (over 2.4 million rolls to be exact) to digital formats to provide more access to more people online. We also collaborated with other genealogical organizations to expand our collective holdings. To date, FamilySearch has over 12 billion images of historical records freely available on our website. Thanks to efforts of volunteers and friend organizations, many of those images also have text-searchable transcriptions of key information (which we call “indexes”). However, human transcription/indexing projects struggle to keep pace with image acquisition. Often, it can take hours to index a single document. Older and non-English collections can be particularly challenging for our base of online volunteers. Without a fundamental change to our indexing approach, most images will remain unindexed — and therefore undiscoverable for most — for years to come.

To that end, in about 2011, FamilySearch began serious investments in technology, research, data, and talent to automatically transcribe and index all our historical document images. We also sought out collaborative interests in commercial and academic research groups to solve this problem together. Since then, we have assembled a capable internal team of research scientists and engineers dedicated to what we now call “Computer Assisted Indexing” (CAI) of historical records.

Our earliest CAI efforts involved Natural Language Processing (NLP) of modern obituaries found in 26 million US newspapers. Prior to that, volunteers had to carefully read each obituary to extract key genealogical information such as names, family relationships, and event dates. In 2015, we were able to automatically index and publish over 170M of these obituaries using an NLP system we built in-house. Initially, we applied these techniques to “born-digital” obituary text which required no previous OCR. Afterwards, we began to develop machine learning technology to find, segment, and transcribe images of obituaries printed on paper.

Figure 3: A printed obituary indexed entirely by our CAI system (in partnership with GenealogyBank™). Image copyright FamilySearch.org. Used with permission.

At this point, it is worth noting why we opted to build much of our CAI stack in-house. In the case of historical documents where we could use OCR technology, we discovered early that there was no commercial OCR software which could read historical newspaper print with sufficient accuracy to produce useable indexes. In fact, this continued as a recurring theme throughout our problem set. Excellent commercial solutions often exist, but they are tuned to modern documents which produce revenue for the solution provider’s customers. By contrast, historical documents are the domain of non-profits, academics, churches, archives, and a small number of genealogical organizations. These segments represent a small potential revenue generator for such services compared to the domains of medical billing, law firms, etc. That said, FamilySearch and our friend organizations are rich in terms of the training data needed to make machine learning work on historical documents. Even a few large tech companies have expressed astonishment at the image data we have accumulated and gradually annotated over the years.

Our decision to build (rather than buy) OCR proved especially beneficial as we shifted our focus from historical newspapers towards historical handwritten documents. The term “OCR” (Optical Character Recognition) is a little misleading in that it implies segmentation and recognition of each distinct character. Instead, our engine segmented at the line level and recognized sequences of characters. We employed a language model to further refine each character-level prediction in the context of actual words. This approach works better for HTR (Handwritten Text Recognition) because distinct characters are often extremely hard to segment in cursive handwriting. For us, the move towards HTR was simply a matter of utilizing new training data. In full disclosure, we originally built our OCR system with HTR in mind.

Figure 4: An example 19th century land deed which we transcribed with HTR in 2016. Image copyright FamilySearch.org. Used with permission.

About this time, we also partnered with researchers at Brigham Young University who helped shape our Machine Learning (ML) recognition approach. The result was a state-of-the-art HTR system for historical paragraph/prose-style, handwritten cursive documents. Towards the end of 2018, we ran this system on over 110M images on historical handwritten will, probate, and land records across the United States to benefit from a wide variety of handwriting styles and samples. To the best of our knowledge, that was (and may still be) the largest single corpus ever processed through HTR. FamilySearch has not officially released these transcripts/indexes yet, but we anticipate doing so soon.

We have since continued expanding our OCR/HTR, NLP and related capabilities by building models for several new languages/scripts and by creating a production ML pipeline. Among other things, we now have a recognizer which can transition between handwriting and printed text in a single line. By the end of 2021, we will have published more than 150 million Spanish and Portuguese records like those in Figures 5–7. In fact, millions of these records are already available to users on our site today.

Figure 5: Automatically Indexed Spanish Baptism Record from Bolivia. Image copyright FamilySearch.org. Used with permission.
Figure 6: Automatically Indexed Spanish Baptism Record from the Dominican Republic. Image copyright FamilySearch.org. Used with permission.
Figure 7: Automatically Indexed Civil Death Record from Brazil. Image copyright FamilySearch.org. Used with permission.

In 2021, we will have published more new Spanish records via CAI than any other sole source of record indexing. We anticipate this trend to grow across languages. At the same time, we recognize that even our best ML models are unlikely to match the overall accuracy and reasoning ability of experienced human indexers. To account for this, FamilySearch is designing new online and mobile experiences which allow users to correct indexing mistakes “just in time” wherever/whenever they find them. We are also using automation to pre-group human indexing tasks by similarity to reduce the cognitive load on our volunteers.

Ongoing Challenges

While we have already had remarkable success, the ML research problems grow increasingly difficult as we march towards our goal of making every image searchable — and in a growing number of languages. In fact, historical documents like the ones we have may contain some of the most difficult document processing problems to be solved overall. In this next section, we will detail some of those challenges and what we are doing to attack them.

Languages and Scripts

Our goal is to make genealogical research possible for everybody, regardless of their language, nationality, or background. FamilySearch officially has 227 different languages in its current holdings. Unofficially, the number is probably greater. We have found languages in our record images that nobody on our team had previously heard of! Even these obscure documents are important in that they contain the names and key life events of somebody’s ancestors. In fact, if we research our family trees back far enough, all of us will begin to encounter languages we do not understand.

Figure 8: Chinese alongside Manchu (a near-extinct language). Image copyright FamilySearch.org. Used with permission.
Figure 9: Devanagari script from India. Image copyright FamilySearch.org. Used with permission.
Figure 10: An archaic English script (ca. 1665). Image copyright FamilySearch.org. Used with permission.

The term “language” itself is an oversimplification of the problem. For one thing, historical documents often contain sub-languages which are unique. Within each major language, specialized topics have their own vocabularies and concepts. To illustrate, you may have learned conversational Spanish in school, but try describing Data Science concepts to somebody in Spanish! There is also an important distinction between language and written script. Speakers of a language may be perfectly fluent, but still entirely unable to read the script. In fact, you may even feel daunted by some of the images in this article if you did not learn cursive handwriting in grade school! Modern cursive is only one of many different scripts found in our document holdings.

These factors compound in the context of the data-hungry supervised ML algorithms of today. To automatically read a document in a particular language and script, we need plenty of carefully annotated examples of script in that language. At the same time, it can be difficult to find people who can do that needed annotation. There are even a few “extinct” languages and scripts in our current holdings which can be understood by only a handful of academics in the world.

Our primary approach to this problem thus far has been transfer learning coupled with harvesting training data from unconventional sources. For example, for our Portuguese HTR model, we initially annotated only 1/10th as much training data as for Spanish. We also selectively incorporated Spanish data from our much larger Spanish training corpus into our Portuguese model. Finally, we patched in more Portuguese training examples from our Portuguese records that were already indexed. The accuracy of the resulting Portuguese HTR model is quickly approaching that of Spanish at much less cost to FamilySearch. While Spanish and Portuguese are admittedly similar, we have noted that training data selectively shared even between more dissimilar languages (generally within the same language family/script) can still reduce the total amount of training data required.

Regarding archaic scripts and languages, an idea we have entertained is “working backwards through time.” FamilySearch’s holdings are so large that the evolution of a particular language or script towards something modern (readable) can sometimes be observed in our own documents. By sequencing these documents in reverse order (generally easiest to hardest), we may be able to train our recognizer using a semi-automatic “rinse-and-repeat” approach until the archaic text is understood.

Variety of Record Types

The content in about 60% of FamilySearch’s historical documents fits a small number of pre-defined “record type” categories (familiar to genealogists) such as birth certificates, censuses, etc. However, approximately 40% of our historical documents are difficult to categorize in a genealogical sense. This categorization is important when it comes to producing a structured, searchable index. It is even more important in the context of “record hints,” which are automatic new linkages between historical record indexes and well-researched people already documented in family trees. FamilySearch users depend heavily on these record hints to research their own family trees.

The fundamental problem with record types is they require extracting information according to a fixed ontology or schema. In turn, at some point, that extraction process typically involves an inflexible rules-based approach (read: hard-coded software rules instead of ML). This works okay for our major record types but is problematic for the long tail of documents which are difficult to categorize. An example of such a document is a court record, which might be anything from criminal trial proceedings to a ledger of financial transactions. Of course, we can automatically transcribe those documents and provide some level of searchability, but it is difficult to interpret them in a generic (yet meaningful) way.

Figure 11: A monetary receipt from one of our less-structured collections. Image copyright FamilySearch.org. Used with permission.

Arbitrary document “understanding” of this kind may not be entirely solvable with current ML technology as it may require human-level reasoning. We have sometimes joked that by the time we have AI (Artificial Intelligence) which can solve these kinds of problems, the AI may not *want* to! That said, we are exploring alternate record searching paradigms which lend themselves to less-structured data (including full text search). Also, even with unstructured/unclassifiable documents, our technology can accurately detect common entities such as person names, organizations, dates, events, etc. We can often also relate those entities to each other in meaningful ways. In other words, our system may not always know exactly what a document means (genealogically), but it may still find useful structured information like people and their relatives.

Complex Layouts

Extracting structured information from historical documents often involves more than understanding (NLP) transcribed prose text. Many documents convey additional semantics through non-textual cues such as font sizes/weights, dividing lines, arrows, diagrams, etc. Tables are a good example (see figures 12 and 13). Tables are not “natural language” at all. Yet, as people, we readily understand the relationship between rows, columns, headers, etc. and can interpret the data accordingly. To be certain, there are ML patterns which can approximate our intuitive visual understanding ability in simple cases. However, in our experience, these techniques break down in more complex cases.

Figure 12: A simple tabular layout which is easy to process with a combination of ML and generic rules. Image copyright FamilySearch.org. Used with permission.
Figure 13: A complex tabular document with more difficult visual semantics. Image copyright FamilySearch.org. Used with permission.
Figure 14: A land deed with a supplementary diagram. Image copyright FamilySearch.org. Used with permission.
Figure 15: A document with visual semantics that are difficult even for human readers. Image copyright FamilySearch.org. Used with permission.
Figure 16: A particularly challenging layout! Image copyright FamilySearch.org. Used with permission.

In some ways, the problem set of complex layouts resembles that of non-standard record types mentioned earlier. We can already “solve” many of these problems with a combination of ML and code rules, but not in a generic way which reliably adapts itself to previously unseen layouts. Nevertheless, our engine still does a remarkable job of segmenting and correctly transcribing text in many of these complicated layouts. Also, we can still recognize entities (such as names, places, etc.) without understanding the layout entirely.

Document Damage, Noise and Poor Image Quality

Unlike most business documents scanned in a modern office, historical documents may be hundreds of years old. During their long lifetime, they may have been subject to mold, insects, fire, and all manner of damage. Most commercial document processing systems are not resilient to these kinds of problems. Besides document damage, many of our images were originally microfilmed (photographed) starting as far back as the 1930s. The film itself may have deteriorated or the exposure was never good to begin with. And, in many cases, it is not an option to simply rescan the original documents (which may no longer exist).

Figure 17: A Chinese Jiapu (genealogy book) with worm damage. Image copyright FamilySearch.org. Used with permission.
Figure 18: A severely deteriorated document. Image copyright FamilySearch.org. Used with permission.
Figure 19: A document with “bleed through” from the reverse page. Image copyright FamilySearch.org. Used with permission.

We obviously cannot extract information from a document that is not there. That said, our language models do a surprisingly reasonable job of “guessing” missing/unreadable text that appears frequently in other documents. Also, we have found that our recognition models can be resilient to noise (such as bleed-through, stains, etc.) when trained with sufficient examples.

As a different approach, we have experimented with models to remove common noise from an image up front before our recognition process begins. The training data for these models were synthetic. We simulated “noisy” images by overlaying clean historical document images with fake artifacts such as stamps, stray marks, and odd paper textures. The model was an autoencoder architecture which accepted the synthetic image as an input and produced the clean image as the target. This experiment was moderately successful in certain scenarios (see figure 20). However, our general approach is towards training our recognizer to cope with noise directly.

Figure 20: A stamp (“noise”) scrubbed out with our autoencoder. Images copyright FamilySearch.org. Used with permission.

Some source documents may be too physically damaged to recover information with post-imaging digital clean-up alone. However, technology may provide hope even when a document appears unreadable to the human eye. One of our industry friends, Ancestry.com, has done some promising research in multispectral imaging of historical documents. In this research, Ancestry demonstrated that they could recover vital information hidden beyond the visible spectrum.

Understanding Local Context

Although we consider most of our historical documents as “genealogical documents,” they rarely were created with genealogists in mind. Most were created for legal, taxation, or religious purposes. The original authors did not expect their document would be read by people trying to discover their family roots, often hundreds of years later and in another country. On the contrary, record keepers assumed readers would have local context pertinent to the area and original intended purpose. Abbreviations for places are a good example. Our death records from the state of São Paulo, Brazil contain obscure place abbreviations such as “SJC.” In a global genealogical context, that abbreviation is almost impossible to resolve. It will not appear as standard abbreviation for cities in Brazil. In fact, it might be easily misidentified as the airport code for San Jose, California. However, a civil registrar living in São Paulo would immediately recognize “SJC” as “São José dos Campos,” a major city in São Paulo.

Fortunately, FamilySearch already has a vast database of standardized values which can be easily filtered to different levels of geographic boundaries. Whenever we encounter a new local variation (such as “SJC” above), our approach thus far has been to manually update our standards database with that new information.

Conclusion

On behalf of the genealogical sector, FamilySearch would like to thank ML practitioners across the greater industry whose research is helping unlock the value of our historical document images. Advancements in ML are helping thousands of everyday people learn about their families — and ultimately themselves. We invite all interested readers to experience this firsthand at FamilySearch.org or connect with us directly.

[1] Cole, John Young, For Congress and the Nation: A Chronological History of the Library of Congress (1979), Library of Congress

--

--