The world’s leading publication for data science, AI, and ML professionals.

Build an interface to parse text from any type of document in 1 line of code

Extract text from txt, docx, pdf, html, pptx files, and more, using object-orientated design patterns.

Source: Unsplash
Photo by Clayton Cardinalli on Unsplash

This post is about applying object-orientated design principles to a common data science task, extracting text from documents. We will build a maintainable and customisable interface to extract text from a wide variety of file types. By interface I mean an object, i.e. an instance of a class, which can extract text from a document, regardless of the file type, in a single line. You’ll need this functionality if you’re building text mining applications and want to extract text data from a variety of file types.

There’s a good way and a not so good way to write code to perform this task. This is how you shouldn’t do it:

This function is serving multiple purposes: determining the file type, choosing which logical path to execute with the if-elif-else structure, and finally executing the path. This is problematic from a Software Engineering perspective since it is violating the single responsibility principle [1].

Suppose you want to add new logical paths or alter an existing one to provide more functionality, or because a package update is forcing you to change your code; in order to do this, you have to alter the code which is responsible for parsing all file types. This increases the likelihood that a bug will break your application completely, as opposed to partially. Furthermore, you can quickly end up with a giant function with tens of elif statements: the more you add, the more likely you are to introduce a bug, and nobody (including yourself) is going to enjoy trying to understand what is happening when it comes to using, altering or fixing that code. We need to break it down into parts.

We’re going to refactor this code by employing the factory method pattern, an object-oriented design pattern made popular in the 90’s [2] that is used for generating new instances of a class. In our case, the object we want to generate will be a string, but initially, we don’t know exactly how we’re going to do it, because that depends on the file type that we want to read.

The factory pattern method defines an interface that accepts any file type, then defers the implementation of object generation to a sub-class or a function. To clarify this concept, let’s talk about what’s happening in the following example.

Refactoring using the factory method pattern

The class DocParser has a single method, parse. When a user calls parse with a file path, a series of events occur:

  1. The path is passed to the function _getformat in line 6 __ and the returned value is stored in the variable parser.
  2. _getformat extracts the file extension from the path and passes this to another function, _getparser, in its return statement.
  3. Inside _getparser, one of three logical paths is executed depending on the file extension using an if-elif-else structure. If a valid file extension is provided, one of two functions is returned, _parsetxt or _parsedocx.
  4. At this point, the variable parser defined in line 6 __ now stores either _parse_tx_t or _parsedocx. In line 7, the file path is passed to the function stored in _parse_r and the resulting string object is returned to the user.

This is why we refer to an instance of DocParser as an interface because its only responsibility is to act as the interface between the user and the logic that generates the object that the user has requested. After receiving a file type, the interface passes the responsibility of determining the file type, choosing a parsing function, and actually generating the object to other functions. This is what we mean when we say the interface defers the implementation of the class.

The functions _parsetxt and _parsedocx are called the concrete implementations of the class, they are the parts that generates the string object for a given file type.

Now the initial function has been broken down into multiple parts, with each part having one, single responsibility. Why is this better?

We can now add and change the functionality of our interface without having to touch other parts of the code, we simply define a new concrete implementation as a function and add another elif statement to _getparser. This is much easier to maintain, especially when the number of file types that you want to parse increases, and means that if one part stops working the rest of the code will be unaffected. Here is how you how to adapt the above example to provide functionality for parsing pdf, html, and pptx files.

The first thing we have done is added more elif statements to _getparser, however, since we only have a return statement in each logical path, the function remains readable. Then we define three new functions _parsepdf, _parsehtml, and _parsepptx, while the original parsing functions remain untouched. It’s easy to see how you could continue this pattern to parse any type of file you desire.

Using the common interface

By saving the above code in a single file, _parsefile.py, we can import the interface into our applications with a simple import statement. All we need to do is instantiate a DocParser object, and we can read text from all of our defined file types in 1 line using the parse method, as in line 17 below.

In this example, we’re calling the parse method in a loop with paths to txt, docx, pdf, html, and pptx files, and printing the results to the terminal. You can access the test files at the project GitHub repo, and I downloaded the html text file from here. The output is shown in the animation below.

The text contained in each test file is sequentially printed to the terminal (I've added in a 2-second sleep command to make the animation easier to watch). Animation by the author.
The text contained in each test file is sequentially printed to the terminal (I’ve added in a 2-second sleep command to make the animation easier to watch). Animation by the author.

As you can see, the single instance of DocParser can parse all five file types which we pass to it. The pdf and html files contain some white space, so if you like, you can go and edit _parsepdf and _parsehtml to remove this, __ safe in the knowledge that your other parsing functionality will remain unchanged.

Conclusion

We’ve covered how to use the factory method design pattern to build a maintainable interface which can parse many different types of document, enabling us to take advantage of as many data sources as we like in our Text Mining projects. Once the interface has been built, we can use it to extract text in a single line of code, keeping our application files clean.

Be aware that this pattern is general and can be used in any situation where the logic you use to generate an object is dependent upon some parameter, which is usually determined by the user of your application. Elsewhere, I’ve found this useful for giving extra functionality to an API so it can accept a variety of request payloads. With a bit of consideration, you’ll likely find ways to use this design pattern in your own applications, keeping them clean, maintainable, and versatile.


[1] The single responsibility principle

[2] Design Patterns: Elements of Reusable Object-Oriented Software

Let’s connect: Linkedin


Related Articles