The world’s leading publication for data science, AI, and ML professionals.

3 Python Modules You Should Know to Extract Text Data

Python for Text Analytics

Photo by João Ferrão on Unsplash
Photo by João Ferrão on Unsplash

Extracting text data is the initial step to do further analysis of the data. We have a considerable amount of data present over social media. However, we need a system that can help us extract useful information from the bundle of text data. Some famous applications that use text extraction are Resume Parsing and Invoice Reading. In this article, We will see some latest free to use python libraries to extract text data and how to use them.


1. Pdf Plumber

PDF Plumber library is written in python. This library can solve different purposes while extracting text. If we want to extract text or tabular data from any document, this library can be much handy.

How to Install

To install this library, open the command prompt and type the below command. Make sure that the python is available in the machine.

pip install pdfplumber

How to Use

To use this library, first, we need to import it and then use pdfplumber.open to read any pdf files.

import requests
import pdfplumber
with pdfplumber.open("Pranjal Saxena Resume.pdf") as pdf:
    page=pdf.pages[0]
    text=page.extract_text()

Output

I have used my resume to extract the data and get a fantastic result to do my further text processing on the text.

PDF Plumber
PDF Plumber

2. PyPDF2

PyPDF2 by Matthew Stamy is another good library that can help us extract data from the documents. It can perform the following actions.

  • Extracting document information.
  • Splitting documents page by page
  • Merging documents page by page
  • Cropping pages
  • Merging multiple pages into a single page
  • Encrypting and decrypting PDF files

It performs all the actions in pdf documents. Let us see how it performs in extracting text data from the document.

How to Install

To install this PyPDF2 library, open the command prompt and type the below command. Make sure that the python is available in the machine.

pip install PyPDF2

How to Use

To use this PyPDF2 library, first, we need to import it and then use PdfFileReader to read any pdf files. And, then finally use extractText() to get the text data.

from PyPDF2 import PdfFileReader
pdfFile_pypdf = open('Pranjal Saxena Resume.pdf', 'rb')
pdfReader = PdfFileReader(pdfFile_pypdf)
print(pdfReader.getPage(0).extractText())

Output

The output here is not that much favourable if we compare it to PDF Plumber library because this library focuses on other pdf document manipulations tasks also.

PyPDF2
PyPDF2
PyPDF2
PyPDF2

3. Apache Tika

Apache Tika is a content detection and analysis framework that was written in Java and stewarded at the Apache Software Foundation. I was surprised by looking at the kind of output it can provide (you will too). Because the was user friendly and easy to transform into valuable data.

How to Install

To install & work with Apache Tika python library, you should have the latest version of Java installed. After installing Java, open the command prompt and type the below command. Make sure that the python is available in the machine.

pip install tika==1.23

And, if you are using Jupyter Notebook to run the code, then Jupyter Notebook will itself install the required java environment.

How to Use

To use Apache Tika library, first, we need to import parser from tika and then use parser.from_file to read any pdf files. And, then finally use ["content"] to get the text data.

from tika import parser
parsed_tika=parser.from_file("Pranjal Saxena Resume.pdf")
print(parsed_tika["content"])

Output

The output seems very interesting. We can have appropriately organized text extracted from the document.

Apache Tika
Apache Tika

Closing Points

We have discussed some latest free to use python libraries to extract text or tabular data from the document. These libraries are much helpful in gathering informative data from the documents. We can try these three libraries and can use them accordingly based on the format of the document. Now that we have the data, the next step is to find the pattern in data using regular expression and store the extracted data for further actions.

That is all for this article. I will see you somewhere around.


Before you go…

If you liked this article and want to stay tuned with more exciting articles on Python & Data Science – do consider becoming a medium member by clicking here https://pranjalai.medium.com/membership.

Please do consider signing up using my referral link. In this way, the portion of the membership fee goes to me, which motivates me to write more exciting stuff on Python and Data Science.

Also, feel free to subscribe to my free newsletter: Pranjal’s Newsletter.


Related Articles