Introduction
There are many different types of files a data scientist may need to read for their project. While it is particularly easy to read text files in Python, other file types may need additional support from various Python APIs to ensure they are Python-readable and usable. Today’s code provides multiple different Python classes that can be used for reading many different file types. The output of each class is a text string which a data scientist can then use for information extraction as well as similarity analysis of various documents.
PDF Files
I have discussed many times in the past the importance of PDF files and how to work with them in Python.
PDF Parsing Dashboard with Plotly Dash
Natural Language Processing: PDF Processing Function for Obtaining a General Overview
I wanted to include the below class because I cannot stress how important it is to use PDF files in your next project. Many people use PDFs for various different tasks today and there is too much information in them to be simply thrown away. The pdfReader class has two functions: one that will make the PDF Python readable and another that will turn the PDF into one string of text in Python. *Note: There have been recently new updates to PyPDF2 so take caution when using old tutorials that have not been updated. Many of the method names have been changed in PyPDF2 3.0.0.
!pip install PyPDF2
import PyPDF2
import re
class pdfReader:
def __init__(self, file_path: str) -> None:
self.file_path = file_path
def pdf_reader(self) -> None:
"""A function that can read .pdf formatted files
and returns a python readable pdf.
Returns:
read_pdf: A python readable .pdf file.
"""
opener = open(self.file_path,'rb')
read_pdf = PyPDF2.PdfReader(opener)
return read_pdf
def PDF_one_pager(self) -> str:
"""A function that returns a one line string of the
pdf.
Returns:
one_page_pdf (str): A one line string of the pdf.
"""
p = pdfReader(self.file_path)
pdf = p.pdf_reader()
content= ''
num_pages = len(pdf.pages)
for i in range(0, num_pages):
content += pdf.pages[i].extract_text() + "n"
content = " ".join(content.replace(u"xa0", " ").strip().split())
page_number_removal = r"d{1,3} of d{1,3}"
page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
content = re.sub(page_number_removal_pattern, '',content)
return content
Microsoft Powerpoint Files
Fun fact: there is a way to read Microsoft Powerpoint files into Python and see what information they contain. The pptReader class below will read the text from each slide in your Powerpoint presentation and add it to a string of text. Do not worry about images in your Powerpoint, the ppt_text() function handles those!
class pptReader:
def __init__(self, file_path: str) -> None:
self.file_path = file_path
def ppt_text(self) -> str:
"""A function that returns a string of text from all
of the slides in a pptReader object.
Returns:
text (str): A single string containing the text
within each slide of the pptReader object.
"""
prs = Presentation(self.file_path)
text = str()
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text += ' ' + run.text
return text
Microsoft Excel Files
Surprisingly, excel files are a bit more difficult to read into Python than I had originally expected. I had to use the openpyxl library to first read an excel file into Python. I was not getting the output I wanted so I then saved the read-in file as a standard csv file and continued from there. That trick worked!
class xlsxReader:
def __init__(self, file_path: str) -> None:
self.file_path = file_path
def xlsx_text(self) -> str:
"""A function that will return a string of text from
the information contained within an xlsxReader object.
Returns:
text (str): A string of text containing the information
within the xlsxReader object.
"""
inputExcelFile = self.file_path
text = str()
wb = openpyxl.load_workbook(inputExcelFile)
for ws in wb.worksheets:
for val in ws.values:
print(val)
for sn in wb.sheetnames:
print(sn)
excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)
with open("ResultCsvFile.csv", "r") as csvFile:
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
if val != '':
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return text
CSV Files
You may be wondering why I did not use the Pandas library for reading my CSVs. I used the traditional open() method because I wanted all of the information within the given csv files and many csv files are formatted differently. Additionally, I am inclined to use the Standard Python Libary and not import external APIs when the Built-in functions work just as well as the APIs or can quickly perform my desired task.
class csvReader:
def __init__(self, file_path: str) -> str:
self.file_path = file_path
def csv_text(self):
"""A function that returns a string of text containing
the information within a csvReader object.
Returns:
text (str): A string of text containing information
within a csvReader object.
"""
text = str()
with open(self.file_path, "r") as csvFile:
lines = csvFile.read().split(",") # "rn" if needed
for val in lines:
text += val + ' '
text = text.replace('ufeff', '')
text = text.replace('n', ' ')
return text
Extra: Natural Language Preprocessing Class
When working with all of the above classes in one project, the normalization of the outputs from each class occurs by formatting them as strings of text. I was working with various documents in a recent project of mine and I needed to also preprocess each of the strings to add more normalization as well as remove unnecessary information. After reading all of your files into python, use the data processor **** class to clean each of the strings and standardize them across all samples.
class dataprocessor:
def __init__(self):
return
@staticmethod
def get_wordnet_pos(text: str) -> str:
"""Map POS tag to first character lemmatize() accepts.
Parameters:
text (str): A string of text
Returns:
tag: The tag of the word
"""
tag = nltk.pos_tag([text])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
@staticmethod
def preprocess(input_text: str):
"""A function that accepts a string of text and conducts various
NLP preprocessing steps on said text including puncation removal,
stopword removal and lemmanization.
Parameters:
input_text (str): A string of text
Returns:
output_text (str): A processed string of text.
"""
#lowercase
input_text = input_text.lower()
#punctuation removal
input_text = "".join([i for i in input_text if i not in string.punctuation])
#Stopword removal
stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['n','nn', '&', ' ', '.', '-', '$', '@']
stopwords.extend(custom_stopwords)
input_text = [i for i in input_text if i not in stopwords]
input_text = ' '.join(word for word in input_text)
#lemmanization
lm = WordNetLemmatizer()
input_text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in input_text.split(' ')]
text = ' '.join(word for word in input_text)
output_text = re.sub(' +', ' ',input_text)
return output_text
Conclusion
Today we looked at different Python Classes a data scientist can use in their next project for reading in various file types. While there are other file types that can be read into Python, the varieties discussed today are very important and can sometimes be overlooked. Using today’s file reading classes along with the final data cleaning class can allow you to compare and contrast the information within completely different file types. For example, maybe you want to use Natural Language Processing and look at different research documents, find which are similar, and then recommend those for a student to study in their current class. This is just one of many different projects that can be created from the classes provided today, enjoy!
If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!