The world’s leading publication for data science, AI, and ML professionals.

From Powerpoints to PDFs to CSV Files: Python Classes for Reading Major File Types

Be able to extract and compare information from multiple different file types in your next data science project!

Photo by Glen Carrie on Unsplash
Photo by Glen Carrie on Unsplash

Introduction

There are many different types of files a data scientist may need to read for their project. While it is particularly easy to read text files in Python, other file types may need additional support from various Python APIs to ensure they are Python-readable and usable. Today’s code provides multiple different Python classes that can be used for reading many different file types. The output of each class is a text string which a data scientist can then use for information extraction as well as similarity analysis of various documents.

PDF Files

I have discussed many times in the past the importance of PDF files and how to work with them in Python.

PDF Parsing Dashboard with Plotly Dash

Natural Language Processing: PDF Processing Function for Obtaining a General Overview

I wanted to include the below class because I cannot stress how important it is to use PDF files in your next project. Many people use PDFs for various different tasks today and there is too much information in them to be simply thrown away. The pdfReader class has two functions: one that will make the PDF Python readable and another that will turn the PDF into one string of text in Python. *Note: There have been recently new updates to PyPDF2 so take caution when using old tutorials that have not been updated. Many of the method names have been changed in PyPDF2 3.0.0.

!pip install PyPDF2
import PyPDF2
import re
class pdfReader:

    def __init__(self, file_path: str) -> None:
        self.file_path = file_path

    def pdf_reader(self) -> None:
            """A function that can read .pdf formatted files 
                and returns a python readable pdf.

                Returns:
                read_pdf: A python readable .pdf file.
            """
            opener = open(self.file_path,'rb')
            read_pdf = PyPDF2.PdfReader(opener)
            return read_pdf

    def PDF_one_pager(self) -> str:
        """A function that returns a one line string of the 
            pdf.

            Returns:
            one_page_pdf (str): A one line string of the pdf.

        """
        p = pdfReader(self.file_path)
        pdf = p.pdf_reader()
        content= ''
        num_pages = len(pdf.pages)

        for i in range(0, num_pages):
            content += pdf.pages[i].extract_text() + "n"
        content = " ".join(content.replace(u"xa0", " ").strip().split())
        page_number_removal = r"d{1,3} of d{1,3}"
        page_number_removal_pattern = re.compile(page_number_removal, re.IGNORECASE)
        content = re.sub(page_number_removal_pattern, '',content)
        return content

Microsoft Powerpoint Files

Fun fact: there is a way to read Microsoft Powerpoint files into Python and see what information they contain. The pptReader class below will read the text from each slide in your Powerpoint presentation and add it to a string of text. Do not worry about images in your Powerpoint, the ppt_text() function handles those!

class pptReader:

    def __init__(self, file_path: str) -> None:
        self.file_path = file_path

    def ppt_text(self) -> str:
    """A function that returns a string of text from all 
       of the slides in a pptReader object.

      Returns:
      text (str): A single string containing the text 
      within each slide of the pptReader object.
   """
      prs = Presentation(self.file_path)
      text = str()

      for slide in prs.slides:
        for shape in slide.shapes:
          if not shape.has_text_frame:
              continue
          for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
              text += ' ' + run.text
      return text

Microsoft Excel Files

Surprisingly, excel files are a bit more difficult to read into Python than I had originally expected. I had to use the openpyxl library to first read an excel file into Python. I was not getting the output I wanted so I then saved the read-in file as a standard csv file and continued from there. That trick worked!


class xlsxReader:

    def __init__(self, file_path: str) -> None:
        self.file_path = file_path

    def xlsx_text(self) -> str:
      """A function that will return a string of text from 
         the information contained within an xlsxReader object.

         Returns:
         text (str): A string of text containing the information
         within the xlsxReader object.
     """
      inputExcelFile = self.file_path
      text = str()
      wb = openpyxl.load_workbook(inputExcelFile)
      for ws in wb.worksheets:
        for val in ws.values:
          print(val)

      for sn in wb.sheetnames:
        print(sn)
        excelFile = pd.read_excel(inputExcelFile, engine = 'openpyxl', sheet_name = sn)
        excelFile.to_csv("ResultCsvFile.csv", index = None, header=True)

        with open("ResultCsvFile.csv", "r") as csvFile: 
          lines = csvFile.read().split(",") # "rn" if needed
          for val in lines:
            if val != '':
              text += val + ' '
          text = text.replace('ufeff', '')
          text = text.replace('n', ' ')
      return text

CSV Files

You may be wondering why I did not use the Pandas library for reading my CSVs. I used the traditional open() method because I wanted all of the information within the given csv files and many csv files are formatted differently. Additionally, I am inclined to use the Standard Python Libary and not import external APIs when the Built-in functions work just as well as the APIs or can quickly perform my desired task.

class csvReader:

    def __init__(self, file_path: str) -> str:
        self.file_path = file_path

    def csv_text(self):
    """A function that returns a string of text containing
       the information within a csvReader object.

      Returns:
      text (str): A string of text containing information
      within a csvReader object.
     """ 
      text = str()
      with open(self.file_path, "r") as csvFile: 
        lines = csvFile.read().split(",") # "rn" if needed

        for val in lines:
          text += val + ' '
        text = text.replace('ufeff', '')
        text = text.replace('n', ' ')
      return text

Extra: Natural Language Preprocessing Class

When working with all of the above classes in one project, the normalization of the outputs from each class occurs by formatting them as strings of text. I was working with various documents in a recent project of mine and I needed to also preprocess each of the strings to add more normalization as well as remove unnecessary information. After reading all of your files into python, use the data processor **** class to clean each of the strings and standardize them across all samples.

class dataprocessor:

  def __init__(self):
    return

  @staticmethod
  def get_wordnet_pos(text: str) -> str:
    """Map POS tag to first character lemmatize() accepts.

       Parameters:
       text (str): A string of text

       Returns:
       tag: The tag of the word
    """
    tag = nltk.pos_tag([text])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

  @staticmethod
  def preprocess(input_text: str):
    """A function that accepts a string of text and conducts various
       NLP preprocessing steps on said text including puncation removal, 
      stopword removal and lemmanization.

       Parameters:
       input_text (str): A string of text

       Returns:
       output_text (str): A processed string of text.
    """
    #lowercase
    input_text = input_text.lower()

    #punctuation removal
    input_text = "".join([i for i in input_text if i not in string.punctuation])

    #Stopword removal
    stopwords = nltk.corpus.stopwords.words('english')
    custom_stopwords = ['n','nn', '&', ' ', '.', '-', '$', '@']
    stopwords.extend(custom_stopwords)
    input_text = [i for i in input_text if i not in stopwords]
    input_text = ' '.join(word for word in input_text)

    #lemmanization
    lm = WordNetLemmatizer()
    input_text = [lm.lemmatize(word, dataprocessor.get_wordnet_pos(word)) for word in input_text.split(' ')]
    text = ' '.join(word for word in input_text)

    output_text = re.sub(' +', ' ',input_text)

    return output_text

Conclusion

Today we looked at different Python Classes a data scientist can use in their next project for reading in various file types. While there are other file types that can be read into Python, the varieties discussed today are very important and can sometimes be overlooked. Using today’s file reading classes along with the final data cleaning class can allow you to compare and contrast the information within completely different file types. For example, maybe you want to use Natural Language Processing and look at different research documents, find which are similar, and then recommend those for a student to study in their current class. This is just one of many different projects that can be created from the classes provided today, enjoy!

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!


Related Articles