The world’s leading publication for data science, AI, and ML professionals.

Different ways of Importing Data into R

I have completed two courses at DataCamp that introduced me to the concept of importing data into R. There are numerous ways to import the…

How to handle different types of data in R

Photo by Caspar Camille Rubin on Unsplash
Photo by Caspar Camille Rubin on Unsplash

I have completed two courses at DataCamp that introduced me to the concept of importing Data into R. There are numerous ways to import the data. I would like to discuss in detail some of the methods that I learned in the course. Let’s get started.

Data can come from many sources. Some of the most common ones are

  • Flat Files – CSV, txt, tsv, etc
  • Data from Excel
  • DataBases – Postgresql, Mysql, etc
  • Web
  • Statistical Softwares – SAS, SPSS, STATA

Flat-Files

What is a flat-file?

According to Wikipedia, A flat-file database is a database stored in a file called a flat-file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file or a binary file.

Listed below are some of the packages that will help you to deal while working with the Flat-Files in R.

UTIL

This package is loaded by default when you load R.

  • read.table(): Main function. Reads a file in table format and creates a data frame from it. It offers many arguments to classify the incoming data.
  • read.csv(): Wrapper function for read.table(). Used to read comma-separated (CSV) files.
  • read.delim(): Wrapper Function used to read tab-separated files. read.delim() is used if the numbers in your file use periods(.) as decimals.
  • read.csv2() : read.csv() and read.csv2() are identical. The only difference is that they are set up depending on whether you use periods or commas as decimal points in numbers.
  • read.delim2() : read.delim2 is used when the numbers in your file use commas(,) as decimals.
Output for read.csv() function
Output for read.csv() function

Specialized Packages

readr

This package makes our life easier. It is fast, convenient, and more efficient than the utils package. I tend to use this always.

read_r supports seven file formats with seven functions:

  • [read_csv()](https://readr.tidyverse.org/reference/read_delim.html): comma-separated (CSV) files
  • [read_tsv()](https://readr.tidyverse.org/reference/read_delim.html): tab-separated files
  • [read_delim()](https://readr.tidyverse.org/reference/read_delim.html): general delimited files
  • [read_fwf()](https://readr.tidyverse.org/reference/read_fwf.html): fixed-width files
  • [read_table()](https://readr.tidyverse.org/reference/read_table.html): tabular files where columns are separated by white-space.
  • [read_log()](https://readr.tidyverse.org/reference/read_log.html): weblog files
Output for read_csv() function
Output for read_csv() function

readr package work with Tibbles. According to the documentation, Tibbles are data frames, but they tweak some older behaviors to make life a little easier. The printout also shows the column classes which is missing in the read.csv ‘s output.

data.table

The key metrics of the author’s Matt Dowle & Arun Srinivasan of data.table is speed. The package is mainly about data manipulation but also features a super powerful function to read the data into R: fread().

If you have huge files to import into R you can use data.table package.

Output for fread() function
Output for fread() function

Fread() can handle the names automatically. It can also infer column types and field separators without having to specify these. It is an improved version of read.table() which is extremely fast, more convenient, and adds more functionality.

Excel

The most common tool used in Data Analysis is Microsoft Excel. The typical structure of excel file contains different sheets with tabular data.

We need to explore the files and then import some data from it. R offers two functions to handle this.

  • excel_sheets(): Explore different sheets

The result is a simple character vector that returns the names of the sheets inside the excel file.

  • read_excel() : Import the Data into R
Output for read_excel() function
Output for read_excel() function

The first sheet is imported as a tibble by default. We can explicitly specify the sheet to import by using either index or by setting a sheet argument. Both the below calls do the same work.

However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with [lapply()](http://www.rdocumentation.org/packages/base/functions/lapply) This function returns a list of the same length.

XL Connect

Developed by Martin Studer. It acts as a bridge between R and Excel.It allows the user to perform any activity like editing sheets, formatting data, etc. on Excel from inside R, It works with XLS and XLSX files. XLConnect works on top of Java. Make sure you have all the dependencies like Java Development Kit (JDK) installed and correctly registered in R.

Install the package before using it, the following command will do the work for you:

Loading XLConnect into the workspace
Loading XLConnect into the workspace

loadWorkbook(): This function loads a Microsoft Excel file into R which can be further manipulated. Setting a create argument to True will ensure that the file will be created if it does not exist yet.

Structure of loadWorkbook() function
Structure of loadWorkbook() function

This object is the actual bridge between R and Excel. After building a workbook in R, you can use it to get the information on the Excel file it links to. Some of the basic functions are

  • get_Sheets(): Thie function returns the sheets as a list from the excel file.
Output for getSheets()
Output for getSheets()
  • readWorksheet(): Allows the user to read the data from the specified sheets by simply giving the name of the sheet in the sheet argument of the function.
Output for readWorksheet()
Output for readWorksheet()

The best part of this function is you can specify from which row and which column to start reading information.

Output for readWorksheet()
Output for readWorksheet()

PS: Make sure the dataset is imported into the working directory before performing any operation on it.

Relational Databases

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database.

Open Source: MySQL, PostgreSQL, SQLite

Proprietary: Oracle Database, Microsoft SQL Server

Depending on the type of database you want to connect to, you’ll have to use different packages in R.

MySQL: RMySQL

PostgreSQL: RPostgreSQL

Oracle Database: ROracle

DBI defines an interface for communication between R and relational database management systems. All classes in this package are virtual and need to be extended by the various R/DBMS implementations.

In more technical terms, DBI is an interface and RMySQL is the implementation.

As usual, let’s install the package first and import the library DBI. Installing RMySQL will automatically install DBI.

The first step is creating a connection to the remote MySQL database. You can do it as follows

Now that we are connected to the database, we have access to the content inside it. The following functions help us in the reading, listing, querying, and performing other operations on the database.

  • dbListTables: This function lets the user list the tables in the database. This function requires the connection object as an input and outputs a character vector with the table names.
  • dbReadTable: Reads the desired tables and displays the results as a dataframe.

Selective Importing

We can do this in two ways

  • Reading the entire table and using the subsetting function to subset the data.
  • dbGetQuery(): This function sends a query, retrieve results, and then clears the result set. The string here is a common SQL query.
  • dbFetch(): This function helps to fetch records from previously executed queries and allows us to specify maximum records to retrieve per fetch.

Note: dbSendQuery() sends the query to the database and to fetch it we should use dbFetch(). It does the same work as dbGetQuery(). This can be useful when you want to load in tons of records chunk by chunk.

Do not forget to disconnect from it.

Web

Downloading a file from the Internet means sending a GET request and receiving the file you asked for.

Reading CSV, TSV, or text files we can specify the URL as a character string in the function in the following way.

Excel File

R doesn’t know how to handle excel files directly coming from the web so, we need to download it before we import. Once the file is downloaded we can use read_excel function to read and import the file.

JSON Data

JavaScript Object Notation (JSON) is a very simple, concise, and well-structured form of data. Moreover, it is human readable and also easy to interpret and generate for machines. This is why JSON is used in communicating with API’s (Application Programming Interface).

jsonlite Package

It is a robust, high-performance JSON parser and generator for R.

Let’s install the package first. After a successful installation, we will be using fromJSON function to get the data from the URL.

R List with JSON Data
R List with JSON Data

Another interesting function from the package is prettify and minify. They are mostly used to format the JSON data.

  • prettify/minify: Prettify adds indentation to a JSON string; minify removes all indentations/ whitespace.

Statistical Software Packages

  • haven

This package is used to read SAS, STATA, SPSS data files. It does this by wrapping around the ReadStat C library by Evan Miller. This package is extremely simple to use.

Let’s install the package first and load the library.

  • read_sas: The function reads SAS files.

Similarly, we can use read_stata(), read_dta() and read_por() and read_sav() for other types of files.

  • Foreign

Written by R Core Team. It is less consistent in naming and use but it supports many foreign data formats like Systat, Weka, etc. We can also export the data into various formats.

Let’s install the package and load the library.

SAS

The drawback of this package is it cannot import .sas7bdat. It can only import SAS libraries (.xport)

STATA

This package can read .dta files of STATA versions 5 to12.

convert.factors: Convert labeled STATA values to R factors.

convert.dates: Convert STATA dates and times to Date and POSIXct.

missing.type:

if FALSE, convert all types of missing values to NA.

if TRUE, store how values are missing in attributes.

SPSS

use.value.labels: Convert labeled STATA values to R factors.

to.data.frame: Returns dataframe instead of list.

Conclusion

That’s about importing basic data into R. Loading the data is the first step in any process like analysis, visualization, and manipulation.

These are the plethora of methods available to import data into R.

I predominantly use read_excel and read_csv.

What do you use? Comment it down.

Thanks for reading!


Related Articles