
Data collection and extraction involves pulling data from different sources and converting it into a useful format for further processing or analysis. It is the first step of the Extract-Transform-Load pipeline (ETL) in the data engineering process.
As a data scientist, you might need to combine data that is available in multiple file formats such as JSON, XML, CSV, and SQL.
In this tutorial, we will use python libraries such as pandas, JSON, and requests to read data from different sources and load them into a Jupyter notebook as a pandas dataframe.
1. CSV files
This refers to a ‘ comma-separated values’ file that is used to store data in a tabular format, similar to a spreadsheet. Each line in the file is an observation (or record), and each record has one or more attributes separated by commas.

Often, but not always, the first line represents the feature (column) names.
We will download the math results data from the data.gov website and save it in our working directory. Launch a new Jupyter notebook.
Before we create a pandas dataframe from the data, it is important to first peek into the file using python’s inbuilt reading library. The following code displays the first 5 lines of the downloaded file.
What to look out for:
- If the first line contains the column names, use
pd.read_csv(file)
without additional parameters, which assumes the first line contains the column headers. If not, we would usepd.read_csv(file, header=None)
. Here, Pandas creates indexed column names for you, and you can later add the names usingdf.columns = <list_of_names>
. - If there are any empty rows at the top, we need to skip them using
pd.read_csv(file, skip_rows=n)
.

The code below downloads the file directly from the website and creates a Pandas dataframe.
file_link = 'https://data.cityofnewyork.us/api/views/7yig-nj52/rows.csv'
df_csv = pd.read_csv(file_link)
2. XML data
Extensible Markup Language, or XML, is a markup language very similar to HTML. You can recognize XML formatted data because it’s marked down by data-specific unique tags, unlike the pre-defined tags of HTML.
We will continue using the math results data from the data.gov website, which provides an option to download the data in an XML format.
Running the code below in a Jupyter notebook will print the first 20 lines.

To extract XML data, we use a python library called BeautifulSoup. To install BeautifulSoup, paste the code below into the Jupyter notebook.
!pip install bs4
Then import the library, open the file and load its contents into BeautifulSoup.
from bs4 import BeautifulSoup
with open("math results.xml") as f:
soup = BeautifulSoup(f, "lxml")
From the collected XML data, we will use the find_all()
method of BeautifulSoup that returns the tags as a list.

For our case, the tags of interest are 'district', 'grade', 'number_tested', and 'mean_scale_score'.
We will then loop through the rows, and for each tag, retrieve the value inside it using the get_text
method.
The list of lists named ‘data’ represents all our records.

The next step is to create a pandas dataframe and pass the columns that will be used for the data.
df_xml = pd.DataFrame(data, columns=['district', 'grade','number_tested', 'mean_scale_score'])

3. JSON data
JavaScript Object Notation, or JSON, compacts data in a dictionary-style format. The JSON file for the math results is available at this link. Below is a screenshot of the file using a code editor.

JSON was developed after XML, but has become more popular over time. The two formats are frequently used for sending data via web APIs, which are discussed in point 5 below.
Pandas library offers a simple way to read in a JSON file; pd.read_json()
. You can pass the 'orient'
option depending on how the data is formatted. The pandas’ documentation explains the 'orient'
options.
Our JSON data is formatted differently and pd.read_json()
returns an error. It has a lot of meta information; 920 lines. The actual data starts at the 921st line.

The image below shows a screenshot of the data using a notepad++ code editor. Note that the first 8 fields per row are useless.

We will use the json library which loads the JSON data as a dictionary.
import json
with open('math results.json') as f:
json_data = json.load(f)
type(json_data)
###dict
In our case, there will be two dictionary entries; meta and data. We can therefore get the data information using dict.get(key).
data = json_data.get('data')
data = [x[8:] for x in data]
The left image below shows our data as a list of lists. The right image uses list comprehension to return only items from the 8th element in each list.


Next is to create a pandas dataframe. Based on our knowledge of the dataset, we will pass the column names as a parameter.

4. SQL databases
Structured Query Language, or SQL, enables us to access and manipulate databases. You can collect data from SQL files and databases using the Pandas library. This is by opening a database, or by running an SQL query.
Two python libraries can be used to make the connection, depending on the type of database; sqlite3 library or the sqlalchemy library.
For our purposes, we have an SQLite database file with the math results we have been working with. SQLite provides local data storage for individual applications and devices. Read more on when to use SQLite here. We will therefore use the sqlite3 library. The db file has only one table.
import sqlite3
import pandas as pd
# connect to the database
conn = sqlite3.connect('maths_results.db')
# run a query
data_sql = pd.read_sql('SELECT * FROM maths_results', conn)
# display 5 rows
data_sql.head()
When working with large client/server SQL databases such as MySQL and PostgreSQL, use the sqlalchemy library. Check the SQLAlchemy documentation on how to load different databases including SQLite which we will implement using the code below.
import pandas as pd
from sqlalchemy import create_engine
#create a db engine
engine = create_engine('sqlite:///maths_results.db')
#run query
df_sql = pd.read_sql("SELECT * FROM maths_results", engine)
5. Extracting data from the web using APIs
Application Programming interfaces, or APIs, provide a way for companies such as Facebook, Twitter, Google among others to share their data with users. This discourages web scraping which poses a legal gray area.
These companies provide web addresses of public-facing APIs from where developers can access data and use it for analysis or create their own applications. Some require login credentials when sending the requests, while others are public. They also provide documentation for how to format the URL requests.
In this section, we will pull data from the World Bank data catalog. The website does not require login credentials. The API documentation is available here.
This page has a list of indicators with options to download using the API. We will look at the ‘number of female business owners’ which shows the results per country per year.
Scroll down to the indicator data you want and click on the ‘API’ access option.

This will redirect you to a new webpage with the raw JSON data. Copy the URL from the address bar and open a new Jupyter notebook. Write the code below, and replace the url
text with the one you copied.
Python has a requests library for sending HTTP requests using python. The code below returns a response object.
import requests
import pandas as pd
url = 'https://api.worldbank.org/v2/en/country/all/indicator/IC.WEF.LLCO.FE?format=json&per_page=20000&source=14'
r = requests.get(url)
Since the data is in JSON format, we will call response.json()
to return the extracted JSON data.
r.json()
This returns a list with two dictionary parts. The first part contains the meta-information. The second part holds the actual data.

To create the data frame, use list[1] which indexes the second element of the list.
data_api = pd.DataFrame(r.json()[1])
data_api.head()

Conclusion
In this post, we explored data collection and extracted the data in 5 different formats, loading it into a Pandas’ data frame. The complete code is available here on Github.
The skills to collect data are extremely valuable to a data scientist who wishes to extract and analyze real-world datasets. The next step after extraction is to transform the data into a structured and organized format.
The resources below list public datasets you can use for your projects.