Stock Market Analysis in Python

PART 1: Getting Data by Web Scraping

Faizan Ahemad
Towards Data Science

--

Staying Invested! What was your gain?

The Stock Markets are having a spectacular bull run globally since 2013. In India many companies have grown over 10 times. Even the industry leaders, nifty 50 or India’s top 50 companies have grown over twice. Annual growth (or returns) of Nifty 50 was over 20% for 2017, and the trend seems to be same in 2018 already. People are investing into mutual funds and ETFs to follow the trend.

In such a situation it seems that we should invest more in stocks, yet if you look at the news you would see that most market analysts are calling the current market very expensive to invest in. So do we take their word or we do some data analysis to find out ourselves? How do we find good companies in a highly overvalued market? Is this another hype like the bitcoin/crypto-currency bubble?

In this series of tutorials we are gonna find that out using python.

In Part 1 we learn how to get the data. In part 2 we will look at how to do the analysis.

In this tutorial (part-1) we will learn to

  • Make http requests in python via requests library.
  • Use chrome dev tools to see where data is on a page.
  • Scrape data from downloaded pages when data is not available in structured form using BeautifulSoup library.
  • Parse data like tables into python 2D array.
  • Scraping function to get data in form of a dictionary (key-val pairs).

The Jupyter Notebook required for this is here.

Setup

  1. Install Jupyter Notebooks by installing Anaconda. See my previous article for installing on a Linux server.
  2. Make sure you have the following python packages installed in addition to Anaconda’s default package set.
beautifulsoup4
fastnumbers
dill

3. Start a python 3 jupyter notebook and add the following imports.

import numpy as np # linear algebra
import pandas as pd # pandas for dataframe based data processing and CSV file I/O
import requests # for http requests
from bs4 import BeautifulSoup # for html parsing and scraping
import bs4
from fastnumbers import isfloat
from fastnumbers import fast_float
from multiprocessing.dummy import Pool as ThreadPool

import matplotlib.pyplot as plt
import seaborn as sns
import json
from tidylib import tidy_document # for tidying incorrect html

sns.set_style('whitegrid')
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Now we are ready to get started. In case of any difficulty just see the jupyter notebook I have on Github.

Some Utilities we will need in Scraping data

String to float conversion

A lot of numbers in web pages are present as strings with commas and % symbols. We use the fast_float function from fastnumbers library.

def ffloat(string):
if string is None:
return np.nan
if type(string)==float or type(string)==np.float64:
return string
if type(string)==int or type(string)==np.int64:
return string
return fast_float(string.split(" ")[0].replace(',','').replace('%',''),
default=np.nan)

We check if input is already float/int then return the same, else remove comma and % and then convert.

Another function to do this conversion to a list of strings

def ffloat_list(string_list):
return list(map(ffloat,string_list))

Removing Multple spaces from within string

When extracting text from web-pages some strings have multiple spaces in between words, to maintain consistency.

def remove_multiple_spaces(string):
if type(string)==str:
return ' '.join(string.split())
return string

Making Http Requests in Python

For this we will use python requests library. You need to know the url of the page you will make request to.

Use the method requests.get to make request, we do response.status_code and response.content to get http status and the content of page respectively

response = requests.get("http://www.example.com/", timeout=240)
response.status_code
response.content

Note that requests library does not run the javascript on page so any data/content that is fetched after the html content is loaded via javascript will not be available. This is not an issue though since most financial websites usually follow server side scripting and send full prepared pages to clients.

Getting Json content and parsing it

To get the json content from a page just do response.json()

url = "https://jsonplaceholder.typicode.com/posts/1"
response = requests.get(url, timeout=240)
response.status_code
response.json()

content = page_response.json()
content.keys()
Json Response

To be sure that your request succeeded make sure to check response.status_code.

Scrape Data by Parsing and Traversing HTML

We will be using beautifulsoup4 library to parse html strings into a tree like representation.

Rendering HTML Strings in Jupyter Notebook

from IPython.core.display import HTML
HTML("<b>Rendered HTML</b>")

Using Chrome Inspector (Dev Tools) to get the position of Content

We will be seeing this url: https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM

Its the page of a company which makes motor bikes in India. Check the page out.

Snapshot of the link

Now to get any content out of the page you need to know where the content is in the HTML. So first we will get the Title (“Hero Motocorp Ltd.”). Lets inspect the page with Chrome inspector. To use chrome inspector in mac use cmd+option+i, for windows and linux Control+Shift+I.

Using inspector

Open chrome inspector -> Click on Elements (1) -> Click on cursor-box item (2) -> Point at “Hero Motocorp Ltd.” and then click.

Finding element location

As you can see the company name is in <h1> tag. Next we look at how we can get that content in our notebook.

Parsing and displaying Content using BeautifulSoup4

For this we need to get the response, then parse the content using BeautifulSoup class. Finally we get the content from <h1> tag and render it.

response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
page_content = BeautifulSoup(response.content, "html.parser")
HTML(str(page_content.find("h1")))

Getting HTML elements by attributes

We will be getting the day’s price change which is in a <div> with id as “b_changetext”. For this you simply pass an attrs object to page_content.find

response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
content = BeautifulSoup(response.content, "html.parser")
price_div = content.find("div",attrs={"id":'b_changetext'})
HTML(str(price_div))
Output of price getter code

Some other variants of find are as belows

content.find_all("p")
content.find_next("p",attrs={"class":"my-id"})

.find_all finds all the occurrences of the given specification in the page. .find_next finds the next occurrence. Once you have the element you can do .text to get its textual content (innerText of browser DOM).

elem = content.find("p",attrs={"class":"my-id"})
text = elem.text

Getting Child elements

To find children of an element (an element you found by above methods) you need to do .children on the element. This will give you an iterable which you can use in loops.

list(price_div.children)
getting children of element

As you can see above .children gives you 3 children, of which first one is a element with nothing but space. We will create a function that filters this and gives us only proper elements. Any actual element on page after being parsed is represented by bs4.element.Tag type. We remove any string which is just spaces or newlines unless they are enclosed in a Tag element.

def get_children(html_content):
return [item for item in html_content.children if type(item)==bs4.element.Tag or len(str(item).replace("\n","").strip())>0]
Get children filter function output

Parsing Tables

Till now we learnt how to find the data we need from single elements. But what about tables? Traversing a table one cell by one to find necessary info each time will be very cumbersome. Note that there can be tables created by using <table> tag and then there can be table like structures in html created using other tags as well, we will learn to parse both kinds.

As such we are going to create a function that helps in getting tabular data out of a table in a 2D array format.

First we create a table and display it.

html = '''
<table>
<tr>
<td>Month</td>
<td>Price</td>
</tr>
<tr>
<td>July</td>
<td>2</td>
</tr>
<tr>
<td>August</td>
<td>4</td>
</tr>
<tr>
<td>September</td>
<td>3</td>
</tr>
<tr>
<td>October</td>
<td>2</td>
</tr>
</table>
'''HTML(html)
Table to be parsed

Let me explain the parsing process in pseudo-code before actual implementation

Step 1: Initialise final row_data as empty list.
Step 2: Get all rows in a list
Step 3: For each row in the list of rows
- Initialise current_row_data as empty list
- Get a list of cells in the row.
- For each cell get its text content
# if no text content present skip to next cell
# else put the text content into current_row_data
-
Put current_row_data into row_data
Step 4: return row_data

The below python function implements these steps. We will use it to parse the earlier table.

def get_table_simple(table,is_table_tag=True):
elems = table.find_all('tr') if is_table_tag else get_children(table)
table_data = list()
for row in elems:
row_data = list()
row_elems = get_children(row)
for elem in row_elems:
text = elem.text.strip().replace("\n","")
text = remove_multiple_spaces(text)
if len(text)==0:
continue
row_data.append(text)
table_data.append(row_data)
return table_data
Using get_table_simple

Lets see if it can parse another type of table. One created with <div> not <table> tag.

html = '''
<html>
<body>
<div id="table" class="FL" style="width:210px; padding-right:10px">
<div class="PA7 brdb">
<div class="FL gL_10 UC">MARKET CAP (Rs Cr)</div>
<div class="FR gD_12">63,783.84</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">P/E</div>
<div class="FR gD_12">17.27</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">BOOK VALUE (Rs)</div>
<div class="FR gD_12">589.29</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">DIV (%)</div>
<div class="FR gD_12">4750.00%</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">Market Lot</div>
<div class="FR gD_12">1</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">INDUSTRY P/E</div>
<div class="FR gD_12">19.99</div>
<div class="CL"></div>
</div>
</div>
</body>
</html>
'''
HTML(html)
content = BeautifulSoup(html,"html.parser")
get_table_simple(content.find("div",attrs={"id":"table"}),is_table_tag=False)
Parsed 2D table in python

As you can see it successfully parses this as well.

Putting it Together

Lets see the places on the page from where I can take data.

data locations

I inspected each of these areas using chrome dev-tools and found out the correct id. For the lower two big boxes I used get_table_simple function we wrote earlier. Step wise process as below

Step 1: Get page content using requests
Step 2: Parse page content using BeautifulSoup
Step 3: Use chrome dev tool to find id of each highlighted block
Step 4: Get Price and yearly low, high.
Step 5: Get the lower two boxes enclosing tag.
Step 6: 1st box is 1st child, parse it as table.
Step 7: 2nd box is 2nd child, parse it as table.
Step 8: Combine the tables into a single dict named collector.
Step 9: Populate the final dict key_val_pairs.

The final function is as belows.

Final Scrapping Function

For using the function we pass it the page url.

get_scrip_info("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM")
Final Data from Scraping.

What we can do next?

  • Make a Search function to find by NSE scrip name. (Like how apple is APPL, Indian stocks have short names as well)
  • Get past prices for analysing returns
  • Deal with pages with incorrect html (pages which don’t have syntactically correct html, browsers correct them and then render but scraping them will be difficult).
  • Simple parallelisation of map operation to speed up scraping.
  • Storing snapshots of scraped data using dill or pickle library.
  • Exploring stocks data like annual returns and deviations in price using various plots.
  • Comparing how indexes like Nifty 50 , Nifty 100 and Nifty Mid-cap 50 performed relative to each other. ETFs track indexes so if you invest in them you need to know how the index performs.
  • Selecting stocks based on P/E , P/B and other ways.

I will cover these in a future tutorial.

Notebook Link for reference.

References

--

--