Stock Market Analysis in Python
PART 1: Getting Data by Web Scraping
The Stock Markets are having a spectacular bull run globally since 2013. In India many companies have grown over 10 times. Even the industry leaders, nifty 50 or India’s top 50 companies have grown over twice. Annual growth (or returns) of Nifty 50 was over 20% for 2017, and the trend seems to be same in 2018 already. People are investing into mutual funds and ETFs to follow the trend.
In such a situation it seems that we should invest more in stocks, yet if you look at the news you would see that most market analysts are calling the current market very expensive to invest in. So do we take their word or we do some data analysis to find out ourselves? How do we find good companies in a highly overvalued market? Is this another hype like the bitcoin/crypto-currency bubble?
In this series of tutorials we are gonna find that out using python.
In Part 1 we learn how to get the data. In part 2 we will look at how to do the analysis.
In this tutorial (part-1) we will learn to
- Make http requests in python via requests library.
- Use chrome dev tools to see where data is on a page.
- Scrape data from downloaded pages when data is not available in structured form using BeautifulSoup library.
- Parse data like tables into python 2D array.
- Scraping function to get data in form of a dictionary (key-val pairs).
The Jupyter Notebook required for this is here.
Setup
- Install Jupyter Notebooks by installing Anaconda. See my previous article for installing on a Linux server.
- Make sure you have the following python packages installed in addition to Anaconda’s default package set.
beautifulsoup4
fastnumbers
dill
3. Start a python 3 jupyter notebook and add the following imports.
import numpy as np # linear algebra
import pandas as pd # pandas for dataframe based data processing and CSV file I/Oimport requests # for http requests
from bs4 import BeautifulSoup # for html parsing and scraping
import bs4from fastnumbers import isfloat
from fastnumbers import fast_float
from multiprocessing.dummy import Pool as ThreadPool
import matplotlib.pyplot as plt
import seaborn as sns
import json
from tidylib import tidy_document # for tidying incorrect html
sns.set_style('whitegrid')
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Now we are ready to get started. In case of any difficulty just see the jupyter notebook I have on Github.
Some Utilities we will need in Scraping data
String to float conversion
A lot of numbers in web pages are present as strings with commas and % symbols. We use the fast_float
function from fastnumbers library.
def ffloat(string):
if string is None:
return np.nan
if type(string)==float or type(string)==np.float64:
return string
if type(string)==int or type(string)==np.int64:
return string
return fast_float(string.split(" ")[0].replace(',','').replace('%',''),
default=np.nan)
We check if input is already float/int then return the same, else remove comma and % and then convert.
Another function to do this conversion to a list of strings
def ffloat_list(string_list):
return list(map(ffloat,string_list))
Removing Multple spaces from within string
When extracting text from web-pages some strings have multiple spaces in between words, to maintain consistency.
def remove_multiple_spaces(string):
if type(string)==str:
return ' '.join(string.split())
return string
Making Http Requests in Python
For this we will use python requests library. You need to know the url of the page you will make request to.
Use the method requests.get
to make request, we do response.status_code
and response.content
to get http status and the content of page respectively
response = requests.get("http://www.example.com/", timeout=240)
response.status_code
response.content
Note that requests library does not run the javascript on page so any data/content that is fetched after the html content is loaded via javascript will not be available. This is not an issue though since most financial websites usually follow server side scripting and send full prepared pages to clients.
Getting Json content and parsing it
To get the json content from a page just do response.json()
url = "https://jsonplaceholder.typicode.com/posts/1"
response = requests.get(url, timeout=240)
response.status_code
response.json()
content = page_response.json()
content.keys()
To be sure that your request succeeded make sure to check response.status_code
.
Scrape Data by Parsing and Traversing HTML
We will be using beautifulsoup4 library to parse html strings into a tree like representation.
Rendering HTML Strings in Jupyter Notebook
from IPython.core.display import HTML
HTML("<b>Rendered HTML</b>")
Using Chrome Inspector (Dev Tools) to get the position of Content
We will be seeing this url: https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM
Its the page of a company which makes motor bikes in India. Check the page out.
Now to get any content out of the page you need to know where the content is in the HTML. So first we will get the Title (“Hero Motocorp Ltd.”). Lets inspect the page with Chrome inspector. To use chrome inspector in mac use cmd+option+i
, for windows and linux Control+Shift+I
.
Open chrome inspector -> Click on Elements (1) -> Click on cursor-box item (2) -> Point at “Hero Motocorp Ltd.” and then click.
As you can see the company name is in <h1> tag. Next we look at how we can get that content in our notebook.
Parsing and displaying Content using BeautifulSoup4
For this we need to get the response, then parse the content using BeautifulSoup
class. Finally we get the content from <h1> tag and render it.
response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
page_content = BeautifulSoup(response.content, "html.parser")
HTML(str(page_content.find("h1")))
Getting HTML elements by attributes
We will be getting the day’s price change which is in a <div> with id as “b_changetext”. For this you simply pass an attrs
object to page_content.find
response = requests.get("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM", timeout=240)
content = BeautifulSoup(response.content, "html.parser")price_div = content.find("div",attrs={"id":'b_changetext'})
HTML(str(price_div))
Some other variants of find are as belows
content.find_all("p")
content.find_next("p",attrs={"class":"my-id"})
.find_all
finds all the occurrences of the given specification in the page. .find_next
finds the next occurrence. Once you have the element you can do .text
to get its textual content (innerText
of browser DOM).
elem = content.find("p",attrs={"class":"my-id"})
text = elem.text
Getting Child elements
To find children of an element (an element you found by above methods) you need to do .children
on the element. This will give you an iterable which you can use in loops.
list(price_div.children)
As you can see above .children
gives you 3 children, of which first one is a element with nothing but space. We will create a function that filters this and gives us only proper elements. Any actual element on page after being parsed is represented by bs4.element.Tag
type. We remove any string which is just spaces or newlines unless they are enclosed in a Tag element.
def get_children(html_content):
return [item for item in html_content.children if type(item)==bs4.element.Tag or len(str(item).replace("\n","").strip())>0]
Parsing Tables
Till now we learnt how to find the data we need from single elements. But what about tables? Traversing a table one cell by one to find necessary info each time will be very cumbersome. Note that there can be tables created by using <table> tag and then there can be table like structures in html created using other tags as well, we will learn to parse both kinds.
As such we are going to create a function that helps in getting tabular data out of a table in a 2D array format.
First we create a table and display it.
html = '''
<table>
<tr>
<td>Month</td>
<td>Price</td>
</tr>
<tr>
<td>July</td>
<td>2</td>
</tr>
<tr>
<td>August</td>
<td>4</td>
</tr>
<tr>
<td>September</td>
<td>3</td>
</tr>
<tr>
<td>October</td>
<td>2</td>
</tr>
</table>'''HTML(html)
Let me explain the parsing process in pseudo-code before actual implementation
Step 1: Initialise final row_data as empty list.
Step 2: Get all rows in a list
Step 3: For each row in the list of rows
- Initialise current_row_data as empty list
- Get a list of cells in the row.
- For each cell get its text content
# if no text content present skip to next cell
# else put the text content into current_row_data
- Put current_row_data into row_data
Step 4: return row_data
The below python function implements these steps. We will use it to parse the earlier table.
def get_table_simple(table,is_table_tag=True):
elems = table.find_all('tr') if is_table_tag else get_children(table)
table_data = list()
for row in elems:
row_data = list()
row_elems = get_children(row)
for elem in row_elems:
text = elem.text.strip().replace("\n","")
text = remove_multiple_spaces(text)
if len(text)==0:
continue
row_data.append(text)
table_data.append(row_data)
return table_data
Lets see if it can parse another type of table. One created with <div> not <table> tag.
html = '''
<html>
<body>
<div id="table" class="FL" style="width:210px; padding-right:10px">
<div class="PA7 brdb">
<div class="FL gL_10 UC">MARKET CAP (Rs Cr)</div>
<div class="FR gD_12">63,783.84</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">P/E</div>
<div class="FR gD_12">17.27</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">BOOK VALUE (Rs)</div>
<div class="FR gD_12">589.29</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">DIV (%)</div>
<div class="FR gD_12">4750.00%</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">Market Lot</div>
<div class="FR gD_12">1</div>
<div class="CL"></div>
</div>
<div class="PA7 brdb">
<div class="FL gL_10 UC">INDUSTRY P/E</div>
<div class="FR gD_12">19.99</div>
<div class="CL"></div>
</div>
</div>
</body>
</html>
'''
HTML(html)content = BeautifulSoup(html,"html.parser")
get_table_simple(content.find("div",attrs={"id":"table"}),is_table_tag=False)
As you can see it successfully parses this as well.
Putting it Together
Lets see the places on the page from where I can take data.
I inspected each of these areas using chrome dev-tools and found out the correct id. For the lower two big boxes I used get_table_simple
function we wrote earlier. Step wise process as below
Step 1: Get page content using requests
Step 2: Parse page content using BeautifulSoup
Step 3: Use chrome dev tool to find id of each highlighted block
Step 4: Get Price and yearly low, high.
Step 5: Get the lower two boxes enclosing tag.
Step 6: 1st box is 1st child, parse it as table.
Step 7: 2nd box is 2nd child, parse it as table.
Step 8: Combine the tables into a single dict named collector.
Step 9: Populate the final dict key_val_pairs.
The final function is as belows.
For using the function we pass it the page url.
get_scrip_info("https://www.moneycontrol.com/india/stockpricequote/auto-2-3-wheelers/heromotocorp/HHM")
What we can do next?
- Make a Search function to find by NSE scrip name. (Like how apple is APPL, Indian stocks have short names as well)
- Get past prices for analysing returns
- Deal with pages with incorrect html (pages which don’t have syntactically correct html, browsers correct them and then render but scraping them will be difficult).
- Simple parallelisation of map operation to speed up scraping.
- Storing snapshots of scraped data using dill or pickle library.
- Exploring stocks data like annual returns and deviations in price using various plots.
- Comparing how indexes like Nifty 50 , Nifty 100 and Nifty Mid-cap 50 performed relative to each other. ETFs track indexes so if you invest in them you need to know how the index performs.
- Selecting stocks based on P/E , P/B and other ways.
I will cover these in a future tutorial.