How web scraping helped me go from learning to teaching

And also a quick tutorial

Otávio Simões Silveira
Towards Data Science

--

Business vector created by katemangostar — www.freepik.com

I started to learn python for data science at the beginning of the current year, 2020. I had basically no coding background, with the exception of a programming with C++ class back in college, a long time ago. Although I could find some great (and sometimes free) courses on the internet, there are always situations where you need to appeal to the community for some help, sometimes because you did not understand some concept or technique, sometimes because you are trying something more complex. I did (and still do) it a lot.

But through my fifth month of studies, I started to discover the other side of this: those who answer the questions. I was (still am) using an online platform called Dataquest. They have a great community where people share projects, ask for help, etc. and where I found out that no matter how much of a beginner you are, someone else will always be even more.

So, I’ve been active in their community for about a month now. I started to visit it more often after I was accepted for their Covid-19 Financial Aid Scholarship. Being accepted for the program made me feel really grateful for being helped during these tough times everybody is going through and so I felt like I should put on more efforts to help others. Then I started visiting the community on a daily bases to see if I could help someone.

The idea was that, if I could help other students with their questions, I would be helping not only the student whose question I’d answer but the platform that helped me in the first place.

Although I’ve been studying for almost six months now, I’m still not a python expert or an experienced data scientist or anything like that. I’m just a data science student and so I was not sure whether I was capable of answering people’s questions. But as it turns out, I was. And as I kept answering questions I noticed that I was also helping myself by doing it, since I had to revisit something I had already studied or even learn something new to answer a question.

After a while, I caught myself typing community.dataquest.io in my browser several times a day and having fun doing it. So I thought of a way to optimize my time and the help I was providing: I wrote a web scraper to notify me via email every time a new question was posted in the community.

The pros of this are:

• I practiced scraping;
• I do not need to check the website manually all the time anymore;
• Students can have their questions answered (if I’m capable of answering it, of course) faster.

And I know that when you are a beginner and you’re stuck it’s easy to get demotivated, especially if you rely on the community to have some answers and need to wait hours or even days to maybe receive an answer that allows you to move on.

And the cons are… well, I don’t see any. Everybody wins.

After I got my scraper working, I felt that it would be good to share this idea and the code with everyone in the community. My intention to share was just to try to get more people to help others, to show that scraping (which is lots of fun) can be simple and maybe make someone interested in it and, of course, share what I’ve done. The idea was very well received by the other users and moderators and I was encouraged to try to publish this and reach a larger audience. So, here I am now.

Enough talking, let’s code!

After I got this scrapper working, the number of questions I was answering everyday skyrocketed and, now that you already know the history behind this scraper, let me show you how I made it happen.

First, the code is made to use Google Chrome to scrap and Gmail to send the emails. You can, of course, use others, but that’s on you.

We’ll start by importing the libraries we’ll use. You’re probably already familiar with pandas and the sleep function from time. Other than those, we’ll use smtplib to send the emails and selenium, which is an extremely powerful tool, to scrap the website. If you’re into web scraping, `selenium` is a must.
Also, you need to download the Chrome WebDriver (if you’re using Chrome) and place it in the same directory as your script.

Now, we’ll write the send_email function to send the emails. This function is pretty straight forward even if you never worked with smtplib.
We’ll use the try and except clauses just so the script does not raise an error if it fails to connect with the Gmail server.

Now, let’s go to the scraping. We’ll use an infinity loop to keep the code running all the time and at the end of the code, we’ll use sleep to set the time we want the scraper to wait between each time it checks for new posts. From now on, everything is inside the while.

So, first we set up selenium and instantiate the driver object. Then we tell the driver to get the website. If you set option.headless to False you can actually see your browser opening and going to the website to scrap the data, which is really fun.

Now that we are in the Community, this is the part of the website we are interested in. If you know HTML you know that this is a table. If you don’t, you do not need to worry about it:

The community’s Q&A section

We’ll use pd.read_html to read the driver’s page source. This returns a list with all the tables on the website as dataframes. As the page we’re scraping only has one table (the one we want), we’ll assign the first (and only) element of the list to our table variable:

This is the table:

Although you can’t see it here, it has all the columns you can see in the image from before. Also, notice that the first topic seems to be a pinned one that we are not interested in as we are looking for new topics. We’ll then use slicing to select the first ten topics without the pinned one and the Topics, Replies, and Activity columns only.

We also need to split the activity column into two columns, the first containing only the number and the other containing the letter that represents the time unit (hours or minutes):

And now we have this:

I defined a new topic as a topic created no longer than 10 minutes before the scraper runs and with no replies. So, we’ll create the new_topics dataframe selecting only the rows that fulfill these requirements. Then we’ll use the shape method to assign the number of new topics to the variable num_new:

And then we have:

The work is basically done. We’ll use an if statement to check if the number of new posts is greater than zero and, if so, we’ll set up the subject, the message and call the send_email function. If it is not greater than zero, it will just print “No new topics found.”.

The subject contains the number of new posts and the message’s body contains the new_topics dataframe so we can see the title of the new topics. The message also contains the url, so we can just click and go the Community right the way.

After that, we’ll use sleep to make our code wait before it goes check the website again. I set it for ten minutes, which I think is a fair amount of time for this.

Finally, for the email to be sent from your Gmail account, you must allow less secure apps in your account. I am not providing a link for this, just google “less secure apps google” and you’ll see how to do it.

And that’s it. Feel free to use this code for whatever you want, it’s absolutely free. The entire code is available in this GitHub repository.

I hope you enjoyed this and that it can be useful somehow. As I said, I just wanted to help and maybe motivate people, and think I accomplished at least one of them.

If you have a question, a suggestion, or just want to be in touch, feel free to contact through Twitter, GitHub, or Linkedin.

--

--