The world’s leading publication for data science, AI, and ML professionals.

How to Collect Live Feed and Frequently Updated Data Using Cron

Cron allows you to schedule repeat tasks, making it a great tool to run data collection scripts

Photo by Nick Chong on Unsplash
Photo by Nick Chong on Unsplash

A major concern when collecting time series data is ensuring that all data is collected at equal time intervals. Without equal time intervals, you will be unable to use most methods for time series analysis. Unfortunately, not all data comes so clean and evenly spaced. Whether you are webscraping or using an API to collect your data, you may be stuck with a live feed of current data when evenly spaced historical data is unavailable. Or perhaps, you need to repeatedly grab the most up-to-date data for your dashboard or project. Both of the situations can be handled well by Cron.

Cron is a tool that you can use in your terminal/console to schedule tasks, called "cronjobs", repeatedly at a specified time interval. These tasks can be anything you can do from terminal, but for the typical data scientist using python, it’s great for running a .py script that webscrapes or interfaces with an API repeatedly. This way you can choose the time interval at which you want to fetch your new data thus either adding to your live feed data collection or updating your dataset with the most recent information. Here’s how to implement it:

Step 0: Create a script (or other task)

Before we even get started with chron, identify what you want to do, and make sure you know how to execute it through terminal. If you are webscraping or interfacing with an API to create a time-sensitive dataset, create your .py script, and be sure it executes properly in terminal with the command:

Python your_script.py

Also, if you are collecting data, it is important that your script outputs the information collected and stores it in a place for later. For example, let’s say I am collecting a dataset of tweets using Twitter’s API and want to make sure it has the most up recent tweets of a particular hashtag. After receiving tweets from the API, I need to store these tweets somewhere. Having the script output a simple text file may do or if I’ve converted the data into a pandas DataFrame, its easy enough to use the to_csv() method (documentation found here). However you are storing your data, be sure that your script saves these new files to a predetermined location.

Step 1: Create a crontab

A crontab is a file that contains a list of cronjobs and how often you want to execute them. To open/edit a crontab simply use the command:

crontab -e

The -e is for edit. This should open a vim window that contains all of your cronjobs. If you’ve never used cron before it should look something like this:

Empty Vim Cron Window
Empty Vim Cron Window

Step 2: Create a Cronjob

In your crontab you’ll enter in the cron job you want to execute. Hit the i button to insert your first cronjob.

Cron syntax:

The syntax of a cronjob can be a little confusing. Here’s a recent one of mine:

1–60/15 * * * * cd ~/Desktop/live_data && /Users/mitchellkrieger/opt/anaconda3/envs/learn-env/bin/python3 bikecron.py >> ~/Desktop/live_data/logs/log.txt 2>&1

Let’s break this down piece by piece. A simplified sudocode cron job would look like this:

[time interval] [cronjob] >> [where to log] 2>&1

Time interval: The first part records how often you want the cronjob to run. It consists of five slots for the minute, hour, day of month, month and day of week respectively:

There are 4 main special characters for the cron time interval:

  • * indicates any value
  • , separates values in a list
  • - indicates a range of values
  • / indicates the step size between value

So, if you wanted to run the cronjob every 30 minutes from 9am – 5pm every other month, every October you would put 0,30 9-17 * */2 * , where 0,30indicates the 0th and 30th minute of every hour, 9–17indicates 9am – 5pm, the first * is any day of the month */2 is every other month and the last * is any day of the week. Note that for day of week, Sunday is 0. The Crontab Guru is a helpful tool that translates the cron time format into an English phrase if you are worried about hitting your desired interval with this syntax.

Cronjob: This section is what you are executing from Terminal. In my above example, I’m first navigating to where the director where script is on my computer cd ~/Desktop/live_data and then executing it. The&& indicates an the additional execution action. White, the long filepath is navigating to where my python3 executable is installed. If you don’t know where this is try the which python command in terminal and it should output this path. I lastly call the name of my script (bikecron.py) to have python execute it.

Where to log: This is optional, but the last piece tells cron where to log information about any potential issues/errors that may be helpful in troubleshooting. I told it to log errors in a text file called log.txt found in that file path. The 2>&1 means take the standard error (2), redirect it (>) to the file path indicated (&) and store the standard out (1). Without this piece cron may email you if there’s an issue with some job. I highly recommend using it.

Putting it all together:

* * * * * cd ~/navigate/to/directory && /navigate/to/python3 script_to_execute.py >> ~/place/to/store/log.txt 2>&1

Step 3: Save Crontab and allow access to computer

After you’ve created your cronjob in crontab, save it by hitting esc and then typing :wq and hit enter. This window may pop up:

Hit OK. At this point you may be all set to go, but for some macs the security preferences may prevent cron from executing. You’ll know if this happens because an errno1 operation not permitted error will be logged in the log.txt file by crontab. To fix this, go to System Preference > Security & Privacy > Privacy. Scroll down to Full Disk Access and grant access to cron:

If you don’t see cron use the + symbol to find chron. If you don’t know where cron is try,command+shift+G to open finder’s Go to Folder… feature, and type /usr/sbin/cron and hit Enter .

Be careful with this step. Malware on your computer could try to use cron to create cronjobs and when cron is checked off as it is above, it has full untethered access to everything on your computer. If you are not actively running cron, I recommend denying Full Disk Access to cron by unchecking it.

Step 4: Cron does stuff for you!

After that, you should be all set and cron will execute your cronjob at the interval you told it to, effectively collecting evenly spaced time series data and/or updating to the newest data at regular intervals. I recommend checking after the first couple intervals to make sure everything is set up correctly.

Step 5: Learn more about Cron

Other useful cron commands as you continue to use it:

  • crontab -l lists all of your current cronjobs in your crontab
  • crontab -r removes the crontab file
  • crontab file allows you to assign a filename as your crontab file

Cron will only run while your computer is awake, so you’ll either need to set your computer to stay awake or wake up at regular intervals.


Related Articles