The world’s leading publication for data science, AI, and ML professionals.

Exploring detection events

Doing Data Science from Scratch: Exploratory Data Analysis with Time Series

Doing Data Science from scratch task by task

Photo by Kalen Emsley on Unsplash
Photo by Kalen Emsley on Unsplash

This is the next article in my series Doing Data Science from Scratch. We are doing a project to count the number of passing vehicles on the road where I live. The question to be answered is fundamental – when is the best time to go for a walk without being mowed down by speeding cars, trucks, or other agricultural vehicles. One of the articles focused on recovering event data from log files, and that was a lot of fun. My most recent report, solving a video sequencing puzzle allowed me to provide more insight into the project work. So I went ahead and recorded an entire days events and therefore have a collection of video clips and that motion.log file I discussed in the previous article. In this post, I will explain my approach to some early analysis and report on the collection process. Buckle up your seat belt and let’s do some Exploratory Data Analysis using Time Series and Pandas.


Data Collection

The previous articles already covered my strategy. In my Lab, I have a large number of purpose-built collection devices. Those devices can measure most things in our environment. I am using my Smart Doorbell to measure the flow of traffic outside my house. Perhaps for giggles, but here is a picture of some of the "stuff" I have built to help me measure and collect data.

An illustration of the data collection devices the author has. Image by the author, Ireland Dec 9
An illustration of the data collection devices the author has. Image by the author, Ireland Dec 9

You can see my doorbell in the top left-hand set of images. The collection ran overnight from 18:00 GMT December 8th to 18:00 GMT December 9th. I will restrict the data to December 9th as I won’t be interested in walking in the pitch dark. Sun Rise was 08:30 and Sun Set was 16:30, and I will work with data in these ranges only. Our scope, therefore, is December 9th, 2020 08:30–16:30.

Data

We are dealing with two types of data in this series. Those are:-

  • A plain text log file. I wrote a script to parse the log file, and I modified that to create a serialised python object. You could call it a JSON object for the sake of convenience. I wrote the serialised object to a disk file. There is no need for elaborated code bases to perform CRUD operations with Databases. Just spill to disk and read from disk.
  • Video data stored as clips, which I can process, using a script that reads all those clips and makes a single video.

Recovery of the data really only requires me to visit my Raspberry Pi based file server. I wrote about creating and debugging this device earlier. I have a Samba (CIFS) share mapped on my Mac Mini M1, which I am using to write this article, so I could just copy&paste to my working directory.

Processing of the log file is easy and quick. I use Spyder in Dark mode on the Mac. Nothing to see here. The Python variable ‘records’ has been written to disk, and I will recover that shortly for the next step.

Image by author after processing the Log file
Image by author after processing the Log file

There are 199 video files generated with a 2020–12–09 timestamp so that will take a couple of minutes to make the compressed video file. Below is a screenshot of the process starting. I am comfortable to run the script from the IDE, but others might use the command line and pass a set of arguments. It is really up to you but keep the focus on the project and don’t get goal displaced by nice to have extra work.

Image by the Author - running a video processing script. Combine 199 clips please!
Image by the Author – running a video processing script. Combine 199 clips please!

I made a cup of tea and when I returned the job had finished. 199 video files have been combined into one mp4 file. The file size is 52.5 MB

Image by Author - the output directory for my video processing script
Image by Author – the output directory for my video processing script

Collecting information about passing cars leads to a video file of 51.31 minutes with the first frame recorded at 5:44:35 GMT, as shown below. The red blur is a passing car. The camera is behind a window and the temperature is near freezing.

Watching the 50 minute video is the subject of future posts because I am gearing up to use Deep Neural Networks (Yolo3) to perform object detection on the files. Computer Vision is new to me, so I am learning as I go. Indeed that might not seem ideal, but when else would you learn something new?

The log data

When I do Exploratory Data Analysis, I always use Jupyter Notebooks. If I work with devices or define repeatable scripts, I use the IDE. You will have to find your own happy medium, and it comes down to what works best for you. Let’s discuss my Exploratory analysis a bit.

import pickle
import pandas as pd
pathd = "/Users/davidmoore/Downloads/e2eSensorNetwork-master/motion/"
with open(pathd+"anal", "rb") as f:
    dump = pickle.load(f)

df = pd.DataFrame.from_records(dump)

My first step is to recover the disk file ‘anal’ which contains the Python object. Next, I load the data into a Pandas DataFrame called, without any imagination at all, ‘df’. That said we can now examine our data structure.

df.info()

We have 5 data elements. Camera & FileName are string. DurMins is an integer whilst the others are datetime objects. Looks clean so far but this is the data structure I defined, so I wasn’t that surprised. There are 442 entries, with the log representing all entries since the file was first created. We will filter to the 199 events we consumed into the MP4 file above. There are 442 File Names, with corresponding StartEvent and EndEvent. There are no missing values so far.

DurMins should be reasonably tight as we expect a series of short video clips.

df['DurMins'].unique(),  df['Camera'].unique()
I didn't expect to see an event of 386 minute duration.
I didn’t expect to see an event of 386 minute duration.

There is only data for a single camera (‘motion2’), but the values in our duration feature are surprising. We have events ranging from 1 through 6 and also 386. So we need to investigate that. Short clips of 1 through 6 minutes is undoubtedly expected, but 386 seems wrong. We can just ask for the record(s) with the outlier value as follows.

Record 243 indicates a constant motion event that lasted almost 6 hours. The value of 386 seems correct. This is why we emphasize the importance of exploring the Data and why Data Analysis takes up 80% of the effort according to many Data Scientists anyhow.

Looking at the clips suggests a problem. There are two clips with one created 23.22, but there is another for 05:45. So there must have been an error!

Looking at the above log entry, we can see a detection event [EVT] but there followed an [ERR], and that resulted in a service restart. Motion is a demonized service, so if it falls over then, the OS will restart it automatically.

A new [EVT] is thrown at 05:44 am, and it seems the service returns to normal. There is a need to refine the log parse script to detect this situation. For today we can drop that record as it is outside of our target time frame. It is vital to investigate outliers thoroughly and only drop them when you really do understand them fully. Filtering the data and examining the range of the duration field we see events from 1 minute to 6 minutes long. There are 198 events, and each event has 5 attributes.

(array([1, 4, 3, 2, 5, 6]), (198, 5))

Ok so we have the log data, we found an error, we filtered that and now what? It isn’t easy to understand time-series data without some visualizations. Let’s do that next.

Visuals and Time Series Data

It would be nice to see a plot of the events. There are 198 discrete data points, so that will likely make a packed x-axis and make things difficult to see. We need to get a summary of events in specific time slots. After a bit of transformation, I managed to create my first view of the data collected.

Creating the chart requires a bit of work

idx = pd.date_range('2020-12-09 08:30:00','2020-12-09 16:30:00', freq='S')
new_df = pd.DataFrame(idx)
new_df.columns = ['StartEvent']
new_df = pd.merge(new_df,df, on="StartEvent", how="left" )

The first thing I had to do was make a new index of all the seconds from 08:30 to 16:30. I then merge that index with my readings to create a new dataframe which I called, again without imagination or clarity for you, the reader, ‘new_df’. So I just made a new matrix with all the seconds in our target window and whatever readings I got. We will have a lot of missing values because we do not record each second. There is only a record of detected movement in front of the house.

new_df = new_df.set_index('StartEvent')
new_df['DurMins'] = new_df['DurMins'].fillna(0)

Now I’ve set the index for the new dataframe and filled in all missing values with 0. Filling in missing data, imputation, setting missing values to 0 is okay. 0 indicates no reading, and that infers no movement. But, that created 238 entries which are more than the 199 I complained about. How does that make sense? Well, hold on!

ax = new_df.resample('10T').sum().plot(title='Movement outside House on the Road')
ax.set_xlabel("Time (08:30 - 16:30 GMT Dec 9)")
ax.set_ylabel("Total Events per 10 min interval")

Okay, so I used resample(’10T’) to resample all my readings from per second to every 10 minutes. It is imperative to create an index that allows resampling.

I sure hope that made sense and I guess you won’t have expected that making a simple line chart would become so involved, hard to explain, but yet so powerful in the end. Doing Data Science from scratch is hard work.


Closing

It has been an exciting day. First, I ran the collector (my doorbell) for 24 hours. The device recorded movements in front of my house for the period. Retrieving the Log data and making the video file was, at first straight forward, as I could just run pre-arranged scripts. The fun started when we found an outlier value and that led to a discovery. There is a problem with the equipment, and that led to a large gap in recorded data. That is fantastic, and I count the day a success. Finally using Pandas and Time Series functionality, I was able to plot the daylight hours and see the amount of movement outside the house through that period.

Doing Data Science from Scratch is a lot of work, hard to explain, and the equipment seems to fight all the way. It is great fun, the data set is exciting and is really bringing out the Exploratory Data Analysis and cleaning requirements. I hope you enjoyed this installment, it was fun writing it. My next step is to check and fix the equipment and then do another run to ensure that everything is stable. Come back for the next write-up.


I got a message on a different channel, asking me how I did things so quickly. Initially, I wasn’t sure what to say like! But then I remembered that I wrote an article on my set-up. It is imperative to get your equipment and workstation set-up correctly. My workstation is what unlocks productivity for me.

What is your workspace like?


Related Articles