Feature Engineering Time

Calculating statistics of circular/periodic features.

Avishalom Shalit
Towards Data Science

--

In this post we discuss a mathematical method for treating periodic features.
This can be used for calculating an average.
(e.g. What is the average time of 23:50, 22:40 and 00:30?)
It is also a useful distance metric for clustering.
(e.g. 00:01 is closer to 23:59 than to 00:10, can we solve this elegantly?)
The same method proves useful in feature engineering a circular feature like “time-of-day” or “day-of-year” or “azimuth” for a ML model.

Problems:

  1. Cluster events based on the time of event.
  2. Learn a model of some outcome when time of day is an important feature.
  3. Calculate a typical time of an action per user when each completes it several times.

A Problem with the Problems:

How do you deal with events in the cluster that cross over to the next day?Some users might be night owls or from another timezone; If their activity spans midnight the math becomes harder.
If a user sends messages at 23:57, 23:58 and 23:59, calculating the average is straightforward. What if the user is a minute late? Obviously we don’t want to add 00:00 to the average.
Imagine you are trying to predict some outcome that is dependant on the time of day, a simple model (regression) will suffer the discontinuity at midnight.

Attempt #1, Brute Force:

Let’s assume that our users must be offline at some point, at some point there is a boundary that has no activity near it.
We can then “move” to that timezone, add some hours to put the day boundary in the empty space.

Problem:

We might have users all around the globe, with different activity patterns, we will need to find the time shift for each. At best, this is highly inefficient, at worst it might be impossible.
Worst still, what if our assumption is wrong? What if we don’t see such a clean boundary between activity and inactivity?

Inspiration: Average Wind Direction.

Wind direction off the East coast.

Looking at a map of the winds we have good intuition about the average direction (and magnitude) of the wind.

The arbitrary distinction between 359° and 1° doesn’t bother us in the slightest because we take the vector average. The mathematical jump at 360° doesn’t matter when we are pointing at a direction with a finger.

How can we formalize that intuition? Let’s just describe direction as a 2d vector (x, y), not a single scalar that is the “heading” .

Yep, let’s do that thing, with time

Let’s map the times to the circumference of a 24 hour clock

The geometric representation of time of day lends itself to geometric clustering, averaging. (Code to generate these if you want to play around with them is provided in a notebook linked below.)

If we want to cluster, we cluster in 2d space (x, y). If we want to take an average, we take the geometric average of the points (center of gravity).
If we want to use the time as a feature in a model we use the vector, (x, y).

What happens when the average is not on the circle.

Though any average of more than one point will be inside the circle; the larger the range, the further away from the circumference the average will be.

Left: with small variance, the center is very close to the circle’s edge and shows the average time; Center: the high variance of the points puts the center inside the circle. The direction of the center indicates the average time, in this case 3:00. Right: A measure that is analogous to standard deviation can be derived by examining the arc defined by the chord that is bisected by that center.

Summary

By mapping time of day to a 2D circle that represents 24 hours we can alleviate the discontinuity that occurs when the hour hand goes from 23 to 0.
This would also work for day of year, or actual direction.

This new 2D vector (that can simply be two columns in a model) behaves very naturally, it defines a metric that puts 23:59 and 00:01 close together. Similarly it puts Dec 31st and Jan 1st right next to each other. NNE and NNW do not care that they are on different sides of 360° — (0°)

Those points of discontinuity were arbitrary to begin with, but so long as we were using only a single dimension they had to occur somewhere. In 2D things are nice again.

Code:

You can clone this notebook to play around with the code. (also contains workable example of pySpark calculating the averages.)

Appendix, math and how-to:

Mapping the time to the circumference of the circle is straightforward.

  1. Convert the time to a single number [0, 24) {or [0, 1) , this might be free} so 12:36 could be 12.6 {or 0.525}.
  2. Then stretch that number to cover 360° by multiplying by 15°. {or by 360°}
  3. Then take the Sin and Cos to get the Y and X coordinates. (Happy note: SQL has trigonometric functions, don’t forget to convert to radians)
  4. To calculate the average, simply take the average of X and Y independently.
  5. To transform back to time using atan2(E(Y), E(X)), then reverse the transformations (radians to degrees, and un-stretch to 24H )

Bonus, STDDEV:

  1. Calculate the radius, r with the Pythagorean theorem given X and Y.
  2. The arc that represents the standard deviation is given by arccos(r). Proving this is left as an exercise for the reader.

--

--

Head of Data Science & Analytics @ Sedna; Xoogler- Sr. SWE; (Data Scientist, but I used to be called an “Algorithm Developer”, which was the style at the time)