The world’s leading publication for data science, AI, and ML professionals.

Connecting Datapoints to a Road Graph with Python Efficiently

How to build a function that does it from scratch. And then speed it up 80 times with numba.

Photo by NASA on Unsplash
Photo by NASA on Unsplash

Generally, it’s easy to find the nearest road to a given point, but when we’re talking about millions of points and thousands of roads things may get a bit complicated. I want to tell you how to build a function that does this from scratch, and then speed it up 80 times with Numba.

So, in this text, I’m going to solve the following task:

Given

  • 7 million points, each determined by its longitude and latitude
  • 10 thousand roads in the graph. Roads are stored in form of OSMnx graph

To do

For each road in the graph find out the number of points for which this particular road is the nearest one.

What is essentially needed, is to transform the first map to the second one, but with a lot more points and streets.

Colors represent the number of points associated with the given street. (Image by Author)
Colors represent the number of points associated with the given street. (Image by Author)

The given task is unfortunately too heavy for standard Shapely functions, so we need to make a custom solution.

1. Geometry

Image by Author
Image by Author

Let’s start with the easiest approach. The distance from the origin of coordinates to a line defined by two points can be calculated with this formula:

Image by Author
Image by Author

We can use this formula to calculate the distance from any given point. All that’s needed to do this is to transform the coordinates in the way that the given point ends up in the origin.

Let’s write a simple function that iterates over all the streets and finds the nearest one.

The only problem is that in real life streets are not infinite, and are usually defined by their endpoints.

Image by Author
Image by Author

Consider the situation in the picture. While point A is closer to the first segment, the formula above tells us that point A is closer to the second one, because its continuation lies right near the point.

Image by Author
Image by Author

To eliminate these streets, let’s introduce a rule, that a perpendicular line, drown from the data point to a street must intersect it between its endpoints. Otherwise, a street is considered irrelevant. I’d like to point out that this rule is equivalent to a requirement that the lines drawn from the data point to the streets’ endpoints form acute angles with the street. The second requirement is easier to check. All we need to do is calculate a scalar product of vectors representing sides of the triangle. If the angle between two vectors is obtuse the scalar product is negative and vice-versa.

Here’s the code that checks if the triangle is acute.

I guess you already know the next problem we’ll face. Not all streets are straight in real life. Consider the situation in the picture below

Image by Author
Image by Author

The algorithm we just developed returns the first line as being the nearest one, while the answer is obviously the second one. This is due to the fact that the current approach considers endpoints only, and doesn’t account for curvature. Fortunaltely, the OSMnx graph also contains geometries of the streets, which can be represented in a form of a sequence of coordinates of endpoints of the streets subsegments.

Image by Author
Image by Author

Now, all we need to do to resolve this issue is to iterate over all the subsegments for every street considered.

This, however, creates one more, rather an unexpected problem. What happens if the continuation of one of the segments of some distant street lies right near the datapoint?

Image by Author
Image by Author

The point will be associated with the street №2 while it clearly belongs to the first one. The problem, however, can be solved by checking the triangle for being acute for every subsegment, just like we already did for streets’ endpoints.

That covers all the cases that may cause a problem, so we can proceed to the next part, the efficiency optimization.

2. Efficiency optimization

At this point, the overall algorithm is finished. The only thing we can add to a function’s logic to speed it up is to check if the street isn’t too far away, and drop the street if it is. After adding a line, that checks if all the endpoints are not farther than a certain distance the function looks like this:

Unfortunately after all the adjustments, it would take approximately a week to calculate the nearest street for every point out of 7 million. We need to go deeper.

I’m going to use the Numba library to speed the function up even more. What it does is translate certain Python functions to optimized machine code at runtime using the LLVM compiler library. The only drawback is that it doesn’t support dynamic typing along with some Python-specific datatypes such as Pandas data frames. I deliberately didn’t use not-supported data types, so it won’t be a problem. So all we have to do is to specify data types of the variables that are used in speeded-up functions.

For a function to be compelled with Numba the @jit decorator must be placed before it. And that’s all.

Now to test the efficiency gain, let’s load a street graph for the London city center, and generate a thousand points. I tried to find the nearest street for all the points with Numba speed-up and without it. Here are the results:

  • without Numba — 4 minutes, 27 seconds
  • with Numba — 0 minutes, 3.4 seconds

The code works 80 times faster, that’s impressive. The initial task of associating 7 million points to streets has been finished in just a few hours, instead of a week.

I made a notebook with all the code and more pictures. You can find it in this repository.

Thank you for reading the article, I hope you found it useful!

Danil Vityazev – Medium


Related Articles