The world’s leading publication for data science, AI, and ML professionals.

Data Pipeline for Autonomous Vehicle Test Drive Results

Are you an Autonomous Vehicle (AV) enthusiast wondering how far AV development has advanced? Are you working at an AV company wondering…

Autonomous Vehicle Test Drive Results | A Data Pipeline

Autonomous Vehicles have run millions of miles on public roads. What is the result so far?

Source: Yanshu Lee via Unsplash
Source: Yanshu Lee via Unsplash

Are you an Autonomous Vehicle (AV) enthusiast wondering how far AV development has advanced? Are you working at an AV company wondering how does your performance compare to the competitors?

There is no single answer, no single source. Mine the regulations, safety standards, market studies, business case acceptance pulse, automotive sub-system report in isolation and conjunction, algorithm benchmarks, test reports, burn rate analysis, and the list goes on.

More than one post is required to comprehensively analyze all these. This post takes a bitesize chunk. It focuses on the AV road test reports in a quantitative way utilizing publicly available Data. It gives you the code and a guide so that you can start using the AV test drive result data pipeline (remember, data is not available in a readily consumable format). It ends with thoughts for future improvements and a brief analysis based on the consolidated data.

Fasten your seat belt and read along …


Publicly Available AV Test Drive Result Data

While there is plenty of AV sensor data publicly available from Udacity, UC Berkeley, Nvidia, and more institutions (rightfully so to drive innovation) in a purpose-built format to experiment with the algorithms, there is a scarcity of publicly available AV test result data to comprehensively assess the current standing even for public policy! AV testing is done in three phases: simulation, track test, and road test. Road test is the last mile before deploying AV in the field and regulators look at the road test data.

California DMV has published AV collision reports and disengagement reports. The companies holding AV testing permits in CA submit reports to the DMV. Safety is the foremost criteria in the viability of AV. Collision report provides direct insights. The disengagement report is a proxy indicator. The hypothesis is that if an AV has disengaged, it is not ready for a driverless ride. What if the safety driver was worried because the AV did not exactly follow his/her line of thoughts, and he/she initiated the disengagement? What if the scenario is run on simulation, the outcome is safe yet not disengaged? The point is that the reports are not without flaws.

Then what is the alternate choice to evaluate? None at this point. Besides California, no other state in the USA has published any such data. Some states, for example, Arizona, Florida have shared AV literature in PowerPoint presentations.


Getting to know the California DMV AV Data Pipeline

The purpose of the data pipeline is to access the data in an analysis-ready format.

Who could use the data pipeline? Developers working in the AV industry have access to rich data of a particular company but lack other company’s data. They could use it to benchmark. Market research firms and AV enthusiasts could use it as well.

Is there a need for a data pipeline? The disengagement report is available in excel, ingestible by any analytics tool, and does not require a data pipeline. Intuitions driven by the disengagement metric is a debatable topic in itself. The collision report is actually a form stored in a pdf format. Every year there are hundreds of collision reports submitted. It is not possible to manually analyze the data.

The data pipeline generates a neat excel file after processing several hundreds of multi-page forms submitted by different AV companies on different dates. Here are a list of the fields/column headers and a snippet of field values on a subset of the fields.

List of the fields (source: author)
List of the fields (source: author)
List of the fields (source: author)
List of the fields (source: author)
List of the fields (source: author)
List of the fields (source: author)
Snippet of the field values (source: author)
Snippet of the field values (source: author)

The input for the above output is the bunch of pdf form files downloaded from the CA DMV site. Below screenshots show 1) Waymo, Cruise, Apple, Zoox, Aurora who bought Uber’s self-driving unit, and more AV companies filing the AV drive crash reports 2) A sneak peek inside a pdf form filed by Lyft.

Input = pdf forms by different companies on different dates (source: author)
Input = pdf forms by different companies on different dates (source: author)
Sneak peek inside a pdf form submitted to CA DMV by Lyft self-driving lab (source: author)
Sneak peek inside a pdf form submitted to CA DMV by Lyft self-driving lab (source: author)

Getting Started Guide for the Data Pipeline

Refer to GitHub. The README contains the step-by-step guide.

A few points are noteworthy. PDFMiner and Pytesseract both Optical Character Recognition (OCR) techniques are used to extract information from the forms and they have complemented each other.

SpaCy is used for Natural Language Processing (NLP). There is a descriptive section in the form as shown below. Not all AV companies fill the section in the same way. If they do, the speed and the associated object values are captured in the Speed and Context fields of excel. In the below screenshots: 1) Waymo filled the speed of the self-driving vehicle. 2) The speed, verb, and the corresponding noun object are extracted enabling the parts of speech dependency parsing. In the picture, a diagram is shown from spaCy documentation. It is an interesting experiment. However, due to the inconsistency of data and complexity of NLP, this part is not very fruitful.

Waymo entered detail of AV speed during the crash on 2/21/19 in the CA DMV form (source: author)
Waymo entered detail of AV speed during the crash on 2/21/19 in the CA DMV form (source: author)
Dependency parser visualization using spaCy (source: spaCy developer documentation)
Dependency parser visualization using spaCy (source: spaCy developer documentation)

Future Improvements to the AV Test Result Data Pipeline

Make something work, test it out, make it better. The current version is the Minimum Viable Version. It allows to create an excel file following the current format of the collision reports.

Following are the ideas for future improvements:

Capture multi-select option. To illustrate, the weather could be rainy, foggy, and windy at the same time. The form could be filled by ticking all of these options. No form has selected more than one option thus far and the code honors only one option.

Collect the input files automatically. Downloading files from the CA DMV website takes a while. Consider using a web crawler and auto-download. The crawler can run for the first time and then at a quarterly or yearly frequency to collect new files. The code should process only the new files quickly and incrementally.

Have an easy to maintain and robust codebase. CA DMV could change the format of the forms in the future. The trait underscores the need for standards in the AV space. That will demand an inevitable code change. Some manual intervention is required currently at the time of image processing and final format checking. The code should be free of any manual intervention.

NLP to recognize pertinent information: If the data and hence the pipeline continue to be useful, the last thing to improve is accurate and relevant text extraction.

A Glimpse on Analysis of the Results

Using the excel file produced by the data pipeline, data analysis can be performed. Here is an analysis of the 2019 data; 104 incidents are reported.

CA DMV AV Collision reports 2019 | Damage level | Speed at the time of crash (source: author)
CA DMV AV Collision reports 2019 | Damage level | Speed at the time of crash (source: author)
CA DMV AV Collision reports 2019 | Collision type | Pre-collision movement of AV (source: author)
CA DMV AV Collision reports 2019 | Collision type | Pre-collision movement of AV (source: author)

It is worth taking a look at the data before diving into the discernments. The data has a bias. All the records belong to crash on the city roads and self-driving cars. There is no data regarding highway driving or larger vehicles such as autonomous trucks and buses.

The speed (mph) data is not reliable enough to infer a conclusion as the speed is not reported in 75% of the cases.

It is surprising to see the autonomous vehicle driving straight during the pre-collision stage and then it has a crash. The AV stopped before the crash seems to be a common scenario. Did the AV anticipate the crash and stopped? Or did it stop at a stop sign and get hit? When a moderate intensity crash took place, did the AV try to make it minor? If so, what was the trade-off? To know further, access to AV’s event log and causal analysis of the situation will be required.

There is an overwhelming representation of rear-end crashes. Should the automotive OEMs innovate and experiment on the rear material of the car?

Do you observe any good news towards AV readiness? While crashes did take place, 70% of the crash was minor. Let’s touch the finish line with a positive note.


Related Articles