The world’s leading publication for data science, AI, and ML professionals.

Compared to Native Spark 3.0, We Have Achieved Significant Optimization Effects in the AI

For AI, expand SQL syntax, support more FE functions and time-series feature calculations

Introducing OpenMLDB and its advantages over native Spark

Photo by SpaceX on Unsplash
Photo by SpaceX on Unsplash

Background

Spark has quickly become the de-facto standard for big data processing, and there is no need to introduce it, but Spark still has many shortcomings in AI scenarios.

  1. good: Native Spark is good at big data processing in the Hadoop cluster
  2. insufficient: SparkSQL of shortcomings is gradually exposed in the field of feature extraction
  3. insufficient: Koalas, the panda API on the Apache Spark

Even though the SparkSQL and Koalas projects can provide some significant data processing functions, the function support required for feature extraction is still very limited. The performance of the time-series feature calculation supported is not satisfactory. It can only support offline analysis, but for AI production, online computing is also required. And because this process should use UDF/UDAF repeatedly, there is still much room for improvement in the integration of open-source Spark and AI.

At present, more and more native execution engine solutions have emerged in the industry, such as Databricks‘ Photon module, Intel’s OAP project, Nvidia’s Spark-rapids project, and Alibaba’s EMR service. These projects have their bright spots. They can be generated through the underlying native code, make full use of CPU vectorization and GPU parallel computing characteristics, and significantly improve Spark’s performance in specific scenarios. But there are still some shortcomings. First of all, it can only support offline calculations and online feature calculations, and online model estimation services required for the landing of AI scenes. Secondly, there is no further optimization for commonly used time series calculation features (SQL Over Window), and finally, for AI scenarios, the feature extraction function also lacks support.

The comparison of several Spark engines

Image by Author
Image by Author

So we release our project of OpenMLDB (SparkFE was the engine of OpenMLDB). It could make up for the shortcomings of the above projects and achieve better performance in AI applications. Based on LLVM optimization, performance is improved by more than six times than the original Spark.

The advantages

  • Based on LLVM optimization, high-performance performance is improved more than 6 times than the original Spark.
  • For AI, expand SQL syntax, support more FE functions and time-series feature calculations.
  • Online and offline consistency, through the window feature aggregation calculation of the time series database, the SQL one-click online is realized.
  • No migration cost, compatible with SparkSQL applications, Scala, Java, Python, and R applications can enjoy performance improvements without modifying the code.
  • So, finally, whether it is a time-series feature calculation scenario or a particular table-composition scenario for Machine Learning, the final performance test results show that OpenMLDB can achieve a performance improvement of six times to hundreds of times without increasing the hardware cost.
Time-serial features extraction. Image by Author
Time-serial features extraction. Image by Author
Custom table join for machine learning. Image by Author
Custom table join for machine learning. Image by Author

The performance optimization compared to native Spark3.0.

Here we introduce some application scenarios of OpenMLDB, focusing on the performance optimization methods.

The first optimization is the native window calculation.

The bottom layer is based on the C++ two-way queue data interface, efficiently implementing standard window and sub-window functions. Expanding the ROWS_RANGE boundary implemented by SQL can also better solve the calculation sequence problem of the same millisecond data. Native Window Computing

  • Native Window Computing
  • ➡Performance improvement
  • ➡ ROWS_RANGE bound
  • ➡ Window with union tables
  • ➡ Handle conflict of microsecond
select trip_duration, passenger_count,
sum(pickup_latitude) over w as vendor_sum_pl,
max(pickup_latitude) over w as vendor_max_pl,
min(pickup_latitude) over w as vendor_min_pl,
avg(pickup_latitude) over w as vendor_avg_pl,
sum(pickup_latitude) over w2 as pc_sum_pl,
max(pickup_latitude) over w2 as pc_max_pl,
min(pickup_latitude) over w2 as pc_min_pl,
avg(pickup_latitude) over w2 as pc_avg_pl ,
count(vendor_id) over w2 as pc_cnt,
count(vendor_id) over w as vendor_cnt
from t1
window w as (partition by vendor_id order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
w2 as (partition by passenger_count order by pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW);

The second point of optimization is the native LastJoin implementation.

This expanded Join type can ensure that the result after the table is assembled is the same as before the splicing. The native implementation at the Spark source code level can be more than a hundred times faster than the implementation based on Spark 3.0, and it saves more memory.

SELECT t1.name FROM t1 LAST JOIN t2 WHERE t1.id == t1.id
LastJoin performance. Image by Author
LastJoin performance. Image by Author

The third point of optimization is the optimization of multi-window parallel computing.

The default serial execution of Spark is optimized into parallel computing, making full use of cluster computing resources, and the overall SparkSQL task time can also be greatly reduced.

SELECT
 min(age) OVER w1 as w1-min-age,
 min(age) OVER w2 as w2-min-age
FROM t1
WINDOW
 w1 as (PARTITION BY name ORDER by age ROWS BETWEEN 10 PERCEDING AND CURRENT ROW),
 W2 as (PARTITION BY age ORDER by age ROWS BETWEEN 10 PERCEDING AND CURRENT ROW)
Image by Author
Image by Author

The fourth point of optimization is time-window data skew calculation

Similar to the tabular skew optimization, the window partitions the skew data twice, which significantly improves the calculation parallelism of the task and makes full use of cluster resources to reduce the overall TCO (Total Cost of Ownership).

Before opt, parallelism: 2. Image by Author
Before opt, parallelism: 2. Image by Author
After opt, parallelism: 4. Image by Author
After opt, parallelism: 4. Image by Author

SparkFE (now merge to OpenMLDB) has many optimizations that cannot be repeated. Because it is compatible with SparkSQL API, the application scenarios are similar to Spark, especially in AI-related time-series feature calculations, data table splicing.

  • Time-serial feature extraction
  • Concurrency for multiple windows
  • Custom table join
  • Custom feature extraction functions
  • Skew optimization
  • Native aggregation functions

Here is the architecture of OpenMLDB , welcome to the commnunity.

Image by Author
Image by Author

Related Articles