The world’s leading publication for data science, AI, and ML professionals.

MLOps Practice: Using OpenMLDB in the Real-Time Anti-Fraud Model for the Bank’s Online Transaction

On-line anti-fraud MLOp practice in China

Background

Nowadays, many banks have begun introducing machine learning models to assist rule engines in making decisions. Banks have also done a lot of offline machine learning model exploration with good results, but they are rarely finally applied in online environments, mainly as follows Several reasons:

  1. In the risk control scenario, online inference requires high performance, requiring TP99 20ms, and it is challenging for most of the machine learning models to achieve such a performance.
  2. High timeliness requirements for feature calculation, but most machine learning infrastructure cannot meet the timing characteristics based on the sliding time window.
  3. Feature calculation is complex and large, and the online cost is very high.
  4. It isn’t easy to maintain the consistency between offline feature calculation and online. Finally, the offline effect is good, and the online effect is poor.

That is because, to complete an anti-fraud model online project, the following things need to be done to solve the consistency problem:

At first, the data scientists use SQL to process data offline. Next, they need to translate the feature engineering solution from offline to online. In this process, the data engineers need to align the feature processing logic with data scientists and consider the risk of data scientist program changes (which always happens in the production system).

Second, the data engineers design an online and real-time data processing architecture based on the anti-fraud online system architecture.

Finally, the data engineers should develop all of the feature processing logic that the data scientists already developed in SQL manually. In this process, the data engineers do the job as "translators" of different computer languages. Considering millions of features in the anti-fraud machine learning project, this translation job is unnecessary and error-prone.

The picture below describes this process for most MLOps platforms.

Image by Author
Image by Author

So is there a way that the data engineers just need to throw the SQL scripts written by data scientists into a magic box that could do the machine learning model inference job and make it workable online without considering the logic of the feature scripts?

Image by Author
Image by Author

Today, as a machine learning application developer, I will tell you how to solve these problems based on OpenMLDB. If you use it in your machine learning system, you could enjoy these benefits:

  1. There is no need to understand the logic of the data scientist’s feature plan. Even if the plan is adjusted, I only need to update the SQL to OpenMLDB.
  2. No need to design a complete set of online data calculation flow, using OpenMLDB makes me feel as simple as using MySQL to develop traditional applications.
  3. Say goodbye to manually developing the feature engineering

You could simplify the development process to the picture below.

image by author
image by author

How to use OpenMLDB to solve consistency problems

Everyone may be curious here, how [OpenMLDB](https://github.com/4paradigm/OpenMLDB) can execute a SQL in batch and real-time without any modification because OpenMLDB supports two execution modes:

  1. Batch mode, generates samples for the training process, similar to traditional Database execution of SQL.
  2. Request mode generates samples in real-time for the reasoning process and only calculates the characteristics related to the request.

Here let’s use the bank transaction data to give you an example:

Here we have two tables:

  1. _t_ins: the time the we query, which includes the userid/record time
  2. _t_trx: the transaction records of all the users, which includes the user_id / a_mount of each transaction / transaction time
image by author
image by author

How to use OpenMLDB to solve performance problems

[OpenMLDB](https://github.com/4paradigm/OpenMLDB) comes with a number of column compilation optimization technologies, such as function dynamic loop binding and the online part of the data is completely in-memory, which can ensure very high execution performance. The following is the performance data that comes with OpenMLDB.

image by author
image by author

We can see that the execution performance of OpenMLDB has great advantages over SingleStore and Hana. Next, let’s look at an actual SQL execution efficiency. SQL is as follows

select * from
(select
card_no,
trx_time,
merchant_id,
month(trx_time) as fea_month,
dayofmonth(trx_time) as fea_day_of_month,
hour(trx_time) as fea_hour,
week(trx_time) as fea_week,
substr(card_no, 1, 6) as card_no_prefix,
max(trx_amt) over w30d as w30d_trx_max ,
min(trx_amt) over w30d as w30d_trx_min,
sum(trx_amt) over w30d,
avg(trx_amt) over w30d,
max(usd_amt) over w30d,
min(usd_amt) over w30d,
sum(usd_amt) over w30d,
avg(usd_amt) over w30d,
max(org_amt) over w30d,
min(org_amt) over w30d,
sum(org_amt) over w30d,
avg(org_amt) over w30d,
distinct_count(merchant_id) over w30d,
count(merchant_id) over w30d,
distinct_count(term_city) over w30d,
count(term_city) over w30d,
max(trx_amt) over w10d,
min(trx_amt) over w10d,
sum(trx_amt) over w10d,
avg(trx_amt) over w10d,
max(usd_amt) over w10d,
min(usd_amt) over w10d,
sum(usd_amt) over w10d,
avg(usd_amt) over w10d,
max(org_amt) over w10d,
min(org_amt) over w10d,
sum(org_amt) over w10d,
avg(org_amt) over w10d,
distinct_count(merchant_id) over w10d,
count(merchant_id)  over w10d,
distinct_count(term_city)  over w10d,
count(term_city) over w10d
from  tran
window w30d as (PARTITION BY tran.card_no ORDER BY tran.trx_time ROWS_RANGE BETWEEN 30d PRECEDING AND CURRENT ROW),
w10d as (PARTITION BY tran.card_no ORDER BY tran.trx_time ROWS_RANGE BETWEEN 10d PRECEDING AND CURRENT ROW)) as trx_fe
last join card_info order by card_info.crd_lst_isu_dte on trx_fe.card_no = card_info.crd_nbr and trx_fe.trx_time >= card_info.crd_lst_isu_dte ;

First, analyze this SQL, the factors that affect performance are

  1. Number of data in the time window
  2. Feature number

Because the number of features is specific, we test the performance in different time windows.

OpenMLDB has a fantastic performance. There are 2000 pieces of data in a time window, which can ensure that p99 is around 1ms to meet the demanding performance requirements of anti-fraud scenarios quickly.

Image by Author
Image by Author

OpenMLDB has an amazing performance. There are 2000 pieces of data in a time window, which can ensure that p99 is around 1ms so that we can easily meet the demanding performance requirements of anti-fraud scenarios.

Use OpenMLDB online business effect

Business effectiveness is something that everyone is very concerned about. At present, we have launched anti-fraud scenarios for multiple banks. The online effect is consistent with the offline evaluation effect. The model result is improved by 2–8 times compared to the customer’s expert rules, and the recall rate is kept the same in the online and offline process. Under the circumstances, the customer approves of our work very much, hope my sharing can help you.

About OpenMLDB

[OpenMLDB](https://github.com/4paradigm/OpenMLDB) is an open-source database that provides correct and efficient data supply for Machine Learning applications. In addition to the more than 10 times improvement in the efficiency of machine learning data development, OpenMLDB also provides a unified computing and storage engine to reduce the complexity and overall cost of development, operation and maintenance.

Architecture

Welcome everyone to participate to HTTPS: // github.com/4paradigm/Op enMLDB In the community.


Related Articles