Signature Fraud Detection

Signature Fraud Detection- An Advanced Analytics Approach

Sourish Dey
Towards Data Science
9 min readMay 22, 2020

--

In my previous article, I discussed advanced analytics application in the area of fraud in a generic fashion. In this article I will delve into details in a specific area of fraud-signature forgery. No wonder that institutions and businesses recognize signatures as the primary way of authenticating transactions. People sign checks, authorize documents and contracts, validate credit card transactions and verify activities through signatures. As the number of signed documents — and their availability — has increased tremendously, so has the growth of signature fraud.

According to recent studies, only check fraud costs banks about $900M per year with 22% of all fraudulent checks attributed to signature fraud. Clearly, with more than 27.5 billion (according to The 2010 Federal Reserve Payments Study) checks written each year in the United States, visually comparing signatures with manual effort on the hundreds of millions of checks processed daily proves impractical.

The advent of big data, on-distributed Hadoop-based platforms like MapR, has made it possible to economically and efficiently store and process large amounts of signature images. This enables enterprises to use comprehensive historical transaction data to discover patterns of fraud signatures by developing algorithms, which can automate the traditional visual comparison.

The art and science of signatures:

Before coming to types of automated signature verification types and detailed method let’s understand some concepts related to signing process and some popular myths, types of signature forgeries, and hence loopholes of conventional visual comparison of static signature images.

Myth: The authentic signatures of same person will be exactly similar throughout all transactions

Reality: The physical act of signing a signature requires coordinating the brain, eyes, arms, fingers, muscles and nerves. Considering all factors in play, it’s no wonder that people don’t sign their name exactly the same every time: some elements may be omitted or altered. Personality, emotional state, health, age, conditions under which the individual signs, space available for the signature and many other factors all influence signature-to-signature deviations.

Types of signature forgeries:

In real life a signature forgery is an event in which the forger mainly focuses on accuracy rather than fluency.

The range of signature forgeries falls into the following three categories:

1. Random/Blind forgery — Typically has little or no similarity to the genuine signatures. This type of forgery is created when the forger has no access to the authentic signature.

2. Unskilled (Trace-over) Forgery: The signature is traced over, appearing as a faint indentation on the sheet of paper underneath. This indentation can then be used as a guide for a signature.

3. Skilled forgery — Produced by a perpetrator that has access to one or more samples of the authentic signature and can imitate it after much practice. Skilled forgery is the most difficult of all forgeries to authenticate.

An effective signature verification system must have the ability to detect all these types of forgeries by means of reliable, customized algorithms.

Manual verification conundrum:

Because of subjective decision and varies heavily depending on human factors such as expertise, fatigue, mood, working conditions etc manual verification is more error prone and inconsistent, in the case of Skilled forgery(offline method)leads to following instances:

False Rejection: Flagging transactions fraudulent (when they are not) mistakenly declined, creating negative impact on customer satisfaction, often called as type-I error.

False Acceptance: Genuine signature and skilled forgery that operator accepted as an authentic signature, leading to financial and reputational loss, often called as type-II error.

Goal of an accurate verification system to minimize both type of error.

Signature traits:

Let’s understand signature features for a human document examiner to distinguish frauds from genuine.Following is non exhaustive list of static and dynamic characteristics used for signature verification:

· Shaky handwriting(static)

· Pen lifts(dynamic)

· Signs of retouching(static and dynamic)

· Letter proportions(static)

· Signature Shape/dimension( static)

· Slant/angulation(static)

· Very close similarity between two or more signatures(static)

· Speed(dynamic)

· Pen pressure(dynamic)

· Pressure Change Patterns(dynamic)

· Acceleration pattern(dynamic)

· Smoothness of Curves(Static)

Based on the verification environment and sample collection condition, not all the features are available for analysis

Types of automatic signature verification system:

As discussed depending on the feasible(available) signature characteristics extraction and business/functional requirement, broadly two category of Signature Verification systems exist in market.

A) Offline Signature Verification: Deployed where there is no scope for monitoring real time signature activity of a person. In applications that scrutinize signed paper documents, only a static, two dimensional image is available for verification. For obvious reason in this type of verification engine, dynamic characteristics. In order to account for the loss of these important information and produce highly accurate signature comparison results, off-line signature verification systems have to imitate the methodologies and approaches used by human forensic document examiners. This method is heavily dependent on tedious image preprocessing(image scaling, resizing, cropping, rotation, filtering, histogram of oriented gradients thresh holding, hash tagging etc.) and adept machine learning skills. The features mainly used here, are static in nature — image texture (wavelet descriptors), geometry and topology (shape, size aspect ratio etc.), stroke positions, hand writing similarity etc.

Although there are many limitations, in most of the real life check transactions and digital document verification signatures are executed beforehand and no scope of real time signature monitoring to capture the dynamic features.

For offline Signature Verification the machine learning tasks can be further categorized in 1) General learning (person-independent)- The verification task is performed by comparing the questioned signature against each known signature in 1:1 basis and 2) Special learning (which is person-dependent) — To verify whether the questioned signature falls within the range of variation among multiple multiple genuine signatures of same individual.

B) Online Signature Verification: Signing is a reflex action based on a repeated action, rather than deliberately controlling muscles and even accurate forgeries take longer to produce than genuine signatures. As the name suggests in this type of verification system capture of crucial dynamic features, such as speed, acceleration and pressure etc., is feasible. This type of system is more accurate as even for the copy machine or an expert, it is virtually impossible to mimic unique behavior patterns and characteristics of the original signer.

Experiment Brief:

Let’s discuss a simplistic offline verification solution in a simulated environment.For this research, data was prepared out of 40 individuals, each contributed 25 signatures and thereby having 1000 genuine signatures. Then subjects are randomly chosen to forge another person’s signature, with 15/individual, so having 600(decent over sampling of fraud) forgeries. Now with 25 genuine/person and 12 forged signature/person the data is randomly splitted in train(75%) and validation(25%) data, ensuring at least 15 genuine signatures/person.in train data.The goal is to build an offline algorithmic Signature Verification system with person independent learning method, an engine to determine whether or not a questioned signature from validation belongs to the a particular individual.

Fig: Genuine Signature sample Fig:Sample for an individual (genuine and forged)

Solution Framework:

Person independent supervised learning: The learning problem is converted to a two-class classification problem where the input consists of the difference(dissimilarity) between a pair of signatures and odds of genuine signature occurrence is calculated in terms of likelihood-ratio (LR) referred from a suitable parametric distribution of distance(dissimilarity score of paired signatures) both for good(authentic) and bad(forged) population. Then of a questioned signature of a person from it’s genuine signature is fitted to the distribution to calculate LR score and based on the LR and a pre-specified threshold value(based on maximum accuracy)the classification decision to be taken whether or not a questioned signature(from test data) is genuine w.r.t. a particular person.

Model Equation

Where

• P(Dg(i)|d) is probability density function value for the Dg(genuine) distribution at the distance d

• P(Db(i)|d) is probability density function value for the Db(forged) distribution at the distance d

  • N is number of known samples from a person for 1:1 comparison

• Ψ is a pre-specified threshold value >1

Although the modeling task is straight forward, a lot of image preprocessing is required to calculate distance/vector of distance(d) between signature pairs based on extracted static features.Also suitable parametric model selection and tuning with optimal cutoff value.

Steps involved:

A) Feature Extraction: This is highly technical area and involves complex image processing to extract discriminating elements and the combination of elements for a particular person.

1) Image preprocessing and grid formation: Each signature was gone through salt pepper noise removal and slant normalization process after gray-scale transformation. Then after suitable resizing, cropping and other augmentation process each image is re-constructed with 4x7 grid

.2) Binary feature vector extraction: Extraction of GSC(gradient, structural and concavity)feature map from pixels image grid and corresponding local histogram cell is quantized into binary feature vector of 1024 bits(summing bits of G,S and C features).

Fig: Image grid and 1024 bit binary feature vector

B) Similarity(distance) measure: Developing Gaussian landmark(exp(−rij2/2σ2))sets for point to point matching of paired images and overall similarity or distance measure is used to compute a score that signifies the strength of match between two signatures. The similarity measure converts the pair wise data from feature space to distance space. Several. Here Hamming Distance method is used.

(Apology for not elaborating these topics here because of space constraint and will discuss in a separate post.)

C) Model training(Distribution fit): These pairwise distances(d) of train data are categorized into two vectors,Dg- vector of distances between all pairs of genuine signatures(samples truly came from same persons) and Db- vector of distances between all pairs of forged signatures(samples came from different persons). These two distance vectors can be modeled using known distributions such as Gaussian or gamma. For this example gamma distribution fits the data well.

D) Likelihood ratio(LR) and classification decision: For a questioned signature of a particular person from untagged data(here from validation)is then 1:1 matched with the person’s genuine signature after above described preprocessing and distance score(pairwise dissimilarity) point is projected against the fitted density curve to get LR value -P(Dg|d)/ P(Db|d).If likelihood ratio is greater than 1,then the classification decision is that the two samples do belong the same person and if the ratio is less than 1, they belong to different persons.If there are a total of N known samples from a person, then for one questioned sample, N no of 1:1 verification can be performed and the likelihood ratios multiplied. For convenience log likelihood-ratios(LLR) are adopted rather than likelihood ratios.

Fig: Distribution fit and Classification Decision

Performance Evaluation: The above distribution, although there is a noticeable overlapping zone, has done its job reasonably well in discriminating two regions(genuine and fraud).Apparently the decision boundary is given by the sign of the LLR and a modified decision boundary can be constructed using a threshold α, such that log P(Dg|d)−log P(Db|d) >α. The model accuracy defined as [1-((False acceptance+False reject)/2)] is maximum at a particular value of α. This involves model tuning and the best setting of α is denoted as operating point, for the specified number of known samples. In ROC curves, generated with varied no of known samples (from 12–15) the operating point is shown as ‘*’. The overall accuracy is around 77%.

Fig: Model Tuning and Performance

Improvement and road ahead:

Through this experiment and simplistic solution a moderate accuracy is achieved. However the accuracy can be improved with bigger training data, fitting and ensembling with other models, including non-parametric methods(deep learning, CNN etc.). Also incorporating other distance measures(e.g. Levenshtein distance, Chamfer distance)between image pairs as additional features and/or with taking simple/weighted average of these dissimilarity features would make the dissimilarity measure more robust and reliable add more discriminatory power to the model.

Finally cutting-edge signature verification systems need to be adaptive, agile and accurate. This requires deep analysis of ever-growing datasets and continuous updates to production models, so that the efficiencies remain stable with time, unlike results achieved in high-volume situations with human operators.

--

--

Sourish has 9+ years of experience in Data Science, Machine Learning,Business Analytics,Consulting across all domains, currently in Berlin with dunnhumby gmbh.