Anomaly Detection Framework Enabling Fraud Prevention In AdTech
August 28, 2023 .
The client is one of the leading global players in the mobile advertising and app monetization space. It facilitates running offers for advertising companies across multiple platforms in its partner network. This in turn helps both advertisers as well as publishers in customer acquisition and retention.
The client has operations in more than 100 countries with 15M+ daily active users who complete 80M+ advertiser offers daily through its interface on publisher mobile applications. The carrot used by advertisers for users to complete offers is rewards in the form of in-app currency or dollars.
Over the last few years scale has grown tremendously, and so has the challenges with respect to fraudulent users exploiting the reward system. Dollar amount associated with fraudulent attempts is to the tune of $25,000 every day. These attempts if not handled correctly, often result in the client incurring losses to the tune of millions of dollars every year.
Build a Machine Learning system that can identify fraudulent attempts with high precision and recall. While it is extremely important to identify fraudulent attempts, it is imperative for the business not to misclassify genuine attempts.
“Impact of misclassification of a genuine attempt is often much higher than that of a fraudulent attempt in client’s business”
Legacy framework at client –
- Rule based Risk Framework that scores every offer attempt based on various factors. Ones above threshold are blocked
- Post-event scoring by third party products that share potential fraudulent attempts that should have been blocked but weren’t.
Yugen’s objective was to build an in-house solution that can help the client in achieving a high precision and recall for Risk Framework predictions as well as high Match Rate with the ML predictions by third party solutions, the latter being an area of unknowns as the client did not have much visibility into factors being considered by 3P products.
Project execution was split into three phases –
Yugen’s diagnostic framework has been designed to lay a strong foundation for the study to continue more smoothly and at a much faster pace when it is kicked off.
Yugen conducted an accelerated two-week diagnostic phase with the client to –
- Understand the problem by conducting stakeholder interviews – A series of discussions were conducted with the stakeholders in business, risk, engineering and data science teams to understand the Fraud identification issue, how is it being solved currently and what are the missing gaps in the existing workflow
- Understand data availability and get accesses sorted
- Conduct preliminary analysis to identify top drivers for the problem –
- Scoping for work streams and alignment on target
Solution Design & Development
An ML Project constitutes 4 key phases –
- Data Analysis
- Feature Engineering
- Model Development & Evaluation
- Model Deployment
While all these 4 components look siloed, for efficiency study execution, Yugen undertakes “an iterative approach for faster deployment and continuous optimization”.
Team came up with all possible hypotheses that could distinguish behaviour of a fraudulent user from a genuine user. A few examples are –
- Usage of VPN – Geo-hopping to exploit offers from other regions
- Bot usage – Identifying patterns in data that distinguishes bots from a human
- Identity theft – Features to identify a user from behavioural or demographic attributes if the ID is changing frequently
- Click activity – Identify anomalous patterns in click events of the user
and many more.
Every hypothesis was tested on tested on three fronts –
- Whether it helps in distinguishing fraudulent behaviour from rest or not
- How much value it can drive in model explainability
- How much history model needs to understand for the feature (1hr, 1day, 1month etc.)
Based on the above 3, the team came up with around 200 unique features and more than 500 including different forms of the same feature, that appeared to have potential value in model development.
During the process, small batches of features (~5) went through the steps –
Hypothesis testing – > Feature deployment – > Model development -> Evaluation
This was done to make sure that all components can run in parallel and make the study as efficient as possible.
The scale of the system that needed to be developed helped us understand during the diagnostic phase itself that Engineering will need to work very closely with the ML team throughout study to build the right infrastructure that will support ML System development.
The team designed an in-house Feature Store for the client with more than 500 features. This feature store catered to the Fraud Detection use case that the team was working on and at the same time a few other Data Science, Analytics and Dashboarding use cases as well.
Model Development & Evaluation
The Fraud detection use case that the team was solving is often referred to as Anomaly detection in the Machine Learning space. That’s due to the fact that fraudulent behaviour of users can also be seen as an anomaly when looked together with that of genuine users.
Moreover, the data at hand was highly imbalanced. Only 0.5% of attempts by users were flagged fraudulent which made this an Extreme Rare Event problem to solve. Most of the ML algorithms fail when it comes to solving such rare event use cases. While there are techniques to tackle imbalancing, they all depend on either reducing the sample size of a class or inflating that of the opposite class or performing both. In all such use approaches, data landscape changes a lot and impacts model performance as well.
Above reasons, led us to leverage one of the Anomaly Detection deep learning algorithms for model training. This process comprised 3 steps –
- Train ML model on all features and identify the ones adding value through feature importance tools, exploratory analysis, and iterative improvement. Multiple iterations were performed to identify around 40 important features which were bringing in almost the same value as all 100 unique features together
- Optimize the feature weights through techniques such as Genetic Algorithm to provide higher weights to features adding more value and eventually enhance model performance
- Evaluate model metrics, performance – Precision, Recall, True Positive/Negative rate, Math Rate with external data etc.
Following the above 3 steps, the team was able to reach desired level of performance in freeze the models.
At first a Post event model was deployed in production to run inference in batches one time a day on the previous day’s attempt. The results were analysed over time to understand drift in performance.
Once, the model did well and remained stable on post-event predictions. Real time model was deployed to run A/B experiments.
On successful execution of experiments, the model recommendations were scaled to 100% transactions.
The current model that runs in production has following attributes –
- 3 different models have been developed and deployed for 3 different types of offers
- Models have been scaled from 1 country to 21 countries now
- Model Performance –
- 98% correct predictions for genuine attempts
- 95% correct predictions for fraudulent attempts
- 30% match rate with attempts that are flagged by third party products
The Machine Learning System developed and deployed by Yugen’s team is expected to help client save more than USD 5M by the end of October 2024.