Real Time Credit Card Fraud Detection with Apache Spark and Event Streaming
https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/
Editor's Note: Have questions about the topics discussed in this post? Search for answers and post questions in the Converge Community.
In this post we are going to discuss building a real time solution for credit card fraud detection.
There are 2 phases to Real Time Fraud detection:
- The first phase involves analysis and forensics on historical data to build the machine learning model.
- The second phase uses the model in production to make predictions on live events.
Building the Model
Classification
Classification is a family of supervised machine learning algorithms that identify which category an item belongs to (for example whether a transaction is fraud or not fraud), based on labeled examples of known items (for example transactions known to be fraud or not). Classification takes a set of data with known labels and pre-determined features and learns how to label new records based on that information. Features are the “if questions” that you ask. The label is the answer to those questions. In the example below, if it walks, swims, and quacks like a duck, then the label is "duck".
Let’s go through an example of car insurance fraud:
- What are we trying to predict?
- This is the Label: The Amount of Fraud
- What are the “if questions” or properties that you can use to predict ?
- These are the Features, to build a classifier model, you extract the features of interest that most contribute to the classification.
- In this simple example we will use the the claimed amount.
Linear regression models the relationship between the Y “Label” and the X “Feature”, in this case the relationship between the amount of fraud and the claimed amount. The coefficient measures the impact of the feature, the claimed amount, on the label, the fraud amount.
Multiple linear regression models the relationship between two or more “Features” and a response “Label”. For example if we wanted to model the relationship between the amount of fraud and the the age of the claimant, the claimed amount, and the severity of the accident, the multiple linear regression function would look like this:
AmntFraud = intercept+ coeff1 age + coeff2 claimedAmnt + coeff3 * severity + error.
The coefficients measure the impact on the fraud amount of each of the features.
Let’s take credit card fraud as another example:
- Example Features: transaction amount, type of merchant, distance from and time since last transaction .
- Example Label: Probability of Fraud
Logistic regression measures the relationship between the Y “Label” and the X “Features” by estimating probabilities using a logistic function. The model predicts a probability which is used to predict the label class.
- Classification: identifies which category (eg fraud or not fraud)
- Linear Regression: predicts a value (eg amount of fraud)
- Logistic Regression: predicts a probability (eg probability of fraud)
Linear and Logistic Regression are just a couple of algorithms used in machine learning, there are many more as shown in this cheat sheet.
Feature Engineering
Feature engineering is the process of transforming raw data into inputs for a machine learning algorithm. Feature engineering is extremely dependent on the type of use case and potential data sources.
(reference Learning Spark)
Looking more in depth at the credit card fraud example for feature engineering, our goal is to distinguish normal card usage from fraudulent card usage.
- Goal: we are looking for someone using the card other than the cardholder
- Strategy: we want to design features to measure the differences between recent and historical activities.
For a credit card transaction we have features associated with the transaction, features associated with the card holder, and features derived from transaction history. Some examples of each are shown below:
Model Building Workflow
A typical supervised machine learning workflow has the following steps:
- Feature engineering to transform historical data into feature and label inputs for a machine learning algorithm.
- Split the data into two parts, one for building the model and one for testing the model.
- Build the model with the training features and labels
- Test the model with the test features to get predictions. Compare the test predictions to the test labels.
- Loop until satisfied with the model accuracy:
- Adjust the model fitting parameters, and repeat tests.
- Adjust the features and/or machine learning algorithm and repeat tests.
Read Time Fraud Detection Solution in Production
The figure below shows the high level architecture of a real time fraud detection solution, which is capable of high performance at scale. Credit card transaction events are delivered through the MapR Streams messaging system, which supports the Kafka .09 API. The events are processed and checked for Fraud by Spark Streaming using Spark Machine Learning with the deployed model. MapR-FS, which supports the posix NFS API and HDFS API, is used for storing event data. MapR-DB a NoSql database which supports the HBase API, is used for storing and providing fast access to credit card holder profile data.
Streaming Data Ingestion
MapR Streams is a new distributed messaging system which enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API. MapR Streams topics are logical collections of messages which organize events into categories. In this solution there are 3 categories:
- Raw Trans: raw credit card transaction events.
- Enriched: credit card transaction events enriched with card holder features, which were predicted to be not fraud.
- Fraud Alert: credit card transaction events enriched with card holder features which were predicted to be fraud.
Topics are partitioned, spreading the load for parallel messaging across multiple servers, which provides for faster throughput and scalability.
Real-time Fraud Prediction Using Spark Streaming
Spark Streaming lets you use the same Spark APIs for streaming and batch processing, meaning that well modularized Spark functions written for the offline machine learning can be re-used for the real time machine learning.
The data flow for the real time fraud detection using Spark Streaming is as follows:
1) Raw events come into Spark Streaming as DStreams, which internally is a sequence of RDDs. RDDs are like a Java Collection, except that the data elements contained in RDDs are partitioned across a cluster. RDD operations are performed in parallel on the data cached in memory, making the iterative algorithms often used in machine learning much faster for processing lots of data.
2) The credit card transaction data is parsed to get the features associated with the transaction.
3) Card holder features and profile history are read from MapR-DB using the account number as the row key.
4) Some derived features are re-calculated with the latest transaction data.
5) Features are run with the model algorithm to produce fraud prediction scores.
6) Non fraud events enriched with derived features are published to the enriched topic. Fraud events with derived features are published to the fraud topic.
Storage of Credit Card Events
Messages are not deleted from Topics when read, and topics can have multiple different consumers, this allows processing of the same messages by different consumers for different purposes.
In this solution, MapR Streams consumers read and store all raw events, enriched events, and alarms to MapR-FS for future analysis, model training and updating. MapR Streams consumers read enriched events and Alerts to update the Card holder features in MapR-DB. Alerts events are also used to update Dashboards in real time.
Rapid Reads and Writes with MapR-DB
With MapR-DB (HBase API), a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table. Grouping the data by key range provides for really fast read and writes by row key.
All of the components of the use case architecture we just discussed can run on the same cluster with the MapR Converged Data Platform. There are several advantages of having MapR Streams on the same cluster as all the other components. For example, maintaining only one cluster means less infrastructure to provision, manage, and monitor. Likewise, having producers and consumers on the same cluster means fewer delays related to copying and moving data between clusters, and between applications.
Summary
In this blog post, you learned how the MapR Converged Data Platform integrates Hadoop and Spark with real-time database capabilities, global event streaming, and scalable enterprise storage.
References and More Information:
- Free Online training on MapR Streams, Spark, and HBase at learn.mapr.com
- Getting Started with MapR Streams Blog
- Ebook: New Designs Using Apache Kafka and MapR Streams
- Ebook: Getting Started with Apache Spark: From Inception to Production
- https://www.mapr.com/blog/parallel-and-iterative-processing-machine-learning-recommendations-spark
- https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db
- https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
- https://www.mapr.com/blog/life-message-mapr-streams
- https://www.mapr.com/blog/spark-streaming-hbase
- Apache Spark Streaming Programming Guide
- Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection Book, by Wouter Verbeke; Veronique Van Vlasselaer; Bart Baesens
- Learning Spark Book, By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
Real Time Credit Card Fraud Detection with Apache Spark and Event Streaming的更多相关文章
- WARN deploy.SparkSubmit$$anon$2: Failed to load org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.
前言 今天运行Spark Structured Streaming官网的如下 ./bin/run-example org.apache.spark.examples.sql.streaming.Str ...
- Apache Spark 2.2.0 中文文档
Apache Spark 2.2.0 中文文档 - 快速入门 | ApacheCN Geekhoo 关注 2017.09.20 13:55* 字数 2062 阅读 13评论 0喜欢 1 快速入门 使用 ...
- Apache Spark 2.2.0 中文文档 - Structured Streaming 编程指南 | ApacheCN
Structured Streaming 编程指南 概述 快速示例 Programming Model (编程模型) 基本概念 处理 Event-time 和延迟数据 容错语义 API 使用 Data ...
- codeforces 893D Credit Card 贪心 思维
codeforces 893D Credit Card 题目大意: 有一张信用卡可以使用,每天白天都可以去给卡充钱.到了晚上,进入银行对卡的操作时间,操作有三种: 1.\(a_i>0\) 银行会 ...
- 论文泛读:Click Fraud Detection: Adversarial Pattern Recognition over 5 Years at Microsoft
这篇论文非常适合工业界的人(比如我)去读,有很多的借鉴意义. 强烈建议自己去读. title:五年微软经验的点击欺诈检测 摘要:1.微软很厉害.2.本文描述了大规模数据挖掘所面临的独特挑战.解决这一问 ...
- (原创)北美信用卡(Credit Card)个人使用心得与总结(个人理财版) [精华]
http://forum.chasedream.com/thread-766972-1-1.html 本人2010年 8月F1 二度来美,现在credit score 在724-728之间浮动,最高的 ...
- Educational Codeforces Round 33 (Rated for Div. 2) D. Credit Card
D. Credit Card time limit per test 2 seconds memory limit per test 256 megabytes input standard inpu ...
- magento 开启 3D secure credit card validation
因为国外盗刷严重,于是得开启验证. 首先可以去 https://developer.cardinalcommerce.com/try-it-now.shtml.这上面有测试账号,截图如下:
- [Angular] Using directive to create a simple Credit card validator
We will use 'HostListener' and 'HostBinding' to accomplish the task. The HTML: <label> Credit ...
随机推荐
- ubuntu物理机上搭建Kubernetes集群 -- 准备
准备工作 1.kubernetes架构 2.三台ubuntu主机: 操作系统:ubuntu-16.04.1-server-amd64 docker: 1.安装 使用命令 sudo apt-get in ...
- 《HTTP - 基于http的认证》
推荐一首歌 - 好吧,今天刚入职第二天,也没听歌. 哈哈哈哈. 1:何为认证? - 其实这个问题就比较宽泛了,总的来说,就是你有证明你身份的标识. - 和人类社会一样,你花了钱想看一场场演唱会,但是谁 ...
- 我的grunt学习笔记
什么是grunt? Grunt是一个JavaScript任务运行器,用于自动执行频繁任务(如压缩,编译,单元测试)的工具.它使用命令行界面来运行在文件中定义的自定义任务(这个文件称为Gruntfil ...
- java 集合(四)HashSet 与 LinkedHashSet
查看源码: HashSet底层new 了HashMap 哈希表的结构: Demo: package cn.sasa.demo2; import java.util.HashSet; import ja ...
- 如何在js中将统计代码图标隐藏
建站时我们都会加一下网站统计,方便把控内容的内容的运营.大部分站长安装的站点统计是第三方统计代码,js形式的,很少用以服务器日志为基础分析的统计.(当然能通过网站日志来分析网站的运营者比一般的站长水平 ...
- Linux下高并发socket最大连接数各种限制的调优
1.修改用户进程可打开文件数限制 在Linux平台上,无论编写客户端程序还是服务端程序,在进行高并发TCP连接处理时,最高的并发数量都要受到系统对用户单一进程同时可打开文件数量的限制(这是因为系统为每 ...
- 009-ThreadPoolExecutor运转机制详解,线程池使用1-newFixedThreadPool、newCachedThreadPool、newSingleThreadExecutor、newScheduledThreadPool
一.ThreadPoolExecutor理解 为什么要用线程池: 1.减少了创建和销毁线程的次数,每个工作线程都可以被重复利用,可执行多个任务. 2.可以根据系统的承受能力,调整线程池中工作线线程的数 ...
- 发现Boost官方文档的一处错误(numpy的ndarray)
文档位置:https://www.boost.org/doc/libs/1_65_1/libs/python/doc/html/numpy/tutorial/ndarray.html shape在这里 ...
- sqlalchemy的fliter使用总结
1.sqlalchemy查询操作的filter函数内,日期比较用datetime试了半天行不通(因为数据库表中那个字段是datetime类型,最后是用字符串"%Y-%m-%d %H:%M:% ...
- Devops路线
自动化运维工具 Docker学习 .