By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran.

By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran

Machine Learning applications relevant to the Financial Services sector

Machine Learning - Understanding
Machine learning is the science of designing & applying algorithms that are able to learn things from past cases. It uses complex algorithms that iterate over large data sets and analyze the patterns in data. The algorithm facilitates the machines to respond to different situations for which they have not been explicitly programmed. It is used in spam detection, image recognition, product recommendation, predictive analysis etc. Significant reduction of human effort is the main aim of data scientists in implementing ML. Even with modern analytics tools, it takes a lot of time for humans to read, collect, categorize and analyze the data. ML teaches machines to identify and gauge the importance of patterns in place of humans. Particularly for use cases where data must be analyzed and acted upon in a short amount of time, having the support of machines allows humans to be more efficient and act with confidence. Machine Learning converts data intensive and confusing information into a simple format that suggests actions to decision makers. A user further trains the ML system by continually adding data and experience. Thus at its core, machine learning is a 3-part cycle i.e. Train-Test-Predict. Optimizing the cycle can make predictions more accurate and relevant to the specific use-case

Use of Machine Learning in Insurance Claim Fraud
Insurance frauds cover the range of improper activities which an individual may commit in order to achieve a favorable outcome from the insurance company. This could range from staging the incident, misrepresenting the situation including the relevant actors and the cause of incident and finally the extent of damage caused. Potential situations could include: Covering-up for a situation that wasn’t covered under insurance (e.g. drunk driving, performing risky acts, illegal activities etc.) Misrepresenting the context of the incident: This could include transferring the blame to incidents where the insured party is to blame, failure to take agreed upon safety measures Inflating the impact of the incident: Increasing the estimate of loss incurred either through addition of unrelated losses (faking losses) or attributing increased cost to the losses

Process followed

Data Set 1 -Multiple Parties Data Set 4 – Age of the vehicle
Data Set - Sample Data Set 1 -Multiple Parties Data Set 2 – For Insured Data Set 3 – FIR Logged Data Set 4 – Age of the vehicle Number of Claims 8,627 562,275 595,360 15,420 Number of Attributes 34 59 62 Categorical Attributes 12 11 13 24 Normal Claims 8537 591,902 595,141 14,497 Frauds Identified 90 373 219 913 Fraud Incidence Rate 1.04% 0.06% 0.03% 5.93% Missing Values 11.36% 10.27% 0.00% Number of Years of Data 10 3

BETTER DATA, BETTER RESULTS
Volume of Data: A fraud management solution needs access to a vast store of historical transaction data to help train its models and maximize the likelihood that it will uncover patterns of suspicious activity. Richness of Data: It is not just the number of past transactions that counts — it is important to get as much information about each transaction as possible. Pulling data from different sources can enhance data quality and fill in missing information gaps. Relevancy of Data: By collecting data from payment processors, businesses and major payment networks, it is possible to tap into a vast reservoir of “truth information” — data regarding duplicated data, claim id’s are found invalid and manual reviews based on the actual outcome of past transactions. This data is critical to distinguishing between good and bad transactions.

Challenges Faced in Detection
The incidence of frauds is far less than the total number of claims, and also each fraud is unique in its own way Another challenge encountered in the process of machine learning is missing value and handling categorical values. Missing data arises in almost all serious statistical analyses. The other challenge is handling categorical attributes (for e.g. – the gender variable is transposed into two different columns say male and female)

Machine Learning model which can be used
Construction of machine learning algorithms that can learn from a dataset and make predictions on unseen data. Such algorithms operate by building a model from historical data in order to make predictions or decisions on the new unseen data. Logistic Regression: Logistic regression measures the relationship between a dependent variable and one or more independent variables by estimating probabilities using a logit function. Instead of regression in generalized linear model, a binomial prediction can be performed. Multivariate Normal Distribution - The multivariate normal distribution is a generalization of the univariate normal to two or more variables. Boosting - Boosting is a procedure that uses the idea of combining the outputs of many “weak” classifiers to produce a powerful “committee” Bagging : Unlike single decision trees which are likely to suffer from high variance or high bias. Decision trees are used to learn the input-output map of a supervised learning problem by expressing the map as a branching tree. The method does well on problems with complicated structures, but has the weakness that it has high variance. This weakness can be mitigated through bagging. Random Forest tuning can be controlled by using an “Out-Of-Bag” error estimate for each observation. This estimate is the average error from the trees corresponding to the bootstrap samples where that observation did not appear, and is used to control the training

Conclusion The machine learning models that are discussed and applied on the datasets should be able to identify most of the fraudulent cases with a low false positive rate i.e. with a reasonable precision. This enables system to focus on new fraud scenarios and ensuring that the models are adapting to identify them

Prediction of consumer credit risk
Because of the increasing number of companies or startups created in the field of microcredit and peer to peer lending, we tried through this project to build an efficient tool to peer to peer lending managers, so that they can easily and accurately assess the default risk of their clients. In order to restore trust in the finance system and to prevent credit and default risk from happening again, banks and other credit companies have recently tried to develop new models to assess the credit risk of individuals even more accurately.

Data Set – from Kaggle Age of the borrower
Number of dependents in family Monthly income Monthly expenditures divided by monthly gross income Total balance on credit cards divided by the sum of credit limits Number of open loans and lines of credit Number of mortgage and real estate loans Number of times the borrower has been days past due but no worse in the last 2 years Number of times the borrower has been days past due but no worse in the last two years Number of times the borrower has been 90 days or more past due.

Machine Learning model which can be used
Logistic regression as it is a very classic model for this type of problems. Classification and Regression Trees : Trees are particularly efficient in classification Random Forests: this model averages multiple deep decision trees trained on different parts of the training set (this aims at reducing the variance) Gradient Boosting Trees (GBT): gradient boosting algorithm improves the accuracy of a predictive function through incremental minimization of the error term. After the initial tree is grown, each tree in the series is fitted with the purpose of reducing the error.

Conclusion It can clearly state that two distinct groups of models results: Logit and CART constitute the First one - The more sophisticated tree models Second one - Random Forest and Gradient Boosting Trees. By combining trees and gradient boosting technique (GBT model), we can implement a model which presents two principal features. First - Its predictive power is very accurate. Second - Small variance makes it much more reliable

By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran.

Similar presentations

Presentation on theme: "By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran.

Similar presentations

Presentation on theme: "By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran."— Presentation transcript:

Similar presentations

About project

Feedback