How to Implement the WHOLE Data Science Process
Understand Model Building Process in Python
Recently, I completed “Predicting Credit Card Approvals” project on DataCamp. This project covers the ins and outs of how to build a machine learning model to predict if a credit card application will get approved. Here I explain the process that I use to build models in Python!
(1) Know what the business needs
In this case study, we want to help predict the credit card approvals for Commercial banks because they receive a lot of applications for credit cards. However, many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual’s credit report, for example.
(2) Set performance metric
The performance metrics that we choose to evaluate your machine learning algorithms are very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared.
Classification Metrics
Here is the list about classification metrics:
Classification Accuracy
- the number of correct predictions made as a ratio of all predictions madeLog Los
- a performance metric for evaluating the predictions of probabilities of membership to a given classArea Under ROC Curve
- a performance metric for binary classification problemsConfusion Matrix
- a handy presentation of the accuracy of a model with two or more classesClassification Report
- display the precision, recall, f1-score and support for each class
Regression Metrics
Here is the list about 3 of the most common metrics for evaluating predictions on regression machine learning problems:
Mean Absolute Error
- the average of the absolute differences between predictions and actual valuesMean Squared Error
- the average absolute error in that it provides a gross idea of the magnitude of errorR-Squared
- provide an indication of the goodness of fit of a set of predictions to the actual values
(3) Get the data
First, we will start off by loading and viewing the dataset.
import pandas as pd
cc_apps = pd.read_csv('cc_approvals.data', header=None)
cc_apps.head()
(4) Use EDA to understand the data
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Typical graphical techniques used in EDA are: Box plot, Histogram, Pareto chart, Scatter plot, and Stem-and-leaf plot.
(5) Split the data
As we work with datasets, under supervised learning, we split a dataset into a training data and test data.
from sklearn.model_selection import train_test_split# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([cc_apps.columns[11], cc_apps.columns[13]], axis=1)
cc_apps = cc_apps.valuesX, y = cc_apps[:,0:13] , cc_apps[:,13]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()
(6) Preprocessing for training and testing data
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
Here is the 8 ways to implement preprocessing:
Standardization, Mean removal, Variance Scaling
Non-linear Transformation
- quantile transforms and power transformsNormalization
- the process of scaling individual samples to have unit normEncoding Categorical Features
-OrdinalEncoder
(convert categorical features to such integer codes),OneHotEncoder
(transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0)Discretization
- provide a way to partition continuous features into discrete valuesImputation of Missing Values
Generating Polynomial Features
Custom Transformer
Fit/Transform on Training Data
We use a fit
method to learn model parameters from a training set, and a transform
method which applies this transformation model to unseen data. Therefore, we use fit_transform()
on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.
Transform on Test Data
We only use transform()
on the test data because we use the scaling parameters learned on the train data to scale the test data.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)
(7) Modeling
Fitting a model to the train set
Which model should we pick? A question to ask is: Are the features that affect the process correlated with each other? Because of this correlation, we will take advantage of the fact that generalized linear models perform well in these cases. Let’s start our machine learning modeling with a Logistic Regression model.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(rescaledX_train, y_train)
Making predictions and evaluating performance
But how well does our model perform?
Based on the performance metric we set, we will now evaluate our model on the test set with respect to classification accuracy and confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. The confusion matrix helps us to view our model’s performance from these aspects.
from sklearn.metrics import confusion_matrixy_pred = logreg.predict(rescaledX_test)print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test)) # 0.8333333
print(confusion_matrix(y_test, y_pred))[[60 10]
[13 55]]
For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.
(8) Validation and Hyperparameter Tuning
In machine learning, hyperparameter tuning is the problem of choosing a set of optimal hyperparameter for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.
Let’s see if we can do better. We can perform a grid search of the model parameters to improve the model’s ability to predict credit card approvals. Grid-search implements a “fit” and a “score” method that is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.
from sklearn.model_selection import GridSearchCVtol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = dict(tol=tol, max_iter=max_iter)
While building this credit card predictor, we finish with some machine learning to predict if a person’s application for a credit card would get approved or not given some information about that person. Therefore, we get the accuracy score as 85%.
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)rescaledX = scaler.fit_transform(X)
grid_model_result = grid_model.fit(rescaledX, y)best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))0.850725
(9) A/B testing the final model before full deployment
After hyperparameter tuning for every model we create, we can already get the best performing model. Eventually, we can do A/B testing to the final model before full deployment to make sure every model will be correctly implementing in the future prediction.
Source code that created this article can be found in my Github.
About me
Thank you so much for reading my article! Hi, I’m Shirley, currently studying for a Master Degree in MS-Business Analytics at ASU. If you have questions, please don’t hesitate to contact me!
Email me at kchen122@asu.edu and feel free to connect me on LinkedIn!