How to Implement the WHOLE Data Science Process

Understand Model Building Process in Python

Shirley Chen
5 min readMar 2, 2020

Recently, I completed “Predicting Credit Card Approvals” project on DataCamp. This project covers the ins and outs of how to build a machine learning model to predict if a credit card application will get approved. Here I explain the process that I use to build models in Python!

(1) Know what the business needs

In this case study, we want to help predict the credit card approvals for Commercial banks because they receive a lot of applications for credit cards. However, many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual’s credit report, for example.

(2) Set performance metric

The performance metrics that we choose to evaluate your machine learning algorithms are very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared.

Classification Metrics

Here is the list about classification metrics:

  1. Classification Accuracy - the number of correct predictions made as a ratio of all predictions made
  2. Log Los - a performance metric for evaluating the predictions of probabilities of membership to a given class
  3. Area Under ROC Curve - a performance metric for binary classification problems
  4. Confusion Matrix - a handy presentation of the accuracy of a model with two or more classes
  5. Classification Report - display the precision, recall, f1-score and support for each class

Regression Metrics

Here is the list about 3 of the most common metrics for evaluating predictions on regression machine learning problems:

  1. Mean Absolute Error - the average of the absolute differences between predictions and actual values
  2. Mean Squared Error - the average absolute error in that it provides a gross idea of the magnitude of error
  3. R-Squared - provide an indication of the goodness of fit of a set of predictions to the actual values

(3) Get the data

First, we will start off by loading and viewing the dataset.

import pandas as pd
cc_apps = pd.read_csv('cc_approvals.data', header=None)
cc_apps.head()

(4) Use EDA to understand the data

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Typical graphical techniques used in EDA are: Box plot, Histogram, Pareto chart, Scatter plot, and Stem-and-leaf plot.

(5) Split the data

As we work with datasets, under supervised learning, we split a dataset into a training data and test data.

Train/Test Split
from sklearn.model_selection import train_test_split# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([cc_apps.columns[11], cc_apps.columns[13]], axis=1)
cc_apps = cc_apps.values
X, y = cc_apps[:,0:13] , cc_apps[:,13]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

(6) Preprocessing for training and testing data

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Here is the 8 ways to implement preprocessing:

  1. Standardization, Mean removal, Variance Scaling
  2. Non-linear Transformation - quantile transforms and power transforms
  3. Normalization - the process of scaling individual samples to have unit norm
  4. Encoding Categorical Features - OrdinalEncoder(convert categorical features to such integer codes), OneHotEncoder (transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0)
  5. Discretization - provide a way to partition continuous features into discrete values
  6. Imputation of Missing Values
  7. Generating Polynomial Features
  8. Custom Transformer

Fit/Transform on Training Data

We use a fit method to learn model parameters from a training set, and a transform method which applies this transformation model to unseen data. Therefore, we use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.

Transform on Test Data

We only use transform() on the test data because we use the scaling parameters learned on the train data to scale the test data.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

(7) Modeling

Fitting a model to the train set

Which model should we pick? A question to ask is: Are the features that affect the process correlated with each other? Because of this correlation, we will take advantage of the fact that generalized linear models perform well in these cases. Let’s start our machine learning modeling with a Logistic Regression model.

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(rescaledX_train, y_train)

Making predictions and evaluating performance

But how well does our model perform?
Based on the performance metric we set, we will now evaluate our model on the test set with respect to classification accuracy and confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. The confusion matrix helps us to view our model’s performance from these aspects.

from sklearn.metrics import confusion_matrixy_pred = logreg.predict(rescaledX_test)print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))   # 0.8333333
print(confusion_matrix(y_test, y_pred))
[[60 10]
[13 55]]

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

(8) Validation and Hyperparameter Tuning

In machine learning, hyperparameter tuning is the problem of choosing a set of optimal hyperparameter for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

Let’s see if we can do better. We can perform a grid search of the model parameters to improve the model’s ability to predict credit card approvals. Grid-search implements a “fit” and a “score” method that is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

from sklearn.model_selection import GridSearchCVtol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = dict(tol=tol, max_iter=max_iter)

While building this credit card predictor, we finish with some machine learning to predict if a person’s application for a credit card would get approved or not given some information about that person. Therefore, we get the accuracy score as 85%.

grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)rescaledX = scaler.fit_transform(X)
grid_model_result = grid_model.fit(rescaledX, y)
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))
0.850725

(9) A/B testing the final model before full deployment

After hyperparameter tuning for every model we create, we can already get the best performing model. Eventually, we can do A/B testing to the final model before full deployment to make sure every model will be correctly implementing in the future prediction.

Source code that created this article can be found in my Github.

About me

Thank you so much for reading my article! Hi, I’m Shirley, currently studying for a Master Degree in MS-Business Analytics at ASU. If you have questions, please don’t hesitate to contact me!

Email me at kchen122@asu.edu and feel free to connect me on LinkedIn!

--

--

Shirley Chen

CEO and Co-Founder @ Preciser AI | Entrepreneurship x Startup x Sports-Tech| Medium List: https://bit.ly/3S8mJt1 | LinkedIn: in/kuanyinchen-shirley