Analyzing subscription through app behavior

Business impact:

Many companies have a mobile presence to provide free/trial service in an attempt to transfer their customers to a paid membership , such as Netflix, Youtube Red, Microsoft 365, Adobe cloud service and so on;
Companies need to know which customer will most likely to pay for service and to target with offers and promotions;
However, in many cases, customer data is rarely clean and requires additional preprocessing steps to make data available to build a solid machine learning model.
Data source: Fintech case study from Udemy

Pro-processing steps:

dataset = pd.read_csv('appdata10.csv')
dataset.head(5) # Viewing the Data

	user	first_open	dayofweek	hour	age	screen_list	numscreens	minigame	used_premium_feature	enrolled	enrolled_date	liked
0	235136	2012-12-27 02:14:51.273	3	02:00:00	23	idscreen,joinscreen,Cycle,product_review,ScanP…	15	0	0	0	NaN	0
1	333588	2012-12-02 01:16:00.905	6	01:00:00	24	joinscreen,product_review,product_review2,Scan…	13	0	0	0	NaN	0
2	254414	2013-03-19 19:19:09.157	1	19:00:00	23	Splash,Cycle,Loan	3	0	1	0	NaN	1
3	234192	2013-07-05 16:08:46.354	4	16:00:00	28	product_review,Home,product_review,Loan3,Finan…	40	0	0	1	2013-07-05 16:11:49.513	0
4	51549	2013-02-26 18:50:48.661	1	18:00:00	31	idscreen,joinscreen,Cycle,Credit3Container,Sca…	32	0	0	1	2013-02-26 18:56:37.841	1
5	56480	2013-04-03 09:58:15.752	2	09:00:00	20	idscreen,Cycle,Home,ScanPreview,VerifyPhone,Ve…	14	0	0	1	2013-04-03 09:59:03.291	0

The customer data is kind of a mess, and there is much information that may not be useful or not understandable even for human-being. It will bring difficulty for machine to train a model. We need to perform feature engineering and cleaning to the raw data.

Simple cross-correlation analysis between different features:

Some feature engineering steps are considered:

Convert data type from string to int for integer number;
Drop out some columns which may contain too many “N/A”, which could bring instability in training;
Calculate cross-correlation between different features and response rate to analyze the data;
Correlation matrix and heatmap are two very useful figures to plot the relationship between different features;
Normalization and standardization

Such as, calculate how long will users subscribe to the service after they start to use the app:

 from dateutil import parser
dataset['first_open'] = [parser.parse(row_date) for row_date in dataset['first_open']]
dataset['enrolled_date'] = [parser.parse(row_date) if isinstance(row_date, str) else row_date for row_date in dataset['enrolled_date']]
#New feature
dataset['difference'] = (dataset.enrolled_date - dataset.first_open).astype('timedelta64[h]')
plt.hist(dataset['difference'].dropna(), color='r', range=[0,100])

Model building

After pre-processing, we can start to build model:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset, target, test_size = 0.2, random_state =0)

Implementscaling to training and testing data:

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = pd.DataFrame(sc_X.fit_transform(X_train))
X_test = pd.DataFrame(sc_X.transform(X_test))

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

#Training
classifier = LogisticRegression(random_state = 0, penalty = 'l1')
classifier.fit(X_train, y_train)
#Prediction
y_pred = classifier.predict(X_test)
#Evaluation
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
precision_score(y_test, y_pred) # tp / (tp + fp)
recall_score(y_test, y_pred) # tp / (tp + fn)
f1_score(y_test, y_pred)

accuracy_score(y_test, y_pred)= 0.7681
precision_score(y_test, y_pred)= 0.7618952017667135
recall_score(y_test, y_pred)= 0.7700892857142857
f1_score(y_test, y_pred)= 0.7659703300030276

The last step, I implemented random forest classifier to analyze the importance of each feature, it gives a rough estimation of which feature could contribute more weights on the prediction result:

import pandas as pd
%matplotlib inline
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X_train.columns, RFC.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Importance'})
importances.sort_values(by='Importance',ascending=False).plot(figsize=(20,10), kind='bar', rot=45)

Conclusions

Analyze collected app usage data to help the company build a model to estimate what kind of customer will most like to subscribe and pay for service;
Help marketing and ads group to provide offers and promotion to a specific target and increase subscription rate;
For those users who may not subscribe predicted by the model, we could offer them greater discount and it will still bring revenue to the company if they decide to accept the offer and subscribe in the end.