User Churn Alert based on pyspark

8 min readNov 13, 2021

From user activity logs to machine learning models

Sparkify is a fake music streaming platform that proposes to its customers to listen to their favorite songs via a free offer that includes advertisement while using the platform or via the paid offer without advertisements.

The project is proposed by Udacity and aims at training the application of machine learning as part of the Data Scientist Nanodegree leveraging the power of distributed computation with Spark.

Content

1. Questions to investigate
2. Data Understanding
3. Data exploration
4. Feature engineering
5. Modeling
6. Conclusion
7. To go further

You can also find the full code in this GitHub repository.

1. Questions to investigate

Often, service providers need a good understanding of user-profiles to anticipate the needs of different groups of users. This is why they usually record users’ actions.

By identifying the drivers that are critical, platforms are able to provide a better usage experience. One of the most important applications is user churn prediction.

In this project, we can start with these questions and develop step by step:

About User’s usage: Usage time? The percentage of using paid services?
Are there significant differences in the usage behavior of churned and lost users?
If so, what are these factors?

2. Data Understanding

As stated above, our data is extracted from user activity logs. In other words, each message records the specific behavior of a given user at a given time.

For this project, it records information on the following aspects :

User Information
| — firstName: string
| — lastName: string
| — gender: string
| — location: string
| — userAgent: string [device information]

User Count Information
| — userId: string
| — level: string [“paid” or “free” version]
| — registration: long [ registration timestamp]
User Activities
| — artist: string [name of singer]
| — song: string [name of song listened]
| — auth: string [authentification status: “Cancelled” or “Logged”]
| — sessionId: long
| — itemInSession: long [number of items in the session]
| — page: string [page visited]
| — ts: long [activity timestamp]
| — status: long [service request status(200/307/404)]
| — method: string [POST or GET]
| — length: double [usage duration]

The information recorded in “page” is worth exploring further. It includes Interaction information (“NextSong”，“Thumbs Up”，“Roll Advert”…), account services (“Downgrade”，“Cancellation Confirmation”…) and other key activities.

A typical record will be like this:

artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30'

3. Data Exploration

Next, we look at the data in more detail.

3.1 Basics about data size

There are 286500 records in total, concerning 226 users.

3.2 categorical value in “page”

In the “page”column, there are some similar expressions, such as “Cancel” and “Cancellation Confirmation”. We need to determine if they point to the same behavior.

Combined with the basics of status codes, we can guess that they are associated pages. After further verification, it can be determined that in this dataset, “Cancel” and “Cancellation Confirmation” point to the same behavior: cancellation of service (i.e. indicating user churn)

3.3 Check data validity

Next, we need to make sure that our data is properly formatted and logical.

A look at the null values in the data results in the following:

check null value number on original data

As it can be seen, “artist”, “song”, and “length” have the same value, which is reasonable considering that not all the time the user is listening to music.

But for “firstName”, “lastName”, “gender”, “registration” and “userAgent”, we are not sure. After further querying, it is found that they all belong to a user with an empty user id string. Usually, this records the activity of all visitors. So we need to delete the relevant items. The result of the new null value is as follows:

null value count after delete the “visitors” records

3.4 Exploration of key factors

Here is one of the most interesting sessions. We will explore the behavioral difference between churned users and existing users.

The first is about the usage time, the following figure shows the distribution of the usage time of the two types of users.

It seems that the distribution of existing users’ usage time is more concentrated and the mean value seems to be larger. To confirm this idea, we also calculated the numerical results. The details are as follows:

Usage duration (min)
[Existing users] mean:249.1412535406481, median:249.16812640511915 
[Churned  users] mean:248.30786496247958, median:248.28066978737792

The mean and median duration of use for existing users is only both slightly greater than that of the churned users. Given the relatively small sample size (225 users), we are unable to determine whether length of use is a valid metric. However, taking into account realistic experience, we will keep this parameter.

Now for the user access page, the following shows the percentage of page types visited by the two types of users.

Obviously, the most frequently accessed page is “NextPage”. Since other pages are visited relatively rarely, we can delete the “Next Page” results and look further.

This time, we have made some new discoveries. The displaced users in seem to have more negative activities (“Thumbs Down”, “Roll Advert”), while the existing users have more positive activities(“Thumbs Up”, “Add Friend”).

Regarding the interface related to account settings(“Home”, “Save Setting”, “About”), it may not have a significant impact. We may not consider it for now.

4. Feature engineering

Combined with practical experience, we selected the following aspects of features for subsequent modeling.

1. User's personal information:
    - gender
2. User account status:
    - Tenure (number of days since registration)
    - Number of active days
    - Level: 1 for paid, 0 for free
    - Change of service:
       -- hasUpgraded: 1 for yes, 0 for no
       -- hasDowngraded: 1 for yes, 0 for no
3. Usage information
    - Average length of use during the current window period
    - Change in average usage duration
    - Average items number during the current window period
    - Change in average items number
    - Number of pages visited in resent period 
    - Change in the number of pages visited
  *The "change" above is the difference compared to the previous window period

Arguably, this part is the most complex. In this article, we are not aiming to discuss the technology. If you are interested, you can see this part of the code by clicking here.

Since some of these data are 0/1, some are specific values with magnitudes. To avoid their absolute size from affecting the modeling, we use Normalized Standard Scaler to scale the data to ensure that they are of the same magnitude.

5. Modeling

Here, we mainly talk about the usage of functions in pyspark.ml package. It is a bit different from the usage in sklearn, but has the same idea.

The LinearSVM and Gradient-Boosted Trees models are selected to demonstrate.

The general steps are as follows:

split trainset and testset
scaling the sets seperately
build classifier
grid search and cross validation
model training
model evaluation

Note that all transformations for data scaling occur in each fold of the cross-validation to avoid data leakage. So we put step 2 and step 3 into the pipeline, which will serve as an input to step 4.

Take the SVM model as an example, the code is as follows.

from pyspark.ml.feature import VectorAssembler, Normalizer
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder# assemble features into a vector
assembler = VectorAssembler(inputCols=feature_names, outputCol="features")
df_vec = assembler.transform(df)
df_vec = df_vec.select("features","Churn")# build scaler
scaler = Normalizer(inputCol="features",outputCol="scaledFeatures")# build model
svc = LinearSVC(featuresCol="features",labelCol="Churn")  
# set grid search range
svc_grid = ParamGridBuilder().addGrid(svc.regParam,[0, 0.1, 0.01]).build()# build pipeline
pipeline = Pipeline(stages=[scaler,svc])bi_evaluator = BinaryClassificationEvaluator(labelCol="Churn")
# build cross validator
cv = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=svc_grid,
        evaluator=evaluator,
        numFolds=3
    )# train model
cv_model = cv.fit(train)# predict
svc_pred  = svc_model.transform(test)

The results obtained for the two models on the test set are as follows:

Clearly, SVC outperforms GBT, with the former having an accuracy rate of 84% and the latter 77%. Considering the small size of our sample, such a result is totally acceptable.

6. Conclusion

The results of the model demonstrate to some extent the validity of our selected features. This means that the user’s usage time, number of actions, user activities (positive/negative feedback, social activities, paid behaviors, advertisements) and changes in user activities all reflect the user’s usage intention to some extent.

More specifically, dissatisfaction with songs (Thumbs Down), increased number of ads, decreased length of use, and decreased user activity are the more obvious characteristics of dissatisfaction with the product.

7. To go further

In the course of this project, there are still some details that have been overlooked, and this, in turn, will be the direction of our further work.

Feature selection: We ignore the device information and region information. In fact, information such as device and region (population, education level, average income) are also very reflective of user characteristics.
Data Balance: In this dataset, 52 out of 225 subscribers churned. The dataset is somewhat unbalanced. In theory, we could balance the data using methods such as upsampling.
Usage of relative value: In the above example, we are using more absolute values, such as the amount of change in the number of times a user views a page. But the differences between them can be very significant, due to different user habits. A better approach would be to use the increase/decrease percentage.
Feature Importance Visualization: For the importance of features, we can visualize them. For example, using the integrated model, the output importance.
Data volume: Obviously, we have a very small amount of data. The increase in the amount of data can be very effective in helping to improve the model performance.

Do you have any comments or suggestions? Feel free to leave your thoughts in the comments section. Also, if you’re interested in code and data, do not hesitate to check them out on my GitHub!