Note: Please see an updated version of this report here.

Brief Report:

Customer Retention:

Businesses care about maintaining their customer base. There are a variety of approaches to doing so: contacting customers who haven’t recently used the business’ service, creating barriers to cancellation (e.g., if I try to delete my Facebook account, I have to click through several pages reminding me about all the friends I’ll lose contact with), or simply creating a monopoly thereby forcing your customer to use your service (e.g., my apartment building only offers Comcast).

Yet another option, and the focus of this project, is to use machine learning. Can we take non-obvious aspects of customers’ usage patterns, and use this to predict customers that are at risk? This problem is similar to those involving recommender systems, in that it nicely intersects theoretical and practical interests; however, the present problem has been less well studied than recommender systems.

The Data:

The dataset comes from a web business that provides an SEO tool that its clients can install on their websites, so that website visitors who want a free SEO audit share their email with the client (clients can also use the tool for self-audits). The dataset I used (i.e., post-cleaning/munging the raw data), consists of a row for each customer who has ever purchased a subscription (n=755). Each row indicates (a) the duration of their subscription, (b) whether they are still subscribed (as of the data extraction), and (c) a list of details of their usage/properties, to be used in prediction. Some of the latter include (for the full list, see the detailed report below):

  • Submission Ratio (numeric) : The client has installed the tool because they want to get clients’ emails. How successful is the tool for each client– that is, many emails do they get (relative to their traffic)?
  • Mean Audit Grade (numeric) : For every audit run, the website being audited gets a grade. This is the average grade that each client’s SEO tool outputs.
  • Number of Recent Logins (numeric) : How many times has the client logged into the system in the past 60 days?

Survival Analysis:

The dataset presents an interesting challenge because what we want to predict–how long a customer subscribes before cancelling–is actually data that’s partially missing. That is, for clients who have already cancelled, we know how long they lasted, so we can use their information to make inferences about this; but what about clients who are still subscribers? We only know how long they’ve lasted so far, which inherently underestimates how long they will last. We don’t want to just ignore these clients, because (a) this removes precious data from our already undersized dataset, (b) this biases the dataset.

The way to approach this challenge is with a tool called “survival analysis”. We can fit a “survival curve” to our data, which captures, for any given timepoint, the probability of surviving up to that timepoint (but not beyond). The way this curve is calculated allows us to gain information from the “missing” data of still-subscribed customers by adding in their contribution to the early parts of the curve, then filtering them out in later parts of the curve. In my reading for my project, my favorite gentle-but-still-technical introductions to survival analysis have been from this post on cancer and this blog post.

Estimating a survival curve allows us to make inferences about our data in an unbiased manner. From there, traditional statistical and machine-learning techniques become available. I was drawn towards decision trees for this project, where each “decision” (branch) is made by finding maximally different survival curves (rather than minimizing entropy in a classification task). The idea here is to make readable and (for a business) potentially actionable predictions; so decision trees seemed well-suited.

Analysis & Results:

The primary free parameter in survival tree, like any decision tree, is how large we grow it. I decided on this by maximizing test-set accuracy in 10-fold cross-validation. Assessing the tree’s test-set accuracy was difficult for technical reasons I go into in the full report below, but the upshot is that a survival tree can output a survival curve for each instance/customer, a curve which gives the probability of survival (not cancelling) up-to-but-not-beyond any given timepoint. The observed timepoint for each customer can then be placed on his/her curve, which will assign a probability that that customer should have cancelled/not, at the timepoint (s)he was observed. This can be compared to whether the customer actually has cancelled. This measure is ideal because of how close it is to what you’d actually want to do with this project: take a database of customers and constantly monitor their “risk” of cancelling.

Below is the winning tree. Plotted at each terminal node is the survival curve for any customers who fall into that node.

The tree generated by rpart allows us to extract the computed feature importance:

##               NumRecentLogins                LoginFrequency 
##                          0.33                          0.20 
##                 ClientTraffic ControlPanelSubmissionsPerDay 
##                          0.13                          0.12 
##                MeanAuditGrade              EmailSubmitRatio 
##                          0.11                          0.11 
##                    EmailCount 
##                          0.01

Many of the most important attributes, and how the tree uses them, seem fairly intuitive. Both number of recent logins & login frequency are highly predictive of (not) cancelling, although only one is used in the tree (probably due to the two being highly correlated with each other– pearson coefficient = .78). Users who haven’t logged in recently tend to cancel (with the exception of users who logged in just once recently–possibly to cancel). Further, the size of the client’s business (traffic), as well as how successful the tool is working for them (email submit ratio), are also predictive.

The average test-set accuracy of this survival tree (when trained only on training data) is ~68%. Since this model is predicting cancellations, and since our dataset has 284 cancelled users out of 755, then baseline performance (always guessing “not-cancelled”) is around 62%. So overall, the model performance isn’t stellar, but it outperforms baseline.

Further Details:

Data Building/Cleaning:

The data originally came in the form of separate tables: a table specifying logins, a table specifying subscriptions (multiple per user), a table specifying SEO-tool usage, etc. These had to cleaned and merged in an (as always) painful process. The biggest obstacle was that none of the columns could be trusted to be describing what they seemed to be describing (e.g., if column A was total tool usage, and column B was a subset of this, for several users column B > column A).

Additionally, caution was needed in removing any hint of time-correlation from variables. For example, I couldn’t use “number of logins”, since this is too highly correlated with “length of subscription”, which is part of what’s being predicted in survival analysis.

Through excluding data and looking at lots of histograms, I managed to parse the data into one table consisting of features I (pretty much) could trust. I detailed three of these above–submission ratio, number of recent logins, and mean audit grade–and I detail the rest of them below. All numeric attributes except for email-count and num-recent-logins were log-transformed when I cleaned the data, in order to increase their normality, just in case I wanted to use a parametric machine-learning technique that depends on normally distributed variables (but I believe the survival trees I used do not have this assumption).

Full Feature List:

  • Email-Count: (numeric) How many times was the client emailed during the first ten days of their subscription? This can vary based on whether they decided preemptively to sign up for the full membership, or whether they opted out of automated emails
  • Client Traffic: (numeric) How many times was the embedded SEO tool viewed total, divided by the length of the customer’s subscription? This acts as a measure of how much traffic the client gets.
  • Embedder: (binary) The SEO tool can be used not only by the client’s customers, but also by the client herself in a control panel. Curiously, many clients only use the SEO tool for this purpose. This binary attribute indicates which behavior the client has (equivalent to “traffic==0”).
  • Control-Panel Submissions Per Day: (numeric) This attribute represents how frequently the tool is used in the above way (in the control-panel), divided by the length of subscription.
  • OS (nominal) : What operating system does the client use (or rather, usually log-in with)?
  • Login-Frequency (numeric) : Number of times the user logged in, divided by the duration of their subscription.

The original dataset had thousands of records, but the vast majority of these were (potential) clients who signed up for a 30-day free trial. This could be an interesting dataset for a very different type of project–predicting client conversion, instead of client retention–but that is not the project I chose to do. Additionally, this alternative project would be problematic for this dataset, since we only have current records: to build a model that predicts client conversion, we’d want data for each client that was frozen to the time of their 30 day trial. (That is, if I want to figure out why Sue didn’t stay past the free trial but Bob did, I should compare Sue’s behavior when she had the free trial to Bob’s behavior when he had the free trial– not Sue’s free-trial behavior to Bob’s current, as-a-member behavior).

Model

Building the model, in contrast to cleaning the data (above), and assessing the model (below), was a piece of cake. I was originally going to write my own implementation of survival tree, modifying my decision tree code from the earlier homework to make splits that maximize survival-curve-difference (rather than minimize class-entropy); however, I decided against this when I found out that R actually has several packages with high-quality implementations.

I used the package rpart. To build the tree, it’s literally a line of code:

tree = rpart(Surv(SubscriptionDuration, event = SubscriptionInactive) ~ ., data= df_p)

The above code builds a model that fits SubscriptionDuration, “censored” by SubscriptionInactive, with every other column in df_p as a feature.

Model Validation

The complicated part comes in when assessing the model. The main free parameter in a survival tree (like a decision tree) is how many branches/nodes it has. This can most intelligently be controlled in rpart with the “complexity parameter”, which dictates the minimum threshold of model-fit improvement a new split has to yield in order to make that split.

I chose this free-parameter using cross-validation. I decided to implement this part myself, because I wanted to make sure this parameter was chosen such that it maximized the kind of accuracy I described above (rather than maximizing some other metric of fit, or minimizing some metric of error, that rpart might use internally). Again, this measure of accuracy–predicting each customer’s “risk” and comparing it to whether they’ve cancelled–is ideal because it’s exactly what you’d want to use this model to do in a practical setting.

Below, I plot 10-fold cross-validation performance as a function of the complexity parameter threshold. I compare accuracy on the training set with accuracy on the test set, as well as with accuracy on the test set using “ZeroR” as a baseline.

This same data can also be plotted against the size of the tree:

We can see in both cases a familiar trend. As we allow our model to become increasingly complex, training accuracy increases (due to overfitting), and test-set accuracy forming a “hill”: initially improving as the model better fits the data, then performing worse and worse as the model fails to generalize.

What we are interested in is the highest point on this ‘hill’: when does the model hit a sweet spot? Based on inspecting the graph, this appears to be when the complexity parameter is set to .10. This was the parameter I used in generating the tree earlier in this report.

Model Performance:

We might wonder whether the model is actually outperforming baseline, or whether we just picked a lucky point on the curve where it is marginally better than ZeroR, but not significantly so. One way of quickly testing this concern is to perform a t-test: take the 10 samples from the cross-validation used at CP = .10, and compare them to the baseline of 62.38%.

## 
##  One Sample t-test
## 
## data:  best_train$acc
## t = 2.7297, df = 9, p-value = 0.02323
## alternative hypothesis: true mean is not equal to 0.6238
## 95 percent confidence interval:
##  0.6327956 0.7198359
## sample estimates:
## mean of x 
## 0.6763158

The t-test is significant, which gives support to the idea that our model performance is not a fluke (the fact that the gray confidence intervals around the green and blue lines in the graph below do not overlap for CP ~= .10 is also support of this).

Model Predictions

I’d like to devote a few paragraphs to one issue that came up in assessing my model, because (a) this particular issue took a very long time to resolve, and it’s the reason I didn’t get a chance to try other approaches for this dataset, and (b) it was a really good learning experience, (c) I still don’t quite understand what happend.

As described above, the model outputs a “survival curve” for each node. For each customer, we can take her node’s survival curve, find her place on that curve, and therefore ouput a “survival probability”.

The question, then, is what to do with the survival probability when comparing it to actual survival (non-cancellation). The solution here seemed obvious to me: for each observation where P(survival) < .5, predict a cancellation, otherwise predict still-a-customer.

However, it turns out that when I did that, the model performed terribly. In the best case, it could perform as well as ZeroR, and never performed better. This absolutely stumped me. It seemed like it should at least perform as well as ZeroR, since ZeroR is essentially an “intercept only” model, whereas the survival tree should be equivalent to “intercept + some free parameters.”

It’s helpful to depict this idea visually. Below, each of the little dots is an observation, either =1 for cancelled or =0 for not cancelled. On the x-axis is the output of the model; each line represents a logistic curve relating model probability to cancelled outcome (one for each CV run). The left graph is what I thought I would get: as probability of “survival” increases, probability of cancellation decreases, with .5 probability being the crossover point. Therefore, any dot that is on the left side of the graph should also tend to be in the top of the graph (and similarly right dots should be on bottom). But the right graph is what I usually got instead. Here, the model’s output probability isn’t related to cancellation at all. In fact, the probabilities don’t even capture the simple fact (which ZeroR does capture) that most customers have not cancelled. That is, the number of dots on the bottom (customers that didn’t cancel) is greater than the number of dots on the top (customers that cancelled), but the number of dots on the left (predicted to stay) and right (predicted to cancel) are about equal. So if I simply classify as cancellers people for whom the model says that P(survival) < .5, I’ll do worse than ZeroR.

temp temp

The solution here is to not classify based on the .5 threshold, but to pick the threshold intelligently, as those logistic curves do. So I eventually was able to solve my problem by what feels like a weird solution: instead of predicting test data with the predictions from a tree trained on training data, I predicted test data based on predictions of a logistic regression model that was fit to the predictions from the tree trained on training data. What this seems to mean is that the model does not output survival probability; instead it outputs some measure of survival likelihood that’s comparable from customer-to-customer, but isn’t calibrated to actual “probability.” I don’t know why this would be the case: it’s not a bug in my code, and I can verify it on happens on much simpler simulated datasets being predicted with much simpler survival models. I suspect it has something to do with how one can (and cannot) interpret survival curves; this speaks to the fact that it’s important to understand the tools you’re using (so you can avoid hours of debugging!).

**That being said, I don’t fully understand what’s going on here. If you’re a future reader of this report, know something about survival analysis, and want to help de-confuse me, I’d love to hear from you! Email me at jacobwdink@gmail.com.** I’ve also posted a StackExchange question about it here.

Future Directions

There are a lot of things I wasn’t able to accomplish in this first pass.

  • There were two primary features of this dataset that I ran out time and wasn’t able to use. First, I had each client’s IP address, and could have used this to extract their country as another feature for prediction. Second, I had information on support tickets which would have been useful: both number of tickets filed, and perhaps a sentiment analysis on the tickets.
  • This was a tough dataset for the job. In particular, there were only 284 users in the data-set whose cancellations were observed. That means we were trying to predict customer retention in a dataset where “retention length” was a (partially) missing data-point for about 2/3s of the cases. I would love to see how a similar approach might fare on a larger dataset.
  • I was never able to compare my model’s performance to a non-survival based model. The theoretical assumptions of survival analysis suggest that a model that simply adds in “duration of subscription” as another feature, and then performs classification (cancel/not) should not perform well. However, it would have been nice to verify this. This might have been difficult to verify simply by cross-validation, however, since the bias that survival analysis attempts to eliminate–that the “true” customer retentions are longer than the observed ones in this dataset–would be present in both training and test sets (since both are derived from this dataset). It would be better to return to this dataset in a year or so and verify that a survival model performs better in predicting who cancelled in that year than a non-survival model could.
  • I still don’t fully understand why survival models don’t output probabilities of survival (the content of the “model prediction” section above). Email me if you know!

Anyways, I learned a lot, and I’m glad I finally got my model to outperform ZeroR. It’s a small victory, but–considering the difficult, real-world nature of the dataset– I’ll take it.

Happy summer!

-Jacob