Forecasting user activity using Machine Learning and R language

Forecasting user activity using Machine Learning and R language

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 24 average: 4.82 (rating: 96%)
During the active technology development, terms such as Artificial Intelligence, Machine Learning, Data Mining and Data Science are becoming more and more known. The main task of these technologies in business is to learn how to understand the client's behavior in constantly changing conditions. After all, if we know how the client will behave in the future, we can best plan and conduct marketing activities.
 
Mathematical algorithms of machine learning work with large volumes of data and find even hidden patterns of customer behavior. These patterns are not visible to managers and, moreover, sometimes even customers themselves do not suspect about them.
 
In this article, we will consider one of the approaches that is used in the eSputnik artificial intelligence system.
 
This approach is based on several ideas:
  • understanding the patterns of users behavior
  • reducing users anxiety
  • increase users longevity
  • prevention of users base burnout
  • effective use of promotional offers
You will learn how to increase the effectiveness of email letters and at the same time to reduce the total number of sent letters by more than 30%.
 

1. The formal statement of the problem

Pattern recognition algorithms assume the existence of historical information, which makes it possible to construct models of the statistical connection x → y, where
  • y ∈ Y, Y – observed user actions (responses) or a simulated random variable
  • x ∈ X, X a set of variables (predictors), through which it is supposed to explain the variability of the variable y
Most of supervised machine leaarning algorithms are arranged in such a way that they can be written in the form y = f(x, β) + ε, where:
  • f – a mathematical function chosen from some arbitrary family
  • β – vector of parameters of the function f
  • ε – residials of the model, which are usually generated by an unbiased, uncorrelated random process
When learning the model from fixed sample values of y, we minimize some function of the residuals of the model Q(y, β). As a result, we find a vector with optimal estimates of the model parameters β̂.
 
Varying the form of the functions f and Q, it is possible to obtain different models from which the most effective ones are selected (i.e. models that gives unbiased, accurate and reliable y-response predictions).
 

2. Data preparation

It should be remembered that any statistical method will be good enough, as far as good the input data ("garbage in - garbage out!"). Miracles does not happen without understanding the process of simulated flow and the effort of preparing a training sample filtering, transformations, handeling missed values, creating derived variables, etc.

2.1. Simulation of initial data

In practice, it is necessary to collect and analyze any available information before building a model. For example, the type of email letter, user activity on the site, transactions, gender, age, marital status, individual preferences, etc.
 
For simplicity of exposition, we will consider the general case, and simulate a simple data set that includes:
  • ContactID - unique user id
  • Date - email delivery time
  • Response - conversion (0 - no, 1 - yes)
Additionally we will take into account that the modeled data should be similar to the behavior of real users. In our data there will be users who do not read emails at all and those who read rarely and read almost everything. We will also consider that some of the users read the letters with a certain periodicity, and some - from time to time.
 
First we load the necessary packages, and set the functions:
library(data.table)
library(dplyr)
library(lubridate)

repeat_last <- function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
        if (!forward) x = rev(x)
        ind = which(!is.na(x))
        if (is.na(x[1]) && !na.rm)
                ind = c(1,ind)
        rep_times = diff(
                c(ind, length(x) + 1))
        if (maxgap < Inf) {
                exceed = rep_times - 1 > maxgap
                if (any(exceed)) {
                        ind = sort(c(ind[exceed] + 1, ind))
                        rep_times = diff(c(ind, length(x) + 1))
                }
        }
        x = rep(x[ind], times = rep_times)
        if (!forward) x = rev(x)
        x
}

shiftUp <- function(x, n){
        as.integer(c(rep(0, n), x[-((length(x)-n+1) : length(x))]))
}

Simulate the data:
n <- 1000000
initialDate <- ymd_hms("2015-01-01 00:00:00")

set.seed(9)
Activ <- data.table(ContactID = base::sample(1:5000, replace = T),
                Date = initialDate + 
                       as.difftime(runif(n, min = 0, max = 5*365/2*24*60),
                                        units = "mins"))
set.seed(9)
Activ[, Response := rnorm(.N, mean = sample(1:200, 1), 
                          sd = sample(1:50, 1)), by = ContactID]

q <- quantile(Activ$Response, .85)
Activ[Response < q, Response := 0][Response > q, Response := 1] 

Here is the data that we have generated:

  ContactID Date Response
1: 45 2017-05-30 22:22:49 0
2: 5 2017-02-14 18:35:40 0
3: 42 2017-03-10 17:31:10 0
4: 44 2017-06-04 14:49:50 0
5: 89 2017-05-17 03:59:05 0
6: 27 2017-04-18 08:30:45 0
7: 79 2017-01-24 07:39:52 0
8: 74 2017-03-14 01:20:03 0
9: 134 2017-05-15 00:25:07 1
10: 199 2017-05-01 16:14:43 0
... ... ... ...
100000: 151 2017-01-07 04:08:03 0
Exclude information about users who are completely inactive. In practice, such users must be reactivated in personal way.
Activ <- Activ[order(ContactID, Date), ]
ActivUsers <- Activ %>% group_by(ContactID) %>%
        summarise(IsActive = ifelse(sum(Response) == 0, 
                                    yes = 0, no = 1)) %>%
        filter(IsActive == 1)

setDT(ActivUsers)
Activ <- merge(Activ, ActivUsers, by = "ContactID", all.x = F, all.y = F)
Activ[, IsActive := NULL]

2.2. Aggregation of information for each user (formation of training set)

It is important to understand that before we know the user's reaction to the next letter, we know only the previous history of his behavior. The statement is obvious, but you can not pay attention to it when forming variables for training.
 
Form the set of variables X. We calculate for each user, at the time of receiving the next letter in history, the following features:
  • number of received/read letters
  • number of received/read letters on weekdays
  • number of received/read letters on weekends
Activ[, Day := wday(Date, label = T)]
Activ[, DayType := ifelse(Day %in% c("Sat", "Sun"), 
                          yes = "Weekend", no = "Weekday")]
Activ[, Day := NULL]

Activ[, Delivered := .N, by = list(ContactID, Date)]
Activ[, Delivered := cumsum(Delivered), by = list(ContactID)]
Activ[Response == 1, Opened := .N, by = list(ContactID, Response, Date)]
Activ[is.na(Opened), Opened := 0]
Activ[, Opened := cumsum(Opened), by = list(ContactID)]
Activ[, Opened := shiftUp(Opened, 1), by = list(ContactID)]
Activ[, Delivered := shiftUp(Delivered, 1), by = list(ContactID)]

Activ[, DeliveredByDayType := .N, 
      by = list(ContactID, DayType, Date)]
Activ[, DeliveredByDayType := cumsum(DeliveredByDayType), 
      by = list(ContactID, DayType)]

Activ[Response == 1, OpenedByDayType := .N, 
      by = list(ContactID, Response, Date)]
Activ[is.na(OpenedByDayType), OpenedByDayType := 0]
Activ[, OpenedByDayType := cumsum(OpenedByDayType), 
      by = list(ContactID, DayType)]
Activ[, OpenedByDayType := shiftUp(OpenedByDayType, 1), 
      by = list(ContactID, DayType)]
Activ[, DeliveredByDayType := shiftUp(DeliveredByDayType, 1), 
      by = list(ContactID)]

Activ[, OpenRate := Opened / Delivered]
Activ[, OpenRateByDayType := OpenedByDayType / DeliveredByDayType]
  • user's open rate
  • user's open rate on weekdays/weekends
Activ[, OpenRate := Opened / Delivered]
Activ[, OpenRateByDayType := OpenedByDayType / DeliveredByDayType]
  • user's lifetime in the system
Activ <- Activ[order(ContactID, Date), ]
Activ[, CreatedDate := min(Date), by = ContactID]
Activ[, LifeTime := difftime(Date, CreatedDate, units = "weeks"), 
      by = ContactID]
Activ[, LifeTime := as.numeric(LifeTime)]
  • date of the last email received
  • date of the last reading of the email
Activ[, LastDelivered := data.table::shift(Date, type = "lag"), 
      by = ContactID]
Activ[Response == 1, LastOpened := data.table::shift(Date, type = "lag"),
      by = list(ContactID, Response)]
Activ[, temp := LastOpened]
Activ[, temp := repeat_last(temp), by = list(ContactID)]
Activ <- Activ[order(desc(ContactID), desc(Date)), ]
Activ[!is.na(temp), LastOpened := repeat_last(LastOpened), by = list(ContactID)]
Activ <- Activ[order(ContactID, Date), ]
Activ[, temp := NULL]
Activ <- Activ[!is.na(LastOpened), ]
  • the time elapsed since the last email received
  • the time elapsed since the last reading of the email
Activ[, RecentlyDelivery := difftime(Date, LastDelivered, 
                                     units = "days"), 
      by = ContactID]
Activ[, RecentlyOpened := difftime(Date, LastOpened, 
                                   units = "days"), 
      by = ContactID]
Activ[, RecentlyDelivery := as.numeric(RecentlyDelivery)]
Activ[, RecentlyOpened := as.numeric(RecentlyOpened)]
  • normalized features:
    • average number of emails received per week
    • average number of emails opened per week
    • the time after the last received email regarding the lifetime of the user in the system
    • the time after the last opened email regarding the lifetime of the user in the system
Activ[, DeliveryNormed := Delivered / LifeTime]
Activ[, OpenedNormed := Opened / LifeTime]
Activ[, RecentlyDeliveryNormed := RecentlyDelivery / LifeTime]
Activ[, RecentlyOpenedNormed := RecentlyOpened / LifeTime]
  • binary activity indicators for selected periods (0 - no, 1 - yes):
    • active for the last 5 days
    • active for the last 10 days
    • active for the last 15 days
Activ[RecentlyOpened <= 5, ActiveLast5 := 1]
Activ[is.na(ActiveLast5), ActiveLast5 := 0]
Activ[RecentlyOpened <= 10, ActiveLast10 := 1]
Activ[is.na(ActiveLast10), ActiveLast10 := 0]
Activ[RecentlyOpened <= 15, ActiveLast15 := 1]
Activ[is.na(ActiveLast15), ActiveLast15 := 0]
Activ[, ActiveLast5  := as.factor(ActiveLast5)]
Activ[, ActiveLast10 := as.factor(ActiveLast10)]
Activ[, ActiveLast15 := as.factor(ActiveLast15)]
Response variable Y takes two values (already specified in the original simulated data):
  • 0, if the user hasn't read the letter;
  • 1, if the user read the letter.
Removing the auxiliary features:
Activ[, CreatedDate := NULL]
Activ[, LastDelivered := NULL]
Activ[, LastOpened := NULL]
Activ[, Delivered := NULL]
Activ[, Opened := NULL]
Activ[, Date := NULL]
Activ[, DeliveredByDayType := NULL]
Activ[, OpenedByDayType := NULL]
Activ[, ContactID := NULL]
Activ[, RecentlyDelivery := NULL]
Activ[, RecentlyOpened := NULL]
Excellent, the data set for training the model are ready. Now we construct a model for predicting the value of the response variable y for a given values of the parameters xi  ∈ X, i = {1,..,n}. In other words, the model will answer this question: will the user read the next email or not? If so then we send the letter.
 
So, our problem reduces to the problem of binary classification. We classify the variable y with only two values 0 or 1 (where 0 - no, he will not read the email; 1 - yes, he will read the email).
 
By the way, we have generated features set X only for an example. Perhaps some of them are statistically insignificant. In one of the following articles, we will consider the topic of analysis of the significance of variables.
 

3. Selection of the machine learning algorithm and training of the model

One of the key points that a researcher encounters in the development of a statistical model is the choice of the optimal algorithm for extracting regularities.
 
Without going into the description of the existing machine learning algorithms, we can say that they are quite complex. Usually complex algorithms are counterintuitive and difficult to interpret. But, different studies show that simple methods (by the example of classifiers) often outperform more sophisticated algorithms in solving practical problems.
 
For our task we take the logistic regression function as a mathematical function f, and the logistic maximum likelihood function as a function of the remainders Q (intuitively, the function Q helps to find such values of the parameters β̂ for which the predicted response probabilities  are closest to the actual values of the response Y).
 
The iterative interaction of the functions f and Q will pick up the optimal parameters of the function f to describe the behavior of the responses of Y through the variables X.

3.1. Division of data into training and test samples

When building a model, it is necessary to check its accuracy. Therefore, we divide our data into two parts: training (80%) and test (20%).
set.seed(9)
trainIndexes <- sample(1:nrow(Activ), round(nrow(Activ) * .8))
Activ_train <- Activ[trainIndexes,  ]
Activ_test  <- Activ[-trainIndexes, ]

3.2. Training the model on the training sample

We'll train the model using the function glm from basic package stats.
formula <- Response ~ .
fit <- glm(formula, family = binomial(link='logit'), data = Activ_train)

3.3. Response prediction for a test sample

When evaluating the parameter y, the model calculates the probability of reading the email, rather than a specific value of 0 or 1. That is, we get the value of  from the continuous interval [0, 1].
 
The question arises: how to choose the probability threshold at which we assign the user to the group 0 or 1? Now take the threshold parameter equal to 0.09.
 
Proceed as follows:
  • if ŷ ≤ threshold, then Response = 0,
  • if ŷ > threshold, then Response = 1;
glmpred <- predict(fit, newdata = Activ_test,  type = 'response')
threshold <- 0.09
glmpredRound <- ifelse(glmpred > trashHold, yes = 1, no = 0)

How to select the threshold parameter, we'll cover in the next articles.

4. Analysis of model accuracy

Let's compare the results of the model forecast with the actual data.
(testResult <- t(table(Actual = Activ_test$Response, Predicted = glmpredRound)))
We get the table of conjugacy of the actual and predicted response values on the test sample.
              Actual
              0     1
Predicted   0 13642   630
            1 34967 29054
We send email to those users for whom we have Predicted = 1, and do not send email to those users who have Predicted = 0.
 

5. Analysis of results

  1. We have reduced the number of emails sent by 14272 (18,24%)
    • 13642 + 630 = 14272
    • (13642+630) / (13642 + 630 +34967 + 29004) * 100 = 18,24%
  2. The accuracy of the response prediction is 97,88%, that is, we lost only 2.12% of the email opens
    • (1 - 630 / (630 + 29004)) * 100 = 97,88%

Is it significant?

Obviously, it is reasonable to sacrifice this number of the email opens and not send an extra 18.24% of letters that will not be read. This will not bother the users without the need to prevent burnout of contact base.

In addition, if a user receives less emails, he is more interested in reading them, and as a consequence it is also more likely that he will make a targeted action on the next email.

Note that this result is obtained using a minimum of information. When adding additional information to the model, which we mentioned above, the efficiency of the model increases several times.

In future articles we will consider 

  1. The choice of the threshold parameter and its effect on the false positive and false negative error
  2. A simple way to determine the best time to send email to a user

 

Add new comment

Forecasting user activity using Machine Learning and R language