How long do users spend on Moodle (or more generally e-Learning) is another common question worth some initial exploration as part of my broader goal towards the notion of an engagement metric. This article discusses an approach into defining and obtaining insights from the idea of a session length for learning. This is mostly a data wrangling exercise to approximate the duration from event logs that will tell us that while all events are born equal, some are more equal than others. The algorithm should prove useful when I progress to course breakdowns in identifying particularly dedicated or struggling students who are investing larger amounts of time online, or those at-risk who aren’t spending enough. These questions are something I will return later in a future post as part of the project.

## Learning Duration

This works on the same data as last week’s look at some basic distribution analysis which contains extraction SQL.

### Step 1. Data Wrangling

The goal here is to calculate the duration based on the difference between events and the challenge is determining when a session starts and ends. Notably there is not a consistent and clear logged in and logged out event recorded in the data.

The principle I have used is to sort the events by user and time, and then compare each row with the previous and determine is a new session should be started. These are the rules I came up with empirically as to when to start counting a new session:

- If the event is a log in event (\core\event\user_loggedin) as this is a new login;
- If the event is an earlier as this implies the user has changed given the sort order;
- If the event is a course view and the duration is over 5 mins as this suggests the user has left the browser without logging out – this was determined because most course views are less than 1 minute and 99% of course view durations were under 5 minutes with several outliers that created unusually long sessions;
- If the previous event was over 60 minutes earlier, which is based on the session timeout value

#### Setting up the data

This is the same as last week’s data where this is explained in more detail.

library(ggplot2) require(scales) library(dplyr) library(tidyr) library(magrittr) library(RColorBrewer) library(ggthemes) library(zoo) setwd("./Documents/") mdl_log <- read.csv('mdl_logstore_standard_log.csv') ### Setup additional date variables mdl_log$time <- as.POSIXlt(mdl_log$timecreated, tz = "Australia/Sydney", origin="1970-01-01") mdl_log$day <- mdl_log$time$mday mdl_log$month <- mdl_log$time$mon+1 # month of year (zero-indexed) mdl_log$year <- mdl_log$time$year+1900 # years since 1900 mdl_log$hour <- mdl_log$time$hour mdl_log$date <- as.Date(mdl_log$DateTime) mdl_log$week <- format(mdl_log$date, '%Y-%U') mdl_log$dts <- as.POSIXct(mdl_log$date) mdl_log$dts_str <- interaction(mdl_log$day,mdl_log$month,mdl_log$year,mdl_log$hour,sep='_') mdl_log$dts_hour <- strptime(mdl_log$dts_str, "%d_%m_%Y_%H") mdl_log$dts_hour <- as.POSIXct(mdl_log$dts_hour)

#### Manipulating the data

As mentioned already the principle I am going to is to break the data into sessions and calculate the difference between the start and end time.

So the first thing I need is a counter function to keep track of the session number, which I modified from this post. Basically this will return the next increment by default, or the current value if increment (inc) is set to false.

init.counter <- function(){ x <- 0 function(inc=TRUE){ if(inc) { x <<- x + 1 x } else { x } } } #source: hadley wickham

Next initialise the counter instance and create the dplyr data frame.

counter1 <- init.counter() d <- tbl_df(mdl_log)

For the first wrangle there are a few additional parameters to add to make the calculation smooth. This is likely to be the area to focus on to refine the rules if you have unusual outliers.

- Fix the time format to POSIXct
- Sort the data by user and time so that is sequenced by user sessions
- Calculate the difference (diff) between the current row and the previous row in minutes
- Calculate the duration (dur) of the event as the difference between the current row and the next row in minutes
- Set a binary login flag if the event is logged in
- Set a binary course dwell flag (cdwell) if the user remains on a course view event for more than 5 minutes

d %<>% mutate(time = as.POSIXct(time)) %>% arrange(userid,time) %>% mutate(diff = difftime(time, lag(time), "Australia/Sydney", "mins")) %>% mutate(dur = -1 * difftime(time, lead(time), "Australia/Sydney", "mins")) %>% mutate(login = ifelse(action=="loggedin", 1, 0)) %>% mutate(cdwell = ifelse(eventname=='\\core\\event\\course_viewed' & diff>5, 1, 0))

The second wrangle uses the row wise method to apply a function to each row in turn, namely to calculate the session number (sessnum). Each time counter1() is called this will utilise the function above to increment the session (or track a new session), otherwise counter1(inc=FALSE) returns the current session. There are 6 nested rules applied at this stage:

- If the diff is NA then set the session to 0 (this deals with the first row and sets the count sequence)
- If the login flag is set (i.e. this is the loggedin event) then increment the session
- If the difference is negative (i.e. the current event happened before the previous) then increment the session (this case deals with the user sort as this indicates a change in user and so a new session)
- If this is flagged as an extended course dwell event then start a new session (assume the user left the browser)
- If the difference (diff) from the last event is greater than 1 hour (60 mins) then start a new session
- Otherwise this is within the current tracked session

d %<>% rowwise() %>% mutate(sessnum = ifelse(is.na(diff), 0, ifelse(login, counter1(), ifelse(diff &lt; 0, counter1(), ifelse(cdwell, counter1(), ifelse(diff &gt; 60, counter1(), counter1(inc=FALSE) ))))))

Finally group the data by user and session and use this to calculate the start and end times, the number of events in the session, and the duration in minutes as end minus start.

d_session <- group_by(d, userid, sessnum) %>% summarise(start = min(time), end = max(time), date=min(date), events = n(), duration = difftime(max(time),min(time), "Australia/Sydney", "mins")) %>% arrange(start) %>% mutate(sesslength = ceiling(as.numeric(duration))) %>% mutate(weeknum = format(date, "%Y-%U"))

### Step 2. Data Visualisation

**Daily duration spread**

This presents the overall spread of durations per day and is useful to understand the general use of the site, and where you have peaks, upper bounds or outliers. I used this to identify outliers and further refine my codification above to remove false positives in session sequences. The size of the circle indicates the number of events in the session which I will discuss further in the last visualisation.

ggplot(d_session, aes(x=start, y=sesslength)) + geom_point(aes(size=events), alpha=0.75, color="#60B3CE", position=position_jitter()) + geom_smooth() + scale_x_datetime(breaks = date_breaks("1 week"), minor_breaks = date_breaks("1 day"), labels = date_format("%d-%b-%y")) + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + labs(x="Day", y="Session length (mins)")

**Distribution **

The next visual looks at the daily distributions within each week (not the total hours per week) to demonstrate averages for individual sessions. The indication is that while there are many sessions longer then 1/2 hour, these represent less than 25% of all sessions. Learning activities or sequences intended to last more than 30 minutes are likely to be incomplete in a single session, which may influence learning design choices. In the second version of the chart I have zoomed in to those sessions of one hour of less to give a better idea of the distribution, which shows that 15 minutes is the average time, which requires quite bitesize learning.

ggplot(d_session, aes(x=weeknum, y=sesslength)) + geom_boxplot(fill="#60b3ce") + geom_smooth(aes(group=1), color="#FF8300", linetype=2) + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_y_continuous(breaks=seq(0, 600, 60), minor_breaks=seq(0, 600, 15)) + labs(x="Week", y="Session length (mins)")

ggplot(d_session, aes(x=weeknum, y=sesslength)) + geom_boxplot(fill="#60b3ce") + geom_smooth(aes(group=1), color="#FF8300", linetype=2) + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_y_continuous(breaks=seq(0, 600, 15), minor_breaks=seq(0, 600, 15)) + labs(x="Week", y="Session length (mins)") + coord_cartesian(ylim = c(0,60))

**Correlation between events and duration**

One may expect that more events in a session indicates a longer session, which is to some extent true. There is correlation, however the visualisation shows that this is not consistent. Having been guilty of historically using total number of events as a proxy for time online, this needs greater justification in future analyses as not events are equal.

ggplot(d_session, aes(x=events, y=sesslength)) + geom_point(alpha=0.75, position = position_jitter(h =0), color = "orange") + labs(x="Number of event logs", y="Session length (mins)") + coord_cartesian(xlim = c(0,100), ylim = c(0,400))

Hopefully some others will find this useful in exploring this question. I want to use and refine the algorithm as a factor for engagement later on. Next up I will be exploring forums in more detail.

https://github.com/Rdatatable/data.table/wiki may be of interest.