At some point I always come back to the question of how do we understand use of the VLE/LMS, which I’ve theorised a lot. As part of an interest to learn about Data Science I’ve signed up to Sliderule (@) and am being mentored through a capstone project with some Moodle data. The main goal is for me to learn R, which I’d never touched until 2 weeks ago, but hopefully the data can tell me something about Moodle at the same time. Feedback or advise on techniques is welcomed.
Exploratory Data Analysis on mdl_logstore_standard
For this part I am going to focus on producing some simple two-dimensional analysis. This assumes you have MySQL access to your Moodle database and RStudio.
Step 1. Data Extraction
I started with a full data extraction of all events in the system to a CSV file (mdl_logstore_standard_log.csv).
SELECT c.fullname as courseName, FROM_UNIXTIME(l.timecreated) as DateTime, l.* FROM mdl_logstore_standard_log l LEFT JOIN mdl_course c ON l.courseid = c.id WHERE origin = 'web'
Step 2. Data Wrangling
In order to do time series analysis the data needs some reformatting.
Install necessary R packages
library(ggplot2) require(scales) library(dplyr) library(tidyr) library(magrittr) library(ggthemes)
Set your working folder to the directory with your CSV
Import the CSV file
mdl_log <- read.csv('mdl_logstore_standard_log.csv')
Create a POSIXlt time field and break down into day, month, year, and hour components
mdl_log$time <- as.POSIXlt(mdl_log$timecreated, tz = "Australia/Sydney", origin="1970-01-01") mdl_log$day <- mdl_log$time$mday mdl_log$month <- mdl_log$time$mon+1 # month of year (zero-indexed) mdl_log$year <- mdl_log$time$year+1900 # years since 1900 mdl_log$hour <- mdl_log$time$hour
Create a date format field and break down into week component
mdl_log$date <- as.Date(mdl_log$DateTime) mdl_log$week <- format(mdl_log$date, '%Y-%U')
Create a timestamp version for the day for daily time series
mdl_log$dts <- as.POSIXct(mdl_log$date)
Create a timestamp version of hour for hourly time series
mdl_log$dts_str <- interaction(mdl_log$day,mdl_log$month,mdl_log$year,mdl_log$hour,sep='_') mdl_log$dts_hour <- strptime(mdl_log$dts_str, "%d_%m_%Y_%H") mdl_log$dts_hour <- as.POSIXct(mdl_log$dts_hour)
Filter to participation education level events for 2015
mdl_2015 <- subset(mdl_log, year == 2015) participation_2015 <- subset(mdl_2015, edulevel %in% c('2'))
Create the dplyr data table
d <- tbl_df(mdl_log) d %>% mutate(time = as.POSIXct(time))
Step 3. Data Visualisation
Daily activity data
Create the daily aggregation
daily <- group_by(d, userid, dts) %>% summarise(Total = n())
Create a day of the week factor
daily$dow = as.factor(format(daily$dts, format="%a"))
Plot the day of the week breakdown
I’m using a really small sample size here, which creates similar results across all days, however most real populations have a degree of variance.
ggplot(daily, aes(dow, Total)) + geom_boxplot(aes(fill=dow)) + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) + theme_few() + xlab('Day of week') + ylab('User activity frequency') + guides(fill=FALSE)
Create the hourly data aggregation
hourly <- group_by(d, dts_hour) %>% summarise(Total = n())
Create the hour of day factors
hourly$dow = as.factor(format(hourly$dts, format="%a")) hourly$hr = format(hourly$dts_hour, format="%H")
hourly$weekend = 'weekday' hourly[hourly$dow=='Sat'|hourly$dow=='Sun',]$weekend = 'weekend'
Plot the hour of the day breakdown
Gives an idea of the spread of activity throughout the day and indicates people sleep in on weekends.
ggplot(hourly, aes(hr,Total)) + geom_boxplot(aes(fill=weekend)) + geom_smooth(aes(group=weekend)) + scale_fill_manual(values=cbPalette) + xlab('Hour of day') + ylab('Daily activity frequency')
Distinct user data
Create the distinct user aggregation
udaily <- group_by(d, dts) %>% summarise(users = n_distinct(userid))
Plot the daily logins data
Gives an idea of user logins per day – can be analysed as a percentage of your total user base.
ggplot(udaily, aes(dts, users)) + geom_bar(stat="identity", fill="#60B3CE") + scale_x_datetime(breaks = date_breaks("1 week"), minor_breaks = date_breaks("1 day"), labels = date_format("%d-%b-%y")) + scale_color_manual(values=cbPalette) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x="Day", y="Distinct number of users")
Create the user activity aggregation
utotal <- group_by(d, userid) %>% summarise(Total = n())
Plot the activity distribution
Provides a sorted list of activity to see the nature of activity distribution for users. A small number of users creating a large number of events.
ggplot(utotal, aes(reorder(userid, Total), Total)) + geom_point(alpha=0.5, color = "#FF8300") + scale_x_discrete(breaks=NULL) + xlab('User') + ylab('Total activity')
Plot the frequency distribution of activity per user
Provides a histogram of the above plot confirming the skew of the data to the majority being low active users and a minority creating a large number of events. This may be concerning in an educational context and warrants further analysis (in a future post).
ggplot(utotal, aes(Total)) + geom_histogram(binwidth=10, fill="#60B3CE") + xlab('Total activity') + ylab('Frequency of users')
Create the component aggregation
component <- group_by(d, userid, component) %>% summarise(Total = n())
Plot the module use
This gives an idea of the spread of tools being used within course or learning design. Quality is more important than quantity here but this might be a useful springboard into further analysis.
ggplot(subset(component, component != 'core'), aes(x = component, y = Total)) + geom_bar(stat="identity", fill="#FF8300") + coord_polar(theta = "x") + labs(x="Component", y="Total number of events") + scale_fill_manual(values=cbPalette)
I’ve used a polar version of the bar chart, but you can also get a bar chart view of this with the following:
ggplot(component, aes(x = component, Total)) + geom_bar(stat="identity", fill="#60B3CE") + coord_flip() + labs(x="Component", y="Total number of events") + theme(axis.text.x = element_text(angle = 90, hjust = 1))