Assignment engagement timeline – starting with basics @salvetore #mootau15 #moodle #learninganalytics

Having joined the assessment analytics working group for Moodle Moot AU this year, I thought I’d have a play around with the feedback event data and it’s relation to future assignments. The simplified assumption to explore is that learners who view their feedback are enabled to perform better in subsequent assignments, which may be a reduction of potentially more complex distance travelled style analytics. To get started exploring the data I have produced a simple timeline that shows the frequency of assignment views within a course based on the following identified status of the submission:

  1. Pre-submission includes activities when the learner is preparing a submission
  2. Submitted includes views after submission but before receiving feedback (possibly anxious about results)
  3. Graded includes feedback views once the assignment is graded
  4. Resubmission includes activities that involve the learner resubmitting work if allowed

The process I undertook was to sort the log data into user sequences and use a function to set the status based on preceding events. For example, once the grade is released then count subsequent views as ‘graded’. This gives an idea of the spread and frequency of assignment engagement.

Timeline

Timeline

The timeline uses days on the x-axis and users on the y-axis. Each point plotted represents when events were logged for each learner – coloured by the status and sized according to the frequency on that day. There are a few noticeable vertical blue lines which correspond to feedback release dates (i.e. many learners view feedback immediately on its release) and you start to get an idea that some learners view feedback much more than others. The pattern of yellow points reveal learners who begin preparing for their assignment early, contrasted with those who cram a lot of activity closer to deadlines. I have zoomed into a subset of the learners below to help show this.

Timeline-zoomed

Having put this together quickly I am hoping I will have some time to refine the visualisation to better identify some of the relationships between assignments. I could also bring in some data from the assignment tables to enrich this having limited myself just to event data in the logs thus far. Some vertical bars showing deadlines, for example, might be helpful, or timelines for individual users with assignments on the y-axis to see how often users return to previous feedback across assignments as shown below. Here you can see the very distinct line of a feedback release, which for formative assessment it may have been better learning design to release feedback more regularly and closer to the submission.

timeline-learner

How to guide

The following shares the code used to produce the above visualisations and should work with recent Moodle versions.

Step 1: Data Extraction

This uses the logstore data from the initial Moodle database query in the initial log analysis. You can also do this with a CSV file of the logs downloaded from the Moodle interface although some columns may be named or formatted differently and need tidying up.

Step 2: Data Wrangling

Load the libraries, files and set up the time series on events as before discussed in detail earlier.

library(ggplot2)
require(scales)
library(dplyr)
library(tidyr)
library(magrittr)
library(RColorBrewer)
library(GGally)
library(zoo)
library(igraph)
library(devtools)
require(indicoio)
library(tm)

setwd("/home/james/infiniter/data")
mdl_log = read.csv(file = "mdl_logstore_standard_log.csv", header = TRUE, sep = ",")

### Create a POSIX time from timestamp
mdl_log$time <- as.POSIXlt(mdl_log$timecreated, tz = "Australia/Sydney", origin="1970-01-01")
mdl_log$day <- mdl_log$time$mday
mdl_log$month <- mdl_log$time$mon+1 # month of year (zero-indexed)
mdl_log$year <- mdl_log$time$year+1900 # years since 1900
mdl_log$hour <- mdl_log$time$hour
mdl_log$date <- as.Date(mdl_log$DateTime)
mdl_log$week <- format(mdl_log$date, '%Y-%U')

#mdl_log$dts <- strptime(mdl_log$date)
mdl_log$dts <- as.POSIXct(mdl_log$date)

mdl_log$dts_str <- interaction(mdl_log$day,mdl_log$month,mdl_log$year,mdl_log$hour,sep='_')
mdl_log$dts_hour <- strptime(mdl_log$dts_str, "%d_%m_%Y_%H")
mdl_log$dts_hour <- as.POSIXct(mdl_log$dts_hour)

Create a data frame for analysing sequence. In order to maintain the learner centric view of the data this manipulates the userid and relateduserid fields to create a new realuser field that associates the grading event with the learner rather than the teacher so we know where it fits into the learner’s sequence of events. The data is then ordered by learner and assignment so that we can look for changes in the sequence with flags for when the user or assignment id has changed to indicate that the status should be reset.

a_log2015 <- tbl_df(mdl_2015) %>%
mutate(time = as.POSIXct(time)) %>%
mutate(relateduserid = as.integer(levels(relateduserid))[relateduserid]) %>%
mutate(eventname = as.character(levels(eventname))[eventname]) %>%
filter(component == 'mod_assign') %>%
mutate(realuser = ifelse(is.na(relateduserid), userid,
ifelse(eventname=='\\mod_assign\\event\\submission_graded', relateduserid,
userid))) %>%
arrange(realuser, contextinstanceid, time) %>%
mutate(userchange = ifelse(abs(realuser - lag(realuser)) == 0, 0, 1)) %>%
mutate(assignchange = ifelse(abs(contextinstanceid - lag(contextinstanceid)) == 0, 0, 1)) %>%
mutate(userchange = ifelse(is.na(userchange), 1, userchange)) %>%
mutate(assignchange = ifelse(is.na(assignchange), 1, assignchange))

I then created a function that acts like a counter to determine the status.

## Set a status for an assignment.
# 0 = not submitted
# 1 = submitted
# 2 = graded
# 3 = resubmission
init.assignStatus <- function(){
 x <- 0
 function(eventname, assignnew=FALSE) {
 if(assignnew) {
 x <<- 0
 x
 } else if(eventname=='\\mod_assign\\event\\assessable_submitted') {
 x <<- 1
 x
 } else if(eventname=='\\mod_assign\\event\\submission_graded') {
 x <<- 2
 x
 } else if(eventname=='\\mod_assign\\event\\submission_form_viewed' & x>0) {
 x <<- 3
 x
 } else {
 x
 }
 }
}

Then process the logs using this function to set the status.

a_log2015 %<>% rowwise() %>%
 mutate(status = ifelse(userchange>0, assignStatus1(eventname, TRUE),
 ifelse(assignchange>0, assignStatus1(eventname, TRUE),
 assignStatus1(eventname))))

Rather than work with the entire dataset I drilled down into a specific course and analysed this. Change this to whatever course you are interested

science <- a_log2015 %>% filter(courseid == 157)

I filtered my data to just show students, which can be done by identifying teachers to filter out manually or by including roles in the data extraction. At this point I tidied up a few variable types for improved visualisation.

#Filter out staff.
students <- (science %>%
  filter(!(userid %in% c(2,16123)))$userid 

science %<>% filter(realuser %in% students) %>%
 mutate(status = as.factor(status)) %>%
 mutate(realuser = as.factor(realuser))

Finally summarise by user, day and status and count the frequency of events for each group.

sci_summary <- science %>%
group_by(realuser, dts, status) %>%
summarise(total = n())

And plot the timeline graph.

ggplot(sci_summary, aes(x=dts, color=status)) +
theme_bw() +
geom_point(aes(y=realuser, size=total)) +
scale_x_datetime(breaks = date_breaks("1 week"),
minor_breaks = date_breaks("1 day"),
labels = date_format("%d-%b-%y")) +
scale_colour_manual(name = 'Status',
values = c('#FABE2E', '#60B3CE', '#4D61C0', '#FA962E'),
labels = c('Pre-submission', 'Submitted', 'Graded', 'Resubmission')) +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
axis.text.y = element_text(angle = 30, vjust = 1)) +
labs(x="Day", y="User")

Timeline

 

And to plot the same for an individual

sci_16210 <- science %>% filter(userid==16210)

summary_16210 <- sci_16210 %>%
 group_by(contextinstanceid, dts, status) %>%
 summarise(total = n())

ggplot(summary_16210, aes(x=dts, color=status)) +
 theme_bw() + 
 geom_point(aes(y=contextinstanceid, size=total)) +
 scale_x_datetime(breaks = date_breaks("1 week"),
 minor_breaks = date_breaks("1 day"),
 labels = date_format("%d-%b-%y")) +
 scale_colour_manual(name = 'Status',
 values = c('#FABE2E', '#60B3CE', '#4D61C0', '#FA962E'),
 labels = c('Pre-submission', 'Submitted', 'Graded', 'Resubmission')) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1),
 axis.text.y = element_text(angle = 30, vjust = 1)) + 
 labs(x="Day", y="Assignment")

timeline-learner

Can activity analytics support understanding engagement a measurable process? Inspiration from @birdahonk

I was pleased to find out my revised paper on this topic was accepted for publication in the September issue of the Journal of Applied Research in Higher Education. The basic premise of the paper is that engagement can be measured as a metric through the appropriation of ideas commonly used in social marketing metrics. For this post I’ll briefly discuss how I approached this by presenting engagement as a learning theory using the ideas of Freire and Vygotsky, as a process, and as a metric. I’ll also share my workshop slides from the conference if you want to try and create your own learner engagement profile. While I’ve started looking into different approaches, this post summarises some of the key principles developed throughout the paper that have guided my thinking of engagement.

UVUEngagement as a learning theory

The paper proposes a concept of engagement that draws on the work of Paolo Freire and Lev Vygotsky and the evolution of the learner voice. The first aspect of this is to re-position the learner as the subject within education and not the object of education, supplanting previous models which portray the learner as a passive recipient of pre-packaged knowledge. The second aspect is understanding the learner voice as a creative (Freire) and spontaneous (Vygotksy) expression within a socialised teaching-learning process that supports dialectical interactions between learner and teacher curiosity. This positions engagement as the process of recognising and respecting the learner’s world, which as Freire reveals is after all the ‘primary and inescapable face of the world itself’ in order to support the development of higher-order thinking skills. The repression of this voice is likely to result in patterns of inertia, non-engagement and alienation that are discussed widely in motivation and engagement literature. This triangulation between motivation and engagement remains a theme central to a range of learning analytics research. Correlation between learning and autonomy remains an interesting area of research.

Engagement as Process

For the paper I used Haven’s engagement process model and overlaid it with concepts from engagement literature reviews by Fredricks, Blumenfeld, & Paris (2004) and Trowler (2010)Haven posits that engagement is the new metric that supersedes previous linear metaphors encompassing the quantitative data of site visits, the qualitative data of surveys and performance, as well as the fuzzy data in between that represents social media. Haven and Vittel elaborate this into an expansive process that link four components of engagement: involvement, interaction, intimacy, and influence through the key stages of discovery, evaluation, use, and affinity (see below). To appropriate this into the educational literature research of Fredricks et al. one can explore examples of involvement and interaction as behavioural engagement, intimacy as emotional engagement, and influence as cognitive engagement. Furthermore when considering whether engagement is high or low in each component, Trowler’s categorisation of negative engagement, non-engagement, and positive engagement can be adopted. Engagement Process

Engagement as a metric

Learner Dashboard

The goal of positioning this as a metric was to create a learner engagement profile, similar to Haven’s engagement profile for marketing. I used Stevenson’s (2008) Pedagogical Model and Conole and Fill’s (2005) Task Type Taxonomy as ways of classifying log data and social network analysis to understand interactions between the different course actors. These were used to form dashboards such as the example above that could then be used to understand profiles, such as the one below (name fictionalised). One insight is that where simple raw VLE data might have suggested an engaged learner who is regularly online and features centrally in discussions, the engagement profile reveals the possibility of a learner who may lack academic support during their time online (evenings) and demonstrating a pattern of alienation based on an apparently strategic approach within an environment that is heavily structured through teacher-led inscription. Given the number of users who have not logged in or have yet to post to the discussions it might also seem sensible to target other learners for engagement interventions, however this would miss opportunities, revealed in the engagement profile, to provide useful support interventions targeting improved learner voice.

Engagement Profile

Where next?

I’m currently writing a PhD proposal which evolves many of the ideas in this paper, although I am heading towards a more quantitative analysis of qualitative data. Hopefully I’ll get a chance to recreate the dashboards used in RStudio and share them soon as an interruption to the cluster analysis I am trying to get working.

As part of delivering the paper as a workshop I created some case study data and a group discussion activity. These are included below for anyone who wants to follow along. I’ve run similar sessions in Malaysia and Utah and would love to try some other groups. The slides from Utah are below for anyone interested.

One of the main things I’ll always have this paper to thank for was the amazing trip to the Utah Valley and the stroll taken to Rock Canyon.

Rock Canyon

Inside forum posts – politics, networks, sentiment and words! Inspired by @phillipdawson, @shaned07, and @indicoData #moodle #learninganalytics

Enhanced communication has long been championed as a benefit of online learning environments, and many educational technology strategies will include statements around increased communication and collaboration between peers. So in thinking towards an engagement metric for my current project and the need to get inside activities for my, in progress, PhD proposal exploring forum use is one of the more interesting analytics spaces within the LMS. I’ve used three techniques for my initial analysis: (1) a look at post and reply counts inspired by @phillipdawson and his work on the Moodle engagement block, (2) social network analysis inspired by a paper by @shaned07 on teacher support networks; and (3) sentiment and political view analysis provided by @indicoData as an introduction to text mining.

I’ll start with sharing the visualisations and where these might be useful and then finish with details of how I coded these.

Forum posts

Total weekly forum posts by student

Following Phillip Dawson’s work on the engagement block for Moodle, I decided to look into two posting patterns: (1) posts over time; and (2) average post word count. The over time analysis (above) compares the weekly posting pattern of each student in a group. For most students replies to peers and teachers are “in phase” suggesting that when they are active they discuss with the entire group and so learning design might focus on keeping them active. One can also notice that those who only reply to peers appear to have much lower overall post activity, which in the original engagement block would place them at-risk – learning design may consider teacher-led interventions to understand whether discussions with the teacher impact their overall activity. The average word count analysis (below) reinforces the latter case where those demonstrating that those who only reply to peers infrequently post shorter replies. Conversely those who post infrequent lengthy posts tend to target the teacher and do not follow up with many further replies discussion. There is some suggestion of an optimal word count around 75-125 for forum posts that might warrant further investigation.

Forum Posts

Social Network Analysis

Social Network Analysis

The network diagram (above) confirms what was emerging in the post analysis: that a smaller core of students (yellow circles) are responsible for a majority of the posts, and further reveals the absolute centrality of the teacher (blue circle) that highlight how important teacher-led interventions may be to this group. This is probably not surprising although the the teacher may use this to consider how they might respond more equally to the group – here the number of replies is represented by increasing thickness of the grey edges and they appear to favour conversations in the lower left of the network. A similar theme is explored by Shane Dawson (2010) in “‘Seeing’ the learning community”. One can understand this further by plotting eigenvalue centrality against betweenness centrality (below) where a student with high betweenness and low eigenvalue centrality may be an important gatekeeper to a central actor, while a student with low betweenness and high eigenvalue centrality may have unique access to central actors.

Centrality

Content Analysis

Sentiment analysis

Text analysis of forums provides a necessary complement to the above analysis, exploring the content within the context. I have used the Indico API to aid my learning of this part of the field rather than try to build this from scratch. The sentiment analysis API determines whether a piece of text was positive or negative in tone and rates this on a scale from 0 (negative) to 1 (positive). Plotting this over time (above) provides insights into how different topics might have been received with this group showing generally positive participation, although with two noticeable troughs that might be worth some further exploration. The political opinion API scores political leaning within a text on a scale of 0 (neutral) to 1 (strong). Plotting this for each user (below) shows that more politicised posts tend to be conservative (unsurprising) although there is a reasonable mix of views across the discussion. What might be interesting here is how different student respond to different points of view and whether a largely conservative discussion, for example, might discourage contribution from others. Plotting sentiment against libertarian leaning (below2) shows that participants are, at least, very positive when leaning towards libertarian ideology, though this is not the only source of positivity. Exploring text analysis is fascinating and if projects such as Cognitive Presence Coding and the Quantitative Discourse Analysis Package make this more accessible then there are some potentially powerful insights to be had here. I had also hoped to analyse the number of external links embedded in posts following a talk by Gardner Campbell I heard some years ago about making external connections of knowledge, however the dataset I had yielded zero links, which while informative to learning design is not well represented in a visual (code is included below).
Political leaning

Libertarian sentiment

How to guide

The following shares the code used to produce the above visualisations and should work with any Moodle version.

Step 1: Data Extraction

This requires some new data sets from the Moodle database query in the initial log analysis.

Group members

SELECT ue.userid, e.courseid, g.id AS groupid, r.shortname AS role, gm.timeadded
FROM mdl_groups_members gm
JOIN mdl_groups g ON gm.groupid = g.id
JOIN mdl_user_enrolments ue ON gm.userid = ue.userid
JOIN mdl_enrol e ON ue.enrolid = e.id AND e.courseid = g.courseid
JOIN mdl_context co ON co.instanceid = e.courseid AND co.contextlevel = 50
JOIN mdl_role_assignments ra ON ra.userid = ue.userid AND ra.contextid = co.id
JOIN mdl_role r ON ra.roleid = r.id
GROUP BY ue.id

Forum posts

SELECT p1.id, d.forum, d.course, d.groupid, p1.discussion, p1.parent, p1.userid, p1.created, p1.modified, p1.subject, p1.message, p1.attachment, p1.totalscore, IFNULL(p2.userid, d.userid) as target
FROM mdl_forum_posts p1
LEFT JOIN mdl_forum_posts p2 ON p1.parent = p2.id
LEFT JOIN mdl_forum_discussions d ON p1.discussion = d.id

 Step 2: Data Wrangling

Load the libraries, files and set up the time series on posts similar to  process discussed in detail earlier.

library(ggplot2)
require(scales)
library(dplyr)
library(tidyr)
library(magrittr)
library(RColorBrewer)
library(GGally)
library(zoo)
library(igraph)
library(devtools)
require(indicoio)
library(tm)

setwd("/home/james/infiniter/data")
mdl_log = read.csv(file = "mdl_logstore_standard_log.csv", header = TRUE, sep = ",")
posts = read.csv(file = "mdl_forum_posts.csv", header = TRUE, sep = ",")
groups = read.csv(file = "mdl_groups_members.csv", header = TRUE, sep = ",")

### Create a POSIX time from timestamp
posts$time <- as.POSIXlt(posts$created, tz = "Australia/Sydney", origin="1970-01-01")
posts$day <- posts$time$mday
posts$month <- posts$time$mon+1 # month of year (zero-indexed)
posts$year <- posts$time$year+1900 # years since 1900
posts$hour <- posts$time$hour 

posts$dts_str <- interaction(posts$day,posts$month,posts$year,sep='_')
posts$dts <- strptime(posts$dts_str, "%d_%m_%Y")
posts$dts <- as.POSIXct(posts$dts)
posts$week <- format(posts$dts, '%Y-%U')

I filtered users on a particular group within a course, however readers will need to adjust this accordingly using the group members extraction above (if you don’t use Moodle groups just use the course instead). The below code creates the filter variable users.

g <- tbl_df(groups)
users <- (g %>% filter(groupid == 1234))$userid
Forum posts

Create some simple text cleaning functions

# returns string without HTML tags.
clean <- function(x) gsub("(<[^>]*>)", " ", x)

# returns string with spaces instead of &nbsp;
space <- function(x) gsub("&nbsp;", " ", x)

# returns string w/o leading or trailing whitespace.
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# returns string w/o double-spaces.
trim2 <- function (x) gsub("^ *|(?<= ) | *$", "", x, perl=T)

Process the forum post data. This essentially creates a data frame, tidies up time and userid variable types, counts the number of external links, tags whether the post is a reply, filters to the group under analysis, cleans the message of HTML, coded spaces, leading and trailing spaces, and extra whitespace, count the number of words, and finally determine whether the post was a reply to the teacher or a peer (based on hard-coded ids – this could be improved). Forum posts use simple HTML markup and structures so cleaning can be achieved with regular expressions – this would not work on all text.

f <- tbl_df(posts)
f %<>% mutate(time = as.POSIXct(time)) %>%
 mutate(userid = as.factor(userid)) %>%
 mutate(links = ifelse(gregexpr("<a href=", f$message)[[1]] == -1, 0,
 length(gregexpr("<a href=", f$message)[[1]]))) %>%
 mutate(reply = ifelse(parent>0, 1, 0)) %>%
 filter(groupid == 1234) %>%
 mutate(clean_message = clean(message)) %>%
 mutate(clean_message = space(clean_message)) %>%
 mutate(clean_message = trim(clean_message)) %>%
 mutate(clean_message = trim2(clean_message)) %>%
 mutate(wcount = sapply(gregexpr("\\S+", clean_message), length)) %>%
 mutate(to = ifelse(target %in% c(10603, 10547), 'Teacher', 'Peer'))

Then apply the sentiment analysis using Indico (you’ll need an API key).

f %<>% rowwise() %>% mutate(sentiment = sentiment(clean_message, api_key = 'xxxxx'))
f %<>% rowwise() %>% mutate(libertarian = political(clean_message, api_key = 'xxxxx')$Libertarian)
f %<>% rowwise() %>% mutate(liberal = political(clean_message, api_key = 'xxxxx')$Liberal)
f %<>% rowwise() %>% mutate(green = political(clean_message, api_key = 'xxxxx')$Green)
f %<>% rowwise() %>% mutate(conservative = political(clean_message, api_key = 'xxxxx')$Conservative)

Finally, set up the social network matrix using the igraph library.

n <- tbl_df(posts) %>% mutate(time = as.POSIXct(time))
n %<>% filter((userid %in% users) | (target %in% users)) %>%
  select(userid, target) 

nmatrix <- as.matrix(n)
ngraph <- graph.data.frame(nmatrix)
adj.mat <- get.adjacency(ngraph,sparse=FALSE)
net=graph.adjacency(adj.mat,mode="directed",weighted=TRUE,diag=FALSE)

And establish the network metrics for centrality

metrics <- data.frame(
 deg=degree(net),
 bet=betweenness(net),
 clo=closeness(net),
 eig=evcent(net)$vector,
 cor=graph.coreness(net)
 )

metrics <- cbind(userid = rownames(metrics), metrics)
rownames(metrics) <- NULL

Step 3: Data Visualisation

Forum posts per week per student

Weekly forum

ggplot(subset(f, !(userid %in% c(10603, 10547))), aes(x=dts)) +
 geom_line(aes(color=to), stat="bin", binwidth=7*24*60*60) +
 scale_x_datetime(breaks = date_breaks("4 week"),
 minor_breaks = date_breaks("1 week"),
 labels = date_format("%d-%b")) +
 scale_y_continuous(breaks = seq(0, 10, 5)) +
 coord_cartesian(ylim = c(0,10)) +
 xlab('Week') + ylab('Forum posts') +
 theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
 facet_wrap(~userid, ncol = 7)
Forum words per post

Forum Posts

fg <- group_by(f, userid, to) %>%
 summarise(wordspp = mean(wcount), medwords = median(as.numeric(wcount)),
 totalwords=sum(wcount), posts=n()) %>%
 filter(!(userid %in% c(10603, 10547)))

ggplot(fg, aes(userid, wordspp), group=to) +
 geom_bar(stat="identity", aes(fill=to), position="dodge") +
 geom_text(aes(label=posts), size=3, color="#2C3E50",
 position=position_dodge(width=0.9), vjust=0) +
 xlab('User (label = total posts)') + ylab('Words per post') +
 theme(axis.text.x = element_text(angle = 60, hjust = 1))
Social Network

Social Network Analysis

V(net)$Role=as.character(g$role[match(V(net)$name,g$userid)])
V(net)$color=V(net)$Role
V(net)$color=gsub("coursedeveloper","#06799F",V(net)$color)
V(net)$color=gsub("editingteacher","#60B3CE",V(net)$color)
V(net)$color=gsub("student","#FF8300",V(net)$color)

V(net)$label.cex <- 2.2 * V(net)$degree / max(V(net)$degree)+ .2
V(net)$label.color <- rgb(0, 0, .2, .8)
V(net)$frame.color <- NA
egam <- (log(E(net)$weight)+.4) / max(log(E(net)$weight)+.4)
E(net)$color <- "#9E9E9E"
E(net)$width <- egam

tkplot(net, layout=layout.fruchterman.reingold, edge.width=0.25*E(net)$weight,
 edge.curved=TRUE)
Centrality

Centrality

ggplot(metrics, aes(x=bet, y=eig, label=userid)) +
 geom_text(angle=30, size=3.5, alpha=0.8, color="#60b3ce") +
 labs(x="Betweenness centrality", y="Eigenvector centrality")
Sentiment

Sentiment analysis

ggplot(f, aes(dts, sentiment)) +
 geom_rect(aes(xmin=strptime('2015-01-22',"%Y-%m-%d"),
 xmax=strptime('2015-05-10',"%Y-%m-%d"),
 ymin=0, ymax=0.5), fill="#FFB2B2", alpha=0.5) +
 geom_rect(aes(xmin=strptime('2015-01-22',"%Y-%m-%d"),
 xmax=strptime('2015-05-10',"%Y-%m-%d"),
 ymin=0.5, ymax=1), fill="#B2F0B2", alpha=0.5) +
 geom_point() +
 geom_line(stat = 'summary', fun.y = mean, linetype=2) +
 scale_x_datetime(breaks = date_breaks("1 week"),
 minor_breaks = date_breaks("1 day"),
 labels = date_format("%d-%b-%y")) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1))
Political leaning

Political leaning

ggplot(f, aes(x=userid)) +
 geom_point(aes(y=libertarian, color="yellow")) +
 geom_point(aes(y=liberal, color="red")) +
 geom_point(aes(y=green, color="green")) +
 geom_point(aes(y=conservative, color="blue")) +
 scale_colour_manual(name = 'Political leaning',
 values =c('yellow'='yellow', 'red'='red',
 'green'='green','blue'='blue'),
 labels = c('Conservative', 'Green', 'Liberal', 'Libertarian')) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1))
Libertarian Sentiment

Libertarian sentiment

ggplot(f, aes(x=libertarian, y=sentiment, label=userid)) +
 geom_point(angle=30, size=3.5, alpha=0.8, color="#60b3ce") +
 coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
 labs(x="Libertarian leaning", y="Sentiment")

Next task: assignment metrics

Learning logs: how long are your users online? Analytics Part 2 #moodle #learninganalytics

How long do users spend on Moodle (or more generally e-Learning) is another common question worth some initial exploration as part of my broader goal towards the notion of an engagement metric. This article discusses an approach into defining and obtaining insights from the idea of a session length for learning. This is mostly a data wrangling exercise to approximate the duration from event logs that will tell us that while all events are born equal, some are more equal than others. The algorithm should prove useful when I progress to course breakdowns in identifying particularly dedicated or struggling students who are investing larger amounts of time online, or those at-risk who aren’t spending enough. These questions are something I will return later in a future post as part of the project.

Learning Duration

This works on the same data as last week’s look at some basic distribution analysis which contains extraction SQL.

Event-duration Correlation

Event-duration Correlation

Duration distribution

Duration distribution

Session spread

Session spread

Step 1. Data Wrangling

The goal here is to calculate the duration based on the difference between events and the challenge is determining when a session starts and ends. Notably there is not a consistent and clear logged in and logged out event recorded in the data.

The principle I have used is to sort the events by user and time, and then compare each row with the previous and determine is a new session should be started. These are the rules I came up with empirically as to when to start counting a new session:

  1. If the event is a log in event (\core\event\user_loggedin) as this is a new login;
  2. If the event is an earlier as this implies the user has changed given the sort order;
  3. If the event is a course view and the duration is over 5 mins as this suggests the user has left the browser without logging out – this was determined because most course views are less than 1 minute and 99% of course view durations were under 5 minutes with several outliers that created unusually long sessions;
  4. If the previous event was over 60 minutes earlier, which is based on the session timeout value

Setting up the data

This is the same as last week’s data where this is explained in more detail.

library(ggplot2)
require(scales)
library(dplyr)
library(tidyr)
library(magrittr)
library(RColorBrewer)
library(ggthemes)
library(zoo)
setwd("./Documents/")

mdl_log <- read.csv('mdl_logstore_standard_log.csv')

### Setup additional date variables
mdl_log$time <- as.POSIXlt(mdl_log$timecreated, tz = "Australia/Sydney", origin="1970-01-01")
mdl_log$day <- mdl_log$time$mday
mdl_log$month <- mdl_log$time$mon+1 # month of year (zero-indexed)
mdl_log$year <- mdl_log$time$year+1900 # years since 1900
mdl_log$hour <- mdl_log$time$hour 
mdl_log$date <- as.Date(mdl_log$DateTime)
mdl_log$week <- format(mdl_log$date, '%Y-%U')
mdl_log$dts <- as.POSIXct(mdl_log$date)

mdl_log$dts_str <- interaction(mdl_log$day,mdl_log$month,mdl_log$year,mdl_log$hour,sep='_')
mdl_log$dts_hour <- strptime(mdl_log$dts_str, "%d_%m_%Y_%H")
mdl_log$dts_hour <- as.POSIXct(mdl_log$dts_hour)

Manipulating the data

As mentioned already the principle I am going to is to break the data into sessions and calculate the difference between the start and end time.

So the first thing I need is a counter function to keep track of the session number, which I modified from this post. Basically this will return the next increment by default, or the current value if increment (inc) is set to false.

init.counter <- function(){
    x <- 0
    function(inc=TRUE){
        if(inc) {
            x <<- x + 1
            x
        } else {
            x
        }
    }
} #source: hadley wickham

Next initialise the counter instance and create the dplyr data frame.

counter1 <- init.counter()
d <- tbl_df(mdl_log)

For the first wrangle there are a few additional parameters to add to make the calculation smooth. This is likely to be the area to focus on to refine the rules if you have unusual outliers.

  1. Fix the time format to POSIXct
  2. Sort the data by user and time so that is sequenced by user sessions
  3. Calculate the difference (diff) between the current row and the previous row in minutes
  4. Calculate the duration (dur) of the event as the difference between the current row and the next row in minutes
  5. Set a binary login flag if the event is logged in
  6. Set a binary course dwell flag (cdwell) if the user remains on a course view event for more than 5 minutes
d %<>% mutate(time = as.POSIXct(time)) %>%
 arrange(userid,time) %>%
 mutate(diff = difftime(time, lag(time), "Australia/Sydney", "mins")) %>%
 mutate(dur = -1 * difftime(time, lead(time), "Australia/Sydney", "mins")) %>%
 mutate(login = ifelse(action=="loggedin", 1, 0)) %>%
 mutate(cdwell = ifelse(eventname=='\\core\\event\\course_viewed' & diff>5, 1, 0))

The second wrangle uses the row wise method to apply a function to each row in turn, namely to calculate the session number (sessnum). Each time counter1() is called this will utilise the function above to increment the session (or track a new session), otherwise counter1(inc=FALSE) returns the current session. There are 6 nested rules applied at this stage:

  1. If the diff is NA then set the session to 0 (this deals with the first row and sets the count sequence)
  2. If the login flag is set (i.e. this is the loggedin event) then increment the session
  3. If the difference is negative (i.e. the current event happened before the previous) then increment the session (this case deals with the user sort as this indicates a change in user and so a new session)
  4. If this is flagged as an extended course dwell event then start a new session (assume the user left the browser)
  5. If the difference (diff) from the last event is greater than 1 hour (60 mins) then start a new session
  6. Otherwise this is within the current tracked session
d %<>% rowwise() %>%
  mutate(sessnum = 
      ifelse(is.na(diff), 0,
      ifelse(login, counter1(),
      ifelse(diff < 0, counter1(),
      ifelse(cdwell, counter1(),
      ifelse(diff > 60, counter1(), counter1(inc=FALSE)
  ))))))

Finally group the data by user and session and use this to calculate the start and end times, the number of events in the session, and the duration in minutes as end minus start.

d_session <- group_by(d, userid, sessnum) %>%
  summarise(start = min(time), end = max(time), date=min(date),
      events = n(), duration = difftime(max(time),min(time), 
      "Australia/Sydney", "mins")) %>%
  arrange(start) %>%
  mutate(sesslength = ceiling(as.numeric(duration))) %>%
  mutate(weeknum = format(date, "%Y-%U"))

Step 2. Data Visualisation

Daily duration spread

This presents the overall spread of durations per day and is useful to understand the general use of the site, and where you have peaks, upper bounds or outliers. I used this to identify outliers and further refine my codification above to remove false positives in session sequences. The size of the circle indicates the number of events in the session which I will discuss further in the last visualisation.

Daily Duration Spread

ggplot(d_session, aes(x=start, y=sesslength)) +
 geom_point(aes(size=events), alpha=0.75, color="#60B3CE", position=position_jitter()) +
 geom_smooth() +
 scale_x_datetime(breaks = date_breaks("1 week"),
 minor_breaks = date_breaks("1 day"),
 labels = date_format("%d-%b-%y")) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
 labs(x="Day", y="Session length (mins)")

Distribution 
The next visual looks at the daily distributions within each week (not the total hours per week) to demonstrate averages for individual sessions. The indication is that while there are many sessions longer then 1/2 hour, these represent less than 25% of all sessions. Learning activities or sequences intended to last more than 30 minutes are likely to be incomplete in a single session, which may influence learning design choices. In the second version of the chart I have zoomed in to those sessions of one hour of less to give a better idea of the distribution, which shows that 15 minutes is the average time, which requires quite bitesize learning.

Weekly spread

ggplot(d_session, aes(x=weeknum, y=sesslength)) +
 geom_boxplot(fill="#60b3ce") +
 geom_smooth(aes(group=1), color="#FF8300", linetype=2) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
 scale_y_continuous(breaks=seq(0, 600, 60), minor_breaks=seq(0, 600, 15)) +
 labs(x="Week", y="Session length (mins)")

Distribution

ggplot(d_session, aes(x=weeknum, y=sesslength)) +
 geom_boxplot(fill="#60b3ce") +
 geom_smooth(aes(group=1), color="#FF8300", linetype=2) +
 theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
 scale_y_continuous(breaks=seq(0, 600, 15), minor_breaks=seq(0, 600, 15)) +
 labs(x="Week", y="Session length (mins)") +
 coord_cartesian(ylim = c(0,60))

Correlation between events and duration

One may expect that more events in a session indicates a longer session, which is to some extent true. There is correlation, however the visualisation shows that this is not consistent. Having been guilty of historically using total number of events as a proxy for time online, this needs greater justification in future analyses as not events are equal.

correlation

ggplot(d_session, aes(x=events, y=sesslength)) +
 geom_point(alpha=0.75, 
 position = position_jitter(h =0),
 color = "orange") + 
 labs(x="Number of event logs", y="Session length (mins)") +
 coord_cartesian(xlim = c(0,100), ylim = c(0,400))

Hopefully some others will find this useful in exploring this question. I want to use and refine the algorithm as a factor for engagement later on. Next up I will be exploring forums in more detail.

Scratching the surface: Moodle analytics in Rstudio Part 1 #moodle #learninganalytics

At some point I always come back to the question of how do we understand use of the VLE/LMS, which I’ve theorised a lot. As part of an interest to learn about Data Science I’ve signed up to Sliderule (@MySlideRule) and am being mentored through a capstone project with some Moodle data. The main goal is for me to learn R, which I’d never touched until 2 weeks ago, but hopefully the data can tell me something about Moodle at the same time. Feedback or advise on techniques is welcomed.

Exploratory Data Analysis on mdl_logstore_standard

For this part I am going to focus on producing some simple two-dimensional analysis. This assumes you have MySQL access to your Moodle database and RStudio.

Daily logins

Hourly access

Module use

Day of week

Frequency distribution

Activity distribution

Step 1. Data Extraction

I started with a full data extraction of all events in the system to a CSV file (mdl_logstore_standard_log.csv).

SELECT c.fullname as courseName,
FROM_UNIXTIME(l.timecreated) as DateTime, l.*
FROM mdl_logstore_standard_log l
LEFT JOIN mdl_course c ON l.courseid = c.id
WHERE origin = 'web'

Step 2. Data Wrangling

In order to do time series analysis the data needs some reformatting.

Install necessary R packages

library(ggplot2)
require(scales)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggthemes)

Set your working folder to the directory with your CSV

setwd("./Documents/")

Import the CSV file

mdl_log <- read.csv('mdl_logstore_standard_log.csv')

Create a POSIXlt time field and break down into day, month, year, and hour components

mdl_log$time <- as.POSIXlt(mdl_log$timecreated,
 tz = "Australia/Sydney", origin="1970-01-01")
mdl_log$day <- mdl_log$time$mday
mdl_log$month <- mdl_log$time$mon+1 # month of year (zero-indexed)
mdl_log$year <- mdl_log$time$year+1900 # years since 1900
mdl_log$hour <- mdl_log$time$hour

Create a date format field and break down into week component

mdl_log$date <- as.Date(mdl_log$DateTime)
mdl_log$week <- format(mdl_log$date, '%Y-%U')

Create a timestamp version for the day for daily time series

mdl_log$dts <- strptime(mdl_log$date)
mdl_log$dts <- as.POSIXct(mdl_log$date)

Create a timestamp version of hour for hourly time series

mdl_log$dts_str <- interaction(mdl_log$day,mdl_log$month,mdl_log$year,mdl_log$hour,sep='_')
mdl_log$dts_hour <- strptime(mdl_log$dts_str, "%d_%m_%Y_%H")
mdl_log$dts_hour <- as.POSIXct(mdl_log$dts_hour)

Filter to participation education level events for 2015

mdl_2015 <- subset(mdl_log, year == 2015)
participation_2015 <- subset(mdl_2015, edulevel %in% c("2"))

Create the dplyr data table

d <- tbl_df(mdl_log)
d %<>% mutate(time = as.POSIXct(time))

Step 3. Data Visualisation

Daily activity data

Create the daily aggregation

daily <- group_by(d, userid, dts) %>% summarise(Total = n())

Create a day of the week factor

daily$dow = as.factor(format(daily$dts, format="%a"))

Plot the day of the week breakdown

I’m using a really small sample size here, which creates similar results across all days, however most real populations have a degree of variance.

ggplot(daily, aes(dow, Total)) +
geom_boxplot(aes(fill=dow)) +
scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +
theme_few() +
xlab('Day of week') + ylab('User activity frequency') +
guides(fill=FALSE)

Hourly data

Create the hourly data aggregation

hourly <- group_by(d, dts_hour) %>% summarise(Total = n())

Create the hour of day factors

hourly$dow = as.factor(format(hourly$dts, format="%a"))
hourly$hr = format(hourly$dts_hour, format="%H")

Separate weekends

hourly$weekend = 'weekday'
hourly[hourly$dow=='Sat'|hourly$dow=='Sun',]$weekend = 'weekend'

Plot the hour of the day breakdown

Gives an idea of the spread of activity throughout the day and indicates people sleep in on weekends.

Hourly access

ggplot(hourly, aes(hr,Total)) +
geom_boxplot(aes(fill=weekend)) +
geom_smooth(aes(group=weekend)) +
scale_fill_manual(values=cbPalette) +
xlab('Hour of day') + ylab('Daily activity frequency')

Distinct user data

Create the distinct user aggregation

udaily <- group_by(d, dts) %>% summarise(users = n_distinct(userid))

Plot the daily logins data

Gives an idea of user logins per day – can be analysed as a percentage of your total user base.

logins

ggplot(udaily, aes(dts, users)) +
geom_bar(stat="identity", fill="#60B3CE") +
scale_x_datetime(breaks = date_breaks("1 week"),
minor_breaks = date_breaks("1 day"),
labels = date_format("%d-%b-%y")) +
scale_color_manual(values=cbPalette) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x="Day", y="Distinct number of users")

Activity distribution

Create the user activity aggregation

utotal <- group_by(d, userid) %>% summarise(Total = n())

Plot the activity distribution

Provides a sorted list of activity to see the nature of activity distribution for users. A small number of users creating a large number of events.

Activity by user

ggplot(utotal, aes(reorder(userid, Total), Total)) +
geom_point(alpha=0.5, color = "#FF8300") +
scale_x_discrete(breaks=NULL) +
xlab('User') + ylab('Total activity')

Plot the frequency distribution of activity per user

Provides a histogram of the above plot confirming the skew of the data to the majority being low active users and a minority creating a large number of events. This may be concerning in an educational context and warrants further analysis (in a future post).

Frequency

ggplot(utotal, aes(Total)) +
geom_histogram(binwidth=10, fill="#60B3CE") +
xlab('Total activity') + ylab('Frequency of users')

Module use

Create the component aggregation

component <- group_by(d, yearLevel, userid, component) %>% summarise(Total = n())

Plot the module use 

This gives an idea of the spread of tools being used within course or learning design. Quality is more important than quantity here but this might be a useful springboard into further analysis.

Module use

ggplot(subset(component, component != 'core'),
aes(x = component, y = Total)) +
geom_bar(stat="identity", fill="#FF8300") +
coord_polar(theta = "x") +
labs(x="Component", y="Total number of events") +
scale_fill_manual(values=cbPalette)

I’ve used a polar version of the bar chart, but you can also get a bar chart view of this with the following:

ggplot(component, aes(x = component, Total)) +
geom_bar(stat="identity", fill="#60B3CE") +
coord_flip() +
labs(x="Component", y="Total number of events") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

the purpose of education is to delight (#purposedu #500words)

Learning begins in delight and ends in wisdom – Gardner Campbell

With its simplicity and panache, the purpose of delight is offered in a similar spirit to those so far: hope, independence, curiosity, magical experience, connection, confidence, enthusiasm, optimism, preparation … and to the seemingly over-arching theme of  helping people become what they are capable of becoming.

However this made me wonder how education actually supports a journey from delight to wisdom; a seemingly different journey than that from the classroom to the exam hall.

An education provider, as social institution, implies a certain structure of time and space and creates its own social system of relationships – perhaps most commonly being one of categorisation:

  • Categorisation of learning providers through league tables;
  • Categorisation of learning through curriculum;
  • Categorisation of learners through standardised testing.

For Foucault such classifications operate a disciplinary function that constitutes the individual as effect and object of power (pouvoir) and knowledge (connaissance). This produces the individual ‘case’ – learner as UCAS points, bachelor, master, drop-out, failure – the examination fixing individual differences and the commonality of potential.

Such systems may well suggest that the end of education is the dawn of learning.

An alternative to categorisation is sense-making, a process based on exploration rather than exploitation. In other words education must shift from instruction to discovery; from boring to building. Taking this further, McLuhan suggests that anyone who makes a distinction between education and entertainment doesn’t know the first thing about either. This should shift interest to the territory of knowledge (saviour) to be explored rather than domains of  knowledge (connaissance) being imposed.

Following Deleuze & Guattari this introduces an alternative concept of power (puissance) as a range of potential or ‘capacity for existence’.  This power resides with the learner – the power to be, so beautifully presented already as burning brightly, building minds, or the magical key to unlocking potential.

For me delight is the interest that can spark a connection with the world. Interest-driven learning or rhizomatic learning provide examples where there aren’t ‘things people should know’ but rather ‘new connections to be made’. The purpose of education and importance of teachers becomes to help learners follow their delights and make new connections; the community becomes the curriculum

As a technologist my interest has been in understanding how personalised learning might see systems adapt to the learner rather than learners to systems. One need only look at the impact of the long tail or the social network as ways of piquing personal interest and connecting people with shared interests.

While possibilities exist for improving learning, technology itself cannot act as an isolated catalyst for these changes which is why debates like this are so important. The classification model of education will resist such changes: rather than allow learners to explore through technology, a computer curriculum was developed … before the computer could change School, School changed the computer.

To reverse this, the purpose of delight seeks to help learners perceive education as something they’re participants in rather than recipients of. 

Thinking in Colour

I have spent a lot of time practising writing and even doing presentations, but far less time exploring colour. This became apparent while attempting to design a document that needs to make an impression and be pleasing to the viewer. I decided to try an adapted activity from a Colour Theory lesson, which I’ve shared below (took about 2 hours). This is a fun exercise in creativity and like free writing it may help engage otherwise dormant thinking processes.

(1) Draw a colour wheel – I usually use an online colour scheme designer, however this doesn’t really encourage me to think about the choices.

(2) Find a location on Google Street View (I didn’t intend or have time to draw on street for this)

(3) Draw the negative space – this has 2 purposes: getting your brain to think differently and creating colour spaces for next step;

(4) If available copy to coloured card – I didn’t have any but might pick some up for next time

(5) Choose complementary colours for spaces (I had pastels so went with bold colours and hue change whereas coloured pencil may have been preferable)

I have to wonder if this was an exercise in procrastination or creativity – the negative spaces took a while to get used to. While I could have adapted a Word template in the same time as choosing a colour for my new format (using Scribus), this would have been far less enjoyable. I won’t claim to have radically altered my thinking, however it does draw attention to how the eyes perceive both colour and text in the same and yet different ways. While I can definitely improve my drawing, it also seems sensible to add colour to any set of thinking tools.