In light of my new position as a HarvardX Research Fellow, I have been thinking about the role of data in improving online learning experiences (aka MOOCs) at edX. Can data tell us everything about the ideal learning experience of tomorrow? Can product developers at edX come up with the best version singe-handedly? Or, maybe, the online students could also tell us what is the ideal MOOC?
First, let's think about what could be the "ideal MOOC". There is a broad consensus that an ideal online learning experience would yield the best "educational outcomes" for the students. For now, let's think about the educational outcome as something that's well-approximated with the amount of learning. Specifically, this means that we want students to extract and internalize as much educational content from the interactive learning experience as possible. Finally, the educational content is information that is relevant to the substance of the class. For example, for a probability course, this would include information on how to use Bayes rule or the change of variables. For a Python programming class this would include information on how to operate Python modules and language syntax. For a class on interactive visualization, this could include (of course!) information on how to use d3js.
This is an important point. Educational content is information relevant to the substance of the class. We want the students to internalize as much of it as possible, make it their knowledge. How can we do that?
Let's assume that the educational materials (lectures, homework, tests, examples) have already been prepared and we believe that they are good. How do we expose the materials to the students in the best possible way so that students learn the most, stay engaged, and more students complete the class?
Clearly, the setting of a MOOC is different from the setting of a standard classroom. One of the significant differences is the number of students - it's massive. Depending on the course, the number of enrolled students can exceed 150 thousand - CS50x by David Malan on HarvardX is a great example. Do we want to expose every single student, no matter what country he/she is from, no matter what talents and aspirations he/she has, no matter how many peers he/she will study with, all to the same sequence of the material? Maybe, yes. And maybe, no.
The setting of MOOCs can be a wonderful platform for adaptive media - an algorithmic way of sequentially presenting content and interacting with the user in order to maximize the informational content that the user "internalizes".
Adaptive media. It's the characterizing trait of a computer as a medium - the ability to simulate responses, interact, predict, "act like a living being". We can use it to model, predict, and synthesize the best way to serve content to users, algorithmically.
Adaptive media is used actively across the Web in conjunction with social media. Often, the inputs of adaptive media are the outputs of social media (and then it repeats). When you share an article on Facebook, the system learns about your preferences and makes sure that the next time you see content it'll be more relevant to your interests. A lot of the time, by the custom-tailored content we mean advertisements. Same goes for LinkedIn - ever noticed the "Ads you may be interested in" section to the right on your LinkedIn profile?
Can we use adaptive media in MOOCs? The benefits are obvious - with hundreds of thousands of enrollees, it is impossible to adequately staff the course with enough qualified facilitators. Adaptive media could be used together with the teachers' input and social media such as forums, social grading, and study groups. The purpose, instead of displaying personalized ads, would be to make sure each student learns as much as possible from the interactive learning experience, in his or her unique way. There could also be a multitude of positive extras - reduced dropout rate, higher engagement, higher enrollment for adaptive MOOCs.
Isn't this interesting?
Tags: algorithm design, experiment design, MOOC, product improvement
Posted in Research | 1 Comment »
Registration numbers in Health and Numbers and Health and Environmental Change were similar in magnitude. As of September 8, 2013, edX had records of 61,181 students who had registered for Health in Numbers and 53,340 students who had registered for Health and Environmental Change. The difference can be attributed in part to registrations in Health in Numbers that have happened since the course “wrapped,” after all graded materials were due.
Figure 1 shows the cumulative registration for Health in Numbers and Health and Environmental Change. Health and Environmental Change had a longer enrollment period before the course began by over 60 days. Health in Numbers, however, had a slightly higher registration rate during the shorter period. They had different levels of enrollment when the course launched (34,970 for Health in Numbers and 45,390 for Health and Environmental Change) and similar numbers at 120 days after course launch, just after each course wrapped (54,007 for Health in Numbers and 53,340 for Health and Environmental Change).
Figure 1. Cumulative enrollment through 120 days after course launch for Health in Numbers (n=54,007) and Health and Environmental Change (n=53,340).
Since Health in Numbers started nine months before Health and Environmental Change, there was an almost ten-month span between the wrap of Health and Numbers and the data collection for this report and only three weeks between the wrap of Health and Environmental Change and data collection. During this post-wrap period, over 7,000 people signed up for Health in Numbers. We include these 7,000 post-wrap registrants from Health in Numbers in many of our analyses below because they are an interesting constituency to consider. Those who register after the due date cannot have a complete course experience—they did not have access to Professor Pagano’s textbook, the free version of Stata, the discussion forums, or the final exam, and they could not earn a certificate. However, they could view the lectures, submit answers to problems and view correct answers, and take the practice exams. These students could have a meaningful, self-directed, learning experience.
In Figures 2, 3 and 4, we present data about the demographic characteristics of students in Health and Numbers and Health and Environmental Change, and we compare these characteristics to the averaged percentages from the five other initial HarvardX large scale courses (Justice, Heroes, Computer Science, Health in Numbers, and Health and Environmental Change). Health and Environmental Change was one of the more gender-balanced courses among the initial HarvardX offerings, with 49% female registrants; Health in Numbers was more typical with 43% female registrants. The multiple course report has additional details about course-specific demographics for HarvardX and MITX courses.
Figure 2. Gender distribution in Health in Numbers (n=57,536; 3,645 missing), Health and Environmental Change (n=50,114; 3,226 missing) and the average distribution of five 2012–2013 HarvardX large-scale courses (n=384,060; 35,254 missing).
Figure 3. Distribution of highest degree earned in Health and Numbers (n=55,638; 5,5453 missing), Health and Environmental Change (n=48,262; 5,078 missing), and five HarvardX large-scale courses (n=368,579, 50,735 missing).
Figure 4. Age distribution of Health in Numbers (n=57,000; 4,181 missing), Health and Environmental Change (n=53,340; 3,740 missing), and the average of the first five HarvardX large-scale courses (n=342,048; 77,266 missing).
As with other HarvardX courses, registrants in the HSPH offerings were highly educated, especially in Health in Numbers. Over 85% of students registered for Health in Numbers held at least a Bachelor’s Degree. In terms of registrants’ highest degree attained, 36% of registrants had a Bachelor’s degree, an additional 38% had a Master’s degree, and an additional 12% of students possessed doctorates. The proportion of advanced degree holders (Master’s and PhD) in Health and Numbers was the highest of any of the first HarvardX courses. Health and Environmental Change was more typical of HarvardX courses: 39% of students had a Bachelor’s degree, an additional 29% held a Master’s degree, and an additional 6% had a PhD.
As expected, given their higher-than-average educational attainment, Health in Numbers skewed somewhat older than other HarvardX courses, with an especially high proportion of students in their 30s. Health and Environmental Change more closely tracked the average distribution of the first five large-scale HarvardX courses.
Both HSPH courses were global enterprises. Of the students with identifiable countries of residence (detected through geo-locating IP addresses and parsing self-reported addresses), 75% came from outside of the United States, with the largest second cohort in each country coming from India. The proportion of international students is higher than the other early HarvardX large-scale courses, where approximately two-thirds of identifiable registrants came from outside the United States.
Table 3. Country of residence for registrants with identifiable addresses in Health and Environmental Change (n=48,360; 4,980 missing) and Health in Numbers (n=58,520; 2,661 missing).
Of the students who registered for these HSPH courses, the degree and kind of participation varied considerably. Some students completed all the materials available in the course; others focused on videos and readings while avoiding assessments; and still others focused mostly on taking assessments. To illustrate these diverse course-taking patterns, Figures 5 and 6 show scatterplots of student activity on two dimensions. On the x-axis, we plot student grades, and on the y-axis, we plot the number of “chapters” that were viewed at least once by the student (the points in the plot are jittered to show density.) Chapters are the highest-level organizational unit in the edX courseware; Health in Numbers had 16 chapters, one for each of the 12 weeks, one for a practice exam, one for the real exam, one introduction, and one collection of videos from guest lecturers.
Figure 5. Scatterplot of grade and chapters viewed for Health in Numbers registrants (n=61,181).
Figure 6. Scatterplot of grade and chapters viewed for Health and Environmental Change registrants (n=53,340).
Within these plots, we identify four interesting categories of students and several interesting specific cases. In the top of the figure, we show those students who earned a certificate in the course. In the top right, we highlight a “completionist,” a student who had the highest possible grade and also viewed all of the chapters in the course. In the top left of this top section, we highlight an “optimizer,” a student who earned a certificate with a grade exactly at the cutoff score while opening a small number of chapters in the courseware.
In the lower sections of the plots, we show students who did not earn a certificate, and we distinguish between students who viewed both more and less than half of the chapters in the course. We define those who viewed more than half of the course but did not earn a certificate as “explorers.” In the bottom right, we highlight the students who viewed all of the chapters in the courseware but answered 0 graded questions correctly as “listeners,” borrowing an MIT term for auditors. We define those who viewed less than half of the courseware and did not earn a certificate as having “viewed” the course. In the bottom left, we define those students who viewed zero chapters as “only registered.” While these points are clustered in a small space on the scatterplot, the represent a substantial portion of students in each course: 22,327 in Health and Numbers and 30,496 in Health and Environmental Change.
One of the signature features of these plots is that students can be found at nearly every possible location in the possibility space. Some students focused on earning a certificate by targeting assessment questions; some students viewed all parts of the course, eschewing all assessment; some students dabbled in various dimensions; and some students successfully completed all parts of the course.
Motivated by this variation (found throughout all of the initial HarvardX courses), we defined four subsamples of participants to investigate in this series of HarvardX reports: Registrants, Viewers, Explorers, and Certified. In Figure 7, we present the numbers in each group as disjoint subsets in Health in Numbers and Health and Environmental Change.
Figure 7. Numbers of participants in Health in Numbers and Health and Environmental Change presented in four disjoint subsets of Only Registered, Only Viewed, Only Explored, and Certified.
Examining student demographics through the lens of these categories reveals patterns of some interest. Figure 8 shows that in Health in Numbers, the female percentage was lower overall (43%) but slightly higher for certificate earners at 46%. In Health and Environment, the female percentage was higher overall (49%) but slightly lower for certificate earners at 45%.
Figure 8. Percentage of female students in four disjoint groups of Health in Numbers (n=61,181) and Health and Environmental Change (n=53,340).
We found a relationship between certificate attainment and level of education in both courses. In Figure 9, we show the distribution of the proportion of students with at least a Bachelor’s degree in our four disjoint subsets, and we see that in both courses—more strongly in Health in Numbers—certificate earners were disproportionately more highly educated. Students who earned a certificate in either course also tended to be older. The median age among only registered students was 27 in Health and Environmental Change and 29 in Health in Numbers. The median age among certified students was 29 in Health and Environmental Change and 31 in Health in Numbers.
Figure 9. Percentage of students with Bachelor’s degrees in four disjoint groups of Health in Numbers (n=61,181) and Health and Environmental Change (n=53,340).