# Introduction:

Today I am going to address an interesting phenomenon that occurred at the Educational Data Mining conference, which took place in Chania, Greece this summer. I have two desired audiences, maybe three. First is the community of people who should be attending these conferences but aren’t aware of the exciting works that are being presented there. The second audience is the community that attends the conferences, who I hope will appreciate what I am suggesting for next year’s EDM. The last potential audience are the people that know me, but have no clue what I do all day, when I say I am working on my PhD.

# Conference Details:

So for the first audience, if you are interested in education or are interested in the next generation of educational software, aka Interactive Learning Environments, you may want to check out the following conferences – loads of exciting things happening there and a very nice, welcoming community.

Intelligent Tutoring Systems – 2014

Educational Data Mining – 2013 (Co-located with AIED)

Artificial Intelligence in Education – 2013

EDM is especially nice because you can receive their entire proceedings free via their webpage. In addition EDM is typically co-located with either ITS or AIED; the later two conferences alternate each year, ITS on even years, AIED on odd years.

# Bayesian Knowledge Tracing and Intelligent Tutors:

Alright next, audience one, here is the heart of Bayesian knowledge tracing real quick. Audience two, you can probably skip this section, as most of you remember was done for many of the introductions of **Session II: Knowledge Tracing** at the EDM conference. My suggestion is scroll to the “Then What Happened”.

When building an intelligent tutor, this KC-model, ie. which KCs exist, is typically included. Let’s look at a simple case. When students start using the tutor, the value for all of their KCs are P(L0) which we will assume is zero (0), this doesn’t have to always be zero, P(L0) represents the knowledge students have, coming into the tutor but zero means you don’t have any confidence the student knows any of the skills. Next the student starts answering questions, and gets them correct, then your confidence in them knowing a skill increases, so K might become .2, then .5 then .7 and so on. In the event the student misses a question the value decreases. Guess and Slip are also considered in this model, so missing a question after getting two questions correct, means the value of K will be reduced by a smaller amount, then if you were to miss two questions, then answer correctly. The goal is for students to all get their values of K to 1.0, which means you have 100% confidence the student knows the skill. This is the value of P(T) which means the student when from an unlearned state to a learned state.

When a person builds an intelligent tutor, they program all the Knowledge Components into the tutor, and then they use BKT to determine if a person knows a skill. Determining what those skills are, how many there are, whether a question has one skill or two skills, etc. is hard to do, requires experts, and there is a lot of research in this area. That’s the down side.

The up side however has two points. If you have a good tutor (which can be a big If), where K is modeled well, and there is a Q for each K, your students get two benefits. The first is that you can build a tutor that will not allow students to proceed without successfully demonstrating that they possess some desired skill (a KC). The second benefit is that once they’ve demonstrated they have the skill, ideally questions requiring that skill will no longer be asked of the student, meaning it’s likely more efficient learning, because students don’t train excessively on skills they have already mastered, allowing them to move onto the next topic which they may not have proficiency.

If you are interested in reading more there is the original work by Corbett and Anderson [2] or I found section 1.1 of this work, by Nooraei et al. to be very easy to understand [3], and likely better written than my own description on Bayesian Knowledge Tracing above.

## Validation:

After people have their model, they need to train it, which is to *learn* the values of P(L0), P(T), P(G), and P(S). To do this, you have data of people working in a tutor, and you have a post-test, which is used as the ground truth. The model says, that if you have a 0.95 for some skill, on the post-test you should get a question that tests that skill, correct. So, people get some data and using the tutor and some post-test scores they train their model, assigning values to P(L0), P(T), P(G) and P(S). Then they compare their model to the predictions made on new unseen data and compare the errors. The goal is to reduce the errors to zero (0), because that means you’re model fits perfectly, which in practice is not real because it is impossible, but we hope we can get close. Usually this is reported via the RMSE, which is how closely the model fits the new unseen data. For more on RMSE check out Linear Regression.

# Then What Happened?

**OK, audience two you can continue reading**. One of the first things people realized with Intelligent Tutoring Systems is that the BKT model is slightly too simple. It only takes into account for whether the student guessed or slipped. When in reality, there are a bunch of characteristics which can be considered. One such characteristic is time. If there is a bunch of time between doing some questions that train a certain skill, you might have forgotten that skill, this was done in 2011 [4], turns out students seem to forget stuff. Next, there are a number of adjustments you can make for training your model, ie. *learning* the values of the parameters. So also in 2011, researchers looked at training the parameters, how many questions from the past should you look at? Just the last question, maybe the last 3, maybe all the previous questions students answered, how many? Well, turns out using the last 5 questions provides the best results [3], ie. lowest RMSE – more on this later.

# Knowledge Tracing Models and the Latest:

Well we have seen some interesting discoveries about knowledge tracing and it is still one of the leading methods to *insert* intelligence into our *intelligent* tutoring systems. Well if we fast forward a year, into 2012, what might we expect? You might expect loads and loads more models, each considering more and more parameters, or adjusting the number of KCs being considered, or changing the way the model is trained, and you would kinda be right.

# The Challenge:

Then a *interesting* thing happened, kudos to Joe Beck for his foresight. During a panel session, Joe Beck, pointed out, (and I’ll paraphrase), I have a magic black box and its the perfect knowledge tracing model with 100% accuracy, what do you do with it?

See Knowledge Tracing lets you do two things, you can A) prevent people from working extra problems because they have already have mastery in a skill, and B) you can make sure students don’t move on to new material until they show mastery. Granted those are great things to have and with the simple BKT model but the key point here is that with just a basic model we might be doing good enough, or perhaps a much more complicated model is only marginally better. See what Beck asked with his question is, if your new model that accounts for hundreds of parameters, and considers the moon’s rotation and time of day, etc. but only improves accuracy by 1%, what are we really gaining by developing the more complex model? In terms of real world effect, maybe not much, maybe even nothing. Perhaps you saved a student from either working one more problem than they need to, or one less problem then they should have, or maybe its not even that much. Maybe a student worked 0.3 more problems than they should have. But what the hell does 0.3 too many problems even mean? In other words, lets not forget our real world effect sizes. Solving 30% of a problem never gave my teacher’s any confidence I mastered a skill.

After that, another great thing happened, by Jung In Lee and Emma Brunskill. Ironically enough, the greatest contribution of their work, I don’t believe was the intended contribution, though that part was good too. The more significant contribution was the format in which they analyzed their results. During the question portion, its value was pointed out and it was like a universal light bulb going off in the head of everyone in the room – “wow, yeah, we should be doing that!”. Jung In Lee and Emma Brunskill wrote a short-paper titled, “The Impact on Individualizing Student Models on Necessary Practice Opportunities”[5]. The focus of which was, on the parameters of the BKT model, P(L0), P(T), P(S), and P(G). The problem they argued was that we train those parameters on a population, ie. a bunch of people, like 200 or even 1000, maybe more. The better thing to do, they argued, would be to train the parameters on data from a single student for a single student, meaning my four parameters ( P(L0), etc.) are likely different from yours. Which makes sense, maybe I make mistakes more often, meaning the probability of a slip is higher in my case, or perhaps you learn faster than I do, so your P(T) is higher than mine. Their results in a nutshell: Yeah, individual parameters matter.

The beauty of their analysis however, came about because to argue that their individualized parameters were meaningful they needed to compare the number of predicted problems (necessary for a student to obtain mastery in a skill) versus that same student using the population parameters. In other words, the population parameters will report some student X has to do 10 problems, whereas the individual parameters report student X should do 6 problems instead. So the significant contribution comes from a great graph, depicting the differences between the two models, in terms of how many more or less problems a student should do.

In figure 2, the authors have compared the two models, the model with the individual parameters and the model with the population parameters. The key here is not the actual results, shown by the graph, but the graphs themselves. In the past people compared models in terms of RMSE, root mean squared error, which is the square root of the square of the sum of distances, which in layman’s terms is basically the total distance of how far the data points are to the regression line created by the model, or how *wrong* your model is compared to the data. Which isn’t a bad idea, but what does it mean? Sure, two models one with an RMSE of 0.4346 is better than another model with 0.4442, but what is an educator supposed to do with that?

Well now we will hopefully know. We can follow the move made by Jung In Lee and Emma Brunskill and start comparing models in terms of the expected number of problems the models suggest students should solve. Their paper goes on to show that the population model is suggesting students work more problems than the individualized models for some students, and other students are being moved on prior to mastery compared to the individual model.

The graph in figure 2(a) (figure 3 in the author’s text), shows that roughly 80 students should be doing 2+ more problems, and another 80 should be doing at least one more problem. As well there are about 30 students that should be doing fewer problems; all out of a total of 265 students, a healthy percentage. So the claim, individualized parameters probably perform better is legitimate.

More importantly however is, future comparisons between models, should reduce the importance of reporting the RMSE, and instead put the importance on reporting the amount of difference in expected questions being asked of students, because that is something people can use, and it strengthens the impact of results by making them more tactile – in a sense.

# To Beat a Dead Horse:

If you take your latest and greatest model, regardless of what parameters are considered, or it’s RMSE, and you compare to a few other *“standard”* models, and you end up with the graph shown above, no one should really care, because a teacher likely can’t ask a student to work 0.3 more problems.

## The Point of the Story:

Let Jung In Lee’s and Emma Brunskill’s figure 3(a), expected opportunities, be the new metric for measuring the quality of a model. I don’t suspect RMSE will be disappearing and it shouldn’t, we need it for comparison to older models and it’s not meaningless, but the expected opportunities metric, now that’s a metric that educators can understand and make sense of.

# In 2013 I’d like to See…

A paper that is a model comparison and lit-review incorporating Lee’s & Brunskill’s expected opportunities metric for all the different knowledge tracing models that consider all the different parameters people have written about over the past half dozen years or so. It would be nice to see which parameters were added to which models, and ideally the amount of effect each parameter had in terms of the expected opportunities.

# Conclusion:

We had a great set of conferences this year, cheers, congratulations and many thanks to the organizers and authors who submitted interesting and exciting works. I am looking forward to next year’s conference in Memphis Tennessee.

The purpose of this article is to foster conversation and discussion, your comments, corrections for mistakes I may have made, ideas, etc. are always welcome. Please comment and share, which can be done via many internet Id’s, facebook, google, etc.

Thank you for reading.

# References:

1. Zachary A. Pardos and Neil T. Heffernan. 2010. Modeling individualization in a bayesian networks implementation of knowledge tracing. In *Proceedings of the 18th international conference on User Modeling, Adaptation, and Personalization* (UMAP’10), Paul Bra, Alfred Kobsa, and David Chin (Eds.). Springer-Verlag, Berlin, Heidelberg, 255-266. DOI=10.1007/978-3-642-13470-8_24 http://dx.doi.org/10.1007/978-3-642-13470-8_24

2. Albert T. Corbett, John R. Anderson. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. *User Modeling and User-Adapted Interaction*, Vol. 4, No. 4. (1 December 1994), pp. 253-278-278, doi:10.1007/BF01099821

3. Nooraei, B. B.; Pardos, Z. A.; Heffernan, N. T. & de Baker, R. S. J. 2011 Less is More: Improving the Speed and Prediction Power of Knowledge Tracing by Using Less Data. *In Proceedings of Educational Data Mining. pp. 101-110.*

4. Qiu, Y.; Qi, Y.; Lu, H.; Pardos, Z. A. & Heffernan, N. T. 2011. Does Time Matter? Modeling the Effect of Time with Bayesian Knowledge Tracing. *In Proceedings of Educational Data Mining.* pp. 139-148.

5. Lee, Jung In; Brunskill, Emma. 2012. The Impact on Individualizing Student Models on Necessary Practice Opportunities. *In Proceedings of Educational Data Mining.* pp. 118-121.

## Leave a Reply

You must be logged in to post a comment.