Identifying and Correcting Programming Language Behavior Misconceptions

Misconceptions about core linguistic concepts like mutable variables, mutable compound data, and their interaction with scope and higher-order functions seem to be widespread. But how do we detect them, given that experts have blind spots and may not realize the myriad ways in which students can misunderstand programs? Furthermore, once identiﬁed, what can we do to correct them? In this paper, we present a curated list of misconceptions, and an instrument to detect them. These are distilled from student work over several years and match and extend prior research. We also present an automated, self-guided tutoring system. The tutor builds on strategies in the education literature and is explicitly designed around identifying and correcting misconceptions. We have tested the tutor in multiple settings. Our data consistently show that (a) the misconceptions we tackle are widespread, and (b) the tutor appears to improve understanding.


INTRODUCTION
A large number of widely used modern programming languages share a common semantic basis: • lexical scope • nested scope • eager evaluation • sequential evaluation (per "thread") • mutable first-order variables • mutable first-class structures (objects, vectors, etc.) • higher-order functions that close over bindings • automated memory management (e.g., garbage collection) This semantic core can be seen in languages from "object-oriented" languages like C# and Java, to "scripting" languages like JavaScript, Python, and Ruby, to "functional" languages like the ML and Lisp families.Of course, there are sometimes restrictions (e.g., Java has restrictions on closures) and extensions (such as the documented semantic oddities of JavaScript and Python [Bernhardt 2012;Guha et al. 2010;Politz et al. 2012Politz et al. , 2013]]).Still, this semantic core bridges many syntaxes, Authors' addresses: Kuang-Chen Lu, Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI, 02912, United States of America, kuang-chen_lu@brown.edu; Shriram Krishnamurthi, Department of Computer Science, Brown University, 115 Waterman Street, Providence, RI, 02912, United States of America, shriram@brown.edu.

Operators Meaning
+ -* / Arithmetic Operators < > <= >= Number comparison mvec Create a (mutable) vector (a.k.a.array) vec-ref Look up a vector element vec-set!Replace a vector element vec-len Get the length of a vector mpair Create a 2-element (mutable) vector left right Look up the first/second element of a 2-element vector set-left!set-right!Replace the first/second element of a 2-element vector eq?Equality A Note on Terminology.We use the term "behavior" to refer to the meaning of programs in terms of the answers they produce.A more standard term for this would, of course, be "semantics".However, the term "semantics tutor" might mislead some readers into thinking it teaches people to read or write a formal semantics, e.g., an introduction to "Greek" notation.Because that is not the kind of tutor we are describing, to avoid confusion, we use the term "behavior" instead.

BACKGROUND: TUTORING SYSTEMS AND PEDAGOGIC TECHNIQUES
There is an extensive body of literature on tutoring systems ([VanLehn 2006] is a quality survey), and indeed whole conferences are dedicated to them.We draw on this literature.In particular, it is common in the literature to talk about a "two-loop" architecture [VanLehn 2006] where the outer loop iterates through "tasks" (i.e., educational activities) and the inner loop iterates through UI events within a task.We follow the same structure in our tutor (Section 6.1).
Many tutoring systems focus on teaching programming (such as the well-known and heavily studied LISP Tutor [Anderson and Reiser 1985]), and in the process undoubtedly address some program behavior misconceptions.Our SMoL Tutor differs in a notable way: it does not try to teach programming per se.Instead, it assumes a basic programming background and focuses entirely on program behavior misconceptions and correcting them.We are not aware of a tutoring system (in computer science) that has this specific design.
The SMoL Tutor is firmly grounded in one technique from cognitive and educational psychology.The fundamental problem is: how do you tackle a misconception?One approach is to present "only the right answer", for fear that discussing the wrong conception might actually reinforce it.Instead, there is a body of literature starting from [Posner et al. 1982] that presents a theory of conceptual change, at whose heart is the refutation text.A refutation text tackles the misconception directly, discussing and providing a refutation for the incorrect idea.Several studies [Schroeder and Kucera 2022] have shown their effectiveness in multiple other domains.
The SMoL Tutor's content structure is also influenced by work on case comparisons (which draws analogies between examples).[Alfieri et al. 2013] suggests that asking (rather than not asking) students to find similarities between cases, and providing principles after the comparisons (rather than before or not at all), are associated with better learning outcomes.

THE SMOL LANGUAGE
SMoL is designed to capture common features of many modern languages.The syntax of SMoL is presented in Figure 1, where t stands for terms, d stands for definitions, e stands for expressions, c  stands for constants (i.e., number, boolean, and string), and x and f are identifiers (variables).The last kind of expression (i.e., (e e ...)) is function application.
SMoL's Lispy syntax is an artifact of the course (Section 4) using Racket [Friedman et al. 2001;Krishnamurthi 2007]; however, it also proved to be pedagogically valuable.We have found the parentheses useful when discussing scope in conjunction with local-binding features like let.This avoids the various confusing "variable lifting" semantics found in languages like Python ([Politz et al. 2013]; see also posts like [froadie 2022]), where the actual defined range of a variable is not apparent from the source code.
Nevertheless, most SMoL programs are easy to translate to other languages.We present machinetranslated Python and JavaScript versions of our programs in Translated_Programs.html in the supplementary materials.Furthermore, we intend to make a multi-lingual tutor in the future (Section 12).
The semantics of SMoL is as described in Section 1: essentially, it's the semantics of Scheme and the dynamic part of SML/OCaml.SMoL includes limited primitive operators (Table 1) to work with numbers, strings, and vectors.It provides only one equality operator, which tests for exact equality between atomic values and for pointer equality between other values.
Students are given a working implementation of SMoL, built as a #lang language inside Racket [Felleisen et al. 2018].This language provides only the defined syntax and semantics of SMoL, with no other Racket features present.(As in Racket, arguments evaluate left-to-right, to give stateful programs an unambiguous semantics.)The implementation is available online: https: //github.com/shriram/smol.

PAPER ROADMAP
This paper describes a four-year effort to obtain a quality instrument to measure misconceptions and to build a tutor to address them.
The first step was a two-year process to generate (a) programs that tend to trip up students and (b) incorrect responses to those programs (Section 5), using a tool for the purpose (Section 5.1).These were then curated (Section 5.2) to produce an instrument that was tested for one more year (Section 5.3).This underwent one more round of curation to produce the final instrument (Section 6.2).
We then present the tutor (Section 6).In particular, we present the list of misconceptionsgrounded in student data-that are the focus of this project (Section 7).We also evaluate the tutor's effectiveness at correcting them (Section 8).
All this work is done with students in a "Principles of Programming Languages" class at a selective, private US university.The class has about 70-75 students per year.It is not required, so students take it by choice.Virtually all are computer science majors.Most are in their third or fourth year of tertiary education; about 10% are graduate students.All have had at least one semester of imperative programming, and most have significantly more experience with it.Most have had close to a semester of functional programming.The student work described here was required, but students were graded on effort, not correctness.
Naturally, we should wonder to what extent the demographic affects the results: we may simply be studying weaknesses at this institution!We therefore work with two other populations (Section 9).In both, we find similar results.

GENERATING AND COLLATING PROBLEMS
In Section 10, we discuss several papers that have provided reports of student misconceptions with different fragments of SMoL.However, it is difficult to know how comprehensive these are.While some are unclear on the origin of their programs, they generally seem to be expert-generated.
The problem with expert-generated lists is that they can be quite incomplete.Education researchers have documented the phenomenon of the expert blind spot [Nathan et al. 2001]: experts simply do not conceive of many learner difficulties.Thus, we need methods to identify problems beyond what experts conceive.
Additionally, in this paper, we intentionally use the word misconceptions rather than mistake.A mistake can happen for any reason (e.g., selecting the wrong answer from a menu).A misconception, however, implies a conceptual problem: the student has formed an incorrect concept in their head.For instance, they may think that mutable structures are copied on function calls, or that scope is resolved dynamically.This requires probing what they are thinking.
Finally, we are inspired by the significant body of education research on concept inventories [Hestenes et al. 1992] (with a growing number for computer science, as a survey lists [Taylor et al. 2014]).In terms of mechanics, a concept inventory is just an instrument consisting of multiplechoice questions (MCQs), where each question has one correct answer and several wrong ones.However, the wrong ones are chosen with great care.Each one has been validated so that if a student picks it, we can quite unambiguously determine what misconception the student has.For instance, if the question is "What is sqrt(4)?", then 37 is probably an uninteresting wrong answer, but if people appear to confuse square-roots with squares, then 16 would be present as an answer. 1ll these, however, add up to a somewhat challenging demand.We want to produce a list of questions (each one an MCQ) such that (1) we can get past the expert blind spot, (2) we have a sense of what misconceptions students have, and (3) we can generally associate wrong answers with specific misconceptions, approaching a concept inventory.

Generating Problems Using izius
Our main solution to the expert blind spot is to use the Quizius system [Saarinen et al. 2019].In contrast to the very heavyweight process (involving a lot of expert time) that is generally used to create a concept inventory, Quizius uses a lightweight, interactive approach to obtain fairly comparable data, which an expert can then shape into a quality instrument.
In Quizius, experts create a prompt; in our case, we asked students to create small but "interesting" programs using the SMoL language.Quizius shows this prompt to students and gathers their answers.Each student is then shown a set of programs created by other students and asked to predict (without running it) the value produced by the program.2Students are also asked to provide a rationale for why they think it will produce that output.
Quizius runs interactively during an assignment period.At each point, it needs to determine which previously authored program to show a student.It can either "exploit" a given program that already has responses or "explore" a new one.Quizius thus treats this as a multi-armed bandit problem [Katehakis and Veinott 1987] and uses that to choose a program.
The output from Quizius is (a) a collection of programs; (b) for each program, a collection of predicted answers; and (c) for each answer, a rationale.Clustering the answers is easy (after ignoring some small syntactic differences).Thus, for each cluster, we obtain a set of rationales.
After running Quizius in the course (Section 4), we took over as experts.Determining which is the right answer is easy.Where expert knowledge is useful is in clustering the rationales.If all the rationales for a wrong answer are fairly similar, this is strong evidence that there is a common misconception that generates it.If, however, there are multiple rationale clusters, that means the program is not discriminative enough to distinguish the misconceptions, and it needs to be further refined to tell them apart.Interestingly, even the correct answer needs to be analyzed, because sometimes correct answers do have incorrect rationales (again, suggesting the program needs refinement to discriminate correct conceptions from misconceptions).
Prior work using Quizius [Saarinen et al. 2019] finds that students do author programs that the experts did not imagine.In our case, we seeded Quizius with programs from prior papers (Section 10), which gives the first few students programs to respond to.However, we found that Quizius significantly expanded the scope of our problems and misconceptions.In our final instrument, most programs were directly or indirectly inspired by the output of Quizius.

Collating Problems
While Quizius is very useful in principle, it also produced data that needed significant curation for the following reasons: • A problem may have produced diverse outputs simply because it was written in a very confusing way.Such programs do not reveal any useful behavior misconceptions, and must therefore be filtered out.For instance: ( defvar x 1) ( defvar y 2) ( defvar z 3) ( deffun ( sum a ...) (+ a ...) ) ( sum x y z) A reader might think that sum takes variable arguments (so the program produces 6), but in fact ... is a single variable, so this produces an arity error.• Some programs relied on (or stumbled upon) intentionally underspecified aspects of SMoL such as floating-point versus rational arithmetic.While these are important to programming in general, we considered them outside the scope of SMoL (due to their lack of standardization).Mystery languages (section 10) are a good way to explore these features.
• A problem may have produced diverse outputs simply because it is hard to parse or to (mentally) trace its execution.One example was a 17-line program with 6 similar-looking and -named functions.As another example: ( defvar a ( or (/ 1 (-0.25 0.25) ) (/ 1 0.0) )) ( defvar b ( and (/ 1 (-0.25 0.25) ) (/ 1 0.0) )) ( defvar c ( and This program is not only confusing, it also tests interpretations of (a) exact versus inexact numbers and (b) truthy/falsiness, leading to significant (but not very useful) answer diversity.• As noted above, a program's wrong (or even correct) answers may correspond to multiple (mis)conceptions.In these cases, the program must be refined to be more discriminative.and so on.We therefore manually curated the Quizius output to address these issues.

The SMoL izzes
Having curated the output, we had to confirm that these programs were still effective!That is, they needed to actually find student errors.
We therefore turned the programs into a set of quizzes (in the US sense: namely, a brief test of knowledge) that we call the SMoL Quizzes.There were three quizzes, ordered by linguistic Table 2. Some questions and student answers from iz 1.Each table row is a program and the (relative) frequencies of student answers.Answers that can be considered correct are marked with *.Wrong answers that we discuss are marked with † .

Questions
Frequency Table of Answers 5% "Nothing is printed" or Other complexity.The first consisted of only basic operators and first-order functions.The second added variable and structure mutation.The third added lambda and higher-order functions.
The goal of the SMoL Quizzes was to confirm that the aforementioned processes of cleansing and enriching the problems was successful.The quizzes were therefore administered in the third year of this project in the same course.Figure 2 shows a sample question.Every question got an "Other" option.If chosen, the quiz gave the user a text box with the caption "Please specify".The goal here was to record any other answers, which in turn might lead to fresh misconceptions.
Question orders were partially randomized.We wanted students to get some easy, warm-up questions initially, so those were kept at the beginning.Similarly, we wanted programs that are syntactically similar to stay close to each other in the quiz.This is so that, when students got a second such program, they would not have to look far to find the first one and confirm that they are indeed (slightly) different, rather than wonder if they were seeing the same program again.
Students only received feedback after having completed a whole quiz.At the end of each quiz, they received both summative feedback and a refutation text that explained every program that appeared in the quiz.Students were also encouraged to run the programs, but we have little reason to believe that they did (and certainly they asked few questions on the class forum about them).
Due to space limitations, we present the entire instrument in SMoL Quizzes/instrument in the supplementary materials.Here we focus on a few programs where student choices correspond to misconceptions identified earlier, thereby also showing that the curated programs are still effective.Syntactically, #t and #f are true and false, while #(...) is a vector.We use a * to indicate the correct answer (which also matches the implementation's output).

5.3.1
iz 1: Basic Operators and First-Order Functions.
(2) 29% of students believe that a variable defined in a function will be available in the toplevel (or global) environment after the function is called.That is, these students may have a dynamic scope misconception.(3) 22% (11% + 11%) of students believe variables themselves can be passed as arguments and redefined inside the function.They disagree on whether the redefinition persists after the function call.
5.3.2iz 2: Adding Variable and Structure Mutation.Table 3 lists interesting programs after the addition of state.These data suggest the students have aliasing-related misconceptions: • Up to 50% of students think vectors are copied rather than aliased.

Question
Frequency Table of Answers • 16% of students think trying to construct and print a self-referring vector would error.(This program is ambiguous: constructing works in most SMoL languages, but printing may well cause a problem.These identified ambiguities are addressed in Section 6.2.) • 31% of students think a variable is aliased by a parameter if the two variables have the same name.Perhaps interestingly, fewer students (27%) think the variable aliasing would happen if the two variables had different names.

iz 3:
Adding Closures and Higher-Order Functions.• 21% of the students think mutating a variable defined outside a lambda can't possibly change the behavior of the lambda.Perhaps interestingly, this misconception seems to depend on how the lambda is constructed (compare the middle two programs).• Student understanding of the let-over-lambda pattern, as seen in table 4, is weak.The same pattern occurs in any SMoL language that permits local variable binding outside a closure.

THE SMOL TUTOR
So far, we have focused on identifying problems.As noted earlier (Section 5.2), we provided students with refutation texts after the quizzes, but it is unclear to what extent students read, understood, or internalized these.We also wanted to determine whether other populations run into these issues.It was, however, unclear whether just quizzes would interest them.Furthermore, while the quizzes may be a useful diagnostic, our goal is not only to find faults but to improve the understanding of basic programming language behavior.We believe (and presumably so does anyone else who writes a formal semantics!)that an understanding of these basic program behaviors is important, and even more so when it comes to understanding concurrency, ownership, and other advanced features.Even more simply, misunderstanding these clearly impacts development and debugging time.
In response, we created a tutor: the SMoL Tutor (https://smol-tutor.xyz/).It is built around our quiz instruments and the detected misconceptions.We describe the Tutor from a user's perspective in Section 6.1, and explain how it was populated from the SMoL Quizzes in Section 6.2.We then discuss what we learned from its data (Section 7), and finally evaluate its effectiveness as a tutor (Section 8).

The User Experience
The Tutor covers five major topics, shown on the left in Table 5.The larger topics are further broken down into 2-3 modules.The goal was for students to spend at most 20-30 minutes per module.Our data show that in practice, students spent about 9.8 (median) minutes.
Each tutorial consists of a sequence of questions.Most questions in the Tutor are interpreting questions.3These questions (illustrated in Figure 3 4 ) are versions of the SMoL Quizzes (obtained by the process described in Section 6.2).In each of these questions, students are shown a program and asked to predict the program's running result(s) by answering an MCQ (with an "Other" option).After making a choice, students receive feedback.If a student chooses incorrectly, they are (a) given an explanation that is based on the misconception associated with that wrong answer (or a generic one, if there is not a specific misconception), and then (b) asked to answer a second question: (1) The second question is semantically the same as the first, but with superficial changes (e.g., variable names, constants, and operators are changed) so that the student cannot immediately guess the answer.
(2) Instead of multiple-choice, students must type the answer into a text box.(The Tutor normalizes text to accommodate variations.)This is intentional.First, we wanted to force reflection on the explanation just given, whereas with an MCQ, students could just guess.Second, we felt that students would find typing more onerous than clicking.In case students had just guessed on a question, our hope is that the penalty of having to type means, on subsequent tasks, they would be more likely to pause and reflect before choosing.In addition to asking students questions, the Tutor along the way introduces terminology and states the true conceptions.These are the teaching goals of the Tutor.We therefore refer to these as goal sentences.Table 5 lists the (abbreviated) goal sentences for each tutorial.Readers can find the full Tutor in SMoL Tutor/instrument in the supplementary materials.
Some later tutorials include questions about earlier tutorials so that we can check whether students remember the concepts across modules.In particular, the mut-vars tutorial starts with questions about scope, and lambda starts with questions about mut-vars and vectors.
Of the modules, local is the least portable across languages.While local binding is of course present in other languages, this is mostly covered using nested defvars in earlier modules, starting with scope.The distinction central to this module (between three local-binding constructs with slightly different scopes) is primarily a focus of Lispy languages.Therefore, we exclude this module from our analysis and instruments, and indeed plan to deprecate the module in future versions.
Section 5.2 discusses the partial randomization employed by the SMoL Quizzes.In contrast, the SMoL Tutor does not randomize question order.This is because the Tutor consists of more than just questions: it also has explanatory text.This text is based on the preceding problems.Authoring it is therefore somewhat like writing a textbook, with complex dependencies that cannot easily be broken.Reordering the problems would introduce dangling references or even lead to misleading ones.

Collating Problems for the Tutor
The SMoL Tutor's interpreting questions are the final instrument of this paper.We bootstrap it from the SMoL Quizzes, but made the following alterations: ( pair? ( mpair 1 2) ) ( pair? ( mvec 1 2) ) As another example, one hinged on whether or not a function's formal parameter could have the same name as the function itself: (5) We resolved ambiguities in some programs, either adding answer choices or even adding other questions to tease out different interpretations.(6) To reduce the number of concepts, we removed programs that relied upon immutable vectors and lists, because they did not seem to create problems.(For brevity, we leave these out of the presentation of SMoL in Section 3, though they are in the implementation.)(7) We removed questions related to function equality, which was not a focus of the Tutor.5 (8) We removed programs of a "Lispy" nature, such as one where the answer depended on whether the reader correctly understood this inequality: This checks whether n < 3, but some presumably vocalized it as "greater than 3, n?".
Most of these steps either modify or elide programs.Two cases, filling gaps in light of the goal sentences and resolving ambiguities, introduce programs.Starting with the 37 programs in the SMoL Quizzes, we added 52 more programs to arrive at a total of 89.
In addition to generating this set of programs, we also modified the answer options.In addition to retaining the correct and incorrect answers from the SMoL Quizzes (and constructing our best guess of analogous incorrect answers for the new problems), we added more wrong answers.
The reason is as follows.The SMoL Quizzes often have very few wrong choices: of the 37 tasks, 26 have only three choices (including the correct answer and the "Other" option6 ), seven have 4, and only four have 5 or 6.Thus, in most cases, students have a 25% or even 33% chance of just guessing the right answer or successfully using a process of elimination.By increasing the number of options, we hoped to greatly reduce the odds of getting the right answer by chance or by elimination.
It was important to add wrong answers that are not utterly implausible, because those would become easy to eliminate.Therefore, we added numeric constants mentioned in the problem, permuted some of the values in case of multiple outputs, and so on.In general, we ended up increasing the number of choices substantially: only 8 out of the 89 questions have three or four choices; 70 questions have 5-8 choices; and 11 questions have 9-14 choices.We hope this reduced both guessing and elimination, and forced students to actually think through the program.Of course, these new answers do not have a clear associated misconception, so their mistakes are given the generic explanation.

MISCONCEPTIONS DETECTED BY THE SMOL TUTOR
We now examine what we learned from the interpreting questions in terms of program behavior misconceptions.But first, we explain how we associate incorrect answers with misconceptions.

Identifying and Correcting Programming Language Behavior Misconceptions 106:15
We do not present the results of begin and local (Table 5), because they focus on constructs not usually found in non-Lispy languages.They were included in the Tutor to help students write programs for the course, but are not interesting from a SMoL perspective.

Misinterpreters
In principle, an expert can identify what misconceptions a particular incorrect answer might correspond to.In practice, we found this rather difficult for three reasons.First, with 89 programs, each with several answers, it's easy to make mistakes.Second, our own expert blind spot may prevent us from seeing an interpretation that would lead to an additional association.Finally, and specific to our case, since we had either curated or written all the programs and answers, we were likely to miss some associations we had not intended.
To address this problem, we formalized misconceptions as interpreters.That is, for each misconception we created a corresponding misinterpreter.This is an interpreter for the SMoL syntax that intentionally has a semantic error corresponding to that misconception.Put differently, a misinterpreter is a like a "definitional interpreter for a misconception".
By running programs through the misinterpreters, we can more uniformly and rigorously identify all the misconceptions associated with a wrong answer.Furthermore, if we identify a new misconception (or alter a misinterpreter), it is easy to automatically re-classify all the answers.By using misinterpreters, we indeed found new interpretations for existing program-answer pairs.We provide all the misinterpreters in the artifact.

A Catalog of Misconceptions
We iteratively created our final catalog of misconceptions (from the perspective of the data in this paper).We started with misinterpreters representing the misconceptions described in Section 5.3.These are misconceptions for which we have reasonable validation (due to the prose in Quizius), so we call them grounded misconceptions.We then looked at wrong answers not covered by these but chosen by students, and did our best to distill these into misconceptions.These are surmised misconceptions (which we identify with a ‡ ), which need to be validated in the future. 7We then re-ran the misinterpreters against the chosen answers.We terminated when all the remaining wrong answers were either (a) found in very few students (we found a gap between 23% and 13%, and hence took 14% as the threshold), (b) difficult for us to attribute to a misconception, or (c) appeared to us to be "Lispy" and hence not of broad interest.
Tables 6 to 8 list the final catalog.For each, we also present a Tutor question for which the marked wrong answer can be explained by only the named misconception.That is, that program-answer pair is a representative example of that misconception. 8onfirmed Misconceptions.The Tutor confirmed all the misconceptions from Quizius via the SMoL Quizzes.
Added Misconceptions.The misinterpreters helped us find misinterpretations that we had overlooked.For instance, consider this program from Quiz 2: DeepClosure Closures copy the values of free variables.
DefByRef Variable definitions alias variables.
DefOrSet Both definitions and variable assignments are interpreted as follows: if a variable is not defined in the current environment, it is defined.Otherwise, it is mutated to the new value.
( defvar x ( mvec 100) ) We had assumed the wrong answer 0 is caused by CallByRef.However, our misinterpreters made us realize it can also be explained by FlatEnv.This could also explain why we see a difference in error rate when the formal and actual parameter have the same name (as mentioned in Section 5.3.2).
StructsCopyStructs Storing data structures into data structures makes copies.
( defvar x ( mvec 2 3) ) A New Potential Misconception.For the following program, added in the Tutor: ( defvar y (+ x 2) ) ( defvar x 1) x y 56% of students asserted that it produces 1 3. Based on this, we surmise that students might have another misconception, which we define as Lazy ‡ .(Recall that SMoL is eager, but even in many lazy languages, this would be an error.) Another Potential Misconception, and Its Effect on Interpreting Descriptions.Consider this program from the SMoL Tutor: This program, suitably translated, would produce 2 in a wide variety of languages (Python, JavaScript, Racket, Java, etc.), because get-x and the inner x are in the same scope block.The answer 1 cannot be explained by any of our existing misconceptions.Based on this, we surmise a new misconception, NestedDef ‡ .(Though its frequency falls below our threshold, our reading of answers suggests this may be more widespread, and we feel it needs to be investigated more.)Once we turned this into a misinterpreter, we found that it unexpectedly captured the programanswer pair for the following program from Quiz 1-which should be an error, due to the doublebinding of x in the same scope block-and the answer 2 0: Previously, we had interpreted this only as DefOrSet, because students had stated that the second (defvar x ....) "mutates" or "redefines" x.This is reminiscent of the behavior of languages like Python, which use the same syntax both for binding new variables and for mutating existing ones.
The problem here is that the word "redefine" underspecifies how the second definition is interpreted.We had interpreted it as mutating the binding established by the first definition, which fits DefOrSet.However, another possibility is that it shadows the binding (i.e., establishes a new scope block).We did not recognize that the original misconception is underspecified until we uncovered the new (surmised) misconception.

Summary.
To summarize, we used the SMoL Tutor, with an expanded set of programs, to investigate student misconceptions.We implemented the idea of misinterpreters to help us properly 106:19 Fig. 4. How many students chose a wrong answer that (uniquely) represents a misconception?(Downward tendency suggests improvement over time.)classify student performance.We were able to reconfirm all the previously identified misconceptions, refine some of them, and also identify potential new ones (that need further investigation).

IS THE TUTOR EFFECTIVE?
Recall that the SMoL Tutor is not only a collection of MCQs: it is also a tutor!So far, we have investigated the value of the MCQs.Now we examine its tutoring aspect.Concretely, we ask: RQ How effective is the Tutor at correcting each misconception?To study this, we perform the following analysis per misconception.Using misinterpreters, we identify those questions whose wrong answers fit only one misconception (i.e., only one output matches that produced by that misinterpreter, and it matches only that misinterpreter).We then examine student performance over time across those questions.This would let us examine how they do on just that topic, in isolation, over time.
We present the result of this analysis in two forms.Graphically, we show plots in Figure 4.Each figure shows the percentage of students who chose the answer corresponding to that misconception.Ideally, we would like to see these percentages diminish.
Indeed, that is what we see in most of the graphs.The exceptions are CallsCopyStructs, NestedDef ‡ , and StructsCopyStructs, which have only one problem (and hence no trend), and DefOrSet and FunNotVal, which show an increase.The lack of improvement for FunNotVal is unsurprising because the Tutor does not explicitly address this issue, focusing on closures created by lambdas rather than named functions.(However, this does not explain the increase!) We also perform a logistic regression to see whether these improvements are significant (at a p < 0.05 threshold); details are in Paper.html in the supplementary materials.Of the 11: • Of the nine seemingly improved (i.e., decreased) misconceptions: -Two are not significant: CallByRef (p-value = .074);NoCircularity (p-value = .075).
-The other seven are significant.• However, the two with an increasing trend (DefOrSet and FunNotVal) are also significant.These data broadly suggest that the Tutor is a net positive.Ultimately, however, what these data really show is a need for improvement in the Tutor.When designing the Tutor questions, many more were intended to be representative of single misconceptions.However, as noted in Section 7.1, it is easy to be incomplete (or incorrect) in ascribing misconceptions.Furthermore, as their set grows, it is difficult to reassess all the problems manually.Once evaluated using misinterpreters, we found far fewer problems than we would have liked.Concretely, only 40 of the 71 eligible problems (after removing the non-SMoL modules) were useful in the above analysis.
We therefore view the above data as purely formative: they suggest that the Tutor most likely did not do harm and perhaps even may have done some good.However, it would be improper to read too much into the analysis.It is quite possible that some of the other problems would have found issues.Rather, what we really see is value in the misinterpreter concept.It is not only useful for analysis, it is also valuable for problem design: in the next iteration of the Tutor, we will use misinterpreters actively to shape the incorrect answers and, as necessary, update the programs as well.We therefore hope to have much more thorough analyses in future work.

PERFORMANCE ON OTHER POPULATIONS
There is, of course, a significant danger that the data above have been overfitted to only one institution (Section 4, henceforth University 1), thereby actually reflecting the state of its curriculum rather than some greater truth about program behavior understanding.We already have some reason to believe this is not the case: the related work discussed in Section 10 is drawn from many institutions in multiple countries with different educational preparations, levels, and demographics.Nevertheless, that gives us only limited information about the specific questions and misconceptions described above.
Fortunately, we were able to deploy the Tutor on two other populations: • University 2 is a primarily public university in the US.It is one of the largest Hispanic-serving institutions in the country.As such, its demographic is extremely different from those whose data were used above.The course is a third-year, programming language course.The students are required to have taken two introductory programming courses (C++ focused).• A separate instance of the Tutor was published on the website of a programming languages textbook [Krishnamurthi 2022].Over the course of 8 months, 597 people started with the first module and 103 users made it to the last one.To protect privacy, we intentionally do not record demographic information, but we conjecture that the population is largely self-learners (who are known to use the accompanying book), including some professional programmers.
It is extremely unlikely to be the students from either university, because they would not get credit for their work on the public instance; they needed to use the university-specific instance.Furthermore, since they were not penalized for wrong answers, it would make little sense to do a "test run" on the public instance.Finally, we note that there is no overlap between the dates of submission on the public instance and the semester at University 1.
These two populations are therefore at least somewhat different from the original population, and help us assess whether the problems we identify are merely an artifact of the first institution.
To evaluate, we computed a Spearman's rank correlation ρ, ranking questions by what percentage of students got the question right.Between the original university and University 2, we obtain a p-value = 2.013e-07.Between the original university and the online population, we obtain a p-value < 2.2e-16.These show that the other two populations performed similarly to the original one. 9While further validation on other populations remains essential, these suggest that the questions are finding misconceptions that may be universal.may well perform differently with more familiar syntaxes.It seems unlikely their performance would be too different given the many languages that have produced similar misconceptions (Table 9).Nevertheless, we intend to use a variety of syntaxes to examine this issue further.
Finally, we curated programs by hand, so they may well reflect our own biases about misconceptions.It may be possible to mitigate this problem by synthesizing MCQ programs using the misinterpreters.

Internal Validity
Is our reasoning valid?We have applied standard techniques and measurements for evaluating student responses.However, our instruments still lack the validity of a proper concept inventory.Their creation requires heavyweight processes (such as Delphi methods [Goldman et al. 2010]) that require many hours of expert attention as well as conversations with learners, and can hence be prohibitive in cost.The Quizius method (Section 5.1) was created precisely to be an inexpensive method that provides a good proxy.
Ultimately, our goal is to provide a reasonable instrument for widespread use.While the set of all misconceptions could be unbounded, we believe our tasks, especially as embedded in the Tutor, provide a good starting point for others.In particular, if students select a wrong answer, that is still of some use to an educator, even if the precise misconception cannot be pinned down with the highest accuracy.We therefore believe our instruments, and our misinterpreter technique, are of general value.
A failure in our Tutor's logging infrastructure led us to miss some responses.These are unlikely to be task-specific because the Tutor has a generic framework that should perform the same across tasks.Moreover, on average only 0.4% (sd = 0.6%) of values are missing.Therefore, we do not believe this had a noticeable impact.(In addition, every wrong answer is still wrong!We may just not have exactly the right proportions of them.)

External Validity
How well do our results generalize?Certainly there is reason to question whether our results would apply to other populations.Section 9 provides preliminary evidence that the results are not specific to one institution, and Section 10 suggests these issues are widespread.Nevertheless, much more broad testing is needed to confirm our specific instruments.
The other major concern is the tie to Lispy syntax.Learners in other settings may do worse or even better with it.Building a Tutor that supports multiple, and more traditional, syntaxes should help address this issue.We defer this to future work.

DISCUSSION
The paper has already identified several areas for future work: • testing on more varied populations; • using other syntaxes; • having more questions that uniquely identify misconceptions; and, • enabling textual responses in the Tutor so we can better characterize wrong answers.These can help get us even closer to a good approximation of a true concept inventory.We would also like the Tutor to make more use of the education theories discussed in Section 2.
We focus here on some issues that we think warrant broader discussion.
Language Design Implications.Our work takes as given the Standard Model.Naturally, it is reasonable to ask whether that should be the default for languages.Suppose, for instance, programmers converge on an "incorrect" behavior; perhaps that should become the behavior of future (versions of) languages?Indeed, section 1 presents examples of such changes to existing languages.Some researchers have considered these questions before.For instance, the natural programming project [Pane et al. 2002] had non-programmer children describe how they would write a program, and created languages around their utterances; but these were very limited in expressive power.[Stefik and Siebert 2013] did user studies to design a language's syntax (a topic we have not covered).Tunnell Wilson, et al. [Tunnell Wilson et al. 2017] asked what would happen if we "crowdsource" a language's semantics, and similarly [Tunnell Wilson et al. 2018] studied programmer preferences for gradual typing systems.
However, this method is not always productive.The crowdsourcing study found in general a lack of not only consensus (agreement across people) but even consistency (agreement across responses from a single person).This agrees with our data: students (and programmers) do not seem to have clear conceptions of program behavior.Furthermore, even if they agreed, other considerations (such as performance) might become relevant: e.g., copying data avoids aliasing, but it comes at a steep cost, which is presumably why languages in the Standard Model do not do it.(Similarly, the most preferred gradual typing behavior was also the most expensive.)That, combined with the presumed importance of understanding the behavior of languages we program in, is why the Tutor focuses on creating a shared, uniform understanding of SMoL.
Program Tracing and Visualization.This paper has not discussed the use of program visualization, or what is known in the education literature as a "notional machine" (i.e., a student-accessible presentation of the semantics) [du Boulay et al. 1999;Krishnamurthi and Fisler 2019;Sorva 2012bSorva , 2013]].In fact, we do have a notional machine called the Stacker that accompanies the Tutor, inspired by earlier research showing student misconceptions about the stack's behavior [Clements and Krishnamurthi 2022].
As fig. 3 shows, every program is accompanied by a button labeled "COPY as program", which puts it in the clipboard in a form that can be directly pasted and run in the Stacker.However, the version of the Stacker used in these studies ran inside DrRacket, which creates friction: students need to copy the program, perhaps launch DrRacket, then paste and run it.We suspect that students made little use of it except when required to.A new version of the Stacker (https://smol-tutor.xyz/stacker/) runs entirely in the browser and is linked to directly from the Tutor, which hopefully reduces considerable friction.
Other Uses for Misinterpreters.Another idea we are toying with is the following.Once we believe the student has sufficiently fixed their understanding of a misconception, we can show them a program stepping through a misinterpreter and have them identify a state where the correct and incorrect semantics diverge.There may be multiple reasonable states, so we would accept any reasonable one.This would let us assess their understanding of where they were previously wrong.Note that checking the marked state can be done automatically, so this is an assessment that is easy to run at scale.Static Semantics.Our work has focused purely on the dynamic semantics of programs.We did this because there is much less agreement (to construct a standard model) over static types: once we go beyond the simply-typed lambda calculus, even basic user-defined data structures introduce questions such as structural versus nominal equality.Nevertheless, it would be very interesting to study these questions for types and other static semantics as well.The closest work we know of is [Crichton et al. 2023], which examines ways in which learners misunderstand Rust's static semantics (and its consequences for dynamic behavior also).It does not use mis-typecheckers, but instead reles on notional machines.

106:25
The State-Aliasing-Function Triangle.Our problematic programs hardly include any "advanced" programming features: there are no threads, asynchrony, sophisticated type systems, ownership, inversion of control, etc.Indeed, many of those features typically build on an understanding of these basics (e.g., it is hard to make sense of ownership [Clarke et al. 1998] without a good understanding of this triangle).But many populations seem to struggle even with this, which may explain why concepts like ownership are considered hard [Crichton et al. 2023].
It is worth noting that we find problems even without all three components, as Section 5.2 shows!Nevertheless, we think it would be useful for curricula to focus on achieving mastery of this triangle.This may require a deep revision of widely accepted pedagogy.For instance, is common to explain variables as "boxes".But if taken seriously by a learner, this metaphor may cause more harm than good.As prior research [Grover and Basu 2017;Hermans et al. 2018;Putnam et al. 1986] shows, students expect that a variable can then contain more than one value, removing one makes the other accessible, etc.That research has not explored aliasing, but the metaphor may affect that too.A box is a closed object, so a (mutable) value put "into" it clearly can't be modified by another "box" (variable)-i.e., it not only doesn't explain but is antithetical to aliasing.
Terminology.It is common to teach programming using the terms "call-by-value" and "call-byreference".The reader will note that we instead have three ByRef misconceptions.
The emphasis on "calling" suggests that the semantics is associated only with calls.That leaves open what happens when one simply binds a variable.In a reasonable language (and definitionally, in SMoL), the behavior is exactly the same: the formal parameter can be viewed as a binding to the value of the actual.But students often form inconsistent views because the "call" terminology breaks this deep similarity.We therefore recommend that languages in general use the term bind-by-, to emphasize the underlying semantic unity of these syntactically different mechanisms.
We feel that further confusion is caused by terms like "pass" and "return".We often vocalize the call (f x) as "passing x to f"; saying "passing the value of x to f" is a mouthful.But is it then surprising that students think x is being aliased?Similarly, consider a statement like return y (in Python syntax).Of course, semanticists understand that it is the value of y, not y itself, that is being returned.Nevertheless, it is not surprising if this also results in assumptions that y is either aliased or that it has escaped from the function (leading to an interpretation of dynamic scope).We believe there is a need for significant research to investigate these kinds of effects, which are similar to but not strictly the same as "vernacular misconceptions" [National Research Council 1997].

DATA-AVAILABILITY STATEMENT
The software that supports sections 6 to 9 and 11 is available on the ACM DL [Lu and Krishnamurthi 2024].

Fig. 3 .
Fig.3.Screenshots of an interpreting question in SMoL Tutor.The top-le shows the initial state, where the question is presented as an MCQ.If a student chooses a wrong answer, they will receive feedback (bo om-le ) and will be asked a similar question (right).The similar question must be answered by typing.

Table 3 .
Some questions and student answers from iz 2. Each table row is a program and the (relative) frequencies of student answers.Answers that can be considered correct are marked with *.Wrong answers that we discuss are marked with † .
Table 2 lists programs in Quiz 1 that we consider the most interesting.These data confirm the presence of scope-related misconceptions: Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 106.Publication date: April 2024.

Table 4 .
Some questions and student answers from iz 3.Each table row is a program and the (relative) frequencies of student answers.Answers that can be considered correct are marked with *.Wrong answers that we discuss are marked with † .
Table 4 lists interesting programs after the addition of closures and higher-order functions.It extends what we saw with loops in Section 1: that students have misconceptions about their interaction with mutable variables: • 17% of students think lambda functions can't refer to free variables.Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 106.Publication date: April 2024.

Table 6 .
Ground misconceptions identified by the SMoL Tutor.Answers marked with "**" represent the misconception.(Part I)

Table 7 .
Ground misconceptions identified by the SMoL Tutor.Answers marked with "**" represent the misconception.(Part II)

Table 8 .
Surmised misconceptions identified by the SMoL Tutor.Answers marked with "**" represent the misconception.

Table 9 .
The Tutor was used in one course in Spring 2023, taken by 12 students.Similar misconceptions found in prior research.