Analyzing-Evaluating-Creating: Assessing Computational Thinking and Problem Solving in Visual Programming Domains

Computational thinking (CT) and problem-solving skills are increasingly integrated into K-8 school curricula worldwide. Consequently, there is a growing need to develop reliable assessments for measuring students' proficiency in these skills. Recent works have proposed tests for assessing these skills across various CT concepts and practices, in particular, based on multi-choice items enabling psychometric validation and usage in large-scale studies. Despite their practical relevance, these tests are limited in how they measure students' computational creativity, a crucial ability when applying CT and problem solving in real-world settings. In our work, we have developed ACE, a novel test focusing on the three higher cognitive levels in Bloom's Taxonomy, i.e., Analyze, Evaluate, and Create. ACE comprises a diverse set of 7x3 multi-choice items spanning these three levels, grounded in elementary block-based visual programming. We evaluate the psychometric properties of ACE through a study conducted with 371 students in grades 3-7 from 10 schools. Based on several psychometric analysis frameworks, our results confirm the reliability and validity of ACE. Our study also shows a positive correlation between students' performance on ACE and performance on Hour of Code: Maze Challenge by Code.org.


INTRODUCTION
Computational thinking (CT) is emerging as a critical skill in today's digital world.According to the work of [1], "computational thinking involves solving problems, designing systems, and understanding human behavior, by drawing on the * This extended version of the SIGCSE 2024 paper includes all 21 test items from ACE along with their answers in the appendix.
concepts fundamental to computer science".Several works have also discussed the multi-faceted nature of CT and its broader role in the acquisition of creative problem-solving skills [2,3].As a result, CT is being increasingly integrated into K-8 curricula worldwide [4,5].With the growing integration of CT at all academic stages, there has also been a surge in demand for validated and reliable tools to assess CT skills, especially at the K-8 stages [6,7].These assessment tools are essential for tracking the progress of students, guiding the design of curricula, and supporting teachers as well as researchers to assist students in the acquisition of CT skills [3,6,8,9].
Prior work has proposed several assessments that measure students' CT during their K-8 academic journey.On the one end, several portfolio-based assessments have been proposed that measure students' CT through projects in specific programming environments [10].Although portfolio-based tests provide open-ended projects to capture students' analytical, evaluative, and creative skills, they are challenging to implement and interpret on a larger scale [7,11].On the other end, several diagnostic assessment tools have been proposed that measure CT in the form of multiple-choice items [7,[12][13][14].These assessment tools are preferred for their practicality in large-scale administration and suitability for both pretest and posttest conditions [11].However, scalability comes at the cost of limiting the ability to effectively measure students' computational creativity.Thus, there is a need to develop multi-choice tests that also capture students' computational creativity.
To this end, we have developed a novel test for grades 3-7, ACE, that focuses on the three higher cognitive levels of Bloom's Taxonomy, i.e., Analyzing, Evaluating, and Creating [15].It comprises a diverse set of multiple-choice items spanning all three higher cognitive levels, including the highest level of Creating.Figure 1 illustrates the diversity of items covered by ACE.Further details of the development of ACE are presented in Section 3. In this paper, our objective is to validate ACE with students from grades 3-7, and report on its psychometric properties.Specifically, we center the analysis around the following research questions: (1) RQ1: How is the internal structure of ACE organized w.r.t.item categories pertaining to Bloom's higher cognitive levels?(2) RQ2: What is the reliability of ACE w.r.t.consistency of its items?(3) RQ3: How does performance on ACE correlate with performance on real-world programming platforms and students' prior programming experience?
Table 1: Categorization of different CT assessments proposed in recent works.The first column shows the specific CT Assessment.The next three columns, Applying-Analyzing, Analyzing-Evaluating, and Evaluating-Creating, classify the assessment based on these different cognitive levels of Bloom's Taxonomy where "✓" implies presence of the levels and "✗" implies absence of the levels.The "Grade" column refers to the intended grades (age group) for the test.The "Validity" column refers to three dimensions across which the test was validated, including (i) "Student": test items validated with students; (ii) "Expert": test items validated with experts; (iii) "Convergent": test validated w.r.t.performance on another test/course.Finally, the "Domain" column shows the domain on which the items in the test were designed.Further details are presented in Section 2.

CT Assessment and Tests
Applying

RELATED WORK
Prior work has proposed several CT assessments and their categorizations based on their format including the following [7,11]: (a) portfolios, which are project-based programming assessments; (b) interviews, which are used in conjunction with portfolios to gain insights into students' thinking process; (c) summative assessments, which are long-format answer type questions to measure CT specifically in the context of a particular domain; (d) multi-choice diagnostic tests, which measure CT aptitude and may be administered in both pretest and posttest conditions.As mentioned in Section 1, we focus on multi-choice CT tests due to their practicality and scalability.Table 1 presents several different multi-choice diagnostic tests proposed in the literature, viewed through the lens of Bloom's taxonomy [11].Specifically, we classify them based on their coverage of the higher cognitive levels of the taxonomy (Applying, Analyzing, Evaluating, and Creating).
These tests cater to students from different school years, starting from kindergarten (K) through the early years of college.Next, we describe three representative assessments in different years.The competent Computational Thinking test (cCTt) [7] was proposed recently in 2022 for students in grades 3-4.The test comprised items that only required finding solution codes or completing a given solution code.These types of items invoke students' Applying, Analyzing, and Evaluating cognitive levels.The Computational Thinking Challenge (CTC) [13] was proposed in 2021 for students in grades 9-12.The test contains programming items in the form of Parsons problems [24], solution-finding multi-choice items, and general items on real-world problemsolving.The items in CTC also cover all cognitive levels except the Creating level.Finally, Placement Skill Inventory v1 (PSIv1) [14] was also proposed recently in 2022 for college students as a placement test.The test contains multi-choice theoretical items on programming and covers only the Applying and Analyzing levels of Bloom's taxonomy.Contrary to these tests, ACE contains items that require synthesizing new problem instances to verify the correctness of a proposed solution.These items in ACE are intended to cover the Bloom's Creating cognitive level.ACE is developed for students in grades 3-7.
Table 1 also shows different domains in which CT is measured in these tests.For grades K-8, the most popular setting is block-based visual programming, likely because of the low syntax overhead of the domains and the ease of measuring CT concepts such as conditionals, loops, and sequences [7,9,12,21].Beyond block-based programming domains, several CT tests also utilize real-world settings, including everyday-scenarios (e.g., a scenario related to seating arrangements in a gathering) [20], robotics [19], and real-world problem-solving (e.g., a problem related to route planning in a city) [13].The advantage of these real-world settings and domains is their administration with minimal domain knowledge, thus making them suitable pretest and posttest candidates.ACE is based on the block-based visual programming domain.
Finally, an important aspect of developing such CT assessments is their validation and reliability [7].Generally, CT assessments are validated using three methods: (a) with students in specific grades for which the assessment was designed; (b) with expert feedback; (c) w.r.t.another test or performance in a course (i.e., convergent validity).For a well-rounded evaluation, it is advisable to explore all three validation methods [7,11].As shown in Table 1, most tests are validated with students, while some are refined by experts.However, the incorporation of convergent validity is less common.ACE is validated using all three methods.

OUR TEST: ACE
The development of ACE is centered around the higher cognitive levels of Bloom's taxonomy: Analyzing, Evaluating, and Creating.The test contains items grounded in the domain of block-based visual programming.Specifically, we consider the popular block-based visual programming domain of Hour of Code: Maze Challenge [16] by code.org[25].
We picked this domain as it encapsulates important CT and problem-solving concepts of conditionals, loops, and sequences, within the simplicity of the block-based structure.
Students can attempt tasks in this domain with a simple description of the constructs and task, as discussed in the caption of Figure 1.Next, we describe the items in ACE which are divided into the following three categories based on Bloom's higher cognitive levels: • Applying-Analyzing: This category comprises items either on finding a solution code of a given task or reasoning about the trace of a given solution code on one or more visual grids.They are based on the Applying and Analyzing levels of Bloom's taxonomy, as they require applying CT concepts and analyzing code traces.These items are typically the most common type of items included in several CT tests [7,20].
• Analyzing-Evaluating: This category comprises items that require reasoning about errors in candidate solution codes of a task and evaluating the equivalence of different codes for a given task.They are based on the Analyzing and Evaluating levels of Bloom's taxonomy.Several CT assessments also include these types of debugging items [9,13].
• Evaluating-Creating: This category comprises items that require reasoning about the design of task grids for given solution codes.They are based on the Evaluating and Creating levels of Bloom's taxonomy, as they involve synthesizing components of visual grids such as Avatar, Goal, and Wall.These items are unique to ACE and capture the open-ended nature of task design, such as counting several possible task configurations to satisfy a given solution code (see items Q18 and Q21 in Figure 1).

STUDY AND DESCRIPTIVE STATISTICS
In this section, we provide details of the data collection process for ACE's psychometric evaluation.

Two-Phase Data Collection Process
The study to evaluate the psychometric properties of ACE was planned in two phases, spread across two weeks.The first phase was intended to familiarize students with the block-based visual programming domain of Hour of Code: Maze Challenge (HoCMaze) [16] by code.org[25], and introduce them to basic programming concepts.Additionally, it would serve as a baseline to correlate students' performance w.r.t.ACE, and measure the convergent validity of ACE.In the second phase of the study, students would take the ACE test.This two-phase study design ensured that students would have enough focus on each study component as well as a time gap between domain familiarity and the actual test.
We obtained an Ethical Review Board approval from the Ethics Committee of Tallinn University before conducting the study.The study was conducted in Estonia, where a random selection of 10 schools was pooled from 11 out of 15 counties.Participation in the study was voluntary for both  The data collection process was conducted in May 2023.
During both phases, students received usernames to ensure anonymity throughout the study.The first phase of data collection included one 45-minute lesson during which the students filled in a short background questionnaire in Google Forms (about 5 minutes) and then solved 20 tasks from HoCMaze (about 40 minutes).We hosted these 20 tasks on a separate platform created for the study to enable the collection of students' performance data on these tasks.Students were allowed multiple attempts to solve each task and could score a maximum of 20 points, i.e., 1 point per task.Henceforth, we refer to students' performance in this phase as their HoCMaze score.The second phase took place one week later and involved a 45-minute lesson during which students took the ACE test.The test was administered through a Qualtrics survey.Students could score a maximum of 21 points, i.e., 1 point per item.Henceforth, we refer to students' performance in this phase as their ACE score.

RESULTS AND DISCUSSION
In this section, we discuss the results of the study centered around the research questions (RQs) introduced in Section 1.

RQ1: Internal Structure of ACE
We assess the internal structure of ACE w.r.

RQ2: Reliability of ACE
Next, we determine the reliability of ACE, i.e., a measure of its ability to produce consistent and stable results over repeated administrations (a higher value being better).One standard way to measure this is through the Cronbach alpha value [13,28] that reflects the average inter-item correlations in a test.Another method is the reliability of student ability estimates obtained from Item Response Theory (IRT).In our study, we apply IRT analysis on students' responses to ACE and fit a 1-parameter logic Rasch model (1-PL IRT) [26].The model estimates the per-item difficulties and students' abilities, and provides the reliability of these estimates.
The overall reliability for our test was good with a Cronbach alpha value of 0.813.Among the three item categories, Cronbach alpha was 0.622 for ACE[01-07], 0.562 for ACE[08-14], and 0.625 for ACE [15][16][17][18][19][20][21]. Figure 3a shows the 1-PL IRT item characteristic curves for all items; we find that Q02 is the easiest and Q17 is the hardest ACE item. Figure 3b illustrates the difficulty of items as well as the estimated ability of students' in our population.The 1-PL IRT Person reliability value for all 21 items is 0.790 (with p < 0.01).
Next, we discuss the potentially problematic item Q17 shown in Figure 4. We find that its exclusion from the model doesn't significantly improve the IRT Person reliability.One possible reason Q17 prompted incorrect responses is that it was the first item in ACE requiring enumeration of all possible Avatar locations.However, students adapted to similar formats in subsequent items (e.g., Q18 and Q21 in Figure 1).Prior work confirms that varying response formats can cause deviations [29].A possible revision of item Q17 could be simplifying the visual grid to reduce its complexity.

RQ3: Correlating ACE scores
We measure the convergent validity of ACE w.r.t.HoC-Maze scores.Additionally, we also measure the correlation of the three ACE categories with both HoCMaze scores as well as overall ACE scores.Finally, we measure the influence of extrinsic factors such as prior programming experience on ACE scores.To measure all these correlations, we perform standard Pearson's correlation analysis between each of these features on data from our entire student population [13,30].High positive values of Pearson's correlation coefficient, r, indicate a strong positive correlation.In terms of the effect of prior programming experience on ACE, we observed a significant positive correlation with both the student's year of study (r = 0.358, p < 0.01) and age (r = 0.359, p < 0.01).Our result aligns with prior work [31] indicating that participants' developmental factors (e.g., reading skills, abstract thinking) can impact test performance.In our student population, varying programming exposure due to elective programming courses influenced prior programming skills.Analyzing this further, we discovered that students who took after-school programming classes outperformed those who did not on ACE (p < 0.05, w.r.t.t-test [32]).

Limitations
Next, we discuss a few limitations of our current study.Firstly, in this study, we evaluated the convergent validity of ACE w.r.t.HoCMaze scores.However, it would be more informative to evaluate ACE w.r.t.other types of assessments, such as portfolios, which specifically consider Creating cognitive level.Moreover, it would be interesting to evaluate the convergent validity of ACE w.r.t.students' performance in other subjects involving CT.Secondly, grade 3 did not present a significant correlation between ACE and HoCMaze scores (Pearson's r = 0.068; p = 0.633), possibly because of difficulties with text comprehension of the item descriptions.Hence, refining the presentation of items could be beneficial for this age group.Finally, we presented the test items in a fixed order, which might have affected students' performance on specific items such as Q17.Implementing a randomized order of the test items within each category could be a way to address this limitation.

CONCLUSION AND FUTURE WORK
We developed a new test, ACE, to assess CT and problemsolving skills, focusing on higher levels of Bloom's taxonomy, including Creating.We capture this level through a novel category of items that go beyond solution finding or debugging and consider task design.In this paper, we studied the psychometric properties of ACE, and our results confirm ACE's reliability and validity.There are several ex-citing directions for future work.Firstly, we can extend the framework of items to develop tests with more advanced programming constructs, such as variables/functions suitable for higher grades.Secondly, while we studied the utility of items in ACE for CT assessments, these items could also be incorporated as part of the curriculum to teach students richer CT and problem-solving skills such as problem design and test-case creation.Q07.You are given a code.You are also given three grids GRID-1, GRID-2, and GRID-3.Which of these grids can be solved with this code?a b c d e f g h 1 1 1 1 1 1 1 1  2 2 2 2 2 2 2 2  3 3 3 3 3 3 3 3  4 4 4 4 4 4 4 4  5 5 5 5 5 5 5 1 1 1 1 1 1 1 1  2 2 2 2 2 2 2 2  3 3 3 3 3 3 3 3  4 4 4 4 4 4 4 4  5 5 5 5 5 5 5  Q08.You are given a code and a grid.When the code is run, the AVATAR crashes on a WALL.At which block in this code does the crash happen?a b c d e f g h 1 1 1 1 1 1 1 1  2 2 2 2 2 2 2 2  3 3 3 3 3 3 3 3  4 4 4 4 4 4 4 1 1 1 1 1 1 1  2 2 2 2 2 2 2 2  3 3 3 3 3 3 3 3  4 4 4 4 4 4 4 4  5 5 5 5 5 5 5  Q15.You are given a code and an incomplete grid without the AVATAR.What could be the initial position of the AVATAR such that the grid is solved by the code? a b c d e f g h 1 1 1 1 1 1 1 1  2 2 2 2 2 2 2 2  3 3 3 3 3 3 3 3  4 4 4 4 4 4 4 three grids GRID-1, GRID-2, and GRID-3 (b) Q07.Solution checking Q09.You are given a code and a grid.You may have to fix some errors in the code such that it solves the grid.How can you fix the code? a b c d e f g h 1 code does not have any errors and it already solves the grid OPTION B Add move forward after Block-2 OPTION C Add move forward after Block-4 OPTION D Change Block-3 to turn left and Block-5 to turn right (c) Q09.Code debugging Q13.You are given a code CODE-1 and two smaller codes CODE-2 and CODE-3.You have to think about the AVATAR's behavior when a code is run on a grid.Which of these two smaller codes produce the same behavior as CODE-1 on any grid? of these two smaller codes (d) Q13.Code equivalence Q18.You are given a code and an incomplete grid without the GOAL.You can add the GOAL in any grid cell which is not occupied by the AVATAR and is not a WALL.How many different locations of the GOAL are possible such that the grid is solved by the code?Q18.Goal design Q21.You are given a code and an incomplete grid.You can add additional WALL cells to the grid by converting any of FREE cells into WALL cells.What is the smallest number of additional WALL cells you must add such that the grid is solved by the code? a b c d e f g h 1

Figure 1 :
Figure 1: (a) shows the distribution of test items w.r.t to CT and problem-solving concepts and Bloom's cognitive levels.(b)-(f) are examples of five items from ACE.These items are grounded in the domain of Hour of Code: Maze Challenge (HoCMaze)[16], which can be found at studio.code.org/s/hourofcode.HoCMaze domain comprises elementary block-based visual programming tasks where one has to write a solution code that would navigate the Avatar (blue dart) to the Goal (red star) without crashing into Walls (gray grid cells).We encourage the reader to attempt these items; all 21 test items from ACE along with their answers are provided in the appendix.

Figure 2 :
Figure 2: An overview of the performance of students on ACE.(a) overall distribution of ACE scores across all 371 students; (b) distribution of ACE score per grade; (c) success rate of students for each item in ACE.Details are in Section 4.
Wright Map: Item and Student Distribution

Figure 3 :
Figure 3: Results from a 1-parameter Rasch model [26] on the ACE items and student scores.(a) Item characteristic curve for each item in ACE and (b) Wright map corresponding to our student population.

Figure 5 :
Figure 5: Pearson's correlation coefficient, r, between ACE and HoCMaze, between ACE and its categories, and between each category.All values are significant with p < 0.001.
are given a grid.Which code solves this grid?are given a grid.Which code solves this grid?are given a grid.Which code solves this grid?are given a grid and its solution code.What happens to the AVATAR when the code is run on this grid?a b c d e f g h OPTION A AVATAR will pass through the grid cell b5 OPTION B AVATAR will pass through the grid cell g5 OPTION C AVATAR will pass through the grid cell b3 OPTION D AVATAR will pass through the grid cell c3 Q06.You are given a grid and its solution code.What happens to the AVATAR when the code is run on this grid?a b c d e f g h OPTION A AVATAR will pass through the grid cells f2 and e2 OPTION B AVATAR will pass through the grid cells e3 and d3 OPTION C AVATAR will pass through the grid cells e4 and d4 OPTION D AVATAR will pass through the grid cells d4 and c4 GRID-1 OPTION B GRID-1 and GRID-3 OPTION C GRID-1 and GRID-2 OPTION D All three grids GRID-1, GRID-2, and GRID-3 are given a code and a grid.You may have to fix some errors in the code such that it solves the grid.How can you fix the code? a b c d e f g h 1 code does not have any errors and it already solves the grid OPTION B Add move forward after Block-2 OPTION C Add move forward after Block-4 OPTION D Change Block-3 to turn left and Block-5 to turn right Q10.You are given a code and a grid.You may have to fix some errors in the code such that it solves the grid.How can you fix the code?code does not have any errors and it already solves the grid OPTION B Add move forward before Block-2 OPTION C Change Block-4 to if path to the left and Block-5 to turn left OPTION D Remove Block-3 Q11.You are given a code and a grid.You may have to fix some errors in the code such that it solves the grid.How can you fix the code?code does not have any errors and it already solves the grid OPTION B One additional move forward needs to be added somewhere in the code to fix it OPTION C One additional turn right needs to be added somewhere in the code to fix it OPTION D One additional turn left needs to be added somewhere in the code to fix it Q12.You are given a code CODE-1 along with two smaller codes CODE-2 and CODE-3.You have to think about the AVATAR's behavior when a code is run on a grid.Which of these two smaller codes produce the same behavior as CODE-1 on any grid?CODE-2 and CODE-3 OPTION D None of these two smaller codes Q13.You are given a code CODE-1 and two smaller codes CODE-2 and CODE-3.You have to think about the AVATAR's behavior when a code is run on a grid.Which of these two smaller codes produce the same behavior as CODE-1 on any grid?CODE-2 OPTION B Only CODE-3 OPTION C Both CODE-2 and CODE-3 OPTION D None of these two smaller codes Q14.You are given two codes CODE-1 and CODE-2, along with a grid.Which of the following is true for the AVATAR's behavior when CODE-1 and CODE-2 are run?have the same behavior for the given grid.However, there are other grids for which they have different behaviors.OPTION B They have different behaviors for the given grid.However, there are other grids for which they have the same behavior.OPTION C They have the same behavior for every grid.OPTION D They have different behaviors for every grid.

8 B
cell d5 facing north OPTION B Grid cell d4 facing west OPTION C Grid cell b5 facing east OPTION D Grid cell c5 facing east Q16.You are given a code and an incomplete grid without the AVATAR.What could be the initial position of the AVATAR such that the grid is solved by the code?cell h5 facing west OPTION B Grid cell a3 facing east OPTION C Grid cell b3 facing east OPTION D Grid cell h3 facing westQ17.You are given a code and an incomplete grid without the AVATAR.How many different positions of the AVATAR are possible such that the grid is solved by the code?are given a code and an incomplete grid without the GOAL.You can add the GOAL in any grid cell which is not occupied by the AVATAR and is not a WALL.How many different locations of the GOAL are possible such that the grid is solved by the code?are given a code and an incomplete grid without the GOAL.You can add the GOAL in any grid cell which is not occupied by the AVATAR and is not a WALL.How many different locations of the GOAL are possible such that the grid is solved by the code?are given a code and an incomplete grid.You can add two additional WALL cells in any of the FREE cells.What could be possible locations of two additional WALL cells such that the grid is solved by the code? at the grid cells g6 and f7 OPTION B WALL at the grid cells g6 and f2 OPTION C WALL at the grid cells g3 and f2 OPTION D WALL at the grid cells g3 and f7 Q21.You are given a code and an incomplete grid.You can add additional WALL cells to the grid by converting any of FREE cells into WALL cells.What is the smallest number of additional WALL cells you must add such that the grid is solved by the code? .ANSWERS TO ACE TEST ITEMS Below we provide answers to the 21 ACE test items.