Comparing Authoring Experiences with Spreadsheet Interfaces vs GUIs

There is little consensus over whether graphical user interfaces (GUIs) or programmatic systems are better for word processing. Even less is known about each interfaces’ affordances and limitations in the context of creating content for adaptive tutoring systems. In order to afford instructors the use of such systems with their own or adapted pedagogies, we must study their experiences in inputting their content. In this study, we conduct a between-subjects A/B test with two content authoring interfaces, a GUI and spreadsheet, to explore 32 instructors’ experiences in authoring algebra content with hints, scaffolds, images, and special characters. We study their experiences by measuring time taken, accuracy, and their perceptions of each interfaces’ usability. Our findings indicate no significant relationship between interface used and time taken authoring problems but significantly more accuracy in authoring problems in the spreadsheet interface over the GUI. Although both interfaces performed reasonably well in time taken and accuracy, both were perceived as average to low in usability, highlighting a dissonance between instructors’ perceptions and actual performances. Since both interfaces are reasonable in authoring content, other factors can be explored, such as cost and author incentive, when deciding which interface approach to take for authoring tutor content.


INTRODUCTION
Human-computer authoring interfaces are often designed either as direct manipulation graphical user interfaces (GUIs) or as programmatic systems.Direct manipulation interfaces (e.g.Microsoft Word, Google Docs, Apple Pages) consist of visual representations that replace command language syntax by allowing users to directly manipulate objects of interest [25].Programmatic systems (e.g.L A T E X), on the contrary, consist of abstractions where users follow a cycle of changing parameters and inspecting the rendered output [6].Each interface's merits and trade-offs have been debated in terms of of speed of use, prior knowledge required, ease of learning, and errors [17].Although the research community has not reached consensus on which type of interface is better for documents, we explore each interface for the purpose of authoring content in tutoring systems.
Creating authoring tools for adaptive educational systems and technology learning environments has been a continuous challenge researched by the education research community, with the recognized need for educators to become more involved in the creation process [16].Some researchers even advocate for teachers to be involved in the creation of content, recognizing that similar to teachers adjusting content from textbooks with their own examples and explanations, teachers also need the ability to adjust content in adaptive systems [5].While teachers are not expected to be the main preparer of adaptive systems' content, such systems should contain interfaces that give them the opportunity to integrate their own created or adapted pedagogies.Thus, studying instructors' experiences in inputting their own content into such systems is imperative to ensure interfaces can realize instructors' expressions.With an increase of systems published yearly, it can become difficult to identify critical differences among various content creation interfaces and assess appropriate systems for instructors' desired tasks.Understanding each systems' affordances and biases in constructing content are important for the learning analytics community to then be able to analyze.
Ongoing research has also explored generative AI as a potential solution to optimize content environments [33].Such efforts have seen some success in creating help to assist 1:1 online instruction [8].However, when asked to produce worked solutions for problems, systems like ChatGPT confabulate the answers around 30% of the time [21].Thus, it is reasonable to assume instructors and other human content creators will continue to play a significant role in content authoring until GPT-4 and other models evolve to a state with minimal error rate.Some adaptive tutors, such as AS-SISTments [23], involve teachers in the authoring process because of teachers' pedagogical training, which further emphasizes the need for human-focused content authoring.Fundamentally, humans remain involved with education as it is a humanistic pursuit.Thus, it is necessary to consider how instructors interface with content production.By comparing the different types of content interfaces, we hope to further identify trends that can help empower instructors in using such systems to best fit the needs of their students.This paper seeks to understand the differences in instructor experiences authoring the same algebra content with different interfaces.We present an A/B study conducted as a survey with algebra instructors that evaluates each interface's time taken, accuracy, and perceived usability in transcribing problems with images and special characters, using hints and scaffolding.Our research questions are: (1) How do GUI and spreadsheet interfaces influence the time, accuracy, and perceived usability in authoring algebra content with images, special characters, hints, and scaffolds?(2) What are the affordances and limitations of GUI and spreadsheet interfaces in relation to authoring algebra content with different problem components?

RELATED WORK 2.1 Authoring Interfaces
The release of the Alto Personal Computer brought with it the introduction of formatting text with fonts, spacing, and other features [28].This was one of the first "What You See Is What You Get" (WYSIWYG) systems, allowing users to edit content in a form that represents its printed appearance.WYSIWYG systems would go on to become popular in all sorts of digital authoring, from websites [32] to adaptive tutoring systems [12].As the complexity of authored content increased, to satisfy the need for maintaining an interface that represents what is being authored, many systems turned towards a Graphical User Interface (GUI).With authoring requirements becoming increasingly demanding, such as 2D graphics in addition to text, GUI interfaces were popularized to help fulfill such needs.[13].This rise of GUI continued and proceeded to become standard practice for many content authoring systems such as the ASSISTments Builder [23].

Adaptive Tutoring Systems
With the push for more effective educational tools, and increasing adoption of artificial intelligence, intelligent tutoring systems (ITS) have become a noteworthy form of tutoring system over the past 30 years [7].Originally, ITS came to be with a focus on creating intuitive problem solving environments for mathematics, specifically algebra [7].A major progression in the ITS ecosystem came in the form of the Cognitive Tutor [24].Cognitive Tutor managed to see widespread implementations in classrooms, allowing for data collection that would be utilized for system improvements.Cognitive Tutor was tested internationally, with its scaffolding being distinguished as a main highlight that contributed to its success [18].Due to the great cost of developing an ITS [7], many systems decided to branch out beyond the strict ITS principles and develop an adaptive and effective tutoring environment.Khan Academy efficiently rose to mainstream popularity due to its supplemental video lessons, before being able to support a tutor with guided hints [29].Khan Academy's content authoring has, at least historically, been programmatic 1 .ASSIStments originated as a test preparation tutor with all content created in-house, but later offered a content builder, allowing instructors to author content via a GUI [10].A plethora of adaptive systems have followed over the years, most of them featuring inspiration from ITS systems but not fully conforming to the original principles.

Authoring For Tutoring Systems
Adaptive tutors necessitate an authoring interface.Due to the structured format of content and the desire for it to be interactive and contain certain information, there is a desire to remove potential barriers that would limit individuals contributing content to systems.Content in adaptive tutors is usually authored by either teachers or other appropriate experts.Each tutor enables their users to author content in various environments.In ITS history, the Cognitive Tutor AUthoring Tools (CTAT) set the foundation for content development [1].While the tools offered certain features of a direct manipulation interface, some programming knowledge was required, serving as a medium between programmatic and direct manipulation interfaces.CTAT enabled the development of ITS content with a minimal programming background, making them the foundation for many tutors through its graphical user interface (GUI).However, ITS content lacks greatly in development time.CTAT itself reduced the estimate of 200-300 hours of development to produce content equivalent to one hour of instruction found in old ITS systems [1] to only 50-100 hours [31].
Adaptive systems, on the other hand, can afford to create content at a faster pace since they do not perfectly conform to ITS principles.ASSISTments hosts its own content builder which matches the 50 hours estimate of CTAT, while being four times faster than original ITS authoring tools and does not require programming knowledge to create content due to its GUI [23].With the focus on providing a simple but usable environment for teachers, the builder enables capabilities for skill mapping of knowledge components.The builder further introduces variabilization support directly into the GUI, allowing potential content generalizability.ASSISTments offers an even further simplified "quick builder" that does not support hints and scaffolds, but allows for even faster problem creation [11].Other computer tutors such as GnuTutor [19], Guru [20], and ThesisITS2 also enable content building for their systems.
Authoring tools such as ASPIRE have also recently been created, focusing on easy ITS creation [15].With similar development times to CTAT, such tools attempt to create a simple process of content creation for domain experts with limited experiences with adaptive tutors [27].Development times serve as a challenge for such tools, making GUI creation interface such as the one in ASSISTments still faster.
Alternative authoring interfaces utilize pre-existing platforms.Open Adaptive Tutor (OATutor) utilizes Google Spreadsheets to facilitate content creation for its system [22].The interface assumes familiarity in that new OATutor users are likely to previously have used a spreadsheet interface, so will not find familiarizing themselves with the system's content creation process difficult.Furthermore, the spreadsheet curation formatting directly maps content between the system and a textbook [2].Since OATutor became open-sourced recently, further insights into the efficiency and success of such an interface remain to be seen.

Content Authoring Challenges
Each type of interface faces different challenges in achieving optimal implementation.Due to the uniqueness of a GUI builder, instructors can become overwhelmed to learn its specific components to teach their classes [31].As a multitude of new systems are developed yearly, learning a GUI just to use a completely different system after a few months can lead to a major time risk.Furthermore, in order to enable accessible content creation, GUI builders sometimes sacrifice complex problem design capabilities in the process [23].This trade-off for increased builder time and user-friendliness can result in further difficulties if a user requires a different form of problem creation.
Within spreadsheet interfaces, the lack of graphical support creates the need for additional tools or methods to preview the finished iteration of a problem [22].This in turn can make identifying minor errors within the problem difficult and increase time spent testing the finished product.Newer systems, such as OATutor, enable the potential to examine the automatization of interface features, such as hint and scaffold creation [2].The uncertainty surrounding such an interface further creates a need for additional testing and research to take place surrounding its content development efficiency.

EXPERIMENTAL DESIGN
Our study uses a between-subjects design, where each instructor learns one system's authoring interface before transcribing two randomly selected algebra problems from a pool of eight problems.We utilize a survey created with Qualtrics to administer our study.Participants are randomly assigned to one of two conditions: a Graphical User Interface (GUI), utilizing the ASSISTments Builder, or spreadsheet interface, utilizing the OATutor spreadsheet format.We measure average time taken transcribing each problem and the accuracy in transcribing each problems' components as accuracy.In addition, we measure participants' perceived usability of each interface with the System Usability Scale (SUS) questionnaire [4] and open-ended questions about their experience authoring.

Training Materials
The OATutor slides matched the general number and structure of ASSISTments, modeling a similar breakdown of authoring problems into a spreadsheet interface.Figures 1 and 2 show the first training slide from each interface.In addition to the training slides, we also included short video tutorials (5 minutes and 27 seconds for ASSISTments and 4 minutes and 55 seconds for OATutor) walking through the same process with an example problem.Similar to the modifications we made to the ASSISTments slides, we also removed the first 48 seconds of the originally posted tutorial video about the quick builder and 36 seconds in the middle about the test drive, since the spreadsheet interface does not contain comparable components.

Selected Algebra Problems
We created a pool of eight Algebra I problems from which participants received two randomly selected problems.Problems were curated from Khan Academy (4), OATutor (2), and ASSISTments (2).Each problem was selected based on their inclusion of hints, special characters (fractions, exponents, absolute value, pi), images, and scaffolds.The first problem contained hints and the second contained a mix of hints and scaffolds.Either hint or scaffold problems could also contain images and special characters.
The first problem with hints was selected from a pool of four problems from Khan Academy, chosen as a reputable open educational resource (OER) that contains problems with images and special characters.At the time of the experiment, Khan Academy's problems did not offer scaffolds.We went through the Algebra I content from Khan Academy in order, selecting the first problem that met our criteria of including an image or special characters.The final four selected problems from Khan Academy contained (1) 2 hints, 1 image and 1 special character (absolute value) (2) 2 hints, 2 special characters (exponents and fractions) (3) 2 hints, 1 special character (pi) (4) 2 hints, 2 images Because we are also interested in investigating problems with scaffolds, the second problem participants received was pulled from ASSISTments and OATutor Algebra I content to ensure a balance of representation from each system in each condition.Problems were selected by going through Algebra I problems in order and finding the first one that fulfilled the image and special character criteria.The final selected problems from ASSISTments and OATutor contained (1) 4 scaffolds, 4 hints, 3 images, 1 special character (pi) (2) 2 scaffolds, 2 hints, 2 images (3) 1 scaffold, 2 hints, 1 image, 1 special character (fraction) (4) 3 scaffolds, 4 hints, 1 special character (exponent) Since our problems came from three different sources and subsequently presented their problems differently, we created our own template design using Google Docs that broke down each problem by problem title, problem body, answer choices, answer, hints, and scaffolds.Figure 3 depicts an example that was used as a template across all eight problems.

Accuracy Rubric
In order to assess how accurately instructors authored each problem, we quantify errors by enumerating every problem component correctly authored.We created a grading rubric for each of the eight problems, awarding a point each for the problem title, problem body, correct answer identified, hints/scaffolds, inclusion of images and special characters, and absence of extraneous components.Each score is calculated to be the percentage of components correctly presented by dividing the number of components correctly transcribed by the total number of components.The second author graded all problems as a researcher who is familiar with creating content for authoring interfaces.

System Usability Scale (SUS)
In order to evaluate the interfaces' perceived usabilities, instructors answered the System Usability Scale (SUS), after finishing authoring two problems.The SUS, tested for reasonable reliability [3], is a 10 item questionnaire that measures users' perceptions of a system's effectiveness, efficiency, and satisfaction.We also calculate each system's learnability, multiplying items 4 and 10 by 12.5 to use alongside the overall SUS score [14].When interpreting SUS scores on systems, scores less than 50 should be cause for significant concern, SUS scores between 50 and 70 are marginally acceptable, and SUS scores above 70 are adequate [3].

Procedure
Each participant first began the study by rating their level of familiarity with either GUIs or spreadsheet interfaces on a scale from 1 to 5 (1 being not at all familiar and 5 being very familiar).Next, participants were presented with training slides to read through and were told they could access them again once they began transcribing the problem at hand.To understand a new user's typical training time to a system, we timed how long instructors spent on the directions slide.They then watched a training video about the respective system, which walked them through authoring an example problem.
Participants were not able to proceed to the next page until the full training video was played.After, they were presented with the Google Doc template problem that outlined each step for them to transcribe.The next page presented the interface to use, the template problem to transcribe, and links to the tutorial slides and video as references.After transcribing the problem to the best of their ability, participants were asked to reflect on their experience authoring, answering what worked well for them in the process and what they found confusing or difficult.Participants repeated the process of transcribing a different problem.Finally, participants rated the usability of the interface with the SUS questionnaire.They had the option to also leave any other feedback regarding their experience.

RESULTS
We recruited 38 participants from a user research platform, Respondent.In order to qualify as a complete response, participants must have scored at least 3 points on at least one problem.If a problem had less than 3 points, we disqualified their responses from our analyses.With this criteria, 6 participants were excluded.Thus, 32 participants were included in our analysis, with 16 participants per interface.All participants were instructors who have taught Algebra 1 for at least one year (39% at middle school-level, 34% at high school-level, 8% at both middle-school and high-school level, and 18% at post-secondary institutions in the United States).All participants were given a $70 gift card for their participation.

Familiarity with spreadsheet and text editors
Instructors, on average, were very familiar with spreadsheets (4.5 ± 0.803) like Microsoft Excel and Google Spreadsheets and are very familiar with GUIs (4.5 ± 0.622 on average) like Microsoft Word and Google Docs.Prior to analyzing the statistical significance of our independent variables on our dependent variable measurement of time taken, we checked if instructors' level of familiarity with spreadsheets and GUIs was related to time taken and accuracy in authoring problems.There were no statistically significant interaction effects between level of familiarity ratings and average time taken authoring problems in the GUI (F(1, 30) = 9.486, p = 0.004) and spreadsheet interfaces (F(1, 30) = 0.968, p = 0.968).In addition, there is no statistically significant interaction effect between prior knowledge rating of spreadsheet interfaces and accuracy in authoring problems (F(1, 30) = 0.817, p = 0.817).However, there is a statistically significant relationship with prior knowledge of GUIs and accuracy of authoring problems (F(1, 30) = 9.486, p = 0.004).

Time Taken
We perform a Mann-Whitney U test to evaluate whether time taken authoring a problem differs by interface.This choice of test was made because our experiment meets the criteria of not having normally distributed data and a small sample size.The results indicate that there is no significant difference between the time taken of instructors authoring in the spreadsheet interface and instructors authoring in the GUI (U = 545.0,p = 0.663).There was also no significant difference on time taken authoring problems with hints in the spreadsheet interface and the GUI (U = 155.0,p = 0.318), mixed hints and scaffolds (U = 124.0,p = 0.895), images (U =182.0 , p = 0.965 ), and special characters (U = 476.0,p = 0.388).
In addition, we use an ordinary least squares regression to analyze the relationship between each interface on average time taken.Based on the model, we do not find statistically significant effects of the interface used on time taken (F(3, 28) = 0.937, p = 0.436).Due to the small sample size, the model also has low explanatory power, with r2 = 0.091.On average, instructors take similar amounts of time, in minutes, to author a problem in the spreadsheet interface (11.56 ± 3.263 minutes) and the GUI (11.01 ± 3.839 minutes).The average time taken in minutes for each problem in each interface is presented in Table 1.
On average, instructors author problems with hints in a GUI faster (9.52 ± 4.71 minutes) than in a spreadsheet interface (10.15 ± 1.35 minutes).Instructors, on average, are also faster in GUIs authoring problems with images (12.66 ± 7.06 minutes) and special characters (10.87 ± 6.08 minutes) than in the spreadsheet interface.Instructors take similar amounts of time, on average, to author mixed problems with hints and scaffolds in the GUI (12.51 ± 2.50 minutes) and spreadsheet interface (12.973 ± 4.21 minutes).The average time taken in minutes for each problem type in each interface is presented in Table 2.
We also perform separate multiple linear regression analyses for each respective interface to decompose the relationship of time taken authoring each problem component to that of other components.Using ordinary least squares, we calculate time taken to author a problem in each interface as follows: where   is time taken in the interface,  1 is hints,  2 is mixed hints and scaffolds,  3 is images, and  4 is special characters.We perform the above multiple linear regression for each interface respectively.When decomposing problem components in spreadsheet interfaces, we find that problems with images, special characters, hints, and mixed hints and scaffolds do not show significant relationships with time taken as compared to authoring each of the problem components in the same interface.However, we do find a significant effect on time taken using the GUI and authoring problems with hints F(3,28) = 2.533, p < 0.001), mixed hints and scaffolds (F(3,28) = 2.533, p < 0.001), and special characters (F(3,28) = 2.53, p =0.047) as compared to problems with images in the GUI.On average, instructors take 22.61 more minutes authoring problems with mixed hints and scaffolds, 16.61 more minutes authoring problems with hints, and 8.63 less minutes authoring problems with special characters in the GUI as compared to problems with images.
In addition, upon comparing the average time instructors spent reviewing the tutorial slides for each respective system before beginning the authoring task, we find a significant effect (U = 876.0,p = 1.04e-06) in instructors spending more time looking at the spreadsheet interface's slides (7.65 ± 4.78 min) than the GUI's slides (3.12 ± 4.97 min).

Accuracy
We perform another Mann-Whitney U test to evaluate whether accuracy authoring a problem overall differs by the spreadsheet interface and the GUI.The results indicate that there is a significant effect on more accurately authoring problems in the spreadsheet interface than the GUI (U = 660.0,p =0.047).Upon breaking down problems by each component, we find that there is a significant effect on more accurately authoring problems with hints (U = 200.0,p = 0.007) and special characters (U = 553.5,p = 0.328) in the spreadsheet interface than in the GUI.However, there is no statistically significant difference between accuracy with mixed hints and scaffolds (U = 553.5,p = 0.328) and images (U = 553.5,p = 0.382) in the two interfaces.The average accuracy score for each problem in each interface is presented in Table 3.On average, instructors more accurately authored problems in the spreadsheet interface (0.7341 ± 0.1534) than in the GUI (0.6574 ± 0.1546).
In addition, on average, instructors more accurately authored problems with hints in the spreadsheet interface (0.7951 ± 0.1597) than in the GUI (0.6927 ± 0.2292 on average).Instructors, on average, more accurately authored mixed problems with hints and scaffolds in the GUI (0.5728 ± 0.1933) than the spreadsheet interface (0.6558 ± 0.3115).Instructors also, on average, authored problems with images more accurately in the spreadsheet (0.6976 ± 0.3122) than in the GUI (0.665 ± 0.2311 on average).In addition, instructors authored problems with special characters more accurately in the spreadsheet interface (0.7100 ± 0.2542 on average) than in the GUI (0.6100 ± 0.2092 on average).The average accuracy score for each problem type is presented in Table 4.
Figure 4 summarizes each interface's performance differentials in terms of time taken and accuracy for each problem components.Each bar leans towards the interface that performed best.We also perform separate multiple linear regression analyses for each respective interface to decompose the relationship of accuracy authoring each problem component to that of other components.Using ordinary least squares, we calculate average accuracy in authoring a problem as follows: where   is accuracy in the interface,  1 is hints,  2 is mixed hints and scaffolds,  3 is images, and  4 is special characters.We perform the above multiple linear regression for each interface respectively.
In the spreadsheet interface, there is no significant effect on average accuracy authoring problems with images and special characters.However, there is a statistically significant relationship between accuracy and inputting problems with hints (F(3, 28) = 1.238, p < 0.001) and problems with mixed hints and scaffolds (F(3, 28) = 1.238, p < 0.001) in the spreadsheet interface as compared to other problem components in the same interface.On average, instructors more accurately input problems with hints in spreadsheet interfaces by 1 point and mixed hints and scaffold problems by 1.92% of a point as compared to other problem components in the same interface.In the GUI, there is a statistically significant relationship between accuracy and authoring problems with hints (F(3, 28) = 2.087, p < 0.001) and problems with mixed hints and scaffolds (F(3, 28) = 2.087, p = 0.027) as compared to other problem components.On average, instructors more accurately author problems with hints by 2.67% of a point and mixed hints and scaffolds by 0.977% of a point in the GUI as compared to other problem components in the same interface.

System Usability Scale
After instructors completed authoring two problems in their respective interfaces, they completed the SUS questionnaire of 10 items, on a scale from 1 to 5, 1 being strongly disagree and 5 being strongly agree.Responses to individual SUS items with mean scores and corresponding standard deviations for each interface, the learnability score, and overall SUS score are presented in Table 5.
After performing a Mann-Whitney U test to evaluate whether the SUS score differs between the interfaces, we found that there is a significantly higher SUS score in the GUI than the spreadsheet interface (U = 60.0,p = 0.012).There is no statistically significant relationship between the interfaces and their learnabilities (U = 164.0,p = 0.177).

Qualitative Feedback on Authoring Experience
We synthesize responses from instructors on their experiences with each system, reporting trends shared by two or more instructors (at least 12% instructors) by looking at the answers they gave to questions about what worked well, what they found difficult or confusing about the authoring process, and any general feedback.

Spreadsheet Interface.
Inputting hints and content into the different row types was straightforward for 38% of instructors.The ability to copy and paste from the problem document to the interface was useful for 25% of participants.19% of instructors cited their previous spreadsheet experience as a helpful pre-requisite into learning the authoring process.However, 12% of instructors did not find anything that worked well in the process, saying "nothing worked well" and that the process "was not intuitive."Although placing problem components into the spreadsheet was straightforward for some participants, 44% found the spreadsheet fields and placement difficult.Specifically, keeping track of the different column meanings, labeling titles, and keeping track of dependencies of hints and scaffolds were confusing.Finally, formatting of questions was confusing for 38% of instructors.19% of instructors wanted the ability to preview the question to ensure they inputted each problem component correctly and another 19% found the formatting for equations and special characters difficult, referencing the formatting guide as they transcribed problems to ensure they correctly followed the shorthand specifications.Overall, 31% of instructors reflected that their experience was tedious and they have programs they currently use that they find much easier.An instructor shared, "I already have another program that allows me to create my own problems, hints, steps, and answers, and insert images or make images on the interface.The program does not require any type of 'coding' like this." 12% of instructors felt that the authoring process would get easier over time, with one instructor mentioning, "it was complicated at first, but when I tried to do the first question as practice I kinda was getting a hang of it." 4.5.2GUI.The equation editor was helpful in inputting problem components for 56% of instructors who authored problems in the GUI.In addition, 44% found adding each of the problem components to the interface was straightforward.Adding images was particularly easy for 25% of instructors.25% of instructors also cited that learning this GUI was intuitive as it matched other systems they were already familiar with, such as Google Documents and Microsoft Word.44% of instructors were unsure if they entered hints correctly when asked to title more than one hint.One participant noted, "I was not sure if I should put the hint in the initial 'title' section or in the next step.I figured out I needed to utilize the next step, but then was unsure what the title was for or if I should have put some other crucial information in there.I wanted to be able to see what it would look like on the student end and couldn't immediately see a way to do that, so it left me unsure if my final product was what I might desire."Navigating among hints, scaffolds, and the main problem was also cumbersome for 19% of instructors.12% of instructors were frustrated that they had to save their work at each step, with one sharing, "I realize that the work does not automatically save and I have to manually save each portion of the problem.I found this to be a little time-consuming and it could be frustrating if the teacher forgot to save it and the entire length of the problem, for example, and the question part gets deleted and that would require the teacher to start all over again."Overall, instructors found that the GUI interface was "well-intentioned" and "seems like a good system", but navigating and editing scaffolds took more effort in understanding.

DISCUSSION
Upon inspecting authoring time per interface, various trends can be discussed.The similar times taken authoring overall per problem between the spreadsheet interface and GUI indicates that in the average problem creation scenario, neither interface format surpasses the other.Such a result indicates that both interface types can be reasonably used for content creation without immediate time drawbacks.This is further enforced by mixed problems with hints and scaffolds also taking similar time per interface.A combination of hints and images (Problem D) was not only the fastest, but also most accurate problem for the spreadsheet interface, suggesting that this format is well suited for authoring.However, when only a single special feature is present (hint, image, special character), the GUI trends to be faster on average by about a minute per category than the spreadsheet interface.After examining problems by their components, we find that instructors author problems with special characters significantly faster in the GUI as compared to problems with images.They also author problems with hints and mixed hints and scaffolds significantly slower in the GUI than problems with special characters and images.There were no significant relationships found with time taken in the spreadsheet interface authoring problems with hints, mixed hints and scaffolds, images, and special characters in comparison to one another.These results suggest that the GUI might afford more efficient authoring of problems with special characters, while there are no specific problem components that are especially efficient in the spreadsheet interface.
In comparing accuracy of authoring problems in each interface, the spreadsheet interface was significantly more accurate than the GUI overall, for problems with hints, and problems with special characters.These results could indicate that the spreadsheet interface lends itself to an easier creation process for problems with symbols that are more programmatic in nature.Further examining problems by their problem components finds that in both the spreadsheet interface and GUI, instructors significantly more accurately author problems with hints and mixed scaffolds as compared to problems with images and special characters.This demonstrates that authoring problems with dependencies was straightforward in both graphical and programmatic environments.
In our study, the mean SUS score of 40 for the spreadsheet inference indicated that the spreadsheet interface is unacceptable.The mean SUS score of 62 for the GUI indicates that that interface is marginally acceptable.Our findings show that although instructors' perceptions of each systems' usabilities were low, their average accuracies were relatively acceptable (73.41% in the spreadsheet interface and 65.74% in the GUI).The discrepancy between instructors' perceptions of each system (SUS) and their actual performance (accuracy) authoring reflects cognitive dissonance, where instructors' cognitive structures of expectations and actuality are inconsistent with one another and results in a psychological state of dissonance [9].In our study's case, we observed a case of positive disconfirmation, a situation where actual performance exceeds expected performance [26].The dissonance between instructors' perception of their experience transcribing problems and their actual performance in terms of accuracy might be explained by the lack of integration between learning strategies, mental models of learning, and learning orientations [30].Instructors may have also tried to relate elements of the spreadsheet interface and GUI to systems they are familiar with in their current workflows with 31% of instructors sharing that they have programs they currently use that they find much easier.Both interfaces scored marginally acceptable in learnability (79.69 for spreadsheet interface and 67.19 for the GUI), demonstrating that it is reasonable to assume first-time users can navigate each system to author content with the minimum training resources provided.
Thus, when examining affordances of GUIs for authoring open access content, we can highlight its average faster speed as compared to the spreadsheet interface's higher accuracy.In terms of limitations, the GUI appeared to have issues in displaying all problem information on a singular page, which can cause potential further concern if a on-creator has to make edits to created content.As for spreadsheet interfaces, since instructors took longer examining the tutorial information, there is a potential that learning to use a spreadsheet interface in the context of authoring problems requires a shift of understanding to match the programmatic nature to instructors' expressions.Further research would be required to better understand the difference in learning times.
We chose Algebra as our subject in this study as Algebra serves as one of the fundamental building blocks for STEM (in addition to Calculus), acting as a baseline for content that can expand into other domains.Khan Academy, the system we curated problem from, features content for various STEM disciplines, such as Physics and Statistics, which reasonably build upon Algebra.This indicates the foundational importance of testing on such content.Furthermore, the features of hints, scaffolds, images, and special characters are not exclusive to Algebra or even Mathematics as a domain.Thus, it is reasonable to assume that our insights are not restricted simply to Algebra tutoring, but are relevant to content creation for multiple disciplines.

Design Implications
A major point of concern within the GUI rested in the complexity of interface panels.Spreadsheets (OATutor format) feature the entire problem creation on a singular page.Meanwhile, the utilized GUI (ASSISTments Builder) requires creation of hints and scaffolds on separate pages, potentially creating difficulties for tracking down the flow of the problem.Furthermore, deleting a hint or a scaffold in the spreadsheet interface requires the removal of a column (taking two individual mouse inputs).Deleting a hint or a scaffold in the GUI requires deleting an entire page worth of content.A more linear GUI display could alleviate such concerns, allowing for easier viewing of problem help and enabling fast deletion of unwanted components.
As for the spreadsheet interface, labeling was referenced as unintuitive multiple times, being reminiscent of coding.This displays a major limitation of the spreadsheet interface, as it could potentially imply difficulty in understanding its usage.Potential solutions could lie in a further simplification of the columns, making them optional if they are unused, or instead, additional resources to support new users.
When instructors choose which type of interface to author content in, they might use GUIs for their perceived usability.This could suggest that future systems could provide a GUI to cater to independent instructors and other communities of users who voluntarily experiment with and utilize content creation interfaces.However, instructors might consider to use spreadsheet interfaces if they wish to more accurately author problems, especially ones that contain hints or special characters.Our results suggest that a joint interface type could be effective depending on the types of problem components most commonly used and whether time taken or accuracy, if not both, hold more importance.Instructors who wish to reduce time taken authoring problems in a GUI might consider to include problems with special characters instead of images.In addition, instructors who wish to maintain higher accuracy in expressing their content in either spreadsheet interfaces or GUIs might only include problems with hints and mixed hints/scaffolds over problems with images and special characters.Including a preview option in either interface type would be useful for educators in ensuring they author problem structures in the way they intend.

LIMITATIONS/FUTURE WORK
The spreadsheet interface's introductory tutorial slides were modeled after a pre-existing tutorial from the GUI, to create as similar of a training environment as possible for the study.There exists a possibility that each interface requires a distinct type of training environment for new users and the spreadsheet interface's introduction was not sufficient to onboard users with the same skillset as the spreadsheet interface's content creators (who undergo a 2-hour introductory course).One of the inherent benefits of a GUI tends to be that it is more intuitive, perhaps suggesting that the spreadsheet interface has a sharper learning curve or that the spreadsheet structure is not as obvious even though it lends itself to higher accuracy.Future studies could experiment with different types of training for each interface, and examine how training affects time taken and accuracy in authoring problems.
As for the authors of the content, we only sampled instructors.Future studies could consider a difference audience's experience authoring.From researchers to new curricula developers, different interfaces could lend themselves to different audiences, and thus sampling instructors is perhaps not conclusive for every use case of said authoring interfaces.For example, OATutor's original content library was authored using spreadsheets by undergraduates with experience tutoring the subject matter, not instructors.
While we discussed that humans will play some role in content creation for the foreseeable future , it is vital to give the appropriate amount of attention to inevitable LLM integrations.Once LLM confabulation rates decrease, tools such as ChatGPT will likely play critical roles in creating educational content.To ease the incorporation of such generated content into human content creation, future work could look into interface modifications that best accommodate this hybrid content creation.One issue identified in the GUI condition is having a page for editing hints and scaffolds separate from editing the main problem was confusing.The simplest solution may be to integrate all editing onto a single page; however, if that approach was to be overly cluttered, a natural language interface (i.e, voice) could be used in place of an editing interface, where the author gives commands to the LLM, that then translates those authoring commands into structured content.Users also encountered confusion in the spreadsheet interface with the hint/scaffold dependency columns.One remedy may be to make this dependency information optional and create a default dependency behavior (i.e., previous hints must be viewed to unlock the next hint).Should a power user want to utilize this feature, a natural language interface could again be utilized to specify the dependency rules, with an LLM filling in the structural dependency information based on an utterance such as, "Students should have to answer the first scaffold question before seeing the first hint." This possible combination of LLM and human problem creation could be visualized as a third, separate interface.The LLM would handle the problem creation, perhaps with the content author providing prompting for personalization of the content.The content interface would then manifest in the form of problem editing, enabling the content author to edit LLM generated content, and perhaps add the aforementioned additional hints and scaffolds.In an NLP-focused system, errors are still bound to appear, requiring a human-facing interface to audit and refine content.The most sensible interface to integrate LLMs will rely on nascent research about how to best integrate new technology into the authoring process and what integration best serves students' interests, considering teachers' and content authors' pedagogical agencies.

CONCLUSION
Our study measured instructors' time taken, accuracy, and perceived usability in authoring algebra problems with different components in spreadsheet interfaces and GUI to explore the affordances and limitations of each.We found that with spreadsheets, instructors authored problems significantly more accurately than with a GUI as a whole (U = 660.0,p = 0.047), in particular problems with hints (U = 200.0,p = 0.007), and problems with special characters (U = 553.5,p = 0.328).We did not find a significant effect on time taken and interface used.Although instructors perceived the usability of the GUI to be significantly higher than the spreadsheet interface (U = 60.0,p = 0.012), instructors authored problems as a whole more accurately in the spreadsheet interface by 7.67%, problems with hints by 10.24%, and problems with special characters by 10% than in the GUI.The dissonance between instructors' perceptions and actual accuracies in authoring problems serves as a reminder that the System Usability Scale score measures user satisfaction, but cannot be used alone to measure the entire usability experience.
Our study did not find significant results in terms of time taken authoring and the interface used.The decision of which interface to ultimately build for authoring systems is left up to the researchers and designers.Both interface types are reasonable in authoring content in terms of accuracy and time.Beyond usability, researchers can consider other factors such as cost, resources available, time, and target audience.Authoring system designers should consider what works well and what can be improved upon in existing systems to build upon users' prior knowledge and ease learnability.

Figure 4 :
Figure 4: Overview of difference in time taken and accuracy by every problem component type in each interface.

Table 1 :
Average time taken in minutes for every problem in each interface

Table 2 :
Average time taken in minutes for each problem component type in each interface

Table 3 :
Accuracy scores fore every problem in each interface

Table 4 :
Average accuracy scores for each problem component type in each interface