TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks

Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.


INTRODUCTION
Instructional videos are important sources for people to acquire new skills.However, the linear timeline-based video format provides limited overviews, with no explicit representation of the steps and their dependencies.Besides, navigating the timeline is tedious and imprecise.While users can fast-forward or replay videos, scrubbing the timeline might cause them to overlook vital information [66,81].
Recent work has shown that mixed-media tutorials, which unify videos, images, text, and diagrams in an interactive user interface, offer more browsable alternatives.For example, YouTube Chapters [54] help navigate long-form videos: each chapter corresponds to a video segment with a short text description, a thumbnail, and a timestamp.Researchers have also proposed non-linear mixedmedia tutorials for tasks such as applying makeup and cooking [49,65,74].Such tutorials optimize user navigation by providing object details and organizing steps based on dependencies.
Although the benefits of mixed-media tutorials are confirmed, creating such tutorials from the original instructional videos remains challenging.Current approaches for authoring mixedmedia tutorials are usually domain-specific, with both the tutorial components and extraction techniques tailored for each domain [14,20,49,65].While many have acknowledged the importance of generalization and argued how their approaches could apply to tutorials in other domains [32,65,67,70], a cross-domain framework with shared vocabulary and reusable methodologies for mixed-media tutorial creation is still lacking.We believe such a framework will benefit the future development of mixed-media tutorial creation, as demonstrated in other research areas [7,11].
Recent advances in AI, especially large language models (LLM) [9], have shown promise in content understanding and generation, and can potentially play a vital role in establishing a crossdomain framework.However, integrating AI with mixed-media tutorial creation is not straightforward.First, we have neither a vocabulary to describe the common components of mixed-media tutorials nor a systematic account of the roles of humans and AI in extracting such components.Second, a single component may have multi-modal appearances (e.g., cooking ingredients appearing in both the audio narration and video frames), and multiple machine learning (ML) models are applicable.Currently, there are no guidelines on how to assemble and evaluate ML models to obtain mixed-media tutorial components from original videos.Though the landscape of ML models changes over time, we believe there are general guidelines that could transcend specific models.
To address these challenges, we present TutoAI, the first crossdomain framework to integrate AI in creating mixed-media tutorials (Figure 1).We focus on instructional videos on physical tasks (e.g., cooking, hardware assembly) instead of concepts (e.g., lectures) or digital artifacts (e.g., software usage, programming).The TutoAI framework has three levels: components, models, and user interfaces (UI).At the component level, we conduct a comprehensive survey to identify common components of mixed-media tutorials and analyze their representations.At the model level, we review ML methods to extract each component and present an approach to assemble and evaluate applicable ML models.At the UI level, we propose guidelines for building UIs that allow creators to review and edit AI-generated components and also implement an example interactive prototype.
We evaluate TutoAI in two ways.At the model level, we validate the performance of the assembled ML pipeline on a large set of cooking videos and a small set of diverse instructional videos.At the UI level, we evaluate the user-perceived component quality by conducting two studies with 24 general instructional video viewers and 2 YouTube creators.Our results show that TutoAI-generated components have higher or similar quality compared to a baseline model (YouTube Chapters [54]), and the TutoAI framework has the potential to be integrated into creators' workflow.In summary, we make the following contributions: • A comprehensive survey for mixed-media tutorials and a taxonomy of mixed-media tutorial components.• TutoAI, a cross-domain framework for AI-assisted mixedmedia tutorial creation on physical tasks, including components, models, and UIs.• Empirical evaluation of TutoAI framework in terms of model quality, user-perceived quality, and workflow integration.

RELATED WORK 2.1 Mixed-media tutorials
Mixed-media tutorials, though diverse in format, share commonalities in tutorial components and extraction methods.Tutorial components.A common component is a step, usually a video segment, comprising a text description, a thumbnail, and a timestamp [20,32,53,54,70].A step could range from a cooking procedure [12] to a software operation [20].Another common component is objects, e.g., ingredients and equipment for cooking tutorials [44,49,74].Besides steps and objects, some tutorials also organize steps based on dependencies, e.g., Truong et al. [65] grouped makeup video segments by facial parts in a two-level hierarchical format; Nawhal et al. [49] and Yang et al. [74] arranged cooking steps non-linearly by temporal and spatial dependencies.TutoAI, our proposed framework, has a Components level built upon components distilled from existing mixed-media tutorials.Extraction methods.Tutorial component extraction from original videos could be manual, automatic, or mixed-initiative (detailed comparison in Appendix Table 2-4).Websites like Wiki-How [71] and Allrecipes [44] depend on experts to draft tutorials; Crowdy [70] requires learners to identify subgoals and steps.In certain domains, automatic extraction methods are feasible.MixT [14] segments PhotoShop videos using software logs.Fraser et al. [20] implement a dynamic programming method to segment creative stream videos based on the transcript and software logs; Truong et al. [65] apply video shot detection and transcript segmentation methods for makeup videos.However, the above methods require domain-specific data and may not apply to other domains.Mixedinitiative methods involve both human effort and computational techniques.Humans could provide input, e.g., ToolScape [32] gathers steps from crowdworkers and converges them through clustering algorithms.EverTutor [68] converts smartphone demonstrations by humans into interactive tutorials.Humans could also refine computational results, e.g., VideoWhiz [49] and RecipeDeck [12] both employ Part-of-Speech (POS) tagging to detect cooking actions and objects and then rely on annotators to refine the results.Video Digests [53] applies Bayesian topic segmentation to generate chapters in lecture videos, allowing users to improve upon them.The second level of TutoAI focuses on models, including an approach to identifying, evaluating, and assembling AI models to extract tutorial components.TutoAI also adopts a mixed-initiative approach, where humans refine computational results.
Cross-domain applicability is a goal in previous work on mixedmedia tutorials.For example, Truong et al. suggest their segmentation algorithm for makeup videos could be adapted for cooking, DIY, and bartending [65]; Soloist [67] transforms instructional guitar videos into mixed-media tutorials, and the processing pipeline can be generalized to other instruments; Kim et al. show that the same annotation pattern combined with a clustering algorithm can process cross-domain instructional videos [32]; Crowdy [70] is a subgoal-based crowdsourcing annotation workflow.
TutoAI extends this line of work, aiming to create a general cross-domain framework for mixed-media tutorials.Unlike crowdsourcing annotation workflows, TutoAI relies on AI.

AI-assisted creation
AI has augmented human creativity, from generating visuals [45,57] to crafting slogans and aiding scientific writing [9,22].However, AI outputs may be imperfect or misaligned with user intentions, necessitating human refinement.Researchers have built AI-assisted creation tools in multiple domains, e.g., Cococo [41] allows users to adjust the mood of AI-generated music notes.Morai Maker [25] is a game-level editor in which human and AI designers take turns to build a Super Mario Bros game.LaMPost [23] facilitates email writing for people with Dyslexia.Dang et al. 's text editor [16] supports writers to refine automatically generated paragraph summaries.Some tools focus on refinement instead of creation: e.g., refinement of topics returned by topic models [61]; repair of auto-extracted PDF tables [27]; refinement of medical images retrieved by ML models [10].TutoAI also adopts an AI-assisted approach, supporting the creation of mixed-media tutorials with extensive refinement.Unlike previous work focusing on a single modality, TutoAI supports multi-modal mixed-media tutorial creation empowered by various ML models.
Providing guardrails for AI output is crucial.Previous research has proposed several principles for designing such mixed-initiative user interfaces [4,28], such as "provide mechanisms for efficient agent-user collaboration to refine results" and "support efficient correction".TutoAI adheres to these principles, and additionally shares design considerations for choosing ML methods across modalities.

Large language models (LLM) prompting
Large language models (LLM) [8,9,64], trained on internet-scale data, have demonstrated extraordinary potential in information processing tasks such as text summarization.Users interact with LLMs by providing natural language descriptions of the task, also called prompting [55].The most commonly used prompting technique is zero-shot prompting [5], which describes the task directly.There are also other prompting techniques, including few-shot prompting [42] and prompt chaining [73].Researchers have applied zero-shot prompting to summarize various types of data, including news [24,79], Reddit posts [75], meeting records [34] and stories [75].Researchers have also applied LLMs to summarize video transcripts.Croitoru et al. [15] applied GPT-3 to summarize software tutorial video transcripts and then used the summary to detect key moments.LUSE [60] also uses zero-shot prompting to summarize tutorial video transcripts and generalize steps for a task across different videos.To evaluate the summarization quality of LLMs, researchers have used traditional metrics like ROUGE scores [37], which measures the number of overlapped n-grams in the reference and summarized text, as well as employed humans to examine different aspects of the output, including coverage [60], descriptivity [60], coherence [79], faithfulness [79], relevance [79] and personal preferences [24].
TutoAI also relies on zero-shot prompting to summarize video transcripts.In addition to requesting a summary, TutoAI also asks an LLM to extract objects and timestamp information.Like Croitoru et al. [15] and LUSE [60], TutoAI uses the generated summary as input for other models.The difference is that their contributions are models that focus on a single task (e.g., detect video moments) and exclude humans from the loop, but TutoAI contributes an AIassisted framework.As LLMs suffer from hallucination (plausible yet incorrect output) [80], involving human refinement is crucial for end users.Similar to previous research, we manually evaluated the output besides ROUGE scores.

TUTOAI OVERVIEW: AN AI-ASSISTED FRAMEWORK
The TutoAI framework aims to provide a cross-domain approach to AI-assisted creation of mixed-media tutorials on physical tasks.We expect the input to include an instructional video and its transcript.
Our design goals, informed by the review of current mixed-media tutorials and ML methods, are: D1 Support cross-domain tutorial creation: Mixed-media tutorials are useful in diverse domains, and TutoAI should offer a generalized approach.D2 Handle multi-modal data types: The input instructional videos and the output mixed-media tutorials both contain multi-modal data.TutoAI should support multi-modality.D3 Empower creators without information overload: Given the multi-modalities in mixed-media tutorials and the vast landscape of ML models, TutoAI should present information to creators without overwhelming them.

Level 1: Components
As shown in Figure 1, TutoAI is built on three types of crossdomain components in mixed-media tutorials (D1): steps, objects, and dependencies (detailed in section 4).These components are multi-modal (D2), specifically: • Steps: represented as text, images, video clips, and temporal metadata (timestamps) • Objects: represented as text, images, and temporal metadata (appearance time in videos) • Dependencies: encoded as hierarchical structures, diagrams, and links The output mixed-media tutorials may include all or a subset of these components.For instance, YouTube Chapters [54] only utilize steps.For completeness (D2), we discuss all three component types in level 2 and level 3.

Level 2: Models
After identifying the components and their representations, we focus on methodologies to select and evaluate applicable ML models to obtain such components from instructional videos.Even though cutting-edge ML models change over time, the general approaches we suggest here transcend particular models (Section 5).

Identifying relevant models.
The first task is identifying models capable of extracting information required for a component.We consider models that take visual or transcript data from the video as inputs (D2), and with outputs that match the desired component representations.For instance, if a step component requires text descriptions, then models that ingest video transcripts or frames, and output text descriptions are applicable.

Assembling models.
After identifying relevant models, we assemble models into candidate pipelines based on input and output modalities.For example, if a step component requires text descriptions and timestamps, instead of finding a single model that generates both, we can assemble two different pipelines serving the same goal.In the first pipeline, one model generates text descriptions, and the other locates the descriptions in the video.Alternatively, we can assemble another pipeline where one model segments videos first and the other model generates text descriptions for each segment.

Evaluating models.
After considering alternative ways to assemble models, we first find common benchmark metrics for model evaluation.Besides objective metrics, we also assess correction efforts for creators.For example, false positives (FPs) are deemed easier to fix than false negatives (FNs), as fixing FPs requires deletion, but fixing FNs requires creation.

Level 3: User Interface (UI) design
AI-generated results are typically imperfect, requiring further refinement from humans.As shown in Figure 1, the UI should support creators to review and revise AI-generated results.To manage cognitive load (D3), the UI should display AI-generated results sequentially, allowing creators to focus on one aspect at a time, and the refined results could be input for subsequent stages, mitigating error propagation.Section 6 discusses UI design guidelines and presents an example implementation.
For each mixed-media tutorial, we annotated the informational units, such as ingredients in recipe tutorials, and visual representations.These units were then categorized into three types of components: step, object, and dependency.We also annotated extraction methods based on human roles (Appendix Table 2-4).

Step
Every tutorial comprises a sequence of steps, e.g., "duplicating a layer" in a PhotoShop tutorial [14].These steps may be conveyed through text, images, and video clips.Among the 13 tutorials we studied, 12 used text descriptions, 10 featured images, and 7 included video clips.Auxiliary elements can enrich the primary media.Timestamps help locate the step in the original video, overlays emphasize parts of an image, and glyphs connect images or text.We found 5 out of 7 tutorials with video clips also provide timestamps; two tutorials have overlays on images, and one uses glyphs.
Figure 2 provides step examples in mixed-media tutorials.Specifically, Figure 2a shows a step in an interactive smartphone tutorial, marked by an overlay indicating the screen area to be clicked [68]; Figure 2b depicts an auto-generated YouTube Chapter for a DIY craft video featuring text, images, and video clips (with timestamps); Figure 2c illustrates a step in a cooking tutorial, where red and blue dots signify ingredients and actions, respectively [12].Comprehensive details are in the Appendix.
(a) A step with an image and overlays in a smartphone tutorial [69] (b) A step with text, images, video clips, and temporal metadata in a DIY craft tutorial [44] (c) A step with text and glyph in a cooking tutorial [12] Figure 2: Examples of steps in mixed-media tutorials (images used with permission)

Object
Many mixed-media tutorials explicitly specify objects required for the task, such as ingredients and equipment in cooking tutorials [74], and UI widgets in software tutorials [20].These objects can be represented through text, images, and timestamps marking their appearance in videos.In our dataset of 13 mixed-media tutorials, 7 explicitly included object components.While the remaining 6 tutorials contained objects implicitly in the instructions, they did not extract and represent these objects as individual components.All 7 tutorials with object components featured text descriptions, 3 incorporated object images, and 2 had appearance time in videos.
(a) Object components represented using text and interactive check-boxes in a roof repairing tutorial [6] (b) Object components represented using text, images, and appearance time in a cooking tutorial [75] Figure 3: Examples of objects in mixed-media tutorials (images used with permission).
Additionally, 3 offered interaction features, including checkboxes or clickable buttons that link objects with other components.
Figure 3 illustrates examples of objects in mixed-media tutorials.Figure 3a displays an object component from a roof repair tutorial on WikiHow [6], with interactive checkboxes to help users gather things needed; Figure 3b shows object buttons; clicking on an object button (e.g., "beef (steak)") brings up video frames containing that object and the appearance time on the timeline [74].All the examples are in the Appendix.

Dependency
Dependencies between steps are everywhere; they could be food processing order in recipe tutorials [12,49,74], concept prerequisites in lectures [39] and facial parts in makeup tutorials [65].Dependencies may imply a different order than the one presented in the original instructional video.For example, in a cake recipe video, though the preparations of dry and wet ingredients are shown sequentially, they could be done in parallel [70].In the TutoAI framework, we focus on physical tasks, where the dependencies between steps are the execution order.Of our collected 13 examples, 5 include dependencies explicitly.Among those 5, 4 utilize spatial layout to encode the dependency, 3 have links in the diagram.
Figure 4 shows dependency examples.Figure 4a shows groupings in a makeup tutorial where steps within each group are sequential but independent of other groups.Figure 4b maps out the dependencies of concepts in a lecture: orange nodes are already covered, and (a) Spatial dependencies in a makeup tutorial [66] (b) Concept prerequisites in a lecture [39] (c) Action dependencies in a cooking tutorial [75] Figure 4: Dependency examples in mixed-media tutorials (images used with permission).gray nodes are not.Figure 4c outlines cooking steps in different rows and columns: steps on the same row must be done sequentially, but steps on different rows could be done simultaneously; steps are also grouped by spatial dependencies (e.g., cutting board) in rectangles.All the examples are in the Appendix.

LEVEL 2: ASSEMBLE AND EVALUATE MODELS
We first review applicable models and candidate pipelines to extract mixed-media tutorial components.We then evaluate them on an annotated dataset of 347 cooking videos and finalize a pipeline.
Note that we only apply ML models to step and object extraction; for dependencies, we build a directed acyclic graph (DAG) based on the temporal order and shared objects between steps.To assemble pipelines that extract all the step information, we start with models that take video frames and transcripts as input and chain additional models based on the output.Figure 5 shows 4 candidate pipelines.

Applicable models and candidate pipelines
• Pipeline 1: text summarization + NLVL + shot boundary detection.As shown in Figure 5, pipeline (1)  Given timestamps and video frames, dense captioning models generate step descriptions.

Object extraction.
For the sake of completeness, we assume that an object component needs the following information: object names and an image containing the object's bounding box.We have identified relevant models: • Models for object names.We identified three types of models to extract object names: Part-of-Speech (POS) taggers, LLM prompting, and traditional object detectors.POS taggers take text as input, categorizing words' roles in a sentence with grammatical properties such as nouns and verbs [2].Obtaining object names from POS tagging results requires parsing nouns.LLMs can also be prompted to extract object names from text input.Traditional object detectors are trained on predefined object categories and, given input images, output detection names and bounding boxes [19,38].• Models for object bounding boxes.We identified two types of models to obtain object bounding boxes: traditional and open-vocabulary object detectors.As mentioned before, traditional object detectors take images as input and return bounding boxes as output.However, it can only recognize objects in the training dataset.Open-vocabulary object detectors take in both object names and images, and output bounding boxes for the object names [30,36,47].
After considering the relevant models, we assemble them into three candidate pipelines.
As shown in Figure 6, pipeline (1) uses POS taggers to identify object names from the video transcript.It then passes these names and video frames into open-vocabulary object detectors to localize the objects.• Pipeline 2: LLM + Open-vocabulary detector.Pipeline (2) prompts an LLM to extract object names from the transcript and runs an open-vocabulary object detector.• Pipeline 3: traditional object detectors.Pipeline (3) only uses traditional object detectors to obtain both the object names and bounding boxes.

Evaluation of applicable models and candidate pipelines
5.2.1 Overall evaluation approach and metrics.We evaluate models within the mentioned pipelines and discard any with subpar performance.Based on available source code and pre-trained models, we use at least one state-of-the-art (SoTA) implementation for each model type.While objective metrics are utilized, we also conduct manual inspections, especially when standard metrics fail to capture the error profiles.In the following subsections, we report the  main findings from the evaluation.Appendix A.1 provides detailed information about the evaluation dataset and results.

Evaluation dataset.
We evaluated on the validation set of YouCook2 [82], containing 347 cooking videos with auto-generated English transcripts.Each video has human-annotated objects, step descriptions, and start/end times.
Traditional NLP metrics might not effectively gauge the quality of text generated by LLMs [40].Through manual comparisons between GPT-generated descriptions and human annotations, we noted discrepancies that could affect ROUGE scores without necessarily compromising summarization quality.For instance: • LLM identifies optional steps, e.g., put the salad in the fridge.
• LLM turns states into steps, e.g., from the statement "I've preheated my oven to 375 degrees", it derived a step "Preheat oven to 375 degrees".• LLM includes more cooking details, e.g., temperature.
Given this, we decided to select LLM for text summarization.Text descriptions: video dense captioning.Pipeline 4 relies on dense captioning to obtain text descriptions.We evaluated two video dense captioning methods: MT [83] and PDVC [69] and there are evident errors in object names and actions.For example, in the video "How to Make Fried Calamari | Hilah Cooking" 1 ,the human annotation is "drop the squid pieces into the oil", but the dense captioning returns "add the chicken in a pot of water boil".Consequently, we decided not to incorporate dense captioning models, leading to the removal of pipeline 4.
For pipeline 1, we provided the video and ground truth step descriptions to DORi [59] to predict each step's start and end time.After manual inspection, we found that the returned steps did not observe the order (e.g., step 3 is localized before step 2) and returned overlapping steps.Given the considerable editing effort required for such errors, and other NLVL models suffer from similar limitations, we eliminated Pipeline 1.
For pipeline 2, we applied LLM alone to predict the boundary timestamps.We sent a transcript and a prompt "summarize the video transcripts in several steps and find the start and end time for each step".The transcript format is the same as the YouTube transcript, with each sentence beginning with a timestamp.Since this approach predicts both the step summaries and timestamps simultaneously, complicating quantitative evaluation without timestamping all 347 videos manually.We sampled 20 videos and conducted a qualitative evaluation, showing LLM returns ordered and non-overlapping steps, and the step descriptions and timestamps were reasonably matched with the ground truth.
For Pipeline 3, we employed ProcNets [82] to determine video shot boundaries.Relying solely on frame visuals, ProcNets scores each segment.We evaluated top-scored segments against the ground truth by computing the average temporal intersection over union (tIOU), however, given a low alignment (tIOU = 0.18), we didn't proceed to generate text summarization for each step.
Therefore, we retained Pipeline 2 for extracting steps.Object names.In Pipeline 1, we applied POS tagger Flair [3] to extract object names.For Pipeline 2, we prompted GPT-3 [9,52] with the transcript and an instruction: "Identify the objects, ingredients, tools, equipment in this tutorial" and parsed objects from the response.In Pipeline 3, we employed a faster R-CNN [58] trained on the Visual Genome dataset [33].Both POS taggers and GPT-3 outperformed visual detectors in identifying true positives.However, POS taggers often identified non-cooking objects, e.g., the chef's necklace (Appendix Table 6).As such, we retained only Pipeline 2, leveraging LLM for object extraction.
Object bounding boxes.Considering the underwhelming results of traditional object detectors, we only evaluated open-vocabulary object detectors and eventually chose OWL-ViT [47] considering both performance and computational cost.

Final pipeline
We finalized our pipeline as shown in Figure 7, which includes Step pipeline 2 (Figure 5) and Object pipeline 2 ( Figure 6) .First, we extract steps from video transcripts by prompting LLM (here we use GPT-3.5 [50], assuming it has better performance than GPT-3): "Summarize the video transcripts in several steps and find start and end time for each step, " then we use a shot boundary detector [63] to pick thumbnails for each step.Next, to extract object components, we make a different prompt: "Find out what objects/ingredients/ tools/ equipment are required in this tutorial." Then, we run an openvocabulary detector [18] to identify the bounding boxes in video frames.Finally, we match object names to each step's description via string match, then build dependencies between steps by the shared objects.

LEVEL 3: USER INTERFACES FOR MIXED-MEDIA TUTORIAL CREATION 6.1 Design considerations
Section 4 shows various mixed-media tutorial formats regarding visual representation, layout, and interactivity tailored to specific domains.Rather than advocating a one-size-fits-all format, we embrace the principle of separating content from style: mixed-media tutorial components are content that can be extracted, reviewed, and edited, with different styles (e.g., visual representations, layouts, and interactive behaviors) added later.We focus on enabling creators to inspect and modify content, assuming that a tool will auto-apply styles to the final tutorial.Thus, we propose the following UI design considerations to elevate the creator experience without information overload (D3).
C1 Component-based creation.The UI should break down the creation process into individual tasks based on the mixedmedia tutorial components.The UI should sequence tasks so that the output from one task can provide context to help users perform subsequent tasks efficiently.C2 One modality at a time.To reduce context switching, when a component encompasses multiple modalities (i.e., text and images), the UI should break it down into subtasks.This will help simplify user interactions and avoid requiring users to operate across multiple modalities in a single task.C3 Editable AI output.The UI should enable creators to keep, modify, or dismiss AI-generated results and add information missed by AI.C4 Real-time edit preview.Upon editing, the UI should automatically reflect changes in the tutorial.

An example prototype
We reify these design guidelines into an example UI and use the video "How to make a seesaw for kids"1 as input.In this implementation, we use a tutorial format depicted in Figure 8.The tutorial contains the following components: a video player and step boundary below it (Figure 8A), an object list (Figure 8B) over which users can hover to see an image of the selected objects (Figure 8C); step overviews, which consist of a text description, a representative thumbnail and objects for each step (Figure 8D); associated dependencies (Figure 8E), represented as arrows between steps, and the buttons on the arrow show objects that connect steps.We chose this tutorial design for its comprehensive components without domain-specific assumptions.
The UI breaks up the creation process into five sequential tasks, each targeting a single tutorial component -steps, objects, or dependencies -in a single modality (C2).Creators can bypass any tasks and accept the default results if they deem the task unnecessary (C3).As they make changes, creators can preview the updates with the current modifications (C4) by the "view" button (Figure 9).Here is the workflow: 1) Identify steps.The UI shows the video and its transcript on the left, AI-generated steps with text descriptions and start/end timestamps on the right (Figure 9); creators can edit the text, add/delete steps, and update the time boundaries by dragging the range slider (C3).
2) Choose step thumbnails.The UI presents dissimilar candidate video frames.Creators can adjust the number of frames using a "show more/less"slider, and select a frame.(Appendix Figure 12).The thumbnails presented for a given step are bounded by the time boundaries identified for that step in task 1 (C1).3) Select objects.The UI suggests an object list required for the tutorial and associates the objects with the steps (Appendix Figure 13).Creators can modify objects and change their step associations (C3).4) Crop objects.Creators can choose a representative image for each object (Appendix Figure 14).The UI shows a list of objects refined by users in task 3 (C1) and presents candidate frames with probable object bounding boxes, which creators can adjust (C3).5) Build dependencies.The final task is to build dependencies (Appendix Figure 15).The UI displays a node-link diagram of dependencies based on shared objects between the steps, as identified in task 3 (C1).Creators can add/delete links via drag and drop (C3).

TUTOAI FRAMEWORK EVALUATION: MODEL
To demonstrate our pipeline's generality, we evaluated it on a small yet diverse dataset.

Dataset
Inspired by the object-action quadrant for instructional videos [13], we considered the following diversity dimension of instructional videos: creator, task, video duration, number of steps, number of objects.The content creator dimension allows us to capture variations over editing styles such as instructional or conversational narration, concise versus verbose steps, use of music fillers, etc.As a result, we collected a dataset of 20 videos (Table 1) across four domains: cooking, crafting, makeup, and repair.Each video within a domain focused on a different task (e.g., fixing an iPhone vs. fixing a hole in the wall for repairs) and was made by a different creator.We manually annotated the 1) objects and 2) step boundary timestamps and used these as ground truths.We assessed our pipeline on object extraction and timestamp prediction.

Object extraction results
We compare object extraction results with the ground truth using the F1 score, computed as: where   is the set predicted by our pipeline and   is the ground truth, and | | denotes the number of objects in the set.As shown in Table 1 column 8 ("F1"), our object extraction F1 scores fall between

Step boundaries
Our pipeline outputs a sequence of steps, including text descriptions and start and end timestamps.On average, it yields 1.3 false negative steps and 0.25 false positive steps per video (Table 1 column 11 "# False Neg." and column 12 "# False Pos.").The low false negative and false positive rates suggest that our pipeline does a good job of extracting steps.Introduction and conclusion segments accounted for most false negative steps, and false positive steps were incorrectly inferred from verbose narrations.We then used F1 score to assess predicted timestamps against the ground truth.For false negative steps, we set   to [0, 0] to signify that this step did not appear.Aggregate F1 scores ranged from 0.22 to 0.95, averaging 0.59 (Table 1 column 13 "Avg.F1") .In general, we found that our pipeline performed better on the step localization task for shorter tutorials and tutorials with more concise steps.Certain video editing decisions, such as using non-speech fillers between steps, showing step execution before verbally describing it, and describing steps out of order, also negatively impacted localization.Our aggregate F1 score suggests reasonable alignment between predicted step boundaries and ground truth with room for improvement, which can be achieved via more sophisticated prompt engineering.

TUTOAI FRAMEWORK EVALUATION -UI
To evaluate the quality of AI-extracted components perceived by users and the tutorial creation experience, we conducted two preliminary user studies to understand 1) if the TutoAI framework generates higher-quality mixed-media tutorial components than a baseline method before editing, 2) if the TutoAI framework generates mixed-media tutorials that are more useful for consumers than a baseline method after editing, and 3) the potential of integrating TutoAI into creators' existing workflow.

Study design rationales
We identify both instructional video consumers and influencers who make instructional videos as potential users of our prototype.Video consumers who want to learn instructional content are motivated to interact with the mixed-media tutorials and can benefit from tutorial creation.For example, Kim et al. find that when students contributed to creating subgoal-based tutorials, they became more attentive to learning [70]; popular video platforms also support video consumers to create video clips (e.g., YouTube's "create clip"2 ) and mixed-media notes (e.g., Coursera's "save note"3 ).Therefore, we recruited participants who frequently watch instructional videos for study 1.Several participants also disclosed that they had created mixed-media tutorials before, confirming our assumption.For study 2, we recruited two YouTube creators who regularly publish instructional videos.1: Pipeline evaluation on ground truth.We annotate ground truth for 20 instructional videos from 4 different domains and test the object extraction and step boundary detection components of our pipeline on these videos.Our pipeline performs object extraction very well (average F1 = 0.88) across domains.Our steps boundary detection performs relatively well on at least one video in each domain (F1 = 0.59).

Objects
In both studies, we used auto-generated YouTube Chapters [54] as the baseline.Although TutoAI was inspired by previous works, these tutorials were either generated automatically using a domainspecific approach [14,20,31,65,68] or manually without AI assistance [39,74].Mixed-initiative approaches [12,32,49,53] do not provide comparable creation experience like TutoAI.We thus determined that YouTube Chapters [54] is the most reasonable baseline since they also support cross-domain generation of steps.

Instructional videos: we chose two instructional videos on
YouTube: office chair assembly 4 and strawberry blueberry shortcakes 5 .We randomly split the participants into two groups: A (office chair assembly, video length: 5 minutes 18 seconds) and B (strawberry blueberry shortcakes, video length: 7 minutes 32 seconds).Participants' median familiarity with the video topic was 2.5 and 3.0, respectively (1: not familiar at all, 5: extremely familiar).

8.2.3
Procedures: First, we briefly introduced the concept of mixedmedia tutorials and editing features of the UI, then, participants followed a step-by-step instruction to reproduce a Kung Pao chicken6 mixed-media tutorial created by TutoAI as a warm-up.Then, the participants were asked to create a mixed-media tutorial for the assigned video and think aloud.Next, participants completed a survey and provided open-ended feedback.Each session was remotely conducted over Zoom and lasted about 1 hour.Each participant received a $20 Amazon gift card.The study was approved by the Institutional Review Board (IRB) Committee.

Findings:
We observed that participants applied different strategies to create mixed-media tutorials.Some participants watched the entire video first, some watched each step's video clip based on the AI-generated results first, and some did not watch the video but read the transcript instead.Quality of AI-generated results.We asked the participants to rate the quality of components generated by TutoAI before editing and YouTube auto-generated Chapters on a five-point Likert scale, where 1 means "the quality is so low that the author needs to start from scratch", and 5 means "the quality is so high that the author barely needs to do anything".YouTube Chapters only generates timestamps, thumbnails, and text descriptions for each step.We conducted a Wilcoxon Signed-Rank test with a Bonferroni correction, and found TutoAI generated higher quality results than YouTube chapters in 2/3 comparisons in group A (Figure 10a): TutoAI vs. YouTube Chapters, text: 4.6±0.65 vs. 2.0±0.71(=0.009);timestamps: 3.5±0.65 vs. 2.5±1.19 (=0.075);thumbnails: 3.6±0.49vs. 2.3±0.75 (=0.021).For group B, the benefits of TutoAI are not statistically significant (Appendix Figure 17  Other scores of TutoAI components are in Appendix Figure 18.Perceived Usefulness of Tutorial Components.We asked participants to rate each component's usefulness for tutorial consumers after editing, where 1 refers to "I don't think consumers will benefit from this component, " and 5 refers to "I'm confident that consumers will benefit from this component." We conducted a Wilcoxon Signed-Rank test with a Bonferroni correction, and found TutoAI results more useful than YouTube Chapters in 3/3 comparisons in group A (Figure 10b).Specifically, TutoAI vs. YouTube Chapters, text: 4.7±0.62 vs. 2.3±1.25 19.TutoAI vs. YouTube Chapters.Although TutoAI has received higher scores than YouTube Chapters in both videos in the user study, the statistical results are insignificant for the strawberry blueberry shortcake video.We looked into the user study recordings and found that since text descriptions of YouTube Chapters are very short ("Strawberry topping" and "Chantilly cream"), the participants deem them to be helpful as long as they contain important keywords.In comparison, the step descriptions generated by TutoAI are "Preparing the strawberries for the topping" and "Preparing the Chantilly cream using an air disc container".Although TutoAI provided more details, the participants believe the essential keywords have been captured by YouTube Chapters.On the other hand, the YouTube Chapters for the office chair assembly video missed most keywords, e.g., "Base Assembly", and were deemed less useful than TutoAI-generated text descriptions: "Attaching Caster Arm to Base".To more conclusively demonstrate the superiority of the fine-grained text descriptions generated by TutoAI, we need more experiment data involving more instructional videos.Dependencies and other components.Many participants (17/24) found the dependency diagram useful (rated 4 or 5), e.g., P12 said "The flow charts were amazing...if I didn't want to watch the video, I could just see the steps...I am getting a visual representation of the whole video." While some expressed confusion, P4 said "dependency diagram was a bit tricky to understand."Besides existing components, participants also brainstormed new tutorial components, e.g., 3D object augmentation/more camera angles (P11).Application Scenarios.The participants shared situations where they would like to have a mixed-media tutorial, e.g., build a pet snake vivarium (P5) and collaborative software development (P8).Some participants also mentioned situations where they would like to create a mixed-media tutorial to refresh their memory, e.g., P9 said "I make quilts, and I have to look up a lot of tutorials for how to finish the quilt because you only do it once every time.".

Study 2: YouTubers
8.3.1 Preparation: we recruited two YouTube creators (E1 and E2) who regularly publish instructional videos.For each YouTuber, we picked several of their videos with auto-generated YouTube Chapters.We ran our ML pipeline on the video: "bike rack installation"7 (E1) and "how to make a seesaw for kids"8 (E2) and loaded the results into TutoAI UI.During the study, we briefly introduced mixed-media tutorials and asked them to complete a step-by-step warm-up task to get familiar with the UI.Then, they created a mixed-media tutorial for the video and provided oral feedback along the way.Each participant received a $50 Amazon gift card.

Findings:
we asked them about the impression of AIgenerated results and workflows in creating instructional videos.
TutoAI vs. YouTube auto-generated Chapters.Both YouTubers spoke highly of the TutoAI-generated results, e.g., when asked about the quality of steps, E1 said "I'd say probably about a 4 (out of 5).There were a few things I changed, but for the most part, it was a good starting point.".When shown the auto-generated YouTube Chapters, E1 gave them a 2.5 to 3: "the first few are getting the breaks pretty good, but they lost some of the steps that your software captured".E2 believed it needs a redo completely: "I won't be able to use any of this..."Wood blocks" is just the name of the material, not something meaningful for the viewers to imagine".The author-created steps are in Appendix Figure 16.Attitudes towards dependencies.E1 expressed enthusiasm in applying dependency diagrams: "I really like the dependency diagram, especially for a procedural how-to video...it helps them understand... when you might need to skip a step or there might be a branch... E2 saw the dependency diagram has better use in cooking videos, "for example, cooking...you can do many things at the same time.But for my (DIY) tutorial, it kind of depends on one flow." Incorporate TutoAI into existing workflow.We asked both E1 and E2 to share their thoughts on incorporating TutoAI into their workflow.E1 said "I think this is a great tool...I don't know that it would necessarily save me time just creating chapters.It's a different animal because this is giving me the ability to do a lot more, especially creating the flow charts, which I really like... viewers would get a lot out of this as opposed to just a regular chapter".E2 recounted that in the past, she spent about 1 hour writing down steps and time boundaries of a 10-min video she created (6 times of the original video length), and to her relief, with the help of TutoAI, it only took her 17.5-minutes to finalize steps and time boundaries for an 11.5-minute video (1.5 times of the original video length).

DISCUSSION
We have proposed TutoAI, the first cross-domain framework for AI-assisted mixed-media tutorial creation.TutoAI extends earlier efforts in generalizing tutorial creation beyond a single domain [32,65,70].It adopts a holistic approach by distilling common tutorial components from existing work, presenting methodologies to identify, evaluate, and assemble AI models to extract components, and introducing a guided workflow for users to inspect and modify extraction results.In this section, we reflect on the lessons learned from our exploration and discuss the broader implications.

Selecting models and constructing pipelines
We demonstrated how to identify, evaluate, and assemble computational models into integrated pipelines to extract tutorial components.Given the rapid advancement in AI, we acknowledge that the pipeline we select may not sustain peak performance.For example, multi-modal LLMs are equipped with vision capabilities [43,51,78], and dense video captioning models may improve rapidly by benefiting from large-scale pre-trained models [85].Despite technological advances, our work provides enduring insights that transcend the specific models.We propose the following guidelines for future endeavors that incorporate AI into tutorial creation: • Adopt a multi-modal perspective: Models across different modalities could achieve similar goals, e.g., object detectors based on video frames and LLM prompting based on transcripts can both identify object names, and each has its SoTA models.By assembling multiple pipelines with the same objective, we can explore the solution space more comprehensively without premature commitment.problem has standard metrics for evaluation, higher scores do not equate to better user experience.Though comparing models across modalities may not be straightforward due to distinct metrics, a potential universal metric could be the user's effort required to refine the output.For example, an NLVL model DORi [59] returns higher tIOU (temporal intersection over union) than ProcNets [82] in video segmentation, but DORi does not observe the order of steps, leading to overlapping and reverse-ordered steps, which require additional user edits.To avoid overwhelming users, we eventually dropped the model.

Designing AI-Assisted user workflows
We believe it is important to tailor the design of mixed-media tutorial formats for different use cases.The tutorial format in our prototype shown in Figure 8 serves only as an example interface.
The following guidelines can inform future efforts to design AIassisted tutorial creation workflows.
• Simplify tutorial creation by guiding and constraining user actions: The sequential editing workflow in TutoAI is structured and domain-agnostic, following the Wizard interface design pattern [72].One potential benefit of this approach is that the complex task of tutorial creation is transformed into a sequence of understandable stages, where the relationships between the stages are implicitly captured.Users can thus focus on individual tasks without worrying about how to structure the overall workflow.The UI should also ensure the results satisfy implicit constraints (e.g., the intervals of two steps should not overlap).• Separate content from style: While mixed-media tutorials are available in diverse formats, TutoAI underscores the value of separating content from style.In our prototype, the user workflow focuses on extracting accurate component information; the visual representations and interactivity of the components in the tutorial are automatically applied to the extraction results.This general approach is adaptable to any mixed-media tutorial with a predefined format.Our prototype offers multiple formats for a customized consumer experience, including a list-based view of steps and a dependency diagram (Appendix Figure 16).Future tools can provide more flexibility in formatting tutorials, yet the principle of separating content from style remains valid.
• Support graceful degradation: The performance of ML models can be uncertain and unpredictable.Even though the overall performance of our pipeline is reasonable, it may be disappointing in some cases.Therefore, it is important to design a UI that supports tutorial creation when AI-powered component extraction fails.To support such graceful degradation, users must be able to interpret the extraction results and make edits easily.To facilitate this, our UI is designed for low-effort error correction, e.g., users can adjust step boundaries with a range slider.In the worst case, where the extraction result is completely wrong, users can override the results and update the component manually.

Cross-domain generalization: tutorials, tools, and methodologies
TutoAI is motivated by previous work's effort to generalize mixedmedia tutorial creation beyond a single domain.Reflecting on our experience, we have identified multiple interpretations of crossdomain generalization: • CD1: Same tutorial format, diverse domains: a tool for creating tutorials with the same format.• CD2: Same creation experience, diverse tutorial formats and domains: a general-purpose tool for creating tutorials with diverse formats.• CD3: Same methodologies, diverse creation experiences, tutorial formats and domains: a set of generalized methodologies to guide the design and development of tutorial creation tools; the tools can be general-purpose or domain-specific, supporting the creation of diverse tutorial formats It is not our intention to advocate a one-size-fits-all tutorial format (CD1), as we have discussed in Section 6.1 and Section 9.2.We believe a general-purpose creation tool (CD2) can be useful, as exemplified by our prototype.Nevertheless, a general-purpose tool risks overlooking domain-specific nuances in terms of both components and ML pipelines.In TutoAI, we are not only trying to build a general-purpose tool (CD2) but also propose a set of generalized methodologies for tool builders (CD3).With advancements in AI, we demonstrate the feasibility of designing tutorial creation tools systematically.Our framework, encompassing three levelscomponents, models, and UIs -and the associated guidelines, is adaptable to various contexts.For example, to develop a tutorial creation tool for software instructional videos, we can standardize the components first (e.g., UI widgets, commands, data), then identify, evaluate, and assemble ML pipelines based on the guidelines outlined in Section 9.1.Though the component and model details may differ, the underlying approach remains the same.

Limitations and future work
Domain limitations.Though TutoAI is a cross-domain framework, it does not apply to all instructional videos.Chang et al. [13] classified instructional videos into a quadrant along an object-action coordinate system, distinguishing between "Diverse objects and diverse actions" (cooking, car repair, makeup, etc.), "diverse objects and few actions" (crafts and packing, etc.), "few objects and few actions" (drawing, musical instrument, etc.), and "few objects and diverse actions" (dance, exercise, etc.).TutoAI focuses on physical tasks that involve diverse objects.For instructional videos with few objects or without concrete objects (e.g., lecture videos), TutoAI will have difficulty in constructing dependencies, as the dependency parser assumes steps share the same object depending on each other.Another related limitation is that if the same object was referred to differently, e.g., in the berry cake video, the creator uses "berries" to refer to both strawberries and blueberries in the late stage, and our method fails to detect the dependency between steps containing "berries" and "strawberries" (or "blueberries").Future work could investigate identifying abstract items and more intelligent dependency parsing, especially dependencies between abstract concepts.Representative frames selection.Currently, we use shot boundary detectors to present diverse frames as step thumbnail candidates, independent of text descriptions.In the future, thumbnail selection could leverage the text descriptions.e.g., multi-modal video summarization methods can extract representative frames and text summaries [26,48] simultaneously, having the potential to return high-quality text-dependent representative frames.Framework evaluation.We use user-perceived component quality as a proxy for learning effects, though the two may not be positively correlated.Further research is necessary to study if user rating of tutorials directly translates to better learning outcomes.Besides, the fact that users interacted with TutoAI but only looked at static YouTube Chapters' screenshots may also cause bias in users' ratings.

CONCLUSION
Transforming linear instructional videos into more browsable mixed-media tutorials will significantly elevate the learning experience, however, existing methods do not harness the full potential of the latest AI advances and are usually limited to specific domains.In response, we introduced TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation.TutoAI provides a taxonomy for mixed-media tutorial components, a methodology to evaluate and select models for component extraction, and guidelines for UI implementation.Our empirical evaluation underscored the capability of TutoAI in extracting high-quality mixed-media tutorial components and helping authors create mixed-media tutorials.Moving forward, we believe the TutoAI framework will provide a strong foundation for future mixed-media tutorial development.

Figure 5 :
Figure 5: Four candidate pipelines for step extraction.Models are in green, and generated subcomponents are in blue.After evaluation, the chosen one is No.2.

Figure 6 :
Figure 6: Three candidate pipelines for object extraction.Models are in green, and generated subcomponents are in blue.After evaluation, the chosen one is No.2.

Figure 7 :
Figure 7: TutoAI's machine learning pipelines to obtain objects and steps in instructional videos: 1. extract steps: ChatGPT processes the video transcript to produce text descriptions and time intervals for each step, then a shot boundary detector augments each step with a thumbnail; 2. extract objects: ChatGPT identifies the objects in the tutorial, then an open-vocabulary object detector returns the frames and bounding boxes of the objects; 3. build dependencies: an object matcher checks if objects are in a step's transcript and produces a dependency graph.

Figure 8 :
Figure 8: A mixed-media tutorial template on making a seesaw for kids: below the video player (A) is a list of required objects (B); hovering on the blue-bordered object will show the object's image along with a bounding box (C); on the right is an overview of steps, (D) each step is a video clip with start and end time, text descriptions and associated objects.(E) The arrows between the steps indicate the dependencies.

Figure 9 :
Figure 9: Identify steps.This task aims to break down the video into several steps and provide text descriptions and time boundaries for each step.On the left is a video player and its transcript ("Make a seesaw for kids"); on the right are the AI-generated steps.

Figure 10 :
Figure 10: Component quality of group A: office chair assembly.Before editing (left), after editing (right).
[29,69,83]56]of completeness, we assume that a step component needs the following: a text description, the start and end timestamps in the video, and a representative video frame (thumbnail).As mentioned in section 3.2, we first identify relevant models:• Models for text descriptions.We identified two types of models for generating text descriptions: text summarization and video dense captioning.Text summarization takes a chunk of text as input and shortens it while preserving the key information[17,35,46,56].Video dense captioning takes video frames and step timestamps as input and generates text descriptions for objects and their interactions within the step's boundary[29,69,83].
5.1.1Step extraction.• Models for step timestamps.We identified four model types for obtaining step timestamps: natural language video localization (NLVL), shot boundary detection, video summarization, and LLM prompting.NLVL localizes the start and end time of a step given a video and a step text description [21, 59, 77].Shot boundary detection takes video frames as input, and returns candidate shot transition frames.Assuming that each shot represents a step, we can convert adjacent transition frame indices into the start and end timestamps [63, 82].Video summarization condenses a long video by selecting and stitching together keyframes to form a shorter video [1, 26, 62, 84].Similar to shot boundary detection, we can convert adjacent keyframe indices into step timestamps.We can also prompt LLMs to generate step timestamps if the input transcript contains word or sentence-level timestamps.• Models for step thumbnails.We identified two types of models for selecting thumbnails: video summarization and shot boundary detection.As mentioned before, video summarization outputs representative keyframes.In addition to representative keyframes, shot boundary detection can filter dissimilar frames to get more thumbnail candidates.

•
Leverage strong models for cross-modal enhancement: Currently, an LLM perform the best at extracting object names.Starting with the best results in one modality, we can minimize errors in other modalities, e.g., object names extracted by an LLM will guide open-vocabulary object detectors to localize objects.Future research should keep monitoring SoTA methods in different modalities.• Focus on user-centric model selection: While each ML