Millions of Views, But Does It Promote Learning? Analyzing Popular SciComm Production Styles Regarding Learning Success, User Behavior and Perception

With a rising amount of highly successful educational content on major video platforms, science communication (SciComm) can be considered mainstream. Although the success in terms of social media metrics (e.g. followers and watch time) is undoubtedly given, the learning mechanisms of these production styles is under-researched. Through a between-subject-design of 980 adult learners in a MOOC about data science, this study analyzes how much of a difference four popular SciComm production styles about relational databases make in regard to perceived quality, learning success and technical user behavior. Testing the isolated effect showed no statistical difference in the grand scheme of things. Additionally, a multivariate regression model, estimating the overall course points with robust standard errors showed six significant variables: The time spend with the material and the number of exercise submissions are particular noteworthy. Based on our results, an underlying (video) script is more relevant than the actual production style. Prioritizing the preparation of this material instead following a specific, pre-existing video production style is recommended.

Figure 1: Overview of the four tested video production styles in this between-subject-design study.All conditions show the same moment, explaining why databases are relevant for data science.The first two panels show the difference in the written media (digital and analogue), while the last two showcase gestures and social cue differences when a speaker has been seen.

ABSTRACT
With a rising amount of highly successful educational content on major video platforms, science communication (SciComm) can be considered mainstream.Although the success in terms of social media metrics (e.g.followers and watch time) is undoubtedly given, the learning mechanisms of these production styles is under-researched.Through a between-subject-design of 980 adult learners in a MOOC about data science, this study analyzes how much of a difference four popular SciComm production styles about relational databases make in regard to perceived quality, learning success and technical user behavior.Testing the isolated effect showed no statistical difference in the grand scheme of things.Additionally, a multivariate regression model, estimating the overall course points with robust standard errors showed six significant variables: The time spend with the material and the number of exercise submissions are particular noteworthy.Based on our results, an underlying (video) script is more relevant than the actual production style.Prioritizing the preparation of this material instead following a specific, pre-existing video production style is recommended.

INTRODUCTION
In 2010, YouTube reported an upload rate of 35 hours of content every minute [2].13 years later, this number seems to be risen to 500 hours per minute [39].Adding the overall second page rank globally [38], the medium video and YouTube specifically is undoubtedly popular.Within these masses of uploads, the proportion of educational relevant material matching standards and being factual correct is unknown, as everybody is able to upload to the platform and a lot of categories exists on the platform (e.g.sports, music, gaming, news).Second, the number of larger channels having more than 100.000subscribers can be considered low with 0.28% (or 1 in 350) [32].With a large demand from the viewer site and a comparatively small supply side of educational material, the larger and known channels have a lot of impact on the average learner and classroom: Not only do students use the platform [44], but teachers and lecturers incorporate it into their teachings too [26,35,37].Therefore, understanding these successful and often watched formats and measuring them in terms of formal assessment methods is vital and at the same time under-researched.
The given study analyzes whether the learning success of these educational video production styles are on par with their social media success.Based on the following literature (Section 2), the paper raises two main research question (see 2.4), followed by a description of the underlying course setup in the 3rd chapter, that was used to gather and analyze the data.Section 4 outlines the intended methods of analysis, while the respective results are presented in Section 5. Finally, these findings are discussed (see 6) and the paper concludes (Section 7) with the existing limitations and recommendations.

RELATED WORK
Within the existing academic literature, three main branches are relevant, covering the science communication characteristics, the influences of learning theories and video production styles.

SciComm: Science & Education Becomes Mainstream
An early example of scientific communication (SciComm) is the 1936 published video about differentials and how they are used in automobiles ("around the corner") [19].Although the car has been invented over a hundred years ago, explaining how two tyres can spin with different speeds while making a turn in laypeople terms can be considered a timeless topic for an explainer video.Interestingly enough, the video uses different camera scenes and perspectives while including animations and real-life props alike.Historically, Carl Sagan is considered one of the earliest science communicators, due to this show "Cosmos: A Personal Voyage" airing in 1980 [1].In the meantime, the media switched from documentaries on television to posts on social media.As a middle ground between TV-productions and an individual content creator, larger YouTube channels such as Veritasium [9] and kurzgesagt [4] can be identified.A recent study by Xia et al. ([46], 2022) interviewed 25 science communicators of larger channels and 13 of their viewers and identified their motivations.Among others, their findings indicate that SciComm channels start with a specific niche topic and grow into mainstream science education later on.The viewers reported a twofold interest: Getting educated is relevant, but so is being entertained.The same study inspired our research, asking "how effective this communication medium is" (p.171f), thus projecting an open research gap.Despite this broad range of video content, there is no general definition of video-based scientific communication.For the giving study, we define it as learning material that is provided by formally educated individuals and being consumed by learners that are not part of a formal class or even enrolled at all.Burns et al. [13] additionally includes four responses to SciComm (awareness, enjoyment, interest, opinion-forming, understanding).

Influences on Educational Success and Perception
Previous studies have discussed whether specific elements promote learning, such as (not) seeing a lecturer's face [23], video angle and proximity [12], digital or hand-written annotations [33], usage of metaphors versus direct language [21] as well as directed at specific classes [29,[41][42][43].In general, the existing literature shows no significant difference in learning outcomes, but only in preferences or perception.Associated with these sub-topics, a list of relevant learning theories emerge: Based on the work by Gunawardena [17], social presence is mentioned whenever a person is supposed to be received as a human and not a mere digital copy.Generally, the existing body of knowledge associates a higher degree of social presence with a better perception of the material [30,45].
Another large influence is the work by Moreno and Mayer.Through multiple principles, general recommendation have been drawn that were incorporated by the academic community.For this work, we identified three principles as most vital: First, the redundancy principle [28] recommends to be consistent of what is said and what is shown: A video should not insert a text-element that effectively works as a footnote, while the main content continues 1 .Second, the temporal contiguity principle is the academic version of "show and tell in the same time frame": We explain what a query does and how its syntax work, while seeing the query -without a time delay.Third, the multimedia principle [27], which states that is is generally better to offer various media and incorporate them planned and intentional; using a video to show text would not satisfy this principle however.
The disadvantage of showing more -whether people or media -is that the viewer's attention is split, which leads to a higher cognitive load and therefore less learning success.A study by Kizilcec et al. [24] quantified the rapid eye-movement: Learners switch between a lecturer and the slides every 3.7 seconds, when controlling it against a condition without the lecturer.At the same time, various media split our attention all the time: While reading an article a newsletter prompt competes with the working memory, a video could have an unskippable advertisement and we rather use more screens at-once than limiting the number of stimuli of a single source [25].In regards to the assessment of previous experiments and comparisons, a posttest testing recall capabilities or a score in a transfer test have often been used before; grades as an output variable are used often as well [31].

Successful Production Styles of Higher Education Media
Two main sources were identified that guided the video production: Previous literature influenced the underlying approach and large SciComm channel showed how they successfully reach their audience.Thus, both sources were equally relevant for the video production process.The case-work done by Hansch et al. [20] is highly influential for this study: They outline the boon and bane of having sheer endless possibilities when it comes to educational video production ("There is no one-size-fits-all approach to making a learning video", p.10), but also outline 18 types of video approaches, although not all of them are mutually exclusive.Similar work exist in the shape of a video classification grid [15], surveys among instructors [36] and an analysis of MOOC courses [34].All unite a qualitative case-study approach, that gives an overview about the possibilities, but thus lacking the basis for generalization.Unfortunately, scientific publications often lack access to the used video conditions and the underlying (recording) scripts; a public or self-hosted repository attached to a study is the exception rather than the rule.Furthermore, in MOOCs or digital courses, the learning or curriculum context is often unknown; instead, isolated experiments are conducted.In reality, learning activities are rarely compartmentalized and the overall context of the learning outcomes is important.

Research questions
Against this background, two research questions are asked: The first being a qualitative research question, divided into three subquestions: RQ1: How does the production style of educational videos influence 1.1 the learning success?1.2 user behavior?1.3 the perceived quality?
The variable learning success is operationalized through three different post-tests: A no stake quiz, a low stake programming exercise and a high stake exam.Further details about these assessments are outlined in section 3 (Course Design).User behavior is defined as the technical interaction and whether learners used video-player functionalities, specifically the events play/pause, seek, watching in full-screen mode and whether the video was watched till the very end.Perceived quality will be collected via a qualitative survey, conducted at the end of the course, asking users to rate the videos on a Likert-scale.RQ1 will be evaluated by a hypothesis test: H0: The variable between the treatment conditions is not significantly different.H1: The variable between the treatment conditions is significantly different.
As the literature outlines various sources and factors for successful digital learning experiences, a second question is raised, contextualizing the overall course success of the same video experiment user sample: RQ2: Which course elements account for learning success in an online data science course?
Here, learning success is defined by the overall achieved points, that learners receive for solving programming exercises and exams.Although the primary focus of the research is on the cause-effect relationship between video formats and knowledge retention, we motivate this second question with overall learning outcomes and contextualize learning gains so that our results can be compared with other studies.
Methodically, RQ2 will be answered and evaluated through a multivariate regression analysis, estimating the impact of demographics, course elements and user behavior on the dependent variable "total points".Taken together, this provides a detailed picture of the effect of an isolated experiment and an insight how learners use different learning functions.

Novelty of the study
Next to this contextualization of our results, three more characteristics make this paper a novel contribution to the established research field of MOOCs and digital online courses: • Matching video styles with established metrics: Previous publications did not adapt classroom recordings and video production styles to popular SciComm formats.Albeit the growing popularity of these channels and their videos, the impact on the formal education system and its assessment landscape has not yet been assessed.• Specific production of learning material: Instead of incorporating existing SciComm videos and embedding them into a course, the study scripted and produced four specific video styles.Since the presenters and their voices are the same throughout the course, the risk of exposing the fieldexperiment is decreased.We argue that these two elements lead to a generally higher internal validity.• Providing dataset and video material: In order to evaluate and contextualize the learning elements further, we provide both the underlying research data as well as the used teaching material in a OSF repository2 .

COURSE DESIGN AND EXPERIMENT SETUP
The research was conducted in an English speaking MOOC, teaching the basics of data operations and analysis (e.g.databases, use of an EDA, bi-and multivariate analysis), machine learning (e.g. the KNN algorithm, decisions trees, logistic regression) and general data science (e.g.data cleaning, label encoding, visualization) in the summer of 2023.The course concept and all its weeks were supervised by the research team.After the six-weeks of an active course phase, the course content remains online in a self-paced course offering.For the given analysis, all entries are taken from the active course phase, totaling 2,801 active learners (or a show-rate of 66,45% based on 4,215 total enrollments at the end of the course).People accessing the course and the material in a self-paced fashion are therefore excluded.
The digital learning experience consisted of the typical MOOC elements (14 topics in 16 videos, six no stake quizzes, six reading items, six programming exercises, four high stake exams and (optional) forum participation), as well as two live-streams, an anonymous help form and an integrated Jupyter Labs (JL) environment as an integrated development environment (IDE).The latter was implemented directly into the course platform to make sure that each learner has access to a programming instance, lowering the entry-barrier especially for novices and first-time-enrollments on the platform (n=554 and 19.8% of the active course cohort).Three exercises could have been submitted within the exercise, while for some exercises it was necessary to transfer the answers to a dedicated answer form.The outlined SQL exercise (low stake) can only be submitted through the specific learning item.
A necessary prerequisite was a basic understanding of Python, as there was no repetition of typical beginner concepts such as different variables, for-loops or type casting.From the overall course cohort, the majority reported basic knowledge without practical experience (n=455), while the majority (n=546) learners stated they have already some practical experience with the same basic knowledge.Both extremes (no knowledge at all and being an expert) were significantly less often reported (n=142 and n=18, respectively).To ensure that every learner can use these different learning contents effectively, a video tutorial on how to use the programming environment and an optional, ungraded programming exercise, a sandbox, were offered.In the same introductory section, learners were asked to complete two surveys: The first on their own demographic background, the second on their self-reported skill level.Additionally, the each learner was asked to submit a pre-test over the whole curriculum; the results of this test were neither graded nor communicated to the learners and serve as the basis for the four video conditions.The first week was published with a delay of three days to separate administrative onboarding and survey participation from the actual learning content.
As part of this first series of learning items, the experiment items on the topic "Introduction to SQL" were released.The underlying design follows a pretest-posttest-design: After watching the allocated video condition (round robin scheme, balanced groups, each n=245), a short text about further database tips was the next learning item, which served as an active break between watching the lecture and answering the following quiz: Here, the learners saw their (in-)correct answers and the respective solution, there was no time limit given but the achieved points were not considered for the overall points.Then, a low stake exercise was presented in the programming environment: Again without a time limit, but now without revealing correct answers and counting towards the final course performance; multiple submission were allowed and possible, giving learners a chance to revisit and change their answers.The learners were then supposed to find the answers by using sqlite queries in the JL notebooks, and submit their answers via the exercise solution sheet.
The primary learning targets were understanding the general sqlite syntax, applying different keywords (select, from, order by) and being able to avoid unwanted cross-tables by matching a primary and a foreign key.For this practical programming exercise, learners had to use the chinook database3 .In order to make answers less searchable -be it through the means of search-engines or large language models -the database was adapted: Multiple tables were removed and values were changed and added4 .The assessment proportion of the experiment ended with an exam.Here, the same concepts from the learning video were asked, for two questions the chinook table was needed again.With a time limit of 90 minutes and only one attempt, as well as relevance to the overall performance of the course, this exam was a high stake assessment.
All three assessments had an eight point reference scale for the respective sql proportion to balance their impact on the statistical analysis.Two more weeks of content was then published on a weekly basis.These learning items had no relation to the given analysis.After four weeks of the core teaching content, the learners were asked to evaluate the overall course and each video specifically on a scale between 1 (best) to 6 (worst) (response rate 14.9%, n=311).Similar to the pre-test, we asked the learners to evaluate the course as a whole in order to keep a potential recency bias as low as possible and to reduce the risk of exposing the experiment.The remaining two weeks of the overall six week MOOC was used for an optional, but graded peer review, which is unrelated to the given analysis.

Learners Demographic
In order to outline the background of our learners, the following metrics refer to the 2,801 active learners, who submitted the optional survey at course start.Thus, the number of samples vary slightly between the demographic questions (sample size between n=1,337 "age" and n=1,405 for "intended number of hours for studying").
The average enrolled learner is male and comes from Germany (58.23%), the USA (7.99%) or India (7.27%).Most of them have a undergraduate or graduate degree (35%) and 64.4% are in the range of 21 to 45, asked via five-year ranges (e.g.[31][32][33][34][35][36][37][38][39][40], resulting in an equal distribution of these ranges.Part of the survey asked about the intended time per week that a learner planned to spend studying ("How many hours per week did you plan to work on this course?").Although the course was advertised with five to seven hours per week of recommended workload in the course description, the majority (43.4%; n=610 out of 1405 submission) indicated two to four hours, followed by the recommended five to seven hours per week (n=314 or 22.3%).These demographic data are later used in the statistical model, estimating the impact on the overall course success.As for the application usage, 87.8% of the learners exclusively used the desktop version of the course platform, the remaining 12.2% is divided between the mobile web and the smartphone app version of the platform.For a course that deals with programming and data analysis, this high number of desktop users is plausible.The teaching team also recommended to use a desktop or laptop to work on the exercises and JL notebooks.All mandatory and graded learning items accounted for a total of 173 points.In order to receive a certificate, 51% had to be achieved, which 541 unique learners got.

Video Styles and Production Process
Independent from the actual production style, the research team wrote a general script, that was recorded.The initial version was reviewed by a research colleague, who is not part of the author team.Besides being factually correct, two more aspects were important: First, the script was independent of a specific production style or hardware requirements (e.g.needing a Lightboard to execute the flow of explanation was not an option [16]).All conditions needed to use the same underlying visuals and elements, thus, using an IDE was not feasible, as the results would not have been comparable.At the same time, we acknowledge, that using a JL notebook would have been the natural reference point, teaching directly from a coding environment.To mimic this effect, the teaching script showcased various queries and showed the results, as an IDE or Jupyter notebook would have done as well, which is in-line with the temporal contiguity principle of Moreno and Mayer [27].Second,  [11,18].Derived from the literature, feasible production styles for a randomly controlled trial were identified: The popular KhanAcademy [5] style of drawing and writing key points of a concept or a flow of mechanisms, linking entities and thoughts was deemed both very interesting and realistic to implement (template for "Drawing").Although working without a script is recommended by Khan himself [22], in order to make the approach comparable, we decided to use a script.A similar approach is to write directly on pieces of paper and move them around as tangible objects.On YouTube, the channel of chemist TylerDeWitt (PhD) was the template for this production style (Moving) [8].The channel "Bozeman Science" uses a similar technique [3].Third, a talking head video with different video zooms, camera perspectives and digital text annotations were identified as the YouTube production style.A known science channel that uses this features is Vsauce [10].Although there are different formats on YouTube, this name was chosen based on certain characteristics (accent lights in the back, various (jump) cuts and having multiple background and scenes).Visually, this production style is the greatest contrast to the other video conditions.Given the fact that most institutions most likely use at least one type of slide (be it through PowerPoint, Keynote or LaTeX), the "University" condition was set as the control condition.The MIT OpenCourseWare channel is using such a slide-based format [6], among other production elements.Figure 1 gives an overview of the four styles and their production characteristics.From a practical point of view, only three to five conditions were feasible, as a number too low would result in an A/B-test, and a number to high would diminish the statistical power.Thus, three experiment conditions and one control condition were targeted.
As Hansch et al. [20] outlined various production styles, although some could not be considered.While the documentary style that is seen in "Veritasium" or "Physics Girl" [7] videos are of a captivating nature, its content cannot be controlled and aligned to other formats.Similar, an interview-situation would rather be a staged play than an in-the-moment dialogue.Recording the lecture within a traditional classroom would have been possible, but that is usually not done on the platform, as the content is specifically produced for a digital learning experience in a supervised MOOC.
The remaining videos of the course were of a mixed-nature as well.While some videos were recorded in a studio environment using a Green Screen, some informal videos were interview situations and the tutorial on JL was a Picture-in-Picture (PiP) video directly in the IDE.Overall, the learners were used to various type of videos and thus the experiment conditions did not break a strict kind of video format.At the same time, each learner, regardless of their assigned condition, has seen and heard the teaching team in previous videos before seeing their video condition.Both aspect made sure that the presenters did not stand out, either in an increased negative or positive way, but were evaluate for their learning content.What unites all videos within the course is the absence of background music and subtitles.Although subtitles would be helpful for accessibility, we have decided to remove this additional restriction as it could affect the reading of a transcript rather than pure listening comprehension.

Hardware and Software Components
After writing the teaching script and deciding which video styles to produce, the research team recorded the four videos.The technical baseline was set to render all video in Full-HD (1080p), with 30 frames per second and a 16:9 ratio, as all of these metrics are within the scope of a usual web-video experience.For the YouTube condition, a 4K camera was used, making sure that a video zoom in the post-production does not reduce the underlying resolution.The other three videos were shot with a DSLR camera.For the KhanAcademy-like "Drawing" condition, a dedicated writing input was already at our disposal.Audio-wise, all audio tracks were recorded with professional microphones (lapel and condenser types) and a sample rate of 48kHz.By using a teleprompter, to which the research team was accustomed, the script created could be followed verbatim without losing the natural inflection of speech.All digital text-annotations and changes of the video track (e.g.zoom/tilt), have been done in DaVinci Resolve (Studio Version 18).All these mechanisms lead to four different video conditions that use the same sound, the same examples and core visuals.

METHODOLOGY
The following section outlines how data was aggregated and filtered, followed by the statistical analysis applied to the components of the pretest-posttest-design.

Data aggregation and filtering
Starting from 4,215 enrolled learners when the course ended, 66.45% unique, active learners were screened.As all of our learners had given consent to research activities within the platform, an additional approval by an ethics committee or an IRB was not necessary for our research project.
The following steps of data clearing and filtering were applied: 1,575 learners did not watch their allocated video condition (proven by the learning analytics event log of our course platform) and 46 learners decided to leave the course.32 learners had inconsistent entries (quiz timings of 0 seconds between opening and closing the quiz, while having a score above 0), while 7 learners were associated with the same educational institution that offered the course.These number were executed after each other (equivalent to a listwise deletion), meaning that if one exclusion criteria was given, the respective record was excluded as a whole.To increase the robustness of the posttest, the first 24 hours after the release of the course content were used to see if the questions were understood correctly and if the learners could handle the material.As a result a sub-population of 132 learners (equivalent to 13.5%) were excluded.Lastly, a check for potential cheating behavior was applied, defined by the overall score of the high stake exam (getting more than 85% of the points in less than four minutes), and two records were excluded.In the end, 1,007 records were eligible for analysis.As the four video conditions were not equally large, the smallest sample size with 245 entries was used as reference and the data set was balanced to 980. Figure 2 summarizes the filtering procedure.

Measuring Learning Success
Building on the outlined experiment setup (section 3), the primary unit of analysis are the statistical t-test differences5 between the control condition "University", which follows a slide-based style, and the three experiment conditions "Moving", "Drawing" and "YouTube".This is done to analyze whether one SciComm format has an advantage over the formal university slide-based approach.Additionally, the no stake quiz and the low stake programming exercise can be differentiated between the first and final attempt score, which will be also evaluated by a t-test.In order to evaluate the existing knowledge and the scores of the first attempts, using pandas, the Pearson correlation between the pre-existing knowledge and the posttests are reported as well.

Measuring User Behavior
The second metric describes how users interacted with the video player elements.Through five video events (play, pause, seek, fullscreen and end), a holistic overview of how our learners interacted with the videos emerge.Compared to the two other metrics, these statistics are not reflected back to the user, which could result in a more unbiased nature of the data: While every users has an understandable motivation of a high score and is actively asked about their opinion, the video player analytics remain in the background.
In order to answer the outlined research questions, the four video conditions are evaluated in isolation (RQ1), as well as in the overall course context (RQ2).

Measuring Perceived Video Quality
In order to make a statement about the perception, four questions are asked.Perception will be operationalized by 1) the overall course rating, 2) the individual video rating, 3) the word association with seeing a lecturers' face and the 4) chance of recommending the course to others.
The overall course impression and the individual videos are using the same scale: Ranging from one to six, together with a grade (1-Very good, 2-Good, 3-Satisfactory, 4-Sufficient, 5-Poor, 6-Deficient).Not only are our learners used to this scale, it also has the advantage of being unbalanced, asking learners to decide whether they grade something above or below the median (decreasing central tendency bias).Third, seven adjectives that could be associated with seeing a lecturers face are presented.With three positive (useful, helpful, pleasant) and negative (distracting, annoying, frustrating) answer option, as well as a neutral one (does not matter), it gives enough range of potential word association while minimizing the effort to code open-answer choices.As we derived these from the existing literature [29,40], it has the benefit to compare our results to previous research using the very same answer options.Finally, the likeliness of recommending the course to others 1 (lowest possibility to 10 (highest) are asked (net-promoter-score, NPS).All four are asked at the of the course, when learners could have seen all of the learning material but did not receive their final grade.

RESULTS
5.0.1 Self-Reported Skills.All of the included learners had a similar self-reported skill level, ranging between 2.71 (YouTube) and 2.87 (Moving).The four groups were testes against each other, resulting in no significant different t-test statistics (df's between 326 and 335, t-values between 1.144 and 0.26).5.0.2Pretest.The existing knowledge is similar between the four groups, with the exception of "YouTube" in contrast to "Moving".Here, the average pre-existing score was 2.07 (YouTube) and 2.41 (Moving).Based on these descriptive differences, the t-test shows a significant difference (t(317)= 2.214, p< .05,Cohens d= .25) between the YouTube and the Moving condition (variances < 0.1).A post-hoc analysis with an adjusted p-value (Bonferroni correction) could not validate this significance (p=0.15 for "YouTube" vs. "Moving").In any case, with a sample size of n > 130, this potential effect is likely to be balanced and purely random.The experiment conditions were not statistically different than the control group (means: University 2.31, Drawing 2.3, with p-values between 0.02 and 1.56).

Learning Success
5.1.1Posttest 1 -No Stake Quiz.The first attempt of the no stake quiz had a mean and a median of 6 to 7 points (out of 8), with a standard deviation of 1.4 (across all groups combined).Between the groups, no statistical significant difference has been found (all tvalues below 0.9, df's > 389).The final score showed the same result, without a relevant difference between the four groups (all df's between 393 and 412, t-values between 1.10 and 0.24).Interesting enough, the score difference between the first and final attempt is 1.26 or 15.75% worth of total points.At the same time, all groups benefited from seeing the correct answers and most users corrected their initial answers, leading to this expected, higher final score.Additionally, the pretest score and the first attempt score showed no correlation in the overall cohort (corr=0.31),indicating that the achieved results are independent from the existing knowledge.Within the groups, the Pearson correlation ranges between 0.16 (University) and 0.42 (YouTube), underlining this independence.

Posttest 2 -Low Stake
Exercise.The first attempts of the programming exercise showed no statistical significant difference, as all t-values are below 1.02 (all df's > 328).The same applies to the final scores.Overall, the users scored between 6.78 (Drawing) and 6.91 (University & YouTube).In contrast to the quiz, the improvement throughout the first and the final attempts is not as high, as the difference between the two means is 0.12 (=1.5% of the exercise points).The impact of the existing knowledge is low again, with correlations between 0.15 (University) and 0.28 (YouTube).

Posttest 3 -High Stake
Exam.Finally, the high stake exam showed similar scores.Means range from 3.87 (University, SD: 2.26) to 4.15 (Moving, SD: 2.11).The average scores from the four groups are therefore not significantly different (all df's > 334, t-values < 1.21).As for the pretest correlation, with a range of 0.24 (Drawing) and 0.39 (University), the coefficients are slightly higher compared to the quiz and exercise assessment, but still low, statistically speaking.
Consequently, based on three posttests, the null hypotheses for RQ 1.1 can be retained, hence accepting H0: The video production style does not show a statistically significant difference in regards to the learning success.

User Behavior
In case of the how users interacted with the video player when given a specific video production style of an introduction to SQL, the functionalities Play, Pause, Full-screen and End do not show a significant difference, while the event Seek does show a difference between in two comparisons.On average, the learner pressed Play between 4.37 (YouTube) and 6.17 (Drawing) times while watching their video condition.Combined with using the Pause event, the YouTube production condition has been less often interrupted than the other three conditions.As for Seek, learners watching the YouTube condition have sought to a different time of the video more often than compared to Moving.Testing the amount of seeks, a statistical significant difference can be reported for the conditions University and Moving (t(435)= 2.96, p<.01, d= .005),albeit with a barely measurable effect size.
Consequently, the null hypotheses for RQ 1.2 can be retained, hence accepting H0: The video production style does not show a statistically significant difference in regards to the user behavior.

Course
Rating.The overall course rating does not show a difference between the groups, as all learners reported a value between 2.12 and 2.38 with similar standard deviations.As the course rating and the individual video do not converge, we interpret this as a successful differentiation: The time gap between watching a condition and completing the survey allowed our learners to evaluate the video on SQL separately from the overall course.

SQL Video Rating.
In regards to the perception, the four different videos about SQL were rated by our learners: In general, there is no substantial deviation between the control group and the three experiment condition (means are between 2.98 and 2.32, SD: 1.42 and 1.22, the lower the grade the better).In contrast, YouTube and Drawing are significantly different perceived, as YouTube shows an average treatment effect of 0.66 (mean_youtube minus mean_drawing) and a moderate effect size, favoring the YouTube production style (t(112)= 2.63, p< .01,d= .52,also confirmed by an p-value adjustment method (0.041).

Association.
The word association of learners (completing "I think seeing the instructor in the video is, ___ compared to not seeing him"), shows a strong positive association towards seeing a lecturer: The majority across all groups indicated either useful, helpful or pleasant, while only a few users reported a negative connotation.This highly positive association is in line with previous studies [29,40].The indifferent proportion is equally low.After testing for a normal distribution (all resulting in p<.05), the Mann-Whitney-U test as a non-parametric test was chosen (asymptotic, list-wise deletion): No significant difference can be reported for the positive associations, while the negative word associations yield a difference for distracting (U: 2044.0;p<0.05) and annoying in between the groups University vs. YouTube and Moving vs. YouTube.Given the very low sample size of negative associated answers (12 in total) (see Figure 3), this difference is not robust enough for generalization.

5.3.4
Net Promoter Score.Finally, the NPS lines up to these overall results, as no condition has a significantly higher or lower recommendation level (all df's > 102, t-values < 1.83), also confirmed by the adjusted p-value method.
Consequently, based on four different quantitative measurements, the null hypotheses for RQ 1.3 can be retained, hence accepting H0: The video production style does not show a statistically significant difference in regards to the perceived quality.

Overall Course Context
In order to answer RQ2 and give an overall context that can be compared to other studies, Table 3 summarizes the factors, that predict learning success: From 23 variables, six shows a significant impact predicting the total score points are user achieved: While being in an older age range decreases the estimated points, the existing knowledge increases it by roughly 0.8.With a total worth of 17 points, an already highly trained learner would have an advantage of 13.6 points on average; the average user had a pretest of 13.2 and benefited statistically 10.56 points -or 6.1% worth of total course points (=173).Similarly, the number of total sessions a learner had, increases their total points by 0.28 per session.If people continued to follow the course materials, they have visited more items, resulting in 2.65 points per item.Out of 63 learning items, the average user visited 44, accounting for a statistical effect worth 116 points.Each of the three notebooks that could have been submitted, the multivariate regression accounts for a total of 13.83 course points, or 4.61 points per notebook.Finally, submitting an exercise is statistically worth 2.31, resulting in a maximum of 13.86 points from the six low stake exercises.Rather passive interactions, such as downloading videos or slides or longer learning sessions, do not contribute significantly to learning success.

DISCUSSION
Overall, the non-difference in terms of assessment scores is in line with previous results [12,23,29,40].As these studies tested more formal video production styles (e.g.talking head versus slides), our results add to the existing body of knowledge.This is especially true, as the underlying sample size is large enough and was carefully vetted and filtered.Since the at-scale field experiment used a repeated measurement of three posttests, the internal validity is considered robust.Striking is the delta between the first attempt of the quiz and the exercise in comparison to the final, highest attempt: Users showed an improvement on the first and none on the latter assessment -again independent from their video condition.This effect could be explained by the visual feedback mechanism that defined the level of stake: The no stake quiz did show the correct answers after the first submission and the exercise only after the submission deadline.The MVA underlines this logic: Submitting more quizzes does not increase the total course points, but the exercises play a relevant role for the overall course success.As the three submittable notebooks showed the achieved points, if the correct solution is entered, it is plausible that learners used this feature more, resulting in a higher impact.
An important factor for experimental studies is keeping the input variables equally, so that any difference can be derived from the inequalities that makes the actual experiment.With an absence of differences and equal results across multiple tests in multiple conditions, we argue that the distinctive factor is the used script as an input factor.As the experiment needed a written baseline, that is being reviewed and agreed upon, multiple rounds of edits  and changes were applied.This process yielded a teaching script that was then ready to be produced as a video, regardless of the video production style.The choice of video format does not matter, but the previous steps for creating the teaching script do.Overall, the relevant predictors are similar to the recommendations we give students in the first weeks of a new semester: "Come to class, be attentive" (no. of sessions and no. of items visited), "be aware of your knowledge gaps, close them" (score in pretest) and "read the given material" (submit exercises).

Limitations
Although these results have been controlled for various variables and the larger sample size increases the overall robustness, we identify two major limitations: First, the definition of a high-stake exam will be different in formal educational setups, such as a final exam or an on-sight presentation.Within a MOOC, a time-limited quiz with one attempt is the only scaling option.Without advanced authentication (e.g.proctoring mechanisms [14]), a learner could simply use a second account and cheat the system.By controlling the timings and filtering out potential cheaters, we tried to limited the impact of this scenario as much as possible.Similar, the second limitation is about what learners do in a second tab, on a second screen or between sessions.For this study, it was assumed that the items were done in the given order.As learners could jump between the learning items, some data points might be influenced by this behavior.Given that the pretest has been put into a different course week, being visually separated, and that the quizzes were presented in a logical order (e.g. the exercise notebook to solve the task, then the submission form), we assume that overall, our learners followed the intended learning path.Future studies could consider the costs of producing SciComm formats.Although we already had access to hardware and software, producing the four conditions required more effort and preparation than following our usual MOOC production process with slides, PiP screens and talking heads.

CONCLUSION
Previous to this study, it was assumed that SciComm formats are successful, because they were successful in terms of social media metrics: Reaching a lot of people was the primary quality metric, thus the provided content was considered high quality.The actual learning gain was unknown, but we assumed there is a benefit of watching scientific and educational content -millions of views cannot be mistaken.With four different production styles, tested in balanced groups of 245 learners, we can be sure that existing SciComm formats are indeed effective.Furthermore, they can be applied to typical lecture topics in formal education, such as relational databases and sqlite queries, as there does not seem to be a predominant strategy in terms of the style of video production, but rather it depends on the preparation of the video material itself.The similar and thus insignificant differences between the conditions mean at the same time, that all conditions are equally successful, making sure that every student had a significant learning acquisition.For the typical classroom -virtual, on-sight or hybrid -this results in more creative and pedagogical freedom: Instead of chasing a specific video style of a large channel and hoping it works, educators can pick any of them and match it to their (existing) teaching mode of operations.

Figure 2 :
Figure 2: Procedure of the data collection and filter criteria.

Figure 3 :
Figure 3: Users perception regarding seeing a lecturers face in a video (I think seeing the instructor in the video is, ___ compared to not seeing him).Multiple answer choices possible.Sample size varying per experiment condition and question, total sample size n=299.

Table 1 :
Summary of the Stylistic Features of the Four Video Conditions. a natural tone, including rhetorical questions and addressing the learner directly (e.g."What could be the worst thing, that happens to our database?Besides loosing it or the server room catching fire, consistency is our biggest concern".Both aspects are backed by recommendations from the literature

Table 2 :
Results of the dimensions Learning Success, User Behavior and Perceived Quality.Regarding the RQs, all variables show no significant differences between the control condition and the experiment conditions; Md: Mode, Mn: Mean

Table 3 :
Multivariate regression results predicting the total course points achieved with different categories of the learning items.Showing coefficients with standard errors in parentheses.