Learning Big Data Systems via Emulation

Big data systems are becoming an integral part of computing and data science curriculum. However, the current curriculum is largely focused on how to use the systems. An effective approach to learning the internals of big data systems is through emulation. In this paper, we report on a study where students in a graduate database course were asked to complete a course project on emulating big data systems such as Hadoop and Spark. We present the design of the emulation projects and examine the impact of the projects on students' learning. Our key finding is that the emulation projects can greatly improve students' self-efficacy in completing tasks that require in-depth knowledge and skills on big data systems.


INTRODUCTION
There is a huge demand for IT professionals and data scientists who can manage and analyze a large scale of diverse data [11,14].To meet this demand, many universities and colleges have added the coverage of big data systems into their computing and data science curricula [6,9,10,19,21].However, the current curriculum on big data systems is largely focused on how to use the systems [4,8,17,28].While the applications of the systems are clearly important, we believe that it is also very important for students, especially the students in graduate programs, to gain key insights on how these systems work.
An effective approach to helping students gain such insights is to give students hands-on experiences in developing such systems.While having students develop a full-fledged big data system might not be feasible in the short duration of a course, we believe that with sufficient scaffolding and careful planning, it will be possible for students to develop prototype systems that can emulate the working of big data systems.Through critical thinking and handson experiences in the emulation process, students will be able to master the key techniques that drive the big data systems.
Towards this goal, in this paper, we present the design of four emulation projects (Section 2), touching upon popular NoSQL databases (e.g., Firebase and MongoDB) that manage a large amount of semistructured data (e.g., JSON documents), distributed file systems (e.g., HDFS) that store and manage large files in multiple machines, and parallel computing systems (e.g., Hadoop MapReduce and Spark) that can process a large amount of data in parallel.We report on a study where students in a graduate database course offered in Spring 2023 at a major research-oriented university in U.S. were asked to select one of the emulation projects as their course project (Section 3).The study found out that the emulation projects can greatly improve students' self-efficacy in completing tasks that require the in-depth knowledge and skills on the big data systems.We also discuss possible adjustments to the structure of the course (Section 3.3) that might increase the number of students selecting some of the emulation topics (e.g., MapReduce and Spark).
Learning via emulation: Garrity et.al. [12] developed WebMapReduce, a Web browser-based app that allows students to write mapper and reducer codes for the MapReduce jobs in their preferred programming languages and run the jobs in the Hadoop system.The app utilized Hadoop's streaming module [13] to facilitate the integration of emulation codes with the system.Du et.al. developed SEED Internet Emulator [7], a Python library for emulating the essential elements of the Internet, e.g., host, network, and router.Users can use the library to construct a mini-Internet to emulate the real-world Internet.Emulators were also used to help teach computer architecture [26], organization [2], and theory [23].
However, we are not aware of any prior work on developing emulators for big data systems and examining the impact of the emulation process on students' learning.Since the developed emulators typically have simpler structure and code base, compared to the original systems, they can also become a great tool for training students on big data systems.

DESIGN OF EMULATION PROJECTS
We now describe the design of four emulation projects: Firebase, HDFS, MapReduce, and Spark.For each project, we first describe the working of the big data system, and then discuss how to approach the emulation.The learning objectives of each project are given in Table 1.For example, items F1 to F5 are the objectives that students working in the Firebase emulation project are expected to achieve.We will further discuss Table 1 in the evaluation (Section 3).

Emulating Firebase
Google's Firebase real-time database [16] is a cloud-hosted NoSQL server that manages JSON data.It has a RESTful API that allows users to send requests using HTTP commands.In particular, PUT Collectively, these commands form the CRUD operations [25].
Example requests, sent via the curl utility, are shown below.
Firebase automatically syncs the data in the server in real-time with the data kept in the apps.This is accomplished via WebSocket, an Internet communication protocol that maintains a long-lived connection between the server and apps.The server notifies all connected apps of the changes on the data through the communication channels established by the WebSockets.
Emulation: As illustrated in Figure 1, an approach to emulating Firebase is as follows.(1) Store data in MongoDB [3] which also manages JSON data and is covered in depth in our database course.(2) Build a RESTful service interface to MongoDB which itself does not have such a service.The developed interface should emulate the CRUD operations provided by Firebase.(3) Build a WebSocket server that monitors the changes in the data and pushes the changes to the connected devices.(4) Develop Web/mobile apps that demonstrate the real-time syncing of the data with the server.
Although MongoDB and Firebase both manage JSON data, the two systems structure the data differently, e.g., MongoDB requires every collection in its database to have a primary key attribute called "_id".Their query languages and query processing are also different.For example, GET request in Firebase may be translated into the find function in MongoDB, while PUT and PATCH requests may be implemented using the update function in MongoDB.Furthermore, Firebase requires an index to be first created on an attribute before filtering can be performed on the attribute.On the other hand, MongoDB does not have such restrictions but does provide a facility for creating indexes on particular attributes.Such a facility can be utilized to emulate how Firebase processes filtering queries.
Building RESTful server and WebSocket require students to have some familiarity with Web app development.Many development framework and tools can be utilized.For example, Figure 2 shows Python code that uses the Flask library to intercept all PUT requests.Students can build on this to implement the functions that process the RESTful requests and insert the data into MongoDB.

Emulating HDFS
As Figure 3.a illustrates, Hadoop Distributed File System (HDFS) [22] consists of a NameNode server and a number of DataNode servers.The NameNode server manages the metadata of the file system which include file names, file sizes, directory structure, and locations of blocks that store the contents of the files.The blocks are stored in the DataNode servers, and a block may have multiple replicas stored in different DataNodes.For example, Figure 3.a shows that the content of the file "cars.csv"(located under the /user/john directory) is stored in blocks 1 and 3. Block 1 is replicated in DataNodes A and C, while block 3 in DataNodes A and B.
To read a file, an HDFS client first contacts NameNode to find out locations of the blocks that store the content of the file, and then retrieves the content directly from DataNodes.To write a file, the client first asks NameNode to select DataNodes to store the content of the file, and then write data to the assigned DataNodes.
HDFS provides a shell program where users can issue file system commands.An HDFS client will interpret these commands and transform them into requests to HDFS servers for processing.Example commands are: ls (retrieving a list of file system objects under a directory), cat (displaying the content of a file), put (uploading files to HDFS), and mkdir (creating a directory).
Emulation: Students will develop a metadata server and a number of data servers to emulate HDFS.All servers may be implemented as RPC (remote procedure call) servers.For example, Figure 3.b shows a metadata server implemented using Python's RPyC library [20].It provides services, such as opening a file for read or creating a new file.The metadata server may use a relational database to manage the metadata of the distributed file system.On the other hand, data servers will be responsible for storing and serving blocks, where each block may be stored as a file in a data server.
The services provided by the metadata and data servers may be used to implement a program that emulates how the HDFS shell and client work.For example, to process a shell command "hdfs -cat /user/john/cars.csv", the program first calls the open function of the metadata server to find out locations of files in the data servers that store the blocks of the file "cars.csv".It then contacts the data servers to retrieve the files and displays them on the screen.

Emulating Hadoop MapReduce
While HDFS enables distributed storage of files, Hadoop MapReduce supports parallel processing of these files [24].For example, Figure   The PartitionTask will partition the output key-value pairs of the MapTask by hashing on the keys.For example, if there are two reduce tasks, then the output file "/tmp/output1" in Figure 4.b will be divided into two files: output1-r1 (for reduce task 1) and output1-r2 (for reduce task 2).The ShuffleTask will then fetch the files to the nodes running the reduce tasks.Next, the GroupTask will merge the files fetched from different map tasks and sort the key-value pairs by keys to produce groups.Finally, the ReduceTask will call a user-supplied reduce function for each group and output final key-value pairs into a file.Students will assemble these tasks into a MapReduce job for execution.For example, MapTasks may be executed on the servers holding the data blocks (see Section 2.2).

Emulating Apache Spark
Apache Spark [27] is built on the concept of RDD (resilient distributed dataset).An RDD is a partitioned dataset where different partitions of a dataset may reside on different machines.An initial RDD may be created from a file (e.g., a file in HDFS) or a data collection (e.g., a list in Python).An RDD may be transformed into another RDD, e.g., by applying a map or filter function.For example, data.map(lambdax: x * 2) will create a new RDD whose values are doubles of the values in the original RDD.An action may be called upon an RDD, e.g., to print its content or compute some statistics of the data in the RDD.For example, data.reduce(lambdax, y: x + y) will compute the sum of values in the data RDD.Both transformations and actions may be performed on different partitions of the RDD in parallel, and results from different partitions may be combined in parallel to produce the final output.Figure 5 illustrates the architecture of Spark which consists of a driver node running a driver program called SparkContext and a number of worker nodes, each having an executor for executing Spark tasks (for transformations or actions).SparkContext will be responsible for distributing the tasks among different worker nodes, and collecting and combining their outputs.A key feature of Spark is lazy evaluation where transformations of an RDD are not processed until an action is called upon the RDD.This enables data pipelining and avoids the cost of storing the intermediate results.
Emulation: The project consists of the following tasks.(1) Implement a class SC (e.g., in Python) that emulates SparkContext.The class should allow users to create an RDD from a file or a Python list, specify the number of partitions, partition the dataset, and support major operations such as map, filter, and reduce on the RDD.(2) Develop a class RDD which should keep track of the partitions in the RDD (e.g., on which nodes).It should also implement an iterator-like interface for the RDD that pipelines the data from one RDD to another and enables the lazy evaluation.For example, Figure 5.b shows how the lazy evaluation of a map function on an RDD may be accomplished by delaying the evaluation to the compute method of MappedRDD, a subclass of the RDD.The compute method will be applied when an action is called on the RDD.(3) Implement a scheduler for SC which will be responsible for sending tasks to worker nodes, coordinating with executors, and fetching task results.Data and codes for the tasks may be sent over the network to worker node through a serialization library, e.g., pickle or dill for Python.Students may be provided with sample codes demonstrating the usage of the library.

EVALUATION 3.1 Setup
We have conducted a study to evaluate the effectiveness of the emulation projects described in Section 2 in helping students gain the key knowledge and skills on the internals of big data systems.Students in a graduate course on data management, offered in Spring 2023, participated in the study.The course is a key part of a master's program on applied data science.The course covers data models, query languages, and query execution of relational and NoSQL databases such as MySQL, Firebase, and MongoDB, as well as large-scale data processing systems such as Hadoop and Spark.
The course consists of lectures, homework assignments, lab tasks, exams, and a course project.The theme of the project is emulating big data systems.Students can choose one of the emulation projects described in Section 2 for their course project.The project was done in phases, with proposal due in week 4, midterm progress report in week 8, and final report and demo in the last week of the semester.Some of the lecture times in the first four weeks were used to introduce students to the big data systems, discuss the emulation steps, and run sample codes.A guideline was also provided to the students with details on the big data systems, emulation process, and project requirements.Students can form a project group of up to 3 people.Multi-person groups were required to complete additional tasks which typically involve developing a Web app that helps showcase the functions of the emulation systems.
To measure the effectiveness of the emulation projects, we constructed a self-efficacy scale [1,18] using the items in the learning objectives.As Table 1 shows, the scale has four subscales, each corresponding to a project topic: Firebase, HDFS, MapReduce, and Spark; and each subscale consists of the items in the learning objectives of the corresponding project topic.All together, there are 20 items in the scale, with five items for each subscale.Based on the scale, we created a survey using Google Forms and administered the survey at the end of the semester to better understand students' capabilities in completing the tasks stated in the learning objectives.
For example, for the first learning objective of the Firebase emulation project (F1 in Table 1), the corresponding survey question asked students to rate their confidence in articulating how RESTful API and real-time data syncing work in Firebase.Students indicate their confidence on a 5-point Likert scale, from not confident at all (with a score of 1), mostly not confident (2), 50/50 (3), mostly confident (4), to absolutely confident (with a score of 5).Students were required to give ratings to all items in all subscales/topics, including the topics that they did not work on in their projects.

Topic/Subscale
F1 Articulate how RESTful API and real-time data syncing work in Firebase F2 Translate Firebase requests into MongoDB operations F3 Compare and contrast data model and query language of Firebase and MongoDB F4 Implement a RESTful server that supports CRUD operations F5 Explain how indexing works in Firebase and MongoDB H1 Articulate the architecture of HDFS H2 Explain how files are partitioned and stored in a distributed file system H3 Develop methods to manage metadata for a distributed file system H4 Implement a metadata server program that emulates NameNode of HDFS H5 Develop an app that emulates how HDFS shell program processes user requests (e.g., hdfs -ls /usr/john) M1 Articulate the working of Hadoop MapReduce M2 Implement map tasks that process input files and invoke map function M3 Develop algorithms for partitioning, shuffling, and sorting the output of map tasks M4 Implement reduce tasks that invoke the reduce function to process data groups M5 Develop an integrative view of HDFS and MapReduce S1 Articulate the concept of RDD and internals of Spark S2 Implement functions of SparkContext for generating initial RDDs (e.g., from files/Python lists) S3 Implement functions for transformations and actions of RDDs S4 Explain lazy evaluation & data pipelining in RDD S5 Develop an iterator interface for data pipelining in RDD Please rate your confidence in completing the following tasks, from not confident at all, mostly not confident, 50/50, mostly confident, to absolutely confident .

Table 2: Distribution of students by group size and topic
The course had 167 students, and 160 students submitted a response to the survey.Table 2 shows the distributions of students based on group size and project topic chosen by the group (note MR means MapReduce).Among the 160 students who completed the survey, 102 worked in a team of 3 people, and 111 students chose Firebase emulation as their project topic.We further examined students' responses and removed students who gave the same rating (e.g., all with a score of 5) to all the 20 questions.Given the varied level of difficulty of tasks in the questions, such responses were likely given by the students who did not take the survey seriously.The last row of the table shows the distribution for the cleaned data.We will use the cleaned data in all the data analysis reported below.

Data Analysis
First, we computed a self-efficacy score for a student on a particular topic by taking the average of the student's confidence scores over all items in the subscale corresponding to the topic.For example, there are five items F1 to F5 (see Table 1) in the Firebase subscale.Suppose the confidence scores of a student John on the five items are: 5, 3, 5, 4, 5. Then John's self-efficacy on Firebase will be 4.4.
As Table 2 shows, 91 students (based on the cleaned data) worked on the Firebase emulation project.Figure 6  of these students' self-efficacy on Firebase, while Figure 6.b shows the distribution of their self-efficacy on HDFS.We can see that students who have worked on Firebase emulation in their projects tend to be more confident in completing tasks related to Firebase than HDFS-related tasks.For example, 54 students had at least 4.6 self-efficacy score on Firebase, while only 21 students had such high scores for HDFS.Table 3 shows the average self-efficacy scores of students working on different project topics.For each topic (e.g., Firebase), there are four average scores: one is the average score on the topic from the students who chose the topic for their projects; and the other three are the average score of the same students on other topics (i.e., HDFS, MapReduce, and Spark).We can see that students consistently had much higher self-efficacy on the topic they chose for their projects, compared to other topics.For example, for the students working on Firebase, the average self-efficacy on Firebase is 4.49, while the self-efficacy for all other topics was around 3.7.
Denote the topic that students worked on for their projects as  and other topic as  , and a student's self-efficacy score on a topic  as  ( ).We further conducted Wilcoxon's signed rank tests to compare the distribution of  ( ) and  ( ).The null hypothesis   4 shows the -value of the statistics of these tests for different project topics and alternative hypotheses.For example, when  is Firebase and  is HDFS, the -value is 1.42E-12.
At a significant level of .05,we can reject the null hypothesis for all topics, except when  = MapReduce, and  = Firebase or HDFS (highlighted in bold font).We examined the data and noted that among the six students working on MapReduce emulation, three had higher self-efficacy on MapReduce than Firebase (the differences are .2, 1, and 1.6 respectively); and the other three had the same self-efficacy on the two topics.Furthermore, the case for  = HDFS is very similar.We expect that the situation might become more decisive if there were more students working on the MapReduce project.We will further discuss this in Section 3.3.
Overall, the results suggest that the emulation projects can greatly improve students' self-efficacy in completing challenging tasks that require the in-depth knowledge of big data systems.

Discussions
First, course adjustment: we note that 68% (115 out of 167) of the students chose to work on Firebase emulation.This is likely due to that Firebase was covered in depth in week 2 of the semester, as an example of a cloud-hosted NoSQL database.Hence, students were more familiar with the subjects when the proposal was due in the week 4. Furthermore, there were fewer students working on MapReduce and Spark emulation.As mentioned earlier, we introduced these systems to the students before the proposal was due.However, many students might still find it challenging to explore some of the advanced concepts and techniques used in these systems in their projects.Both subjects were covered in depth in the second half of the semester.
A solution to this is to adjust the course schedule to focus on big data first, while the coverage on relational databases (data models, SQL, views, indexing, query execution, etc) may be moved to the latter part of the course.Furthermore, students might be given more time to work on their project proposal (e.g., the proposal may be due in the middle, instead of the beginning, of the semester).
Second, statistical tests: although the number of students working on MapReduce and Spark projects were relatively small, it does have the minimum number of cases for the Wilcoxon's tests.For example, according to [5,15], there should be at least 5 or 6 pairs for conducting Wilcoxon's tests for paired samples.
Third, project scores: the project accounts for 20% of the course grade, and most of the students took the projects seriously, making sure that they have completed the required emulation.Except for a handful of students (who might have glitches in their implementations), about 90% of the groups received perfect scores for their projects.As a result, it might be difficult to gauge the effectiveness of the projects just from students' project grades.Nevertheless, almost all groups stated in their final reports that the projects have greatly helped them gain key system and development skills.

CONCLUSIONS
We have proposed a novel emulation-based approach to learning the internals of big data systems, and presented the design of four emulation projects, covering major NoSQL databases, distributed and parallel data systems.We have also evaluated the effectiveness of the projects in helping students acquire the advanced knowledge and skills on big data systems.The results show that the projects can greatly improve students' confidences in completing challenging tasks that require the in-depth knowledge of big data systems.
In addition to course projects, the emulation projects described in Section 2 may also be used in other forms.For example, learning modules may be created to cover the design of the emulation systems in depth, and the tasks in the projects may be transformed into a series of progressive programming assignments.Furthermore, a new course on "Internals of Big Data Systems" may be created by incorporating the above learning modules and assignments.

Figure 2 :
Figure 2: A Flask Web server that catches all PUT requests

Figure 4 :
Figure 4: (a) Hadoop MapReduce framework; (b) a sample implementation of MapTask Figure 4.b shows a sample implementation of MapTask.The MapTask class will have an instance of record reader, output collector, and mapper.The mapper has a user-defined map function.When the task runs, it will use the recorder reader to generate key-value pairs from an input file, and for each pair, call the mapper's map function and collect the output in a file, e.g., "/tmp/output1" in Figure 4.b.

Figure 6 :
Figure 6: Comparing self-efficacy of students working on Firebase emulation project remove 32 same rating

Table 1 :
Project topics, learning objectives, and the corresponding self-efficacy scale .a shows the distribution

Table 3 :
Self-efficacy of students working on different topics

Table 4 :
Wilcoxon tests for varied projects and hypotheses is that the two distributions are statistically similar, that is, the probability of  ( ) >  ( ) is the same as the probability of  ( ) >  ( ).The alternative hypothesis is that it is more likely that  ( ) will be greater than  ( ).Wilcoxon's test computes a statistic based on the signed ranks of | ( ) −  ( )|.Table