Integration of Metaverse Recordings in Multimedia Information Retrieval

This paper addresses the challenges in integrating Metaverse Recordings in Multimedia Information Retrieval as a new type of multimedia. Specifically, we describe the characteristics of video content and explain the key differences of a metaverse to other videos. We present a specialized process to analyze and index Metaverse Recordings. We demonstrate that existing techniques can be used for metaverse-specific retrieval, but need content-specific adaptions. Furthermore, we evaluate a prototypical implementation based on the proposed methods with respect to the impact on the retrieval process.


INTRODUCTION
The metaverse will be a multimedia generator, and Multimedia Information Retrieval (MMIR) [31] should support it [35].The concept of the metaverse is a persistent multi-user environment consisting of virtual worlds which mix physicality and digitality [24].It has increasingly attracted scientific and social interest because of its multifaceted applications [12], [23], [15].As evidenced by the increasing adoption rated [21], along with the advancement and proliferation [5] of Virtual Reality (VR) [4] and Augmented Reality (AR) [1] devices such as Meta Quest [45] and Apple Vision Pro [43], there is a palpable momentum surrounding this digital realm.As individuals engage with the metaverse, they capture and archive their unique experiences [8].We define such multimedia as Metaverse Recordings (MVR), a documentation of a user's session within the metaverse over time, which can manifest in various formats, such as videos of the media they perceive, but also as a technical representation in the form of, i.e., scene graphs.
MVRs present a new type of multimedia, which can be added to existing collections such as personal memories and social media platforms.The role of MMIR in this field is to retrieve and access elements of such collections.This paper addresses the challenges in integrating MVRs in MMIR as a new type of multimedia.
We employ the method of Nunamaker [28], which organizes the process in the phases of Observation, Theory Building, System Integration and Experiment.The remainder of this paper is structured to these phases.Section 2 described our observation.With Section 3 we present our theory building (i.e., the modeling), and in Section 4 the corresponding implementation is given.Section 5 presents the evaluation of the models.

STATE OF THE ART
This section provides an overview of two areas: the existing landscape of metaverse technologies specifically related to MVR, and the integration of MVR in MMIR.Initially, we survey current adoption and key technologies behind MVR.This is followed by a brief introduction to MMIR and a summary of relevant work concerning our approach to integrate MVR and MMIR.

Metaverse
The metaverse "is an interconnected web of social, networked immersive environments in persistent multiuser platforms" [24].The idea emerged from the science-fiction book [37] "Snow Crash".Over time, various instances have been created, all limited to a certain extent by existing technology and mostly not interconnected.Among the instances that have prevailed in recent years [23] are Minecraft [22], Roblox [3], Spatial [33], Decentraland [11], or Horizon Worlds [20].They provide immersive virtual worlds based on 3D technology similar to massive multiplayer online games (MMOGs) [2].The metaverse visitor, represented by an avatar, can move through the world from a first-or third-person perspective.All such examples are built on a virtual world which is shared between different users represented avatars.Some virtual worlds, such as Horizon Worlds, are accessible only by VR devices.
Also MMOGs -3D or VR -share the same principles in visualization, navigation, 3D models, textures, or logical structures, such as scene graphs [36].The rendering engine uses the inputs to render a continuous stream of video and audio, along with optional outputs, such as controller commands for force feedback.Thus, for recording MVR, the input and the output are important points.Capturing the audiovisual output stream as a screen recording is common.However, the technical information of the input is also interesting, but probably not easy to collect.Non-audiovisual output, such as log files, may provide relevant information.Non-audiovisual output, such as log files, may provide relevant information.Log files have been used successfully as a source of semantic information [38] [26].For MMIR, it may be more efficient to obtain features from log files than via feature extraction of screen recordings.Concluding this, MVR can be captured by using different inputs and/or outputs of the rendering process.

MMIR Introduction
MMIR, and Information Retrieval (IR) [19] in general, aim at retrieving information from a larger collection of information or multimedia and providing access to the elements of the collection.
The activity is driven by a gap in a human's knowledge.Having this knowledge gap is described by Belkin [7] as an Anomalous State of Knowledge (ASK).From this state originates an information need, which is the key driver to use an IR system to close the gap.
According to Belkin [7], MMIR activity is driven by an information need representable as an Anomalous State of Knowledge (ASK).Such an ASK represents the lack of information, which an MMIR system is expected to close.
The state-of-the-art approach in this system is grounded in the principles of semantic understanding of the multimedia content in a collection.At its core is in-depth semantic analysis, i.e., employing feature extraction, to create a comprehensive index.The information need can be formulated as a query, which are processed in the MMIR system by comparing the query to the index and by computing a result set in form of ranked lists or exact matches.There results are presented to the users and either satisfy their information need or provide a basis for further refinements of the query.
IR systems have been created for different domains, such as image, video, and 3D retrieval.MMIR addresses the requirements of the mix of different media types.
As described, MVRs can be made by recording the rendering output displayed on a screen, resulting in a video.Hence, Video Retrieval can process this type of media, but, in our observations, we see that methods to analyze video lack support for specific concepts on virtual worlds.For example, common scene boundary detection is based on shots, where MVRs are continuous.Many object recognition algorithms, such as face, celebrity, or place detection, do not work with MVRs, because the content is computer generated, often displaying lower resolution, simplified graphics that are far from photorealistic.Further more, in virtual worlds, avatars of any kind subsume the role of real-world people.
Despite the lack of analysis methods, the support of this new media type of MVR is lacking.Since MVRs can also contain rendering inputs and further outputs, this is not supported in MMIR systems.Also, there is a lack of playback options for MVR formats beyond plain video.However, some MMIR systems are adaptable by extensions, which we present next.

Multimedia Feature Graphs and Graph Codes
We presented the Generic Multimedia Analysis Framework (GMAF) [41], [39], [42] as a versatile MMIR system.The indexing structure of GMAF is the Multimedia Feature Graph (MMFG), an efficient method for representing features detected in multimedia analysis.
MMFGs are based on a direct acyclic graph, where nodes represent the detected features and edges represent the typed relationships between nodes.The MMFG can be converted into Graph Codes (GCs) employing the type-coded adjacency matrix, introduced in [40].GCs are very efficient in calculating the similarity score of assets in the retrieval process.The similarity score is based on three metrics, feature-metric, feature-relationship-metric and featurerelationship-type-metric, which will provide the calculation for out modeling.
With this understanding of the metaverse and MMIR, we present our modeling work.

MODELING
In this section, we explore the intricacies of incorporating MVR into existing MMIR paradigms.We start by identifying the idiosyncratic characteristics of MVR compared to traditional media types.Following this, we model the modifications and additions needed for successful integration into MMIR systems.The section culminates in the introduction of a generalized solution.The methodologies applied in this section are User-Centered System Design [27] and Unified Modeling Language [29].

Differences of Metaverse Recordings as multimedia type
To integrate metaverse in MMIR, we first outline the differences in MVRs.
The hypothesis is that MVRs are a seperate type of multimedia.Multimedia can consist of  ∈ {, , , ,  }.Assume that multimedia  is described as  = (, ,  ),  describes the content,  describes the structure of the content (i.e.sections or chapters), and  is the format (i.e.video, audio), we can describe the differences in the following sections.

Content Differences
MVRs share stylistic elements with other video formats, such as animated films and live sports broadcasts, but remain distinct.Specifically, while they resemble animated films in visual style, their shot composition diverges.They are closer to live sports by genre, offering a continuous, action-camera viewpoint with minimal cuts and graphic overlays.Thus, despite certain similarities, the MVRs as a whole have a unique combination of features.We can describe the major content differences as follows: "Video is a structured medium in which actions and events, in a time and space, comprise stories or convey particular visual information." [47].The same definition applies to an MVR, the difference is in the type of actions and events.Let  represent the set of all actions of a video, : represent the set of all events of a video,  is the set of all points in time, and  is the set of all points in space.Then, the content  can be defined as Considering the video content of a movie   and an MVR    , it is the same for the movie and the MVR.The difference is revealed in the components of the action.Content of videos differ from type to type, i.e., a movie, or a soccer live game.Hence, a different set of actions is expected: a soccer match has a kickoff, goals and shots, but no fishing, or car driving.Even if other types of action can happen, usually they would not be included in a describing set of actions and events.Hence, different actions and events are expected in an MVR.
There is no action without objects and the objects in the metaverse are different.The set of actions, including objects, and events, as well as the mapping to time and space can be described based on concepts in ontologies, such as OntoMedia [18], [9], LSCOM [25], VideoAnnotation [16] or TimelineOntology [30].
An illustration of the disparity in ontology can be seen when defining an individual in the real world versus the metaverse.In the real world, an individual human can be described as a person with head, torso, and limbs: , ℎ, ,   , ℎ} While in a metaverse, an individual human is described as avatar, usually shown as a name tag above a character, which can be of any form, i.e. a 3D or 2d model or even just a line of text: In LSCOM, the concept for an individual is a perceptual agent of type Person or Sentient Animal. } For MVRs the concept need to be extended by the described avatars.
,  } However, this example explains the differences in the content of video types.Hence, the difference can be described as different set of The distinctiveness between video formats, movies, news and MVRs lies in their underlying concepts of objects, actions, and events, so that   ≠   ≠    .Within the metaverse, the diversity of content extends from simplified to realistic visuals, varying in their Level of Detail (LoD).This complexity is further amplified by the introduction of new and different concepts unique to the virtual world.
In terms of perspective, the metaverse differs from traditional media like movies, news, or physical game recordings like soccer matches, which usually employ either a first or third person viewpoint.In the metaverse, the perspective is more comparable to that found in life logs or action camera recordings, but with the added layer of unique metaverse-specific objects and structures.
Taking all of these factors into account, it can be concluded that MVR constitute a distinct category of video-based multimedia, separate from existing categories like movies, animated films, or news shows.

Structural differences
Video content is generally organized in some kind of structure.Soccer games follow the structure of "pre-show," "first half," "halftime, " "second half, " and "after-show." News is repetitive blocks with newscasters, followed by reportage, commentary or interviews.Movies have opening credits, followed by several scenes, and conclude with closing credits.
Techically, the structure  of a movie   can be described as Compared to MVRs    , where the structures are different.A continuous stream of random actions, i.e., playing a game or having a conversation, and events, i.e., teleport or message notification, all from the same camera angle.Hence, MVRs have a weak structure and a generic approach to    is limited to frames and segments, defined as...
In conclusion, the structure of an MVR is weak and limited.

Format Differences
Finally, we analyzed the format of MVRs.The key difference is that MVRs are digitally created, while most video data is produced by digital capturing of real world action and events.For a semantic understanding of video captured by filming, the methods are limited to image analysis.This is done mostly by feature extraction, while few approaches also take sensory data such as GPS coordinates into account.However, image analysis depends on the quality and input.The metaverse with virtual worlds is, at least partly, a digital space, where theoretically anything can be recorded as MVR.All such information should make content analysis as simple as a transformation into the index structure.Instead of laboriously detecting an avatar on a keyframe of a video, a log file of scene information generated by the Metaverse system can simply hold the information needed.
Practically, it is at least difficult to store all data.Rendering data is complex and high in number.It would be great, to store everything and revisit a scene as an observer in 3D space and time.A first approach is to store partial data which is valuable for the content analysis.Data from the rendering process is what we define as Scene Raw Data (SRD).Data from I/O Devices we define as Peripheral Data (PD).Alternively, the recording of audio/video produces Multimedia Content Object (MMCO).The combination of these data results in a Composite Multimedia Content Object (CMMCO).If furthermore, a common timecode is available, we define it as time-mapped CMMCO (TCMMCO).This derived that an Asset    can be of Assets in of    ∈ {, , , } Based on these definitions and understandings, the content analysis process can be analyzed in more detail.Furthermore it is a task to identify how to record the data.The idea to have all information in log data in a well structured way is advantageous, but current renderings do not provide such logs and a standard is lacking.At the time of writing, screen recordings are the most realistic type of recording.
However, the fact that metaverse is technical space makes optimized recording formats possible and influences the content analysis methods, because identifying segments from a textual log file with a regex is different to identify visual patterns in a series of frames.

Definitions
From the differences in content, structure, and format described above, it can be concluded that MVRs are a new type of media.Assumed multimedia is described as  = (, ,  ), then formally, the difference of MVRs and, i.e., movies, can be represented as follows for movies: and for MVRs ∈ {, ...}, , } For the calculation of similarity or differences, metrics can be defined.Let the content descriptor be   , for movies  , and MVR  ,  , assigned a value for person = 0 and avatar = 1 for a given content: for movies  , will typically close to 0, and for MVR  , leans towards 1.
Much like a content descriptor   , a structure descriptor   can be articulated based on the frequency of scene changes per hour.Similarly, a format descriptor   , can be characterized according to the complexity of the video format, assigning a value of 0 for plain video and 1 for complex video.
Taking into account the tuple (  ,   ,   ), it can be observed that movies typically register values in the vicinity of (0.1, 50, 0) while MVRs exhibit a different distribution, approximately (0.9, 5, 1) This disparate allocation of values effectively distinguishes the two mediums, thereby substantiating the classification of MVRs as a novel form of MM.With all of this, MVRs differ significantly from other video formats.Based on this understanding, we discuss the impact on MMIR when integrating MVRs.

Challenges for MMIR to support MVR
MVRs bring several challenges to MMIR.Differences in content, structure, and format require adaptation in the MMIR process.The process in MMIR-systems starts with content analysis, indexing, and query.Since we work user-centred, we start the opposite way, explaining the differences of common MMIR and MVR-specific MMIR from the user's perspective, starting with querying followed by indexing, content analysis, and formats. Query The ASK describes the information need of a person to achieve a certain task and can be formulated as a query.To understand the information need for MVR-specific MMIR, we modeled four examplary use contexts (UCs) to discuss and illustrate our further modeling work.
• UC example 1 -Personal Memories: A person has recorded his visits in the metaverse.The person wants to remember and review the activities.• UC example 2 -Entertainment: People record metaverse content for entertainment, i.e., as a video log [46] or commentary Let's Play [44].A viewer searches for content with keywords.
• UC example 3 -Research: Researchers want to analyze human behavior, e.g., study the use of explicit language, and therefore search for relevant metaverse recordings.
The information need for the MVR-specific UC examples is similar to non-MVR specific UCs.If metaverse recordings are replaced by videos, it is still valid.The differences are more in the details.
The query examples show, that existing query methods can be used for MVRs.But they differ in concepts and ontologies, which is consistent with the substantial differences in the content of MVR.The different concepts of objects in MVRs require an adapted vocabulary in the query.To describe the concepts, ontologies can be used for analysis and query.
We conclude, that MVR queries are specific to the metaverse.If the information need is expressed as a query , MMIR can be simply described as a function of a query , applied on all assets  in a collection .If  is the space of all possible queries ,  is space of all assets,  () is all sorted list of assets .
Hence,  ′ ⊆  is the space of metaverse-specific queries and  ′ ⊆  is the space of all metaverse-specific assets  ′ .The metaversespecific function is defined as: In MMIR, the relevance function  describes the subjective distance of the results  ′ ∈  ′ to the users information need, or query  ′ . : The elements in the sorted list are arranged so that the relevance values are in descending order, i.e.
The longer the duration of the recording, the less relevance the whole video could have for the users.Either because users might not be interested in the whole video, or because the relevance becomes too unspecific due to too much data.A long video has much content which matches the query and relevance (or similarity) metric is high.For different UCs, relevance of the whole recording is inaccurate.As seen before, MVR have a weak substructure.Hence, metaverse content is different to traditional multimedia content processed by MMIR.However, single objective of the user is to find relevant part , not asset.Structural analysis should improve relevance of the part compared to the relevance of the whole asset: In summary, Metaverse-specific queries can use existing techniques, but are content specific, because metaverse content has different structures and concepts, which result in metaverse-specific vocabulary and ontology.
Index With different concepts and structures, indexing needs to be adapted.For videos, important factors are times and events, which should be reflected in indexes.Graph-based indexes are common and perform very well.The concept of time is difficult to integrate.
Let a graph  consist of nodes  and edges : A simple integration of time is to store it in the nodes.A node can integrate time and space, by  = (, ) ∈  , where  represents the start and duration of the appearance and  represents the location in the virtual world.
However, this representation is not specific to metaverse content and can be used for other cases too.Hence, we can state that indexing structures are not specific to metaverse content, just as queries show a different way.Therefore, index data structures, such as MMFG can be used for MVRs.

Content Analysis
With the previously described differences in the content and structure of metaverse multimedia, adapted content analysis is needed.
For structure, the asset  ′ it can be divided in parts , which is Different types of parts are possible.Mutually exclusive segments can give a clear structure and fit in the index graph.
To identify metaverse-specific concepts, such as the avatar, in MVRs, feature extraction need to be adapted or even specific ones developed.This will be outlined in more detail in our future work.In the context of this paper, it is important to highlight, that the MMIR mechanisms for content representation are also valid for MVR, but feature extraction needs to be enhanced in a MVR specific way.
Summarizing this, the differences between general MMIR and MVR-specific MMIR are primarily rooted in the unique content concepts they each incorporate.While this alone may not set MVRs apart from other video-based media types, such as movies, news shows, or soccer games, the additional layer of differentiation emerges from the different formats that MVRs can take.Together, these factors contribute to the classification of MVR as a distinct and novel content type within the broader MMIR landscape.
For MVR-Retrieval we aim at presenting the most relevant parts of the MVR to the users by ordering the results based on their similarity to the query.Structure Analysis improves relevance, contentbased analysis improves similarity metrics.Next, we present our approach to integrate the new MM type MVRs in MMIR in the order of processing, starting from Format, CA, Index and Query.
The first step in the process, MVR Structure Analysis, defines the segments in the MCO, to optimize the relevance, as described above.The sum of all segments is the MVR.
The second step, MVR Feature Extraction, employs multiple feature extraction methods to reconstruct the features of the MVR.It is expected, that this step only provides basic features, which need to be refined in the later process steps.Extracted features  are defined as basic features from the input formats.
The process step MVR Data Mining uses the extracted features, to find structural patterns of MVR contents, such as behavior patterns of moving objects, content characteristics of a scene, event patterns and their associations, and other semantic knowledge, in order to achieve intelligent applications, such as MMIR.A simple example can be that feature extraction detected a fish and a fishing rod, which data mining translate to the activity of fishing.This results in new features or elimination of features.MVR Data Mining is expected to be based on rules and ideally improves the similarity by better semantic features.Mathematically, MVR Data Mining can be described as a function applied to feature vectors, containing the features detected by MVR Feature Extraction, i.e.   and   , resulting in a new feature vector   .
MVR Feature Fusion of multiple classification results into a single one.The process can be based on rules such as a confidence threshold or feature mapping.MVR Feature Fusion improves similarity by processing, e.g.just keeping significant features.Formally it is described as The final step is MVR Indexing, which simply stores the results in the index in a MMIR typical way.

Integration of PFRM into MMIR system
To test PFMR, we selected GMAF as MMIR system to adopt PFMR.GMAF is a flexible architecture for MMIR and has extension points.For the integration of PFMR in GMAF, the extension points are used.GMAF is based on a MMFG as indexing structure.The MM analysis results in a MMFG for this multimedia content object.An essential part are the extension point of plugins, which do feature extraction and result in an MMFG.Multiple plugins result in mutliple MMFGs, which are fused by feature fusion strategy, the second extension point.A third extension point is the workflow mechanism, which allows to create different workflows, a chain of plugins and feature fusion strategies along with basic operations.Figure 2 shows how the extension points are used in PFMR.The whole process is built based on a workflow.The structural analysis is covered by plugins, which produce features that can be fused before the next processing step.Basic operations of GMAF flows can be used to split media for further analytics steps.Next, the MVR Feature Fusion is done by GMAF plugins, which also results in multiple MMFGs, which may need to be fused.MVR Data Mining is also a plugin.Finally, the MVR Feature Fusion step applies a GMAF Feature Fusion Strategy to create the final MMFG.
To represent the detected features, multiple scenes can be attached to the MMFG root, as shown in Figure 3.The detect features are attached to the root node as child nodes.
Next, we implement a concrete workflow based on an example to demonstrate the PFMR integration into GMAF and later evaluate it.The example is based on the previously described ASK of the UC example 1, 2, 3 (Personal Memories, Entertainment and Behavioral Research).For a deeper analysis, we will use a segmentation of elements in the recordings.Furthermore, we do a label detection and a voice/speech analysis.Our implementation follows the PFMR and implements a GMAF Flow with two steps, visualized in Figure 4, first an analysis of the video with three kinds (Segment, Label, and Speech), followed by a fusion without further MVR Data Mining.After the fusion, the data is handed over to the GMAF index.
Based on the modeled integration of PFMR in GMAF, we implemented two plugins and a fusion strategy.We created three GMAF plugins to perform feature extraction connected to a fusion plugin 5. AWS Rekognition [6]  used in a later stage by the time-mapping fusion.The fusion strategy takes all the detected segments, searches for features detected within the TimeRange of the segment and adds them as child node.
The query execution needed to be modified, because GMAF does not support similarity search on subgraphs.Hence, the similarity calculation is adapted to create subgraphs of scenes and perform the similarity calculation.The process is described in the following pseudocode: For each asset split MMFG in scenes for each scene create a MMFG and GraphCode For each GraphCode calculate similarity Order Rank by Relevance Sscene > Srecording This procedure requires a higher number of graphs to calculate the similarity.Assuming 5 segments per minute, it results in 25 times more graphs for a five minute video.
With this implementation, we could see that queries can be fulfilled.

EVALUATION
The evaluation presented in this section focused on the implementation of PFMR and its implications on the retrieval process.We conducted two experiments evaluating the implementation of PFMR and prooving working queries with correct similarity, without severe impact on the efficiency of similarity calculation caused by segmentation in the query process.
The evaluation is based on self-made screen records of sessions in Meta Horizon Worlds [20].The videos were annotated by us, describing activities over time periods.

Indexing
We evaluated the segmentation of MVRs with existing tools.Horizon Worlds basically uses two types of transition events: loading new world and reset/move position.We regard this as a possible segment boundary.The first event is an animation, the second event is a positional reset indicated by black frames.We applied ffmpeg [13] black frame detection with a threshold (trs) of 0.05 and minimum duration of black frames (min) of 0.01 as a baseline which results in a confidence of 1.We used a public available segmentation service AWS Rekognition as comparison.Rekognition is trained for movie and TV series segmentation, which is different to MVRs as explained.We worked with standard settings, confidence is min 0.5 and we used interval of 0.5 to 0.9 in 0.1 steps to calculate the Average Precision (AP).
Based on the annotations, we evaluated different kind of actions: Enter Building, Interact object, teleport, sound is played, message from system (such as notification); these produce best AP (see also Table 1).
In the analysis of the videos, it was noticed that the accuracy of the scene segmentation varies a lot due to the content and technical implementation.For example, in a world where a shooter is played, a teleport is performed after each player's death, which is implemented with black frames and makes the segments very easy to recognize.In another world, which is about fishing, segments hardly contain any teleports and thus the scenes are very hard to recognize.

Method
AP Recall (trs 0.5) Recall@3s (trs 0. Further experiments with object detection based on AWS Rekognition were carried out to support semantic understanding in the MVR Feature Extraction phase.With the MVRs of the virtual world Bobber Bay Fishing, we evaluated the detection of rods and fish.The experiments failed because AWS Rekognition could not detect the fishing rod.The only type of fish recognized, although with poor accuracy, was the shark.To detect the fishing rod, we experimented with alternative tools, such as YOLO [17] trained on the Objects365 [32] dataset, which contains the class fishing rod.The detection accuracy is also too low to be used.However, these examples show that multiple feature extraction techniques are required in the process of building a semantic understanding of MVR content.In this example, the fusion of multiple extracted features in the MVR Feature Fusion phase is necessary to join the feature sets of different tools.The MVR Data Mining phase is required to determine whether a detected fishing rod and a detected fish in the same segment are recognized as fishing activity.Overall, these experiments support the relevance of PFMR in MVR-specific MMIR.

Activity Related Experiments
In UC example 1, a user wants to review a personal memory, i.e.where he was fishing for sharks.We prepared several MVRs for this example.The query by keyword shark should retrieve three files.Furthermore, with splitting the scenes based on PFMR, the corresponding scene should be in the result list, with the same similarity score as the corresponding total video.
Because of a lack of proper recognition, we formulated an artificial query.The similarity ranking for the keyword query "shark, shot18" was expected to match with the highest similarity score.This was the case.
The result is MMFG ( 1 3 _ V i d e o _ 1 8 .mp4 / Subgraph − s h o t 1 8 −13 _ V i d e o _ 1 8 .mp4 ) , S i m i l a r i t y : 1 .0 , 1 .0 , 1 .0 MMFG ( 1 3 _ V i d e o _ 1 8 .mp4 ) , S i m i l a r i t y : 1 .0 , 0 .5 , 1 .0 MMFG ( 1 3 _ V i d e o _ 1 7 .mp4 ) , S i m i l a r i t y : 1 .0 , 0 .0 , 0 .0 MMFG ( k i l l e r g a m e .mp4 ) , S i m i l a r i t y : 0 .5 , 0 .0 , 0 .0 In terms of the impact on the runtime, we evaluated the produced MVRs.The Video length is on average 5 minutes, resulted in an average of 90 (Median 61) detected shots with std.dev. of 91.In total, the number of MMFGs for a similarity search is 1807 segments for 20 MVRs, which is a significant increase of MMFGs.
This extends the runtime on a MacBook Pro (Core i7, 2.6GHz) from 0.7 to 1.07.Pretending the processing is performed as a web service request, the processing duration is slower than one second, and according to [10] "Anything over 1 second is a problem".However, with further optimization through parallelization [34] should stay below one second of processing time even for larger collections.

Summary
The PFMR can improve the results for the user by segmentation.Segmentation results with selected method are bad and need to be improved.The multiplication of the MMFGs and GCs are measurable but addressable by parallelization.Also, better segmentation should further reduce the number of Graphs.

CONCLUSION & OUTLOOK
In this paper, we have described the characteristics of video content and explained the key differences of MVRs to other video.This was used to analyze what changes to MMIR would be necessary to support MVRs.We proposed PFMR for MMIR to support MVRs.We demonstrated the integration of PFMR in GMAF and showed the impact of the efficiency of the retrieval process.
The presented approach has shown that it works, while analysis methods for videos need optimization for metaverse recordings.

Figure 2 :
Figure 2: PFMR process steps mapped to extension points in GMAF

4 IMPLEMENTATION
This section describes the prototypical implementation of our model for validation by experiments.

Figure 4 :
Figure 4: Activity Diagram of the implemented PFMR in a GMAF flow.

Figure 5 :
Figure 5: Class diagram of plugins and fusion strategy used in the example workflow.