Multimodal AI-Based Summarization and Storytelling for Soccer on Social Media

The rapid advancement of technology has been revolutionizing the field of sports media, where there is a growing need for sophisticated data processing methods. Current methodologies for extracting information from soccer broadcast videos to generate game highlights and summaries for social media are predominantly manual and rely heavily on text-based NLP techniques, overlooking the rich visual and auditory information available. In response to this challenge, our research introduces SoccerSum, a tool that innovates in the field by integrating computer vision, audio analysis with advanced language models like GPT-4. This multimodal approach enables automated, enriched content summarization, including detection of players and key field elements, thereby enhancing the metadata used in summarization algorithms. SoccerSum uniquely combines textual and visual data, offering a comprehensive solution for generating accurate, platform-specific content. This development represents a significant advancement in automated, data-driven sports media dissemination, and sets a new benchmark in the realm of soccer information extraction. A video of the demo can be found here: https://youtu.be/za4VIi2ARXY.


INTRODUCTION
In the dynamic and fast-paced world of sports summarization, the nuanced analysis of raw video and audio in soccer is pivotal for a deeper understanding of the game's intricate dynamics.We propose a new tool called SoccerSum, which is designed to analyze such inputs to summarize key moments in soccer games, with a special focus on goal events.A distinctive feature of SoccerSum is its ability to create engaging captions, thereby enhancing the way soccer clubs connect with their audiences at a much faster pace while requiring less tedious manual operations.
In the competitive landscape of sports, fostering fan engagement, expanding audience reach, and establishing brand visibility are instrumental in building and maintaining a loyal fan base [22].With its ability to create succinct and engaging social media captions, SoccerSum facilitates a seamless connection between clubs and their fans and provides an innovative and cost-effective approach to content creation, offering immense value.The tool's capacity to generate platform-specific content empowers clubs to effectively utilize various social media channels, optimizing their digital engagement strategy.
SoccerSum's workflow (see Figure 1) is robustly constructed, integrating video, audio, and metadata from game events.The metadata includes details such as the time of the goal, the players involved, and other relevant statistics, providing crucial information for analysis.By examining this input alongside the visual build-up and potential audio cues such as crowd reactions, SoccerSum effectively dissects the complexity of each event.The integration of multiple modalities ensures a more thorough event understanding, capturing both the technical depth and narrative richness inherent in pivotal moments of the game [5].
In this demonstration, we showcase a Flask application where participants can actively interact with SoccerSum.By configuring the parameters of various modules targeting different modalities, they can experience an end-to-end process, culminating in the generation of descriptive tweets that vividly capture the essence of the goal event.

BACKGROUND AND MOTIVATION
The field of sports summarization, particularly soccer, is undergoing a significant transformation with the advent of AI-driven summarization techniques [4,9,23].Traditional systems primarily utilize Natural Language Processing (NLP) and other methods for text summarization, analyzing game metadata extensively [2,6,7,11,29,34].However, these state-of-the-art systems do not integrate automated pipelines for extracting detailed scene information using computer vision from video and audio data.This leads to a significant gap in leveraging the rich audio-visual content from sports footage.Advancements in this field include works focusing on video captioning for real-time soccer commentary [26], which underscore the complexity of video captioning in sports, where models are required to master spatio-temporal dynamics [3], bridge video-text elements [8], and generate context-rich narratives [31].The addition of background knowledge to video captioning poses further challenges, necessitating advanced techniques in video understanding and knowledge integration [21].
As the role of social media is pivotal for fan engagement and visibility [10,13,25,32], SoccerSum aims to bridge the gap in existing systems [15-17, 30, 33] which often overlook crucial visual and auditory features to enhance soccer narratives created for the general public.Our system comprehensively understands scene context by extracting audio and visual features, and integrates NLP with audio-visual data analysis.SoccerSum can offer a holistic portrayal of soccer events on social media, providing descriptive and context-rich textual output, based on an enhanced understanding of event dynamics.

SOCCERSUM
The SoccerSum framework, depicted in Figure 1, is an advanced system specifically engineered for automating the summarization of soccer goal events.This framework adeptly integrates multiple data modalities, encompassing high-resolution video feeds, multichannel audio recordings, and comprehensive game metadata.The fundamental aim is to efficiently process and synthesize the abundant and intricate data streams into succinct, compelling summaries tailored for major social media channels.This involves crafting concise Twitter posts and creating engaging Instagram captions that accompany goal videos, each adhering to the specific content guidelines and audience preferences of these platforms [13].Central to the system's design is the application of cutting-edge AI algorithms, capable of processing vast amounts of data while adhering to the stringent input limitations of the GPT-4 Turbo model.Currently, GPT-4 Turbo imposes a constraint of a 128k character window for input, which necessitates meticulous data curation and optimization across all processing modules.The framework's architecture is therefore tailored to not only extract the most pertinent and impactful elements from the visual, audio, and metadata layers of a soccer goal event but also to synthesize these elements into a coherent narrative that fits within the specified character limit.This constraint fundamentally influences the configuration and functionality of each module within SoccerSum, ensuring that the final output delivered to GPT-4 for response generation is both informationally rich and within the prescribed input size.The framework thus represents a harmonious blend of deep learning, natural language processing, and computer vision technologies, all working in concert to transform raw soccer goal event data into captivating stories for the digital audience.

Video
SoccerSum has several components processing the video input: • Preprocessing: We extract video format, frame rate and metadata information to extract frame accurate information.
Here, we also extract the audio for the audio processing described below.
• Key Frame Selection: To capture crucial moments efficiently, we implement a frame down-sampling algorithm, customized to parse the video feed effectively and reduce data volume.We have two approaches: i) a dynamic frame decimation algorithm adaptively determine action intensity within the video, enabling the retention of a higher frame rate during moments of intense activity; and ii) a uniform frame selection statically selecting every  ℎ frame.This dual-method both enhances computational efficiency and maintains the integrity of important events within the game.• Camera Shot Type Classification: In our system, we employ a down-sampling algorithm for the classification of camera shot types in soccer game footage.This is achieved using a deep learning-based model, specifically designed to categorize each frame into one of several camera shot types.Primarily, we focus on full, long, and medium camera shots to effectively summarize goal events.This classification leverages the MovieShot classifier [28] as a foundation, which we have extensively fine-tuned using a specialized soccer dataset [1].By honing these camera shot types, the framework focusing on frames that significantly contribute to a holistic and informative summary of key game events.• Object Detection: For the selected frames outputted above, we perform object detection (Figure 2a).We have fine-tuned the state-of-the-art YOLOv8 model using a custom dataset [12,14], of 750 frames, primarily consisting of full, long and medium shots, and covering eight distinct classes: player, goalkeeper, referee, ball, logo, goal net, penalty mark, and corner flag post.Identifying such objects enriches each frame's analysis, making the overall summarization of key game events more data-rich.• Pitch Segmentation: A key limitation is the absence of specialized models for accurately segmenting various critical areas of a soccer pitch.We therefore add an enhanced version of the medium-sized YOLO Segment model trained on an own annotated dataset [12,14].This enhanced pitch segmentation model now significantly augments our framework (Figure 2b).It allows for a more detailed spatial analysis, precisely identifying player positions and game strategies relative to the segmented areas of the pitch.• Object Tracking: We meticulously track the movement of dynamic objects, predominantly player, ball and goalkeeper, across sequential frames.This process hinges on the effectiveness of our object detection model, which has demonstrated an impressive F1 score of 95% in identifying key objects on the pitch.Moreover, to consistently maintain identifiers for the object across object interactions, we integrated Byte-Track [35] to effectively reassign the correct ID to objects that might have been temporarily lost or misidentified.Understanding these intricate movements and interactions is key to generating a thorough and contextually rich summary of goal events.

Audio
SoccerSum also has several modules for audio processing (Figure 1): • Automatic Speech Recognition (ASR): ASR plays a vital role in converting commentary and sideline discussions into text.To harness this narrative power, we employ the OPE-NAI Whisper model [27].To address Wisper's short window challenge, we implemented a segmentation mechanism with an overlapping window strategy dividing the audio into segments of approximately 20 seconds each and applying an overlapping method ensuring no sentences are lost between segments.Duplicate sentences are removed using an NLPbased similarity detection mechanism [19].Moreover, the Whisper model's capability to recognize speech in various languages adds another layer of versatility to our ASR module.All of this ensures that our framework can effectively process and transcribe goal event narrations from diverse linguistic sources, thereby enriching the input for generating more contextually relevant and engaging captions.• Audio Intensity Analysis: Audio intensity Analysis is important for augmenting visual data with auditory cues, particularly in capturing crowd reactions and the overall atmosphere.We employ Librosa [20] to calculate the Root Mean Square (RMS) [24] of the audio signals, serving as a robust metric for quantifying the audio signal's power, combining both perceived loudness and objective amplitude measures.By correlating peaks in audio intensity with specific goal event segments, we can generate captions that not only depict the visual action but also echo the excitement and energy captured in the audio.This holistic approach ensures that our summaries offer a rich, multi-dimensional portrayal of the soccer game, immersing the audience in both the visual spectacle and the emotional fervor of the event.

Game Metadata
The game metadata processing module is dedicated to capturing and integrating essential data for each goal event.This metadata provides a factual foundation to our analysis, supplementing the insights obtained from the video and audio analyses.For each event, the module collects factual information crucial to understanding the dynamics of the game.This includes the exact timestamp of the goal, which is critical for synchronizing with video and audio data.Additionally, it records the name of the player who scored and, if applicable, the assisting player's name.Another key piece of metadata is the type of shot that resulted in the goal, distinguishing between various scenarios like a standard field goal or an own goal.This metadata offers details about the events that are otherwise not ascertainable through video or audio analysis.The integration of this metadata into our framework ensures that our generated summaries encompass not only the visual and emotional narrative of the game but also the precise and essential details of each goal event.

Prompt Engineering
In the prompt engineering phase, a critical step is the strategic organization and optimization of data collected from prior modules for input into the GPT model.This process begins with aggregating data by frame number to chronologically align information from object detection, segmentation with each moment of the goal event.
A key aspect of this stage is optimizing the data to fit within the GPT-4 Turbo model's input limitations.This involves not only reducing the character count, but also a careful scrutiny of the frames based on object detection count.Frames with object counts below a set threshold are subjected to a secondary, more detailed analysis.The purpose of this analysis is twofold.Firstly, it assesses each frame's contribution to creating a meaningful and informative caption.Secondly, it involves comparing the bounding box dimensions with the frame size to identify potential misclassification of shot types.The development of prompts marks the culmination of this phase, where the goal is to transform processed data into coherent and contextually appropriate narratives.This process requires iterative testing and refinement to ensure that the prompts accurately capture key data points and are both precise and comprehensive.These prompts are designed to structure the data in a way that leverages the GPT model's ability to generate detailed and engaging summaries, particularly for goal events.Another critical aspect of this phase is the mapping of information obtained from various modalities to specific parts of the output.For example, if the first segment of a video contains a specific narrative from the audio, it's essential to align this information with the corresponding video frames to ensure a cohesive and accurate representation of the event.Through this meticulous process, we ensure that the data fed into the GPT model is not only rich in detail but also organized in a manner that facilitates the generation of accurate, contextually relevant, and engaging narrative descriptions of the goal events.

GPT API Interactions
The GPT API Interactions module represents a critical component of our framework, enabling the transformation of processed data into narrative content through the OpenAI GPT API.In adapting the API for our SoccerSum application, significant emphasis has been placed on various parameters [18] that guide the generative process of the GPT model.These parameters include temperature, which controls the randomness of the output; max tokens, which sets the limit for the length of the generated content; and top-p, which dictates the probability threshold for considering potential output tokens.Adjusting these parameters has been a process of iterative experimentation, aiming to balance creativity with coherence in the generated narratives.Additionally, response length and frequency penalty parameters [18] have been calibrated to refine the model's output.The response length parameter ensures that the summaries are concise yet comprehensive, while the frequency penalty helps reduce repetitiveness in the generated text.
An integral feature of this module is its capability to record the history of interactions between the LLM and the input data.This historical record enhances the user experience in the SoccerSum application, allowing for iterative refinements to the prompts based on user feedback.Users can modify their queries or add new information, prompting the system to adapt its responses accordingly, thus providing a more tailored and interactive narrative generation experience.The strategic optimization of these parameters, coupled with cost-effective token management and interactive user features, makes the GPT API Interactions module a cornerstone in delivering nuanced and engaging summaries that encapsulate the essence of soccer goal events.

DEMONSTRATION
The SoccerSum GUI, built on the Flask framework, showcases a stepby-step, robust and interactive platform for summarizing soccer goal events.Flask serves as the foundation, facilitating the seamless integration of front-end interactions and back-end processing.The front-end of SoccerSUM, designed with HTML and CSS, offers an intuitive user interface, as depicted in Figures 3, 4, 5, and 6.A video of the demo can be found here: https://youtu.be/za4VIi2ARXY, with corresponding software artifacts here: https://github.com/simula/SoccerSum.4.1.2Video Modality.SoccerSum offers various options for key frame selection following video input.Users have the choice of either setting a fixed frames per second (FPS) or opting for uniform frame selection, such as selecting every second frame or every 13th frame.When the fixed FPS method is chosen, users specify the desired FPS for video downsampling.In contrast, the uniform frame selection method allows users to determine the frequency of frame selection, providing a customizable approach.Additionally, users can select from one or multiple options in the shot type classification.Our model classifies each frame of the video into five distinct categories, ranging from extremely close-up shots to long shots.The default selection for our end-to-end pipeline includes long shots, full shots, and medium shots.Moreover, users also have the option to choose between different models for object detection and pitch segmentation.The default models in our pipeline are medium-sized, fine-tuned versions of YOLOv8, specialized for 8 detection and 2 segmentation classes.Moreover, users can select specific objects for tracking, ranging from the goalkeeper to the ball and other players.Furthermore, we have implemented a manual key frame selection method.This allows users to manually choose their preferred frames from the entire video, after which the rest of the pipeline is applied to these selected frames (Figure 4).

Audio Modality.
The application provides a suite of audio processing features that users can toggle according to their preferences.These features encompass Video Transcript Processing, which facilitates speech recognition, and audio RMS analysis for assessing audio loudness.The activation or deactivation of these features significantly impacts the comprehensiveness and detail of the captions produced.For ASR, users have the option to select from various sizes of the Whisper V2 model.However, the default model integrated into our pipeline is the base version.In terms of audio intensity analysis, users are afforded the flexibility to adjust the frame length and hop length within the Librosa library, which are key parameters in the analysis process (Figure 5).4.1.4Metadata Input and GPT-4 API Interaction.In the metadata processing section, users provide essential details such as the goal scorer's name, the goal type, and the timestamp.This information greatly enhances the contextual framework of the summary.Additionally, users are required to input their private OPENAI GPT-4 API key, enabling interaction with the language model.Users also have the option to upload a JSON file containing metadata specific to the goal event in question.Furthermore, they can experiment with various parameter combinations when working with the GPT-4 API.These parameters include temperature, Top P, and parameter penalty, which influence the determinism or randomness of the results generated by the model (Figure 6).

Intermediate Results
Upon initiating the 'Run' action, SoccerSum's backend, orchestrated by Flask, efficiently processes each module, from video and audio analysis to metadata integration.The interaction with the GPT model is managed to generate coherent and contextually relevant summaries.The results are dynamically, step-by-step displayed on the dashboard, showcasing the effectiveness of the integrated processing and AI-driven content generation.

Final Outputs
The outcomes of our analysis across various modalities that have been scrutinized in the preceding stages of our study is shown in Figure 7.We commence by examining the output generated by GPT-4, which manifests in the form of tweets.These tweets are parsed individually, rendering them readily accessible for copying to the clipboard for immediate sharing.Subsequently, we delve into the results stemming from the Video modality.This encompasses a comprehensive assessment involving a series of key steps.Firstly, we ascertain the number of frames analyzed after the careful selection of key frames.Concurrently, we enumerate the detected objects and segmented objects within the video.Additionally, we provide a visual representation in the form of a 3D plot, illustrating the positional variations of the goalkeeper across the selected keyframes.Moving on to the Audio results, we present the automatically detected transcript of the narrator for the goal event.Notably, this transcript can encompass diverse languages, demonstrating the robustness of our analysis.Furthermore, we offer insight into the audio intensity analysis, portrayed through a 2D plot, spanning the entire duration of the video (Figure 7).

DISCUSSION
Our framework demonstrates efficiency in processing and summarizing soccer events for social media content.For instance, for a 15-second video clip, it takes less than 90 seconds to generate a related tweet, highlighting its capability for rapid content creation.This is not only limited to goals but extends to other events such as bookings and important scenes.By reforming the metadata structure specific to the event, our system can adapt and derive the necessary content from our pipeline.
While our detection, segmentation, and ASR models are operable on CPUs, the use of GPUs significantly accelerates runtime, making it a preferable choice for time-sensitive applications.In cases where not all modalities are available, the system will still function by placing more weight on the existing modalities.For example, if there is no audio transcript, the model detects this and adjusts the output generation process accordingly, ensuring the result remains relevant and informative.
This balance of speed and performance underlines its potential impact on real-time sports media dissemination, paving the way for advancements in AI-driven content generation.Additionally, this system represents a significant advantage for smaller clubs operating on tight budgets.By automating content generation, it saves considerable time and resources, eliminating the need for manual labor in monitoring events and tailoring content to various social media platforms.
Regarding the requirements as outlined in [13], our outputs are tailored to meet the specific characteristics and limitations of different social media platforms.This includes emphasis on output token size and adapting to the unique requirements of each platform.As a result, our framework not only streamlines content creation but ensures that the generated summaries are optimized for social media dissemination, meeting the diverse and platform-specific requirements effectively.We are also considering the integration of additional models, such as replay detection, and further fine-tuning the language model with more platform-specific language.This includes adding user-centric customization options, allowing for even more tailored and engaging content.These enhancements aim to further refine our system's capabilities, ensuring it not only meets but exceeds the dynamic needs of social media content generation in the realm of sports.

CONCLUSION
In this study, we introduce a framework called SoccerSum for automating the summarization of soccer goals for social media.It integrates video and audio analysis with natural language processing, enhancing sports analytics and media sharing.
SoccerSum can undertake comprehensive data processing and extract information from various sources and multimedia modalities.It analyzes game footage for player and other objects positions, crowd reactions, and audio intensity, using object tracking and pitch segmentation for spatial insights.The Automatic Speech Recognition module adds contextual depth.We optimize data for the GPT-4 Turbo model's input limits, maintaining information richness.The GPT API interactions module is fine-tuned for engaging, accurate soccer summaries.The framework is also economically efficient, reducing operational costs without losing output quality.Its interactive nature allows for personalized content generation.Overall, our system advances soccer game summarization, setting new AI and sports analytics standards.It opens the door to more accessible, engaging sports narratives, improving the experience for fans and stakeholders.

Figure 2 :
Figure 2: Fine-tuned machine learning models for video processing.

4. 1 . 1
Video Input.The application's initial section is dedicated to video input, where users can upload highlights of specific goal events.SoccerSum handles various video formats, accepting both HLS m3u8 playlists and MP4 files.Users have the flexibility to upload videos directly from their local systems or input URLs of videos hosted on public servers (Figure3).