The SoccerSum Dataset for Automated Detection, Segmentation, and Tracking of Objects on the Soccer Pitch

This paper introduces SoccerSum, a novel dataset aimed at enhancing object detection and segmentation in video frames depicting the soccer pitch, using footage from the Norwegian Eliteserien league across 2021-2023. With the goal of detecting elements beyond common entities in existing datasets, such as the soccer ball, players and referees, this dataset includes additional annotations for the goal net, corner flag posts, and the penalty mark. SoccerSum also includes the segmentation of key pitch areas such as the penalty and goal boxes for the same frame sequences. Comprising 750 frames annotated with 10 classes for advanced analysis, SoccerSum offers compatibility with existing frameworks, providing a rich dataset for the development of computer vision algorithms. This dataset not only serves as a resource for improving sports analytics, but also introduces a new application for automatic game summarization, enabling the generation of detailed and engaging content for fans and professionals. The SoccerSum dataset is accessible on Zenodo: https://zenodo.org/records/10612084.


INTRODUCTION
Soccer is a globally celebrated sport, drawing millions of enthusiasts and players from a multitude of cultures and continents.Its rapid dynamics, strategic complexity, and the vibrant interplay among teams not only make it an enthralling spectator sport but also a rich domain for analytical exploration and technological innovation.
The pursuit of automatic summarization of soccer videos has emerged as a critical area of interest in recent years [21], driven by the growing demand for enhanced viewing experiences, advanced tactical analysis, and detailed player performance evaluations.The incorporation of cutting-edge computer vision techniques in this area has revolutionized the ability to extract comprehensive and nuanced insights from soccer footage, significantly surpassing traditional approaches that rely on basic statistics and scorelines.
Effective machine learning applications, particularly those aimed at sports analytics, require access to well-curated and diverse datasets for training.The inherent complexity and dynamic nature of sports environments, such as that of soccer, demand that these applications are capable of accurately identifying and interpreting various elements of the game under a range of conditions and scenarios.Hence, there is an essential need for datasets that are not only comprehensive but also reflective of the myriad real-world situations encountered in soccer games.
This paper introduces the SoccerSum dataset, curated from the Norwegian Eliteserien soccer league, with the specific goal of addressing the challenges in automatic soccer game summarization.SoccerSum concentrates on augmenting object detection capabilities beyond traditional game entities and includes the segmentation of key areas on the pitch.Featuring diverse video recording styles, weather conditions, and team jersey colors, SoccerSum provides a robust foundation for the development and enhancement of algorithms capable of handling the complexity inherent in soccer videos.
Overall, our contributions are as follows: We present a unique dataset called SoccerSum curated from the Norwegian Eliteserien soccer league, tailored for enhancing automatic soccer video summarization.SoccerSum includes object detection bounding boxes and segmentation masks, covering a total of 10 classes with annotations in the You Only Look Once (YOLO) [28] format.It stands out due to its diversity in video recording styles, weather conditions, and jersey colors, ensuring its applicability to a wide range of machine learning use cases for sports analytics.[20], and SoccerNet-tracking from the perspective of different applications.
Figure 1: Illustration of the soccer pitch with key dimensions and markings [4].

BACKGROUND AND RELATED WORK
We outline the main features of a soccer pitch (Figure 1), based on FIFA's official rules [4].The pitch is a rectangle, 90-120 meters long and 45-90 meters wide.The goal area extends 5.5 meters from inside each goalpost into the field.The larger penalty area, 16.5 meters from each post, includes the goal area and allows the goalkeeper to handle the ball.The penalty mark is 11 meters from the goal line, with a D-shaped arc 9.15 meters from the mark for penalty kicks.Flag posts, over 1.5 meters high with non-pointed tops and flags, are placed at each corner and optionally at the halfway line's midpoint.The goals, at each goal line's center, are 7.32 meters apart.
Corner areas for kicks are marked by quarter circles with a 1-meter radius at each corner.
There are a number of open datasets available for research on soccer game understanding.SoccerNet-V<1/2/3> [2,3,9] are comprehensive datasets designed for analyzing soccer broadcast videos.They encompass a wide range of tasks beyond player tracking, including action spotting, replay grounding, ball action spotting, dense video captioning, and more.These versions are characterized by their extensive corpus of manual annotations that cover various aspects of a soccer broadcast.The progressive updates in each version introduce new challenges and tasks, promoting ongoing research in diverse areas of soccer video analysis.These datasets are particularly suited for general soccer video understanding and summarization, offering a holistic view of the game beyond mere player movements.Moreover, SoccerNet-Tracking is specifically tailored for Multiple Object Tracking (MOT) [19] in soccer videos.This dataset is one of the largest in the sports domain, especially for soccer, focusing on multi-object tracking.It includes detailed annotations for player tracklets, bounding boxes, and tracklet IDs, catering to the training of tracking models for soccer.The dataset consists of 200 sequences of 30 seconds each, covering various challenging soccer scenarios.Additionally, it includes a 45-minute sequence (one entire game period) for long-term tracking [2].It also introduces specific challenges for tracking tasks, using metrics like HOTA [18] to evaluate both detection and association accuracy.This dataset is ideal for in-depth analysis of player and object movement within the soccer field.Berjon et al. [1] introduced a novel pitch segmentation technique utilizing stochastic watershed transform for line marking detection, which efficiently handles radial distortions and links segmented points to primitives.Finally, Liu et al. [17] used object detection results with bounding boxes as visual inputs to Large Language Models (LLMs), showcasing the opportunity for the outputs from object detection, segmentation and tracking models to be used as inputs to LLMs [6].
Automated summarization of soccer game videos has received interest as a research area, with the different modalities available in soccer broadcasts shown to have potential use [8].From datato-text approaches using template engine and rule-bases [25] to others employing Deep Learning based vision and language models, many works have utilized the available modalities such as visual features [16,22,26], audio features [7,16,23,26], captions [30], event information, team lineups & player information [29], and transcribed commentary [5,14].Guntuboina et al. [11] employed a scoreboard detection method for automatic key-event extraction and summarization of sports videos.Yan et al. [31] proposed utilizing a pose detection algorithm along with object detection for clipping sports activity highlights.
Marques et al. [20] extended the SoccerNet dataset with a focus on camera calibration [10], identifying 10 segmentation areas of the soccer pitch, including the goal and penalty areas, penalty arc, half field, center circle, and entire field.Although these datasets have advanced sports analytics, they have limitations (see Table 1), especially in detecting specific objects such as penalty marks, corner flag posts, goal nets, and logo overlays, which are very common in soccer broadcasts.These elements are crucial for analyzing game events, player positions, and field localization.Our Soccer-Sum dataset bridges this gap by including extended object detection and segmentation classes annotated in sequential game frames.• Varied weather conditions: The frames encapsulate a range of weather conditions, from sunny and clear to snowy and overcast, providing insights into how different environmental factors affect gameplay.• Diverse jersey colors: To account for the visual variability in uniforms, the dataset includes frames featuring a wide spectrum of jersey colors for players, goalkeepers, and main referee.This aids in the development of algorithms capable of distinguishing players from multiple teams under various lighting conditions.• Multiple ball colors: Recognizing the importance of ball detection and tracking in soccer video analysis, the dataset includes frames with balls of different colors, ensuring robustness in ball detection across diverse backgrounds and lighting conditions.• Variety of logos: Team and league logos, and their placement in video frames, offer unique challenges in image recognition.Our dataset includes frames with a variety of logo shapes and sizes.• Different pitch styles: To capture the nuances of different playing surfaces, the dataset features frames showcasing a range of pitch styles, from well-manicured grass to artificial turf, each presenting its own set of visual characteristics.• Varied camera angles and recording quality: Understanding the impact of camera angles and recording quality on video analysis, the dataset includes frames from different camera angles, simulating real-world conditions for advanced algorithm training.

Sources and Collection
In partnership with the Norwegian Professional Football League (Norsk Toppfotball), we used APIs from their partner company Forzasys 1 and accessed M3U8 URLs for games from the Eliteserien league, spanning three years (2021 to 2023).Our data collection 1 https://forzasys.com/Team 2021 2022 2023 Table 2: Eliteserien teams across 3 years as dataset sources.
methodology involved downloading 10-minute segments from randomly selected games of all participating teams for each year.For each video segment, we extracted 5000 random frames, ensuring a minimum interval of 150 frames between selections to maximize the diversity of captured moments and avoid the inclusion of contiguous sequences.The original source videos were encoded in H.264 format with a resolution of 1280x720 pixels and a frame rate of 25 fps.The selected frames were stored as JPEG files, using a quality setting of 90 to maintain high fidelity with the original content.Consequently, we amassed a dataset comprising a vast array of frames, each representing distinct scenarios and conditions encountered in these soccer games.The subsequent stage entailed a manual review of all collected frames.Through several iterations of this review process, a total of 750 frames were chosen, culminating in a representative and richly diverse dataset.Table 2 presents a breakdown of the dataset, showcasing the number of frames annotated for each year, and the teams involved in the games from which the frames were extracted.

Annotations
We focused on identifying and labeling key elements within each frame that are critical to the understanding and analysis of soccer games.These elements were categorized into 8 classes for detection, each assigned a specific integer as an identifier.The class labels are as follows: In addition to the classes for detection, our dataset also includes 2 classes for segmentation, focusing on critical areas within the soccer pitch.Segmentation involves delineating specific regions or areas within each frame for more detailed analysis.The class labels used for this purpose are as follows: Figure 4 provides a summary of our dataset, showing the number of annotations for each class over three years.Annotations for the "Player" class are most frequent, aligning with expectations.Other classes have a consistent distribution, except for the "Logo" class, which has no instances in 2021 due to broadcasting changes.Figure 5 illustrates the distribution of soccer teams, based on game participation and appearance in frames.

Public Dataset
The SoccerSum dataset, accessible on Zenodo [12], is a collection of soccer game frames.Each frame within the dataset is complemented by two types of annotation files: one dedicated to detection and the other to segmentation.The dataset is hierarchically structured, starting with the league (Eliteserien), followed by the year, and down to individual game IDs.Each year folder is divided into three primary folders: "frames", "detection", and "segmentation".Within these folders, the files are further grouped by their respective game IDs.As a result, each frame in our dataset is paired with two corresponding .txtfiles -one in the "detection" folder and the other in the "segmentation" folder, both sharing the same naming convention as the frame (see Figure 3).
Coordinate system: The coordinates (x, y) used in the dataset are normalized to the dimensions of the frame image.This means: where absolute_x and absolute_y are the pixel coordinates in the image, and image_width and image_height are the dimensions of the image.
Sample line in the detection file: In YOLO's classification format [15,27], each line in the .txtfile corresponds to one classified object in the associated image.The format for each line is as follows: <class_id>, <x>,<y>,<w>,<h>

Novel Contributions of SoccerSum
Curation of the SoccerSum dataset is a strategic response to the limitations of existing soccer video analysis resources.By extending the range of detectable objects, focusing on critical event-rich areas, and providing detailed frame-by-frame annotations, SoccerSum is positioned to significantly advance the field of automated soccer video analysis.It addresses the nuanced requirements of comprehensive game analysis, enabling researchers and practitioners to develop more sophisticated models for understanding and predicting soccer game events.SoccerSum introduces several novelties: • Augmented object detection capabilities: By including bounding boxes for the player, goalkeeper, referee, ball, penalty   mark, corner flag post and goal net classes, SoccerSum provides a more comprehensive set of annotations that are crucial for detailed game analysis.This enables a deeper understanding of player and object positions in relation to key areas of the field.• Segmentation of key areas: The inclusion of segmentation masks for penalty and goal areas offers precise spatial information for analyzing events and player movements in these critical areas of the pitch.• Focus on event-rich regions: SoccerSum's emphasis on annotating areas and objects associated with goal-area events allows for the development of models that are more adept at interpreting significant moments in soccer games, such as scoring opportunities, defensive actions, and set pieces.• Frame-by-frame annotation for tracking: By annotating consecutive frames, SoccerSum facilitates the development of tracking algorithms for players, referees, and the ball.This is a significant step forward in creating models capable of real-time analysis and prediction of game dynamics.

DATASET APPLICATIONS
SoccerSum enables the development and refinement of models tailored for various tasks.The SoccerSum dataset's inclusion of videos featuring a wide range of scenarios, including different weather conditions, jersey colors and pitch styles, is pivotal for training robust object detection models.This diversity ensures models can accurately identify and classify various objects under multiple conditions, enhancing gameplay analysis [24].SoccerSum include annotations for detection of players, goalkeepers, referees, balls, penalty marks, corner flagposts, and goal nets allowing for comprehensive game analysis, crucial for understanding dynamic game events.Detailed annotations, as shown in Figure 6, for a broader range of detectable objects and critical event-rich areas support the development of models capable of interpreting significant moments in soccer games.

Application 2: Pitch Segmentation
Precise spatial information for analyzing events and movements in critical areas of the pitch enhances understanding of gameplay strategies and player efficiency.Access to segmentation masks for the important regions, as shown in Figure 7, allows for a holistic view of game and player dynamics, facilitating advanced tactical analysis.
Additionally, we also experimented with pitch segments derivation from the SoccerNet Camera Calibration Challenge dataset [10], as demonstrated in Figure 2, akin to Marques et al. 's [20] approach for pitch segmentation.We found that this augmentation is a valuable resource for creating pitch segmentation dataset and deriving tentative positions of interesting objects and regions like corners, flag posts, goal net, middle arch, center circle, penalty arch, half field and whole field boundary.The integration of the SoccerSum  dataset, alongside the segmentation masks derived from the Soc-cerNet Camera Calibration Challenge dataset, presents a unique opportunity to push the boundaries of sports analytics in soccer through pitch segmentation and its derivative applications alike player localization and tactics analysis.

Application 3: Object Tracking
Detailed frame-by-frame annotations as shown in Figure 8, especially in sequences capturing goal-area events, is crucial for developing tracking algorithms.These algorithms can follow players, referees, and the ball, providing real-time analysis and prediction of game dynamics.Sequential frame annotation of a whole clip facilitates the development of models for real-time tracking and analysis, essential for live game analytics and post-game review.The dataset's varied conditions and high-quality annotations enable the training of more sophisticated models capable of understanding and predicting game events accurately.

Application 4: Automatic Summarization
The integration of trained detection, segmentation, and tracking models with LLMs significantly advances the field of automatic soccer game summarization [13].Our trained model demonstrates this advancement, achieving a mean Average Precision at IoU of 0.5 of 0.98 for two segmentation classes and 0.895 for eight detection classes.This integration enhances the depth and accuracy of LLMgenerated summaries, offering a nuanced understanding of on-field events.Such results not only enrich the narrative by detailing player movements and game dynamics but also provide engaging and informative content for fans, coaches, and analysts.This approach, synthesizing complex data into coherent, context-rich summaries, represents a notable leap in sports analytics and content generation.

Application 5: Event Detection
The SoccerSum dataset holds significant potential for event detection in soccer analytics, enabling a deeper understanding of game dynamics.By leveraging its annotations and segmentation of players, ball, goal nets, and other pitch elements, the dataset facilitates the recognition of specific game scenarios.For example, combining the detection of the ball's position, proximity to the corner flag, and player distribution within the penalty area, SoccerSum can effectively identify a corner kick situation.This capability extends to various in-game events, allowing for a nuanced analysis of team formations, strategies, and potential play outcomes.

CONCLUSION
The SoccerSum dataset signifies an advancement in advanced analytics for soccer, providing a strong foundation for future research and development in this area.By addressing the limitations of existing datasets and incorporating enhanced annotations for object detection and segmentation, our dataset facilitates the development of machine learning models, such as YOLOv8 and YOLOv9, which are effective in object detection and segmentation.This enhancement greatly improves the analysis and understanding of soccer games, offering valuable insights to coaches, teams, and broadcasters who aim to enhance their analytical capabilities and comprehensive understanding of the sport.We anticipate that the SoccerSum dataset will significantly influence the future approach to game analysis.Additionally, SoccerSum has the potential to aid in developing automated game summarization pipelines that utilize large language models, creating detailed narratives of soccer matches and thereby transforming the future of sports analytics and content generation.

Figure 2 :
Figure 2: Derivation of pitch segments from the SoccerNet Camera Calibration Challenge dataset, illustrating varying levels of accuracy and challenges in automated mask extraction for pitch regions.

Figure 4 :
Figure 4: Number of annotations per class per year.

Figure 5 :
Figure 5: Annotations per team in terms of the number of occurrence in frames and games.

Figure 6 :
Figure 6: Object detection showcase: object detection with eight classes trained on SoccerSum, highlighting capability in identifying diverse objects.

Figure 8 :
Figure 8: Player tracking showcase: frames showing player positions on the field where movements are marked by blue lines.