TACDEC: Dataset of Tackle Events in Soccer Game Videos

This paper introduces TACDEC, a dataset of tackle events in soccer game videos. Recognizing the gap in existing open datasets that predominantly focus on official soccer events such as goals and cards, TACDEC targets a comprehensive analysis of tackles --- a critical aspect of soccer that combines technical skills, tactical decision-making, and physical engagement. By leveraging video data from the Norwegian Eliteserien league across multiple seasons, we annotated 425 videos with 4 types of tackle events, categorized into "tackle-live", "tackle-replay", "tackle-live-incomplete", and "tackle-replay-incomplete", yielding a total of 836 event annotations. The dataset offers an unprecedented resource for the development and testing of machine learning models aimed at understanding and analyzing soccer game dynamics. A proof-of-concept classification model demonstrates the dataset's utility, achieving promising results in automatic tackle detection, thereby validating TACDEC's potential to support not only advanced game analytics but also to enhance fan engagement and player development initiatives.


INTRODUCTION
Soccer is one of the most popular sports in the world, attracting millions of fans and participants from various cultures and communities.Its global appeal and the significant commercial interests it generates have spurred extensive research and innovation in areas such as soccer analytics, player performance tracking, and fan engagement strategies.Recent studies, such as Pu et al. [36], highlight the evolving nature and potential of game analysis techniques, emphasizing the importance of data-driven decision-making in modern soccer.
Automatic content production for soccer has seen significant advances, driven by the integration of artificial intelligence and machine learning [10][11][12][13].This evolution is reflected in the growing sophistication of algorithms capable of not only analyzing player and ball movements but also generating real-time statistics and predictive insights.Works such as [19,27,38] demonstrate how data-based analysis such as deep learning models have been effectively used to automate content production, offering faster and more accurate analyses than traditional methods.These innovations are transforming how soccer is broadcast, analyzed, and experienced by audiences worldwide.
Detecting events within video streams presents a significant challenge, given the diversity of methodologies developed for various applications.Specifically, in the sports domain, the conventional method of manually annotating video segments, where a team tags portions of the content in real-time during live broadcasts, is both time-consuming and costly.For soccer, this process involves individuals closely watching the game to identify and mark significant moments like goals and fouls, leading to delays in the availability of annotated content.To counter these limitations, there has been a surge in the development of automated systems over recent years, with several notable contributions making strides in this area [4,14,21,25,31,34,37,39,41,42].Furthermore, the Soccer-Net challenge has featured action spotting as a key task [8], where recent competitions have seen the top-performing teams achieve an average mean Average Precision (mAP) of approximately 80% in 2023, demonstrating the progress in this field. 1achine learning applications require well-curated datasets for training, which is particularly challenging in the context of sports video analysis.The quality and granularity of the data significantly influence the model performance.For soccer, datasets must accurately capture intricate details of the game, such as player movements, ball position, and game events.This demands highresolution video footage as well as precise temporal and spatial annotations.Existing datasets in this domain, like SoccerNet, provide extensive annotations covering a wide range of events, but there is still a gap in terms of real-time processing and multi-dimensional analysis.The integration of advanced techniques in computer vision and deep learning promises to bridge this gap, offering the potential to revolutionize how sports analytics are performed, leading to more nuanced and real-time insights into games.
In this paper, we introduce TACDEC, a dataset designed to enable automatic tackle event detection in soccer games by leveraging video data from the Eliteserien league.TACDEC incorporates various class annotations based on a comprehensive understanding of a tackle and aims to fill the gap in soccer analytics and research concerning the technical, tactical, and physical aspects of tackles.Overall, our contributions are as follows: • Creation of the TACDEC dataset, a novel resource for detailed tackle event analysis in soccer.

RELATED WORK
Several datasets already exist in the context of soccer and data analysis.For example, SoccerNet-v3 [5] is a comprehensive dataset that supports tasks like camera calibration, pitch localization, and player re-identification, with a focus on video data from soccer games.Another notable dataset, SoccerDB [20], offers a large-scale resource for video understanding, incorporating a substantial collection of images and extensive video footage of soccer games.These datasets serve as valuable resources for analyzing various aspects of soccer games, including player tracking and event spotting.Despite the availability of soccer datasets, a gap exists in analyzing tackles.Current datasets mainly cover goals, cards, and set pieces, directly affecting game outcomes.Yet, they overlook tactical actions and specific player behaviors like tackling.
In other sports, tackling has been more in focus, and research in that direction shows the advantages of specifically focusing on tackling.In 2022, Nonaka et al. wanted to reduce the risk of severe injuries in sports where contact occurs; more specifically, they implemented a tackle detection system for rugby [32].Using rugby games from the Japan Rugby Top League, deep learning was used together with pose estimation to detect potential concussionleading tackles.A total of 293 tackles were extracted, with the subcategories "low-risk tackles" (238) and "high-risk tackles" (155).Some years earlier, in 2014, Gastin et al. explored how the use of wearable microsensors could help detect tackles in Australian football [9].The detection accuracy for tackles was observed to be 78%.This accuracy was derived from two scenarios: a 66% correct detection rate when identifying the player executing the tackle and a 90% correct detection rate when identifying the player being tackled.Observations from actual videos were used as the criterion.
Back to computer vision tasks, Morra et al. [28] developed a two-level system that detects atomic and complex events using soccer data.With this, they used a synthetic dataset generated from a soccer game simulator, which included events such as goals, corners, and penalty kicks, but also events such as fouls.Although there were around 60,000 tackle events generated, simulated data might not capture the full complexity of real-game tackles.
Most of the work in the context of tackling is related to heavyimpact sports.Studies such as those by Klein et al. [24] show that tackling also contributes significantly to injuries in soccer players and that specific analysis and datasets can help reduce the burden on players and gain new insights.
Recognition of a gap in existing soccer datasets, particularly in their lack of coverage of tactical and technical elements such as tackling, shows the need for our specialized tackle dataset.This gap is further highlighted by SoccerNet's incorporation of multiview foul recognition in their annual challenges, demonstrating a growing interest in this research area [2].Our dataset is not just an addition to the existing array of soccer data, but it represents a targeted approach with distinct advantages: • Tackles are a key indicator of defensive abilities.By analyzing tackles, as depicted in Figure 1, we can obtain granular data on defensive actions, which allows for a more nuanced evaluation of player performances.• Tackles, often dynamic and thrilling, significantly contribute to spectator experience.Highlighting these moments can enhance fan engagement, especially in digital media.• Data on tackles can be instrumental in coaching and training, helping players refine their defensive skills with targeted feedback.• Understanding the mechanics and context of tackles that lead to injuries can guide preventive measures, both in training and during games.• Data on tackles can inform referees and soccer governing bodies about high-risk scenarios, leading to improved safety rules and enforcement.
The focus on tackles in soccer and the creation of a specific dataset not only addresses a significant gap in existing soccer analysis methods but also offers a comprehensive view of the game's dynamics.It enriches tactical and performance analysis, enhances fan experiences, and supports player development initiatives.

DATASET CURATION 3.1 Sources and Data Collection
TACDEC was curated by capturing and annotating soccer videos from the Norwegian Eliteserien league.In partnership with the Norwegian Professional Football League (Norsk Toppfotball), we used APIs from their partner company Forzasys 2 and retrieved video clips with yellow card events, which present a higher probability of seeing tackles.Clips were extracted from 70 games in 2021, and 176 games in 2022, giving a total of 246 unique games.All yellow card events for 6 of the total 16 teams (Odd, Rosenborg, Sarpsborg, Brann, Viking, and Vålerenga) from the 2021 season and all yellow card events from the 2022 season were extracted.This returned a total of 1015 clips, which were inspected manually to determine whether a tackle occurred, assessed the presence of audio, and excluded clips where distinguishing between pushing/pulling and tackles was challenging.The total number of clips was then reduced to the current dataset size of 425.These 425 clips include a total of 836 annotated events, covering all teams involved in each season.Due to relegation, some teams only have entries in one season.Figure 2 depicts dataset statistics, including a histogram of video durations and the distribution of event classes by occurrence and duration.

Annotation
To label our dataset, we have defined a tackling sequence with startand end-frames as follows: Start frame: The moment the defender redistributes their weight onto the leg that is used to launch into the tackle, initiating the movement.End frame: The tackle is considered complete when one of the two following conditions are met: (1) The forward motion of the player tackling has stopped.
(2) If airborne, the tackled player makes contact with the ground.
Tackle classes (and corresponding class labels) are defined as follows: I tackle-live: A "real-time" tackle in a broadcast video.A specific tackle can only be a live tackle once, even though there could be multiple unique live tackles in one clip.
2 https://forzasys.com/II tackle-replay: When a tackle is replayed during the same clip.Could be in slow-motion or zoomed, and there could be multiple replays for each live tackle.III tackle-live-incomplete: A live tackle, for which the start or end frame is not visible according to the definition of a tackle.This could be due to clipping artifacts or players being off-screen.IV tackle-replay-incomplete: A tackle replay, for which the start or end frame is not visible according to the definition of a tackle.This could be due to clipping artifacts or players being off-screen.V background: For when none of the 4 other class labels are applicable.
LabelBox [1] was used to manually annotate frames due to its feature allowing for double review or verification of all annotations.Here, we iterated through the frames of each video and labeled them according to the classes above.Labels were then exported from LabelBox and post-processed in terms of cleaning and removing excess information.Overall, among the 425 labeled videos, each video contains at least one event, often multiple.Figure 3 visually illustrates two tackle events.Approximately 68% of the videos feature labeled replay tackles, while 60% have tackles occurring in real-time.23% of the videos contain incomplete replay sequences, and 5% showcase incomplete live tackles.It is important to note that despite the presence of incomplete tackles within a video, multiple tackle sequences are typically observed.

Final Output
The curated dataset comprises a total of 425 labeled videos.The average length per clip is 0.43 minutes translating to 25.8 seconds.Table 1 presents the number of samples in the final dataset per team and season.To prevent each sample from being counted twice, we exclusively associated each sample with the home team.Each video in the dataset is accompanied by a separate JSON file (Listing 1) that contains detailed information about the labeled sequences within the video (e.g., class label, start and end time for each sequence).The TACDEC dataset can be found on Hugging Face (https://huggingface.co/datasets/SimulaMet-HOST/TACDEC).
Figure 3: Sample frames from a tackle-live event with a side block tackle (top), and sample frames from a tackle-replay event with a sliding tackle (bottom).Starting frames in (a) and (g), ending frames in (f) and (l).{ " i d " : " 1 2 3 4 _ q w e r t y " , " m e t a d a t a " : { " game− i d " : 1 2 3 4 , " d a t e " : 2 0 2 3 − 0 2 − 0 1 , " team −home " : { " name " : " TeamH , " i d " : 1 2 } " team −away " : { " name " : " TeamA " , " i d " : 3 4 } " c l i p − i d " : " q w e r t y " } , " m e d i a _ a t t r i b u t e s " : { " h e i g h t " : 7 2 0 , " w i d t h " : 1 2 8 0 , " mime_type " : " v i d e o / mp4 " , " f r a m e _ r a t e " : 2 5 , " f r a m e _ c o u n t " : 6 5 0 } , " e v e n t s " : [ { " s t a r t _ f r a m e " : 2 0 0 , " e n d _ f r a m e " : 3 0 0 , " t y p e " : " t a c k l e _ l i v e " } , { " s t a r t _ f r a m e " : 3 2 5 , " e n d _ f r a m e " : 4 5 0 , " t y p e " : " t a c k l e _ l i v e " } , { " s t a r t _ f r a m e " : 5 2 0 , " e n d _ f r a m e " : 6 0 0 , " t y p e " : " t a c k l e _ r e p l a y " } ] }

DATASET USE CASE DEMONSTRATION
In this section, we aim to demonstrate the validity and utility of our dataset by applying it to a straightforward classification task.As a proof of concept, we focus on the automatic detection of tackle actions within video frames.While recognizing that this application represents a relatively narrow scope of the potential of the dataset, it serves as a practical entry point due to the comparative ease of evaluating classification tasks.Through this methodological approach, we aim to validate the robustness of the dataset and establish a foundation for its application in more complex and varied research domains.
The initial stage of our pipeline involves the pre-processing of video frames, such as resizing and transforming, to ensure a standardized format and dimensionality across all inputs.This is crucial for maintaining consistency when feeding data into the feature extraction model.

Feature Extraction
Following pre-processing, each video is processed on a frame-byframe basis using the DINOv2 [35] model to extract a set of rich, descriptive features.This model is adept at distilling significant visual information from individual frames, producing a feature vector for each frame.To accommodate videos of varying lengths, a fixed-length feature matrix is constructed for each video.This matrix is populated with the extracted features, and padding frames are utilized as necessary to standardize the matrix size across all videos, ensuring that each video, regardless of its original length, is represented by a feature matrix of uniform dimensions.
Using the DINOv2 model, we extract characteristics from video frames, where each frame is individually processed to produce a characteristic vector of dimensions [257,1024].The vector has a 1024 representation for 256 regions within each frame, as well as a CLS token to it.The extraction process capitalizes on the model's capacity to distill comprehensive visual representations, where we specifically retain the output of the last layers to harness rich features conducive to various research domains.This approach facilitates a robust foundation for feature representation, accommodating a broad spectrum of downstream tasks beyond our primary focus.The version used during the feature extraction for this work is the HuggingFace implementation of Facebook's DINOv2-large, with around 302M parameters 3 .

Multi-Class Classification
Once the feature extraction phase is complete, we construct a frameclassification dataset comprising 35,609 (13.10%) randomly sampled frame instances across 3 distinct classes ("background", "tacklelive", "tackle-replay"), leaving 236,293 of 271,902 frames out.For the classification, we utilize only the initial token (CLS) in the feature sequence, excluding 256 features from the total of 257 available.It is also worth noting that instances of the "incomplete" classes were substituted for their respective others as a frame-by-frame classification would not be able to learn the temporal features, hence distinguishing between a frame present in a tackle could be problematic.The division of the dataset into training and testing subsets adheres to an 80:20 ratio, with an allocation strategy that ensures that frames from an individual clip are exclusively assigned to either subset.This partitioning scheme is pivotal in preserving the integrity of the evaluation process, preventing informational leakage between the training and testing phases, and ensuring a realistic assessment of the model's performance on unseen data.
The resulting subsets are used to train and evaluate a multilayered neural network designed for classification characterized by a minimalist architecture, incorporating dropout mechanisms and L1 regularization [40] as a measure against over-fitting, reinforcing the model's generalization capability.The architecture of this neural network is chosen to optimize performance on the classification task, incorporating three linear layers and ReLU activation functions [30] suited to handling the complexity and variability of the input features.This network uses the extracted features to make per-frame predictions, identifying specific actions or characteristics defined by the classification task.To ensure versatility for other potential downstream applications, the full DINOv2 features for all 271,902 frames in the dataset are made available as part of the public dataset, along with the dataset generation code, feature extraction code, the full specification of the baseline classification architecture, training process, and the model weights, enhancing transparency and facilitating reproducibility4 .

Evaluation
Performance Evaluation: The classifiers are evaluated using a range of standard metrics and visualization methods to ensure a comprehensive assessment of their performance [15].These metrics include accuracy which measures the overall correctness of the classifier; precision which assesses the classifier's ability to avoid false positives; recall which evaluates the classifier's capability to find all relevant instances; F1 Score which provides a balance between precision and recall; Confusion Matrix which visualizes the classifier performance in different categories; Area Under the Receiver Operating Characteristic curve (AUC-ROC) which reflects the tradeoff between sensitivity and specificity [16]; and Precision-Recall (PR) curve which shows the the relationship between precision and recall for various threshold settings.Each metric offers insight into different aspects of classifier performance, allowing a holistic evaluation [7].The following section outlines the rigorous evaluation of the classifier.The classifier was evaluated based on various metrics that are essential to understanding its strengths and weaknesses in the classification of soccer events, as shown in Table 2.The trained classification model outperformed the stratified dummy classifier baseline with random guessing by a significant margin.
Overall Performance: The classifiers showed varied effectiveness in different types of events.As seen in Table 2, the lowest F1 score was observed for the "tackle-live" of (0.66), with a relatively low precision of 0.52, even though 90% is captured.For "background", we see a high precision, indicating that it predicts the background correctly 93% of the time, with an F1 score reflecting a balanced trade-off between precision and recall.The "tackle-replay" class has an F1-score of 0.88, which is the overall highest, shining back on great precision and recall with 0.84 and 0.94, respectively.
Overall Accuracy: The classifiers achieved an overall accuracy of 0.81, indicating that the classifier with a more complex model and more fine-tuning, can achieve good results.Confusion Matrix: The Confusion Matrix seen in Figure 4a shows that the background has the highest number of false negatives, indicating that it can be difficult to differentiate between the background and the tackle events.As for "tackle-live", we see a lower number of false positives and false negatives compared to "background" suggesting the model distinguishes better here than for "background"."tackle-replay" shows the best performance with the high true positives and far less misclassification, meaning the classifiers is most effective at identifying "tackle-replay".These results might be due to "background" and "tackle-live" sharing similar camera settings, whereas "tackle-replay" often features closer zooms.
ROC Curve: A multi-class ROC curve was plotted to analyze performance across various classes and is shown in Figure 4b.The figure shows that the (blue) line, denoting "background", is the most challenging one to distinguish from the rest, while for the other two classes, we move closer toward the top left corner.
Precision-Recall (PR) Curve: A multi-class PR curve was plotted to analyze the performance between various classes and is shown in Figure 4c.The precision for "background" (blue) remains high until a sharp decline at high recall, indicating stable precision until it struggles with positive instances."tackle-live" (cyan) shows lower and inconsistent precision, dropping with increased recall."tackle-replay" (orange) maintains high precision across recall levels, highlighting the classifier's effectiveness in distinguishing this class.

Discussion
The TACDEC dataset, centered on tackle events in soccer, heralds numerous possibilities for research and practical applications in various domains [3,22,23,26,29,33].It serves as a key resource for advanced sports analytics, enabling a deeper understanding of defensive strategies and player dynamics, which are crucial for player performance evaluation and team tactics [18].The dataset on tackle events facilitates injury prevention research by identifying patterns linked to player injuries, informing safer training practices and potential rule changes [3,26,33].For the computer vision and machine learning community, TACDEC provides a rich ground for developing and testing algorithms aimed at tackling recognition and enhancing automated sports analysis technologies.
Media companies can leverage tackle recognition models to create engaging fan content, while coaches and players could benefit from it as an educational tool to improve defensive skills and game understanding [6,29].Additionally, it offers insights for social and psychological studies on player behavior and decision-making [17,29].Lastly, the dataset could assist soccer governing bodies in refining rules and officiating standards, ultimately contributing to the sport's safety and integrity.Overall, the potential of TACDEC to impact a wide array of fields underscores its value as a significant contribution to soccer analytics and the broader sports technology landscape.

CONCLUSION
The development and evaluation of the TACDEC dataset represent a significant step forward in the domain of soccer analytics, specifically in the automatic detection of tackle events.By focusing on a previously under-explored aspect of soccer game analysis, this work fills a gap in sports research.Our proof-of-concept classification model, applied to tackle detection, underscores the dataset's practical value and offers an example of its broader applicability for various research purposes beyond the initial demonstration.As TACDEC becomes available to the research community, we anticipate that it will lead to a wide range of studies in computer vision, game understanding, and analytics, thus contributing to the enhancement of tactical analysis, player performance evaluation, and the overall experience of soccer fans.As future work, we aim to expand the dataset, incorporate additional event types, connect resulting injuries to tackle events, and explore more sophisticated models for event detection.
(a) Histogram of video duration.tackle -re pla y tac kle -re pla y-i nco mp lete tac kle -liv e tac kle -liv e-in com ple Event occurence and duration.
Distribution of event classes.

Figure 4 :
Figure 4: (a) Confusion matrix showing the performance of the classifier in predicting action events from the soccer game.(b) Multi-class ROC curve for all event types.(c) Multi-class PR curve for all event types.

Table 1 :
Number of annotated events in the dataset, per home team and season.
Listing 1: Template for the JSON files in the dataset.

Table 2 :
Evaluation metrics for the classifier and comparison with a dummy classifier.