Abstract
Live video broadcasting normally requires a multitude of skills and expertise with domain knowledge to enable multi-camera productions. As the number of cameras keeps increasing, directing a live sports broadcast has now become more complicated and challenging than ever before. The broadcast directors need to be much more concentrated, responsive, and knowledgeable, during the production. To relieve the directors from their intensive efforts, we develop an innovative automated sports broadcast directing system, called Smart Director, which aims at mimicking the typical human-in-the-loop broadcasting process to automatically create near-professional broadcasting programs in real-time by using a set of advanced multi-view video analysis algorithms. Inspired by the so-called “three-event” construction of sports broadcast [14], we build our system with an event-driven pipeline consisting of three consecutive novel components: (1) the Multi-View Event Localization to detect events by modeling multi-view correlations, (2) the Multi-View Highlight Detection to rank camera views by the visual importance for view selection, and (3) the Auto-Broadcasting Scheduler to control the production of broadcasting videos. To our best knowledge, our system is the first end-to-end automated directing system for multi-camera sports broadcasting, completely driven by the semantic understanding of sports events. It is also the first system to solve the novel problem of multi-view joint event detection by cross-view relation modeling. We conduct both objective and subjective evaluations on a real-world multi-camera soccer dataset, which demonstrate the quality of our auto-generated videos is comparable to that of the human-directed videos. Thanks to its faster response, our system is able to capture more fast-passing and short-duration events which are usually missed by human directors.
- [1] . 2013. Soccer, broadcasting, and narrative: On televising a live soccer match. Communication & Sport 1, 4 (2013), 326–341.Google Scholar
Cross Ref
- [2] . 2017. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2911–2920.Google Scholar
Cross Ref
- [3] . 2019. Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11457–11466.Google Scholar
Cross Ref
- [4] . 2013. Computational sports broadcasting: Automated director assistance for live sports. In 2013 IEEE International Conference on Multimedia and Expo (ICME’13).
IEEE , 1–6.Google ScholarCross Ref
- [5] . 2019. Sports camera calibration via synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.Google Scholar
Cross Ref
- [6] . 2019. Learning sports camera selection from internet videos. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19).
IEEE , 1682–1691.Google ScholarCross Ref
- [7] . 2018. Camera selection for broadcasting soccer games. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV’18).
IEEE , 427–435.Google ScholarCross Ref
- [8] . 2019. Mocycle-Gan: Unpaired video-to-video translation. In Proceedings of the 27th ACM International Conference on Multimedia. 647–655. Google Scholar
Digital Library
- [9] . 2009. Automatic broadcast video generation for ball sports from multiple views. In Proceedings of the Korean Society of Broadcast Engineers Conference. The Korean Institute of Broadcast and Media Engineers, 193–198.Google Scholar
- [10] . 2011. Multi-camera scheduling for video production. In 2011 Conference for Visual Media Production.
IEEE , 11–20. Google ScholarDigital Library
- [11] . 2019. Relation distillation networks for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7023–7032.Google Scholar
Cross Ref
- [12] . 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2782–2795. Google Scholar
Digital Library
- [13] . 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 68–83.Google Scholar
Cross Ref
- [14] . 2018. Playing for Keeps: Sport, the Media and Society. Hybrid Publishers.Google Scholar
- [15] . 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1125–1134.Google Scholar
Cross Ref
- [16] . 2016. An efficient framework for automatic highlights generation from sports videos. IEEE Signal Processing Letters 23, 7 (2016), 954–958.Google Scholar
Cross Ref
- [17] . 2018. Exploiting web images for video highlight detection with triplet deep ranking. IEEE Transactions on Multimedia 20, 9 (2018), 2415–2426. Google Scholar
Digital Library
- [18] . 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1481.Google Scholar
Cross Ref
- [19] . 2017. Computational video editing for dialogue-driven scenes.ACM Transactions on Graphics 36, 4 (2017), 130–1. Google Scholar
Digital Library
- [20] . 2018. Automatic camera selection in the context of basketball game. In International Conference on Image and Signal Processing. Springer, 72–79.Google Scholar
Cross Ref
- [21] . 2019. psDirector: An automatic director for watching view generation from panoramic soccer video. In International Conference on Multimedia Modeling. Springer, 218–230.Google Scholar
Cross Ref
- [22] . 2019. Learning click-based deep structure-preserving embeddings with visual attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 3 (2019), 1–19. Google Scholar
Digital Library
- [23] . 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7492–7500.Google Scholar
Cross Ref
- [24] . 2019. Deep metric learning with density adaptivity. IEEE Transactions on Multimedia 22, 5 (2019), 1285–1297.Google Scholar
Cross Ref
- [25] . 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia. 988–996. Google Scholar
Digital Library
- [26] . 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 344–353.Google Scholar
Cross Ref
- [27] . 2019. Pointly-supervised action localization. International Journal of Computer Vision 127, 3 (2019), 263–281. Google Scholar
Digital Library
- [28] . 2013. Action and event recognition with Fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision. 1817–1824. Google Scholar
Digital Library
- [29] . 2015. Television Sports Production. CRC Press.Google Scholar
Cross Ref
- [30] . 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In IJCAI. Citeseer, 3832–3838. Google Scholar
Digital Library
- [31] . 2015. Semi-supervised hashing with semantic confidence for large scale visual search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 53–62. Google Scholar
Digital Library
- [32] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [33] . 2014. Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM International Conference on Multimedia. 233–236. Google Scholar
Digital Library
- [34] . 2000. Diatom autofocusing in brightfield microscopy: A comparative study. In Proceedings 15th International Conference on Pattern Recognition (ICPR-2000), Vol. 3.
IEEE , 314–317. Google ScholarDigital Library
- [35] . 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.Google Scholar
Cross Ref
- [36] . 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533–5541.Google Scholar
Cross Ref
- [37] . 2015. Automatic summarization of soccer highlights using audio-visual descriptors. SpringerPlus 4, 1 (2015), 1–19.Google Scholar
Cross Ref
- [38] . 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the 8th ACM International Conference on Multimedia. 105–115. Google Scholar
Digital Library
- [39] . 2017. A survey of content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology 28, 5 (2017), 1212–1231.Google Scholar
Cross Ref
- [40] . 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.Google Scholar
Cross Ref
- [41] . 2019. Relational action forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 273–283.Google Scholar
Cross Ref
- [42] . 2008. Automatic composition of broadcast sports video. Multimedia Systems 14, 4 (2008), 179–193. Google Scholar
Digital Library
- [43] . 2004. Automatic replay generation for soccer video broadcasting. In Proceedings of the 12th Annual ACM International Conference on Multimedia. 32–39. Google Scholar
Digital Library
- [44] . 2016. Personal multi-view viewpoint recommendation based on trajectory distribution of the viewing target. In Proceedings of the 24th ACM International Conference on Multimedia. 471–475. Google Scholar
Digital Library
- [45] . 2014. Context-dependent viewpoint sequence recommendation system for multi-view video. In 2014 IEEE International Symposium on Multimedia. IEEE, 195–202. Google Scholar
Digital Library
- [46] . 2019. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1258–1267.Google Scholar
Cross Ref
- [47] . 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In CVPR.Google Scholar
- [48] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [49] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [50] . 2020. SeCo: Exploring sequence supervision for unsupervised representation learning. CoRR abs/2008.00975.
arxiv:2008.00975 https://arxiv.org/abs/2008.00975Google Scholar - [51] . 1997. Lane boundary detection using a multiresolution hough transform. In Proceedings of International Conference on Image Processing, Vol. 2.
IEEE , 748–751.Google ScholarCross Ref
- [52] . 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.Google Scholar
Cross Ref
- [53] . 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision (ECCV’18). 383–399.Google Scholar
Cross Ref
- [54] . 2020. Robust visual object tracking with two-stream residual convolutional networks. CoRR abs/2005.06536.
arxiv:2005.06536 https://arxiv.org/abs/2005.06536Google Scholar - [55] . 2017. FaceBoxes: A CPU real-time face detector with high accuracy. In 2017 IEEE International Joint Conference on Biometrics (IJCB’17).
IEEE , 1–9.Google ScholarCross Ref
- [56] . 2019. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9935–9944.Google Scholar
Cross Ref
- [57] . 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914–2923.Google Scholar
Cross Ref
- [58] . 2020. iDirector: An intelligent directing system for live broadcast. In Proceedings of the 28th ACM International Conference on Multimedia. 4545–4547.Google Scholar
Digital Library
Index Terms
Smart Director: An Event-Driven Directing System for Live Broadcasting
Recommendations
iDirector: An Intelligent Directing System for Live Broadcast
MM '20: Proceedings of the 28th ACM International Conference on MultimediaLive sports broadcasting is the live coverage of sports (e.g., a soccer match) as a television program, on various types of broadcasting media (e.g., television or internet). Directing such live sports broadcast is cost-expensive and demands experienced ...
Event Detection and Highlight Detection of Broadcasted Game Videos
HCMC '15: Proceedings of the 2nd Workshop on Computational Models of Social Interactions: Human-Computer-Media CommunicationEfficient access of game videos is urgently demanded due to the emergence of live streaming platforms and the explosive numbers of gamers and viewers. In this work we facilitate efficient access from two aspects: game event detection and highlight ...
Virtual director technology for social video communication and live event broadcast production
MM '13: Proceedings of the 21st ACM international conference on MultimediaThis thesis investigates several aspects of Virtual Director technology, i.e. software capable of intelligent real-time selection of live media streams. It addresses several research questions in this interdisciplinary field with respect to how a ...






Comments