Abstract
Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of events and conditions provides a more nuanced model, but also increases its computational footprint. To manage these processing requirements in a scalable manner, discrete event simulations can be distributed across multiple computing resources. Orchestrating the simulations in a distributed setting involves coping with resource uncertainty. We consider three key aspects of resource uncertainty: resource failures, heterogeneity, and slowdowns. Each of these aspects is managed autonomously, which involves making accurate predictions of future execution times and latencies while also accounting for differences in hardware capabilities and dynamic resource consumption profiles. Further complicating matters, individual tasks within the simulation are stateful and stochastic, requiring inter-task communication and synchronization to produce accurate outcomes. We deal with these challenges through intelligent state collection and migration, active resource monitoring, and empirical evaluation of resource capabilities under changing conditions. To underscore the viability of our solution, we provide benchmarks using a production discrete event simulation that can simultaneously sustain failures, manage resource heterogeneity, and handle slowdowns while being orchestrated by our framework.
- A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. 2005. Hadoop: A framework for running applications on large clusters built of commodity hardware. Retrieved August 1, 2015 from http://hadoop.apache.org/.Google Scholar
- M. Chtepen, F. H. A. Claeys, B. Dhoedt, F. De Turck, P. Demeester, and P. A. Vanrolleghem. 2009. Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids. IEEE Transactions on Parallel and Distributed Systems, 20, 2, 180--190. Google Scholar
Digital Library
- W. R. Cotton, R. A. Pielke Sr., R. L. Walko, G. E. Liston, C. J. Tremback, H. Jiang, R. L. McAnelly, J. Y. Harrington, M. E. Nicholls, G. G. Carrio, and others. 2003. RAMS 2001: Current status and future directions. Meteorology and Atmospheric Physics 82, 1--4, 5--29.Google Scholar
Cross Ref
- D. Cucuzzo, S. D’Alessio, F. Quaglia, and P. Romano. 2007. A lightweight heuristic-based mechanism for collecting committed consistent global states in optimistic simulation. Proceedings of the International Symposium on Distributed Simulation and Real-Time Applications, 227--234. Google Scholar
Digital Library
- G. D’Angelo. 2011. Parallel and distributed simulation from many cores to the public cloud. Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’11).Google Scholar
Cross Ref
- J. Dean and S. Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113. Google Scholar
Digital Library
- L. P. Deutsch. 1996. DEFLATE compressed data format specification, version 1.3.Google Scholar
- M. Eklof, F. Moradi, and R. Ayani. 2005. A framework for fault tolerance in HLA-based distributed simulations. Proceedings of Conference on Winter Simulation, 1182--1189. Google Scholar
Digital Library
- K. Ericson and S. Pallickara. 2012. On the performance of high dimensional data clustering and classification algorithms. Future Generation Computer Systems. Google Scholar
Digital Library
- K. Ericson, S. Pallickara, and C. W. Anderson. 2010. Analyzing electroencephalograms using cloud computing techniques. In 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 185--192. Google Scholar
Digital Library
- T. H. Feng and E. A. Lee. 2007. Implementation of Real-Time Distributed Discrete-Event Execution with Fault Tolerance. Technical Report. University of California, Berkeley, Berkeley, CA.Google Scholar
- C. Green and others. 2010. Simulation modeling of alternative control strategies for an HPAI outbreak using NAADSM. In Canadian Association of Veterinary Epidemiology Preventive Medicine (CAVEPM) Meeting, Guelph, Ontario, Canada.Google Scholar
- N. Harvey, A. Reeves, M. A. Schoenbaum, F. J. Zagmutt-Vergara, C. Dube, A. E. Hill, et al. 2007. The North American animal disease spread model: A simulation model to assist decision making in evaluating animal disease incursions. Preventive Veterinary Medicine 82, 3, 176--197.Google Scholar
Cross Ref
- Heaton Research, Inc. Encog Machine Learning Framework. Retrieved August 1, 2015 from http://www.heatonresearch.com/encog.Google Scholar
- D. Jefferson and J. Leek. 2010. Application of parallel discrete event simulation to the Space Surveillance Network. In Proceedings of the Advanced Maui Optical and Space Surveillance Technologies Conference, S. Ryan (ed.). Maui Economic Development Board, E, Vol. 34.Google Scholar
- D. Korn and K. Vo. 2002. The VCDIFF generic differencing and compression data format. Retrieved August 1, 2015 from http://www.heise.de/netze/rfc/rfcs/rfc3284.shtml. Google Scholar
Digital Library
- G. Lee, B.-G. Chun, and R. H. Katz. 2011. Heterogeneity-aware resource allocation and scheduling in the cloud. Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 11. Google Scholar
Digital Library
- J. MacDonald. 2008. XDelta. Retrieved August 1, 2015 from http://xdelta.org.Google Scholar
- M. Malensek, S. L. Pallickara, and S. Pallickara. 2012. Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals. Future Generation Computer Systems 29, 4, 1049--1061. Google Scholar
Digital Library
- M. Malensek, Z. Sui, N. Harvey, and S. Pallickara. 2013. Autonomous, failure-resilient orchestration of distributed discrete event simulations. Proceedings of the ACM Cloud and Autonomic Computing Conference. Miami, FL. 2013. Google Scholar
Digital Library
- S. Pallickara, J. Ekanayake, and G. Fox. 2009. Granules: A lightweight, streaming runtime for cloud computing with support, for Map-Reduce. In IEEE International Conference on Cluster Computing and Workshops, 2009 (CLUSTER’09). IEEE, 1--10.Google Scholar
Cross Ref
- A. Park and R. M. Fujimoto. 2006. Aurora: An approach to high throughput parallel simulation. 20th Workshop on Principles of Advanced and Distributed Simulation (PADS’06). 3, 10. Google Scholar
Digital Library
- A. Park and R. Fujimoto. 2007. A scalable framework for parallel discrete event simulations on desktop grids. In 8th IEEE/ACM International Conference on Grid Computing. Google Scholar
Digital Library
- D. Patterson, A. Brown, P. Broadwell, and others. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical Report. UCB//CSD-02-1175, University of California, Berkeley Computer Science, Berkeley, CA. Google Scholar
Digital Library
- D. L. Pendell, J. Leatherman, T. C. Schroeder, and G. S. Alward. 2007. The economic impacts of a foot-and-mouth disease outbreak: A regional analysis. Journal of Agricultural and Applied Economics 39, 0, 19--33.Google Scholar
Cross Ref
- C. Percival. 2006. Matching with mismatches and assorted applications. Ph.D. Dissertation. University of Oxford. Oxford, UK.Google Scholar
- K. Portacci, A. Reeves, B. Corso, and M. Salman. 2009. Evaluation of vaccination strategies for an outbreak of pseudorabies virus in US commercial swine using the NAADSM. In ISVEE 12: Proceedings of the 12th Symposium of the International Society for Veterinary Epidemiology and Economics, Durban, South Africa. 78.Google Scholar
- J. L. Ramírez Ortiz and R. M. Jiménez. 2011. Fault-tolerant distributed discrete event simulator based on a p2p architecture. In SIMUL 2011, The 3rd International Conference on Advances in System Simulation. 21--26.Google Scholar
- N. Roy, A. Dubey, and A. Gokhale. 2011. Efficient autoscaling in the cloud using predictive models for workload forecasting. 2011 IEEE International Conference on Cloud Computing (CLOUD). Google Scholar
Digital Library
- K. Vanmechelen, S. De Munck, and J. Broeckhove. 2013. Conservative distributed discrete-event simulation on the Amazon EC2 cloud: An evaluation of time synchronization protocol performance and cost efficiency. Simulation Modelling Practice and Theory 34, 126--143.Google Scholar
Cross Ref
- V. Springel. 2005. The cosmological simulation code gadget-2. Monthly Notices of the Royal Astronomical Society 364, 4, 1105--1134.Google Scholar
Cross Ref
- Z. Sui, N. Harvey, and S. Pallickara. 2013. On the distributed orchestration of stochastic discrete event simulations. Concurrency and Computation: Practice and Experience. DOI:10.1002/cpe.3121 Google Scholar
Digital Library
Index Terms
Autonomous Orchestration of Distributed Discrete Event Simulations in the Presence of Resource Uncertainty
Recommendations
Autonomous, failure-resilient orchestration of distributed discrete event simulations
CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing ConferenceDiscrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the ...
Distributed Simulation of ECATNets: A Conservative Approach
PDP '96: Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96)Abstract: ECATNets (Extended Concurrent Algebraic Term Nets) are a kind of high-level algebraic net used for specifying various aspects of distributed and parallel systems. We address the problem of developing parallel simulation techniques to analyze ...
Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations
PDP '15: Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingDEVS is a formalism for modeling and analysis of discrete event systems. PDEVS is an extension of DEVS for supporting Parallel and Discrete Event Simulation (PDES). PCD++ is a simulation platform that supports parallel simulations of DEVS models, where ...






Comments