Abstract
Cloud providers have begun to offer their surplus capacity in the form of low-cost transient servers, which can be revoked unilaterally at any time. While the low cost of transient servers makes them attractive for a wide range of applications, such as data processing and scientific computing, failures due to server revocation can severely degrade application performance. Since different transient server types offer different cost and availability tradeoffs, we present the notion of server portfolios that is based on financial portfolio modeling. Server portfolios enable construction of an "optimal" mix of severs to meet an application's sensitivity to cost and revocation risk. We implement model-driven portfolios in a system called ExoSphere, and show how diverse applications can use portfolios and application-specific policies to gracefully handle transient servers. We show that ExoSphere enables widely-used parallel applications such as Spark, MPI, and BOINC to be made transiency-aware with modest effort. Our experiments show that allowing the applications to use suitable transiency-aware policies, ExoSphere is able to achieve 80% cost savings when compared to on-demand servers and greatly reduces revocation risk compared to existing approaches.
- Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/, September 24th 2015.Google Scholar
- Ec2 spot bid advisor. https://aws.amazon.com/ec2/spot/bid-advisor/, September 2015.Google Scholar
- Ec2 spot-fleet. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html, September 2015.Google Scholar
- Eucalyptus workload traces. https://www.cs.ucsb.edu/~rich/workload/, 2015.Google Scholar
- Google preemptible instances. https://cloud.google.com/compute/docs/instances/preemptible, September 24th 2015.Google Scholar
- Kubernetes. https://kubernetes.io, June 2016.Google Scholar
- Mpich: High performance portable mpi. https://www.open-mpi.org/, 2016.Google Scholar
- Openmpi checkpointing. https://www.open-mpi.org/faq/?category=ft, 2016.Google Scholar
- Risk-return trade-off. http://cvxopt.org/examples/book/portfolio.html, 2016.Google Scholar
- Ec2 spot instances pricing. https://aws.amazon.com/ec2/spot/pricing/, January 2017.Google Scholar
- O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI. USENIX, 2017.Google Scholar
- D. P. Anderson. Boinc: A system for public-resource computing and storage. In Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, pages 4--10. IEEE, 2004. Google Scholar
Digital Library
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et~al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991. Google Scholar
Digital Library
- O. Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Deconstructing Amazon EC2 Spot Instance Pricing. In CloudCom, November 2011.Google Scholar
Digital Library
- E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and coordinated scheduling for cloud-scale computing. In OSDI, 2014. Google Scholar
Digital Library
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google Scholar
Digital Library
- R. A. Brealey, S. C. Myers, F. Allen, and P. Mohanty. Principles of corporate finance. Tata McGraw-Hill Education, 2012.Google Scholar
- M. Carvalho, W. Cirne, F. Brasileiro, and J. Wilkes. Long-term slos for reclaimed cloud computing resources. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--13. ACM, 2014. Google Scholar
Digital Library
- J. Chen, C. Wang, B. B. Zhou, L. Sun, Y. C. Lee, and A. Y. Zomaya. Tradeoffs between profit and customer satisfaction for service provisioning in the cloud. In HPDC, pages 229--238. ACM, 2011. Google Scholar
Digital Library
- M. Ciavotta, E. Gianniti, and D. Ardagna. D-space4cloud: a design tool for big data applications. In Algorithms and Architectures for Parallel Processing, pages 614--629. Springer, 2016.Google Scholar
Cross Ref
- J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22(3), 2006. Google Scholar
Digital Library
- P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In USENIX ATC, 2015. Google Scholar
Digital Library
- C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and qos-aware cluster management. In ACM SIGPLAN Notices, volume~49, pages 127--144. ACM, 2014. Google Scholar
Digital Library
- C. Delimitrou and C. Kozyrakis. Hcloud: Resource-efficient provisioning in shared cloud systems. In ASPLOS, 2016. Google Scholar
Digital Library
- D. J. Dubois and G. Casale. Optispot: minimizing application deployment cost using spot cloud resources. Cluster Computing, pages 1--17, 2016. Google Scholar
Digital Library
- E. J. Elton, M. J. Gruber, S. J. Brown, and W. N. Goetzmann. Modern portfolio theory and investment analysis. John Wiley & Sons, 2009.Google Scholar
- D. R. Engler, M. F. Kaashoek, et~al. Exokernel: An operating system architecture for application-level resource management. In SOSP. ACM, 1995. Google Scholar
Digital Library
- F. J. Fabozzi, F. Gupta, and H. M. Markowitz. The legacy of modern portfolio theory. The Journal of Investing, 11(3):7--22, 2002.Google Scholar
Cross Ref
- B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In Symposium on Cloud Computing. ACM, 2012. Google Scholar
Digital Library
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In NSDI, 2011. Google Scholar
Digital Library
- A. Harlap, A. Tumanov, A. Chung, G. Ganger, and P. Gibbons. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In EuroSys. ACM, 2017. Google Scholar
Digital Library
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, 2011. Google Scholar
Digital Library
- B. A. Huberman, R. M. Lukose, and T. Hogg. An economics approach to hard computational problems. Science, 275(5296):51--54, 1997.Google Scholar
Cross Ref
- K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In USENIX ATC, 2015. Google Scholar
Digital Library
- J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi. Retro: Targeted resource management in multi-tenant distributed systems. In NSDI 15, 2015. Google Scholar
Digital Library
- A. Marathe, R. Harris, D. Lowenthal, B. R. De~Supinski, B. Rountree, and M. Schulz. Exploiting redundancy for cost-effective, time-constrained execution of hpc applications on amazon ec2. In HPDC. ACM, 2014. Google Scholar
Digital Library
- H. Markowitz. Portfolio selection. The journal of finance, 7(1):77--91, 1952.Google Scholar
- A. Meucci. Risk and Asset Allocation. Springer Finance, 2005.Google Scholar
- X. Ouyang, D. Irwin, and P. Shenoy. Spotlight: An information service for the cloud. In IEEE International Conference on Distributed Computing Systems (ICDCS), 2016.Google Scholar
Cross Ref
- S. Satchell and A. Scowcroft. A demystification of the black--litterman model: Managing quantitative and traditional portfolio construction. Journal of Asset Management, 1(2):138--150, 2000.Google Scholar
Cross Ref
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In EuroSys. ACM, 2013. Google Scholar
Digital Library
- M. Sedaghat, E. Wadbro, J. Wilkes, S. De~Luna, O. Seleznjev, and E. Elmroth. Diehard: reliable scheduling to survive correlated failures in cloud data centers. In Cluster, Cloud and Grid Computing (CCGrid), pages 52--59. IEEE/ACM, 2016.Google Scholar
Digital Library
- P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. Flint: batch-interactive data-intensive processing on transient servers. In EuroSys. ACM, 2016. Google Scholar
Digital Library
- P. Sharma, D. Irwin, and P. Shenoy. How not to bid the cloud. In HotCloud. USENIX, 2016. Google Scholar
Digital Library
- P. Sharma, S. Lee, T. Guo, D. Irwin, and P. Shenoy. Spotcheck: Designing a derivative iaas cloud on the spot market. In EuroSys, page~16. ACM, 2015. Google Scholar
Digital Library
- R. Singh, D. Irwin, P. Shenoy, and K. Ramakrishnan. Yank: Enabling Green Data Centers to Pull the Plug. In NSDI, April 2013. Google Scholar
Digital Library
- R. Singh, P. Sharma, D. Irwin, P. Shenoy, and K. Ramakrishnan. Here today, gone tomorrow: Exploiting transient servers in datacenters. IEEE Internet Computing, 18(4):22--29, 2014.Google Scholar
Cross Ref
- A. I. Siqi~Shen, Kefeng~Deng and D. Epema. Scheduling jobs in the cloud using on-demand and reserved instances. In EuroPar, 2013.Google Scholar
- S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. SpotOn: A Batch Computing Service for the Spot Market. In SOCC, August 2015. Google Scholar
Digital Library
- L. Tomas and J. Tordsson. An autonomic approach to risk-aware data center overbooking. In Transactions on Cloud Computing. IEEE, 2014.Google Scholar
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et~al. Apache hadoop yarn: Yet another resource negotiator. In Symposium on Cloud Computing. ACM, 2013. Google Scholar
Digital Library
- A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at google with borg. In EuroSys. ACM, 2015. Google Scholar
Digital Library
- A. Vintila, A.-M. Oprescu, and T. Kielmann. Fast (re-) configuration of mixed on-demand and spot instance pools for high-throughput computing. In Workshop on Optimization techniques for resources management in clouds, pages 25--32. ACM, 2013. Google Scholar
Digital Library
- C. A. Waldspurger, T. Hogg, B. A. Huberman, J. O. Kephart, and W. S. Stornetta. Spawn: A distributed computational economy. IEEE Transactions on Software Engineering, 18(2):103--117, 1992. Google Scholar
Digital Library
- J. Wen, L. Lu, G. Casale, and E. Smirni. Less can be more: Micro-managing vms in amazon ec2. In International Conference on Cloud Computing. IEEE, 2015. Google Scholar
Digital Library
- A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the deployment of computations in the cloud with conductor. In NSDI 12, 2012. Google Scholar
Digital Library
- Y. Yan, Y. Gao, Y. Chen, Z. Guo, B. Chen, and T. Moscibroda. TR-Spark: Transient Computing for Big Data Analytics. In SOCC, October 2016. Google Scholar
Digital Library
- Y. Yang, G.-W. Kim, W. W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho, and B.-G. Chun. Pado: A data processing engine for harnessing transient resources in datacenters. In EuroSys. ACM, 2017. Google Scholar
Digital Library
- S. Yi, D. Kondo, and A. Andrzejak. Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In International Conference on Cloud Computing, pages 236--243. IEEE, 2010. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. HotCloud, 2010. Google Scholar
Digital Library
Index Terms
Portfolio-driven Resource Management for Transient Cloud Servers
Recommendations
Portfolio-driven Resource Management for Transient Cloud Servers
Performance evaluation reviewCloud providers have begun to offer their surplus capacity in the form of low-cost transient servers, which can be revoked unilaterally at any time. While the low cost of transient servers makes them attractive for a wide range of applications, such as ...
Portfolio-driven Resource Management for Transient Cloud Servers
SIGMETRICS '17 Abstracts: Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer SystemsCloud providers have begun to offer their surplus capacity in the form of low-cost transient servers, which can be revoked unilaterally at any time. While the low cost of transient servers makes them attractive for a wide range of applications, such as ...
Systemic Risk-Driven Portfolio Selection
How can we construct portfolios that perform well in the face of systemic events? The global financial crisis of 2007–2008 and the coronavirus disease 2019 pandemic have highlighted the importance of accounting for extreme form of risks. In “Systemic Risk-...
We consider an investor who trades off tail risk and expected growth of the investment. We measure tail risk through the portfolio’s expected losses conditioned on the occurrence of a systemic event: financial market loss being exactly at, or at least at, ...






Comments