skip to main content
research-article

Lancet: Better Network Resilience by Designing for Pruned Failure Sets

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

Recently, researchers have started exploring the design of route protection schemes that ensure networks can sustain traffic demand without congestion under failures. Existing approaches focus on ensuring worst-case performance over simultaneous f-failure scenarios is acceptable. Unfortunately, even a single bad scenario may render the schemes unable to protect against any f-failure scenario. In this paper, we present Lancet, a system designed to handle most failures when not all can be tackled. Lancet comprises three components: (i) an algorithm to analyze which failure scenarios the network can intrinsically handle which provides a benchmark for any protection routing scheme, and guides the design of new schemes; (ii) an approach to efficiently design a protection schemes for more general failure sets than all f-failure scenarios; and (iii) techniques to determine which of combinatorially many scenarios to design for. Our evaluations with real topologies and validations on an emulation testbed show that Lancet outperforms a worst-case approach by protecting against many more scenarios, and can even match the scenarios that can be handled by optimal network response.

References

  1. Topology zoo. http://www.topology-zoo.org/.Google ScholarGoogle Scholar
  2. Abilene traffic matrices. http://www.cs.utexas.edu/~yzhang/research/AbileneTM/, 2014.Google ScholarGoogle Scholar
  3. Inside AT&T's grand plans for SDN. https://www.networkworld.com/article/2866439/sdn/inside-atts-grand-plans-for-sdn.html, 2015.Google ScholarGoogle Scholar
  4. Cisco WAN automation engine (WAE), 2016. http://www.cisco.com/c/en/us/products/routers/wan-automation-engine/index.html.Google ScholarGoogle Scholar
  5. Building Express Backbone: Facebook's new long-haul network. https://code.facebook.com/posts/1782709872057497/building-express-backbone-facebook-s-new-long-haul-network/, 2017.Google ScholarGoogle Scholar
  6. Gustavo Angulo, Shabbir Ahmed, Santanu~S. Dey, and Volker Kaibel. Forbidden vertices. Mathematics of Operations Research, 40 (2): 350--360, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  7. David Applegate and Edith Cohen. Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In Proceedings of ACM SIGCOMM, pages 313--324, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David Applegate, Lee Breslau, and Edith Cohen. Coping with network failures: Routing strategies for optimal demand oblivious restoration. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '04/Performance '04, pages 270--281, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ajay~Kumar Bangla, Alireza Ghaffarkhah, Ben Preskill, Bikash Koley, Christoph Albrecht, Emilie Danna, Joe Jiang, and Xiaoxue Zhao. Capacity planning for the google backbone network. In ISMP 2015 (International Symposium on Mathematical Programming), 2015.Google ScholarGoogle Scholar
  10. Randeep~S. Bhatia, Murali Kodialam, T. V. Lakshman, and Sudipta Sengupta. Bandwidth guaranteed routing with fast restoration against link and node failures. IEEE/ACM Transactions on Networking, 16 (6): 1321--1330, December 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Martin Birk, Gagan Choudhury, Bruce Cortez, Alvin Goddard, Narayan Padi, Aswatnarayan Raghuram, Kathy Tse, Simon Tse, Andrew Wallace, and Kang Xi. Evolving to an SDN-enabled isp backbone: key technologies and applications. IEEE Communications Magazine, 54 (10): 129--135, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjorner, Asaf Valadarsky, and Michael Schapira. Teavar: Striking the right utilization-availability balance in wan traffic engineering. In Proceedings of ACM SIGCOMM, 2019. (to appear).Google ScholarGoogle Scholar
  13. Michael Borokhovich, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Load-optimal local fast rerouting for dense networks. IEEE/ACM Transactions on Networking, 26 (6): 2583--2597, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yiyang Chang, Sanjay Rao, and Mohit Tawarmalani. Robust validation of network designs under uncertain demands and failures. In 14$^th$ USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 347--362, 2017.Google ScholarGoogle Scholar
  15. Michele Conforti, Gerard Cornuejols, and Giacomo Zambelli. Integer Programming. Springer Publishing Company, Incorporated, 2014.Google ScholarGoogle Scholar
  16. Klaus-Tycho Foerster, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Casa: congestion and stretch aware static fast rerouting. In Proceedings of IEEE INFOCOM, pages 469--477, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  17. Bernard Fortz and Mikkel Thorup. Robust optimization of OSPF/IS-IS weights. In Proceedings of International Network Optimization Conference, pages 225--230, 2003.Google ScholarGoogle Scholar
  18. Monia Ghobadi and Ratul Mahajan. Optical layer failures in a large backbone. In Proceedings of the 2016 Internet Measurement Conference, pages 461--467, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of ACM SIGCOMM, pages 350--361, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In Proceedings of ACM SIGCOMM, pages 58--72, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fang Hao, Murali Kodialam, and T. V. Lakshman. Optimizing restoration with segment routing. In Proceedings of IEEE INFOCOM, pages 1--9, April 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving high utilization with software-driven wan. In Proceedings of ACM SIGCOMM, pages 15--26, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa~Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. B4 and after: Managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined wan. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 74--87, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gurobi~Optimization Inc. Gurobi optimizer reference manual, 2016. http://www.gurobi.com.Google ScholarGoogle Scholar
  25. le, Stuart, and Vahdat]b4Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. B4: Experience with a globally-deployed software defined wan. In Proceedings of ACM SIGCOMM, pages 3--14, 2013.Google ScholarGoogle Scholar
  26. semi_oblivious_nsdi18Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun~Lin Lim, and Robert Soulé. Semi-oblivious traffic engineering: The road not taken. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 157--170, 2018.Google ScholarGoogle Scholar
  27. n, and Zhang]TONProtection11Kin-Wah Kwong, Lixin Gao, Roch Guérin, and Zhi-Li Zhang. On the feasibility and efficacy of protection routing in ip networks. IEEE/ACM Transactions on Networking, 19 (5): 1543--1556, October 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Tom Anderson, Scott Shenker, and Ion Stoica. Achieving convergence-free routing using failure-carrying packets. In Proceedings of ACM SIGCOMM, pages 241--252, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hongqiang~Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic engineering with forward fault correction. In Proceedings of ACM SIGCOMM, pages 527--538, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Athina Markopoulou, Gianluca Iannaccone, Supratik Bhattacharyya, Chen-Nee Chuah, Yashar Ganjali, and Christophe Diot. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw., 16 (4): 749--762, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Pan, G. Swallow, and A. Atlas. Fast Reroute Extensions to RSVP-TE for LSP Tunnels. RFC 4090, May 2005.Google ScholarGoogle Scholar
  32. and Medhi(2004)]MedhiBookMichal Pióro and Deepankar Medhi. Routing, Flow, and Capacity Design in Communication and Computer Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004. ISBN 0125571895.Google ScholarGoogle Scholar
  33. Rahul Potharaju and Navendu Jain. When the network crumbles: An empirical study of cloud network failures and their impact on services. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 15:1--15:17, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Shand and S. Bryant. IP Fast Reroute Framework. RFC 5714, January 2010.Google ScholarGoogle Scholar
  35. R. K. Sinha, F. Ergun, K. N. Oikonomou, and K. K. Ramakrishnan. Network design for tolerating multiple link failures using Fast Re-route (FRR). In 2014 10th International Conference on the Design of Reliable Communication Networks (DRCN), pages 1--8, April 2014.Google ScholarGoogle ScholarCross RefCross Ref
  36. Martin Suchara, Dahai Xu, Robert Doverspike, David Johnson, and Jennifer Rexford. Network architecture for joint failure recovery and traffic engineering. SIGMETRICS Perform. Eval. Rev., 39 (1): 97--108, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Daniel Turner, Kirill Levchenko, Alex~C. Snoeren, and Stefan Savage. California fault lines: Understanding the causes and impact of network failures. In Proceedings of the ACM SIGCOMM 2010 Conference, pages 315--326, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hao Wang, Haiyong Xie, Lili Qiu, Yang~Richard Yang, Yin Zhang, and Albert Greenberg. COPE: Traffic engineering in dynamic networks. In Proceedings of ACM SIGCOMM, pages 99--110, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ye~Wang, Hao Wang, Ajay Mahimkar, Richard Alimi, Yin Zhang, Lili Qiu, and Yang~Richard Yang. R3: Resilient routing reconfiguration. In Proceedings of ACM SIGCOMM, pages 291--302, 2010.Google ScholarGoogle Scholar
  40. R.Kevin Wood. Deterministic network interdiction. Mathematical and Computer Modelling, 17 (2): 1--18, January 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Yang, J. Liu, S. Shenker, J. Li, and K. Zheng. Keep forwarding: Towards k-link failure resilient routing. In Proceedings of IEEE INFOCOM, pages 1617--1625, April 2014.Google ScholarGoogle ScholarCross RefCross Ref
  42. Zhang, Ge, Kurose, Liu, and Towsley]TrafficMultiMatrixC. Zhang, Zihui Ge, J. Kurose, Y. Liu, and D. Towsley. Optimal routing with multiple traffic matrices tradeoff between average and worst case performance. In Network Protocols, 2005. ICNP 2005. 13th IEEE International Conference on, 2005a.Google ScholarGoogle Scholar
  43. Zhang, Ge, Greenberg, and Roughan]gravity_modelYin Zhang, Zihui Ge, Albert Greenberg, and Matthew Roughan. Network anomography. In Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement, pages 30--30, 2005b.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jiaqi Zheng, Hong Xu, Xiaojun Zhu, Guihai Chen, and Yanhui Geng. We've got you covered: Failure recovery with backup tunnels in traffic engineering. In 2016 IEEE 24th International Conference on Network Protocols (ICNP), pages 1--10, 2016.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Lancet: Better Network Resilience by Designing for Pruned Failure Sets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!