skip to main content
research-article

Blasting through the Front-End Bottleneck with Shotgun

Authors Info & Claims
Published:19 March 2018Publication History
Skip Abstract Section

Abstract

The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between performance and metadata storage costs. This work introduces Shotgun, a BTB-directed front-end prefetcher powered by a new BTB organization that maintains a logical map of an application's instruction footprint, which enables high-efficacy prefetching at low storage cost. To map active code regions, Shotgun precisely tracks an application's global control flow (e.g., function and trap routine entry points) and summarizes local control flow within each code region. Because the local control flow enjoys high spatial locality, with most functions comprised of a handful of instruction cache blocks, it lends itself to a compact region-based encoding. Meanwhile, the global control flow is naturally captured by the application's unconditional branch working set (calls, returns, traps). Based on these insights, Shotgun devotes the bulk of its BTB capacity to branches responsible for the global control flow and a spatial encoding of their target regions. By effectively capturing a map of the application's instruction footprint in the BTB, Shotgun enables highly effective BTB-directed prefetching. Using a storage budget equivalent to a conventional BTB, Shotgun outperforms the state-of-the-art BTB-directed front-end prefetcher by up to 14% on a set of varied commercial workloads.

References

  1. A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In International Conference on Very Large Data Bases, pages 266--277, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Bonanno, A. Collura, D. Lipetz, U. Mayer, B. Prasky, and A. Saporito. Two Level Bulk Preload Branch Prediction. In International Symposium on High-Performance Computer Architecture, pages 71--82, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. Burcea and A. Moshovos. Phantom-btb: a virtualized branch target buffer design. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7--11, 2009, pages 313--324, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I.-C. K. Chen, C.-C. Lee, and T. N. Mudge. Instruction Prefetching Using Branch Prediction Information. In International Conference on Computer Design, pages 593--601, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal Instruction Fetch Streaming. In International Symposium on Microarchitecture, pages 1--10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Ferdman, C. Kaynak, and B. Falsafi. Proactive Instruction Fetch. In International Symposium on Microarchitecture, pages 152--162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Kallurkar and S. R. Sarangi. ptask: A smart prefetching scheme for os intensive applications. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Oct 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Kanev, J. P. Darago, K. M. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. M. Brooks. Profiling a warehouse-scale computer. In International Symposium on Computer Architecture, pages 158--169, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Kaynak, B. Grot, and B. Falsafi. SHIFT: Shared History Instruction Fetch for Lean-core Server Processors. In International Symposium on Microarchitecture, pages 272--283, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Kaynak, B. Grot, and B. Falsafi. Confluence: Unified Instruction Supply for Scale-Out Servers. In International Symposium on Microarchitecture, pages 166--177, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a quad pentium pro SMP using OLTP workloads. In International Symposium on Computer Architecture, pages 15--26, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Kolli, A. G. Saidi, and T. F. Wenisch. RDIP: return-address-stack directed instruction prefetching. In The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 7--11, 2013, pages 260--271, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Kumar, C. Huang, B. Grot, and V. Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017, pages 493--504, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 307--318, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Reinman, B. Calder, and T. Austin. Fetch Directed Instruction Prefetching. In International Symposium on Microarchitecture, pages 16--27. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Seznec and P. Michaud. A case for (partially) tagged geometric history length branch prediction. J. Instruction-Level Parallelism, 8, 2006.Google ScholarGoogle Scholar
  17. L. Spracklen, Y. Chou, and S. G. Abraham. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In 11th International Symposium on High-Performance Computer Architecture, pages 225--236, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. Simflex: Statistical sampling of computer system simulation. IEEE Micro, 26 (4): 18--31, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling. In International Symposium on Computer Architecture, pages 84--95, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Yeh and Y. N. Patt. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. In International Symposium on Microarchitecture, pages 129--139, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Blasting through the Front-End Bottleneck with Shotgun

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!