Abstract
The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between performance and metadata storage costs. This work introduces Shotgun, a BTB-directed front-end prefetcher powered by a new BTB organization that maintains a logical map of an application's instruction footprint, which enables high-efficacy prefetching at low storage cost. To map active code regions, Shotgun precisely tracks an application's global control flow (e.g., function and trap routine entry points) and summarizes local control flow within each code region. Because the local control flow enjoys high spatial locality, with most functions comprised of a handful of instruction cache blocks, it lends itself to a compact region-based encoding. Meanwhile, the global control flow is naturally captured by the application's unconditional branch working set (calls, returns, traps). Based on these insights, Shotgun devotes the bulk of its BTB capacity to branches responsible for the global control flow and a spatial encoding of their target regions. By effectively capturing a map of the application's instruction footprint in the BTB, Shotgun enables highly effective BTB-directed prefetching. Using a storage budget equivalent to a conventional BTB, Shotgun outperforms the state-of-the-art BTB-directed front-end prefetcher by up to 14% on a set of varied commercial workloads.
- A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In International Conference on Very Large Data Bases, pages 266--277, 1999. Google Scholar
Digital Library
- J. Bonanno, A. Collura, D. Lipetz, U. Mayer, B. Prasky, and A. Saporito. Two Level Bulk Preload Branch Prediction. In International Symposium on High-Performance Computer Architecture, pages 71--82, 2013. Google Scholar
Digital Library
- I. Burcea and A. Moshovos. Phantom-btb: a virtualized branch target buffer design. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7--11, 2009, pages 313--324, 2009. Google Scholar
Digital Library
- I.-C. K. Chen, C.-C. Lee, and T. N. Mudge. Instruction Prefetching Using Branch Prediction Information. In International Conference on Computer Design, pages 593--601, 1997. Google Scholar
Digital Library
- M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal Instruction Fetch Streaming. In International Symposium on Microarchitecture, pages 1--10, 2008. Google Scholar
Digital Library
- M. Ferdman, C. Kaynak, and B. Falsafi. Proactive Instruction Fetch. In International Symposium on Microarchitecture, pages 152--162, 2011. Google Scholar
Digital Library
- P. Kallurkar and S. R. Sarangi. ptask: A smart prefetching scheme for os intensive applications. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Oct 2016. Google Scholar
Digital Library
- S. Kanev, J. P. Darago, K. M. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. M. Brooks. Profiling a warehouse-scale computer. In International Symposium on Computer Architecture, pages 158--169, 2015. Google Scholar
Digital Library
- C. Kaynak, B. Grot, and B. Falsafi. SHIFT: Shared History Instruction Fetch for Lean-core Server Processors. In International Symposium on Microarchitecture, pages 272--283, 2013. Google Scholar
Digital Library
- C. Kaynak, B. Grot, and B. Falsafi. Confluence: Unified Instruction Supply for Scale-Out Servers. In International Symposium on Microarchitecture, pages 166--177, 2015. Google Scholar
Digital Library
- K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a quad pentium pro SMP using OLTP workloads. In International Symposium on Computer Architecture, pages 15--26, 1998. Google Scholar
Digital Library
- A. Kolli, A. G. Saidi, and T. F. Wenisch. RDIP: return-address-stack directed instruction prefetching. In The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 7--11, 2013, pages 260--271, 2013. Google Scholar
Digital Library
- R. Kumar, C. Huang, B. Grot, and V. Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017, pages 493--504, 2017.Google Scholar
Cross Ref
- P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 307--318, 1998. Google Scholar
Digital Library
- G. Reinman, B. Calder, and T. Austin. Fetch Directed Instruction Prefetching. In International Symposium on Microarchitecture, pages 16--27. IEEE, 1999. Google Scholar
Digital Library
- A. Seznec and P. Michaud. A case for (partially) tagged geometric history length branch prediction. J. Instruction-Level Parallelism, 8, 2006.Google Scholar
- L. Spracklen, Y. Chou, and S. G. Abraham. Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. In 11th International Symposium on High-Performance Computer Architecture, pages 225--236, 2005. Google Scholar
Digital Library
- T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. Simflex: Statistical sampling of computer system simulation. IEEE Micro, 26 (4): 18--31, 2006. Google Scholar
Digital Library
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling. In International Symposium on Computer Architecture, pages 84--95, 2003. Google Scholar
Digital Library
- T. Yeh and Y. N. Patt. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. In International Symposium on Microarchitecture, pages 129--139, 1992. Google Scholar
Digital Library
Index Terms
Blasting through the Front-End Bottleneck with Shotgun
Recommendations
Blasting through the Front-End Bottleneck with Shotgun
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsThe front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a ...
Shooting Down the Server Front-End Bottleneck
The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction footprints. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-...
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...







Comments