skip to main content
research-article

I can't believe it's not (only) software!: bionic distributed storage for Parquet files

Published:01 August 2019Publication History
Skip Abstract Section

Abstract

There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet.

Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file.

In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations.

References

  1. Fips 180-4 secure hash standard (shs), national institute of standards and technology (nist). https://csrc.nist.gov/publications/detail/fips/180/4/final.Google ScholarGoogle Scholar
  2. M. Blott, L. Liu, K. Karras, and K. A. Vissers. Scaling out to a single-node 80Gbps memcached server with 40Terabytes of memory. In HotStorage'15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki. Improving duplicate elimination in storage systems. ACM TOS, 2(4):424--448, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Cannon. Data deduplication and tivoli storage manager. Tivoli Storage, IBM Software Group (September 2007), 2009.Google ScholarGoogle Scholar
  5. S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA memcached appliance. In FPGA'13, pages 245--254. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. K. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In USENIX ATC'10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. DuBois and R. Amatruda. Backup and recovery: Accelerating efficiency and driving down it costs using data deduplication. EMC Corporation, 2010.Google ScholarGoogle Scholar
  8. E. S. Fukuda, H. Inoue, T. Takenaka, D. Kim, T. Sadahisa, T. Asai, and M. Motomura. Caching memcached at reconfigurable network interface. In FPL'14, pages 1--6. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  9. Z. István, G. Alonso, and A. Singla. Providing multi-tenant services with FPGAs: Case study on a key-value store. In FPL'18, pages 119--124, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  10. Z. István, D. Sidler, and G. Alonso. Caribou: intelligent distributed storage. PVLDB, 10(11):1202--1213, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, and S. X. and. Bluedbm: An appliance for big data analytics. In ISCA '15, pages 1--13, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Kestelyn. Introducing Parquet: Efficient columnar storage for Apache Hadoop. Cloudera Blog, 3, 2013.Google ScholarGoogle Scholar
  13. A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In OSDI'18, pages 107--127, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Lavasani, H. Angepat, and D. Chiou. An fpga-based in-line accelerator for memcached. IEEE Computer Architecture Letters, 13(2):57--60, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Li, Z. Ruan, W. Xiao, Y. Lu, Y. Xiong, A. Putnam, E. Chen, and L. Zhang. Kv-direct: high-performance in-memory key-value store with programmable nic. In SOSP'17, pages 137--152, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Fast'09, volume 9, pages 111--123, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. Metadata considered harmful. . . to deduplication. In HotStorage'15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file system. In ACM SIGOPS Operating Systems Review, volume 35. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Salami, G. A. Malazgirt, O. Arcas-Abella, A. Yurdakul, and N. Sonmez. AxleDB: A novel programmable query processing platform on FPGA. Microprocessors and Microsystems, 51:142--164, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  21. K. Srinivasan, T. Bisson, G. R. Goodson, and K. Voruganti. idedup: latency-aware, inline data deduplication for primary storage. In FAST'12, volume 12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX ATC'03, volume 3, pages 127--140, 2003.Google ScholarGoogle Scholar
  23. S. Xu, S. Lee, S.-W. Jun, M. Liu, J. Hicks, et al. BlueCache: A scalable distributed flash-based key-value store. PVLDB, 10(4):301--312, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 12
    August 2019
    547 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2019
    Published in pvldb Volume 12, Issue 12

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!