Applications of Sketching and Pathways to Impact

Data summaries (a.k.a., sketches) are compact data structures that can be updated flexibly and efficiently to capture certain properties of a data set. Well-known examples include set summaries (Bloom Filters) and cardinality estimators (e.g., Hyperloglog), amongst others. PODS and SIGMOD have been home to many papers on sketching, including several best paper recipients. Sketch algorithms have emerged from the theoretical research community, but have found wide impact in practice. This paper describes some of the impacts that sketches have had, from online advertising to privacy-preserving data analysis. It will consider the range of different strategies that researchers can follow to encourage the adoption of their work, and what has and has not worked for sketches as a case study.


INTRODUCTION
The concept of a "sketch" of data is that of a compact data summary that is designed to capture certain properties of the input data.The sketch comprises a description of the data structure, and algorithms that can update the structure and query it.This can include routines to update the sketch with a single new piece of information (capturing a streaming model of data processing), or to merge together two sketches (capturing a distributed model of data processing).
The queries that are supported by a sketch are usually to efficiently approximate some function of the input data.For instance, some sketches report the cardinality of the set of input items that they have processed, leading to the count distinct (a.k.a. 0 ) sketches.Sketches for dimensionality reduction approximate the Euclidean norm (or other norm) of their input, interpreted as a highdimensional vector.Other sketches represent frequency histograms in order to answer heavy hitter or quantile queries.Sketches can also capture properties of more complex data types, such as graphs, and matrices.
Sketching algorithms make use of a number of basic algorithmic concepts.Deterministic sketches track counts and other simple statistics of the input data in order to give exact or approximate results.But the majority of sketches use randomization to provide a summary that obtains an accurate approximation with high probability.Key techniques include random sampling, or other probabilistic ways to select information about the set of input items; and hashing, to select items from the input in a random but repeatable way.These are combined with tracking counts or setting bits in a particular fashion.From this collection of methods, a wide variety of techniques can flourish.For instance, both the Bloom filter and Hyperloglog summary use hashing and bit vectors to represent the input, but for quite different purposes (tracking approximate set membership, and approximate set cardinality, respectively).
The purpose of this article, accompanying a talk at the PODS 2023 conference, is not to give a complete overview of sketches or their workings.Rather, the aim is to give a flavour of the development of sketch algorithms over the decades, and to cast some light on how they have found applications in practice.
We proceed as follows: in Section 2 we present a history of the development of sketches over the years.We complement this in Section 3 by looking at the corresponding timeline of how sketches have been motivated for different problems.In Section 4 we consider some of the ways that have been tried to encourage the adoption of sketches in practice, and offer concluding remarks in Section 5.

A (VERY) BRIEF HISTORY OF SKETCHES
The Pre-history of sketching (1970s and 1980s).The earliest instance of something that we could reasonably refer to as a sketch algorithm would be (uniform) random sampling, which far predates computers.Computer science and random sampling intersected with the notion of 'reservoir sampling': drawing a uniform random sample from a large stream of examples, whose cardinality is initially unknown.The simple incremental reservoir sampling algorithm is attributed variously to Fan et al. and to Waterman1 .Generalizations of sampling have led to a wide range of statistical techniques, going far beyond what can be discussed in this short note.
Apart from sampling, the earliest sketches started to emerge along with the wider availability of programmable computers, in the second half of the 20th century, from the 1970s onwards.Perhaps the first example of something we can think of as a sketch is due to Bloom in 1970, in the form of the famed 'Bloom Filter' [9] (if any reader can point to earlier examples, I would be delighted to hear of them).The Bloom filter compactly represents a set as a collection of bits, and is easy to update with new entries, and to query for (approximate) set membership.It does however take space that is linear in the size of the set that is represented (albeit with a small constant of proportionality).For an asymptotic space reduction, we have to look to examples such as the Morris counter (1977) [37], which allows us to count  events approximately in space proportional to  (log log ), rather than the exact binary counter that requires log 2  bits.There is also the Flajolet and Martin distinct counter (1983), which uses  (log ) bits, but tracks the number of distinct items that have been observed within the input [22].
Other notions developed during these early years is the Munro-Paterson approach to finding quantiles in sublinear space (1980) [38].The original focus of the work was to find the exact median (and other order statistics) with multiple passes over the input (assumed to be stored on tape).Later work reframed these results as providing deterministic approximations with a single pass.Boyer and Moore provided a simple algorithm to find the majority item in a sequence (1981) [11], which was generalized by Misra and Gries to find all frequently occurring items [35].From mathematics, the Johnson-Lindenstrauss lemma (1984) [26] argued that Euclidean distances could be preserved among a set of high-dimensional points via a suitable projection.However, it took until the 1990s before explicit constructions emerged, based on random projections that were approximately distance preserving.A quirk of the literature is that these sketches are mostly known by the surnames of their designers: Bloom, Morris, Flajolet-Martin, Munro and Paterson, Johnson-Lindenstrauss.
The Streaming Years.The area of sketching accelerated from the mid 1990s through the first decade of the 2000s (approximately), due to the sudden growth in interest in 'data streaming' and the streaming model of computation.This model was formulated based on sources of large volumes of data, where it was necessary to process many small, incremental updates as a 'stream' of information.The work of Alon, Matias and Szegedy on the space complexity of the frequency moments launched the interest in this from an algorithmic perspective [5].One key result was their "tug-of-war" or AMS sketch, based on maintaining the inner product of the input with Rademacher random variables (which can be viewed as small space version of the Johnson-Lindendstrauss lemma).In parallel, Indyk and Motwani introduced the notion of Locality Sensitive Hashing, which builds a sketch of a large object, such that similar objects are likely to have similar sketches -also relying in part on Johnson-Lindenstrauss ideas [25].
The problem of finding the quantiles from a stream of items has been a keystone problem for sketching over the years.Manku, Rajagopalan and Lindsay adapted the Munro-Paterson algorithm to the streaming setting, and proposed extensions that obtained polylogarithmic space bounds [32].Greenwald and Khanna presented and analyzed a streaming algorithm for quantiles that obtained logarithmic space [23].Then Shrivastava et al. presented the q-digest sketch for quantile estimation, which focused on mergability for distributed data [44].
Sketches based on carefully structured random projection appeared.The Count sketch can be viewed as an improvement of the AMS sketch, replacing averaging with hashing to speed up the computation [12].Originally proposed for estimating item frequencies, it has been generalized as the basis of sparse Johnson-Lindenstrauss transforms.The Count-Min sketch seeks to further streamline sketching, by removing the Rademacher random variables, in order to provide frequency estimation with  1 instead of  2 guarantees [14].The SpaceSaving algorithm was introduced to give a fast, deterministic solution to frequency estimation [34]; it was later connected with the similar Misra-Gries algorithm [35].
The distinct element counting problem was also revisited in the streaming model, with the aim of providing strong approximation guarantees with tighter space bounds.The loglog algorithm reduced the dependence on the cardinality from logarithmic to double-logarithmic [18].Subsequently, the hyperloglog (HLL) further squeezed the space cost for this problem, while remaining very simple to implement (the same cannot be said about the algorithmic analysis, which is highly sophisticated) [21].
During this era, it was common to refer to sketches by initialisms of their authors, e.g., AMS, MRL, GK, CM; or by names given to the sketches, such as SpaceSaving, Q-Digest, LogLog, Hyperloglog and Count Sketch.
From streaming to mergable.Interest in sketches for the streaming model of computation has remained, but over the last decade there has been more emphasis on generalizing sketches to work in distributed settings (as opposed to on centralized streams), and on improving their performance for practical implementations.
Agarwal et al. [3] placed emphasis on the notion of mergability of summaries, drawing out a theme that was present in many prior works.They provided new results on mergability of deterministic frequency estimation algorithms and randomized quantile algorithms.A sequence of papers further tightened results on quantiles, leading to the Karnin-Lang-Liberty (KLL) [30] optimal quantile sketch, combining sampling with sketching ideas.Some deep theoretical advances were made.Truly sparse constructions of the Johnson-Lindenstrauss lemma were presented by Kane and Nelson, similar in outline to the Count Sketch but with stronger guarantees [28].Such dimensionality reduction techniques led to the development of the areas of compressed sensing [17] and subspace embeddings [48].Sketch techniques for graphs were developed by Ahn, Guha and McGregor, based on   sampling, which allowed dynamic connectivity and minimum spanning trees to be solved in near-linear space [4].
On the practical side, work from Google discussed how to optimize the HLL algorithm for tracking cardinalities of very high magnitude, while improving accuracy at small cardinalities [24].A team from Yahoo! started the "data sketches" project, which aimed to provide robust implementations of many sketches to ease adoption.The project emphasised the need for concurrency and mergability of sketches [41].
Sketching at PODS.The PODS conference has been a welcoming home for results on sketching, with many papers on sketching and related topics appearing here.A complete enumeration is out of the question, and instead I highlight some examples of papers on sketching that have been honoured with awards at PODS: • "An optimal algorithm for the distinct elements problem" [29] (PODS 2010, best paper award) gives a sketch algorithm that achieves the lower bound for counting the number of distinct items in a stream of updates.• "Tight bounds for   samplers" [27] (PODS 2011, Test of time award in 2021) uses sketch techniques to sample according to the   distribution, where the probability of picking an item is proportional to a monomial function of its frequency.• "Mergeable Summaries" [3] (PODS 2012, Test of time award in 2022) formalizes the notion of mergeable summaries, and shows sketches that can be merged for frequency estimation, quantiles, and geometric approximations.• "A framework for adversarially robust streaming algorithms" [7] (PODS 2020, best paper award) considers how randomized sketch algorithms can be built that are robust to an adversary trying to break the approximation guarantee.• "Relative Error streaming quantiles" [13] (PODS 2021, best paper award) gives a near-optimal sketch for the problem of summarizing a stream of items to find the quantiles with a relative error guarantee.• "Optimal Bounds for Approximate Counting" [39] (PODS 2022, best paper award) revisits the foundational problem of approximate counting [37] and shows a variant sketch that achieves improved dependency on the approximation parameters.
Collectively, these demonstrate the depth and challenge of problems relating to sketches, and their high level of interest to the PODS community.
Sketching in print.For further reading, and a technical presentation of the fundamentals of sketching, there are now several textbooks that cover the topic in depth."Probability and Computing" by Mitzenmacher and Upfal [36] is a general introduction to probabilistic algorithms, which uses some sketch algorithms to illustrate the key techniques."Mining of Massive Datasets" (Leskovec, Rajamaran and Ullman) [31] devotes a chapter to sketch techniques for data analysis.Likewise, "Foundations of Data Science" by Blum, Hopcroft and Kannan [10] has a chapter on sketching as a mathematical tool in data science.The book "Algorithms and Data Structures for Massive Datasets" by Medjedovic, Tahirovic and Dedovic [33] is almost entirely concerned with presenting core sketch algorithms and their analyses, and similarly, "Small Summaries for Big Data" (Cormode and Yi) [15] presents multiple sketch algorithms, and discusses implementation issues.

THE SHIFTING MOTIVATIONS FOR SKETCHING.
As hinted in the previous section, the motivations for using sketches have shifted over time, presenting different demands on the algorithms, and highlighting different concerns.Some applications have faded due to shifts in technologies, while others remain broadly relevant today.

Memory constrained systems (1970s-1980s
).The initial motivation for the first sketches were the constraints of memory.Bloom filters were proposed as a compact way to perform spell checking when it was not feasible to keep a full dictionary in memory [9].
Similarly, the Munro-Paterson work on quantiles was in the context of tape-resident data sets which were too large to be brought into memory.The advent of hierarchical memory systems with larger main memory and ready access to (relatively) fast disks diminished the need for such algorithms during the subsequent decades.However, the emergence of systems performing analytics on very large volumes of data meant that these applications did not disappear entirely.
Massive Data Streams (mid 1990s-early 2000s).The growth of the internet, and associated ecosystems, provided the setting for "massive data" and "data streams".Initially, the focus came from the network/ISP world: for the first time, we could easily see examples where the volume of data moving through a system dwarfed the capacity to store it at rest.Although the (meta)data was mostly ephemeral, it was desirable to be able to summarize and query it in order to monitor and debug networked systems.This drove the demand for sketches that could be build in a streaming (incremental) fashion, and integrated into special-purpose data stream management systems.These included systems from Sprint (CMON [46]) and AT&T (Gigascope [16]) in industry, and academic systems such as Stream, Aurora and Borealis [1,2,6,49].Here, the need was often not to build one sketch, but to maintain huge numbers of sketches in parallel (i.e., to support GROUP BY aggregate queries over many groups).While impactful within their specialist domains, these applications tended to be internal and bespoke to specific network management problems.Attempts to generalize these ideas to distributed models, captured in settings such as sensor networks, provided rich fodder for research papers, but had more limited practical impact.

From ISPs to Internet Companies (2000s onwards).
A shift in the motivation for sketch algorithms came in the first decade of the current century, when a new class of Internet-based companies came along with a focus on novel technologies.Starting with search engines, these companies handled vast amounts of data, and hence brought applications that could benefit from sketching.Google was the leading example here, and several sketches found important motivation from search data: the Count sketch was proposed by academic visitors to Google [12], while locality sensitive hashing was built into systems to perform multimedia (image) search.Even though many of the technologies have changed over the years, sketching still has relevance to these applications.For instance, the mechanism for image similarity search may have shifted from simple feature extraction to learned vector embeddings.However, both rely on notions of (high-dimensional) vector similarity which can be supported efficiently by LSH-based techniques.
Online advertising (2010s).The financial underpinning for these new tech companies primarily derived from online advertising: connecting internet users with adverts to draw their clicks.A basic question that advertisers wanted to understand was exactly how many individuals were their adverts reaching?This could be a non-trivial question, due to the complexity and scale of the online advertising ecosystem that quickly grew up.Sketches, specifically distinct count sketches such as loglog and hyperloglog, were proposed as an answer: these sketches could be used to track how many distinct users (based on cookie information) were exposed to a particular campaign, while avoiding double counting.Properties of these sketches meant that it was possible to "slice and dice" these statistics, by reporting response rates across multiple dimensions (e.g., demographic attributes).Systems were built and put into production based on this principle, by companies such as Aggregate Knowledge.However, there were obstacles that prevented this approach having a major impact.First, a long-standing limitation of sketches that use randomization is the challenge in communicating a randomized approximation guarantee to non-technical consumers.This is not unique to sketching, and can be overcome with appropriate communication tools (e.g., confidence intervals on reported statistics), but presents an initial barrier.A more fundamental issue is that computer systems eventually scaled faster than advertising clicks: it became possible to track and process advertising information in highly performant data warehouses, giving "exact" results (up to sampling bias and other noise factors).While there remain cases where the data volume is very high (e.g., systems that track every tiny interaction, such as a mouse movement) that could benefit from the use of sketching, these may instead be handled by alternative downsampling techniques to reduce the data down to more managable amounts.
The Big Data era (2010s onwards).Meanwhile, the terminology shifted from 'massive data' to 'big data', and new applications emerged.Other applications, such as social media and video streaming, became mainstream, and brought their own data analytics questions with them.While the primary data (posts, videos etc.) have to be stored and delivered exactly, there are large volumes of secondary data that can be summarized in sketches.For example, Twitter used count-min sketches to keep track of how many views were received by "embedded tweets", such as a tweet that is presented within a news article.New algorithms for the core problems of heavy hitters, quantiles, and count distinct were developed (e.g., the KLL algorithm, the t-digest summary) and made available via libraries (the Apache Data Sketches Library) and platforms (e.g., Splunk and Salesforce).While it remains challenging to find detailed information on the extent to which sketches have been deployed in practice, anecdotally some sketches are very widely used, and many software engineers sing the praises of sketches such as Bloom filters and hyperloglog2 .
Private Data Analysis (late 2010s onwards).As the focus on data analysis has grown over time, so has the need for privacy enhancing technologies to support it.The data being analyzed is often related to individual people, and so it is necessary to modify the analysis procedure in order to protect the privacy of the individuals who have contributed to the data.Formal definitions of privacy have emerged in the form of -anonymity [43] and differential privacy [19], which require that the data analysis procedure adheres to some requirements, such as coarsening the level of information available, or adding calibrated random noise to the output.
Such definitions happen to cohere well with sketching: the compact representations formed by sketch algorithms tend to mix and concentrate the information from many individuals, making the perturbations due to privacy less disruptive than other representations would be [50].
A concrete example is the RAPPOR system deployed by Google to collect statistics on web browsing activity [20].The system can be summarized as combining the Bloom filter summary [9] with randomized response [47], to randomly flip some of the bits.Similarly, Apple's deployment of differential privacy can be understood as taking a Count-Min sketch of a sparse input and applying randomized response to each entry [45].More generally, the emerging area of Federated Analytics [8], which aims to collect data privately from a large population of distributed individuals can be crudely described as being based on sketches with privacy.
Optimizing Machine Learning (mid 2010s onwards).The vast growth in interest in machine learning over the last decade has drawn on many aspects of computer science: optimization to train models from data, hardware integration to speed up training at scale, and so one.One direction of interest has been on using sketches to reduce the cost of the training process.A basic idea is to make use of sketches that preserve the norm of data in high-dimensional space to perform the learning in the sketch space, rather than in the original space.This has been leveraged to reduce the communication cost of distributed machine learning [42].Other potential directions include using sketching as a way to approximate expensive linear algebra operations, such as matrix multiplication, and to incorporate kernel transformations [40,48].To the best of my knowledge, such uses of sketches have so far been primarily of academic interest, but it is feasible that future work can more directly benefit from sketch techniques.

LESSONS LEARNED FROM SKETCHES IN PRACTICE
In studying and working with sketches over many years, and being excited about their potential for adoption in a wide range of applications, it is natural to have considered a number of ways to spread this enthusiasm, and to accelerate the pathway to adoption.This section reviews some of the different strategies that have been tried, and comments on their efficacy in this regard.
Launching a startup.The most direct way of pushing ideas from research into practice is to do it yourself: to launch a company around your latest paper.This idea has been suggested many times for sketches, given the powerful results that they can achieve.But, at the risk of stating the obvious, a successful startup needs a business idea: a product or service that can succeed in the marketplace.Sketches, like many ideas from the data management and algorithms worlds, are too far "under the hood" for a clear case to emerge: they don't obviously solve a problem that was impossible before, or tackle an issue that is a pain point for many users.Profitable companies that have come from academic research (e.g., Google with web search, and Akamai with consistent hashing) have ultimately succeeded due to having a business model that is supported by the technology (i.e., online advertising, and content delivery networks).So while sketching may be a useful tool in building software that is of use to people, it is not (yet) a piece that is vital to the success of a product, and so is hard to build a business around.
Pushing out code.A more direct route to getting research ideas into use is to provide code to implement them.Sketches are a particularly good test case for this: the algorithms needed are often relatively simple to code up for a researcher.But the concepts and techniques may be sufficiently unfamiliar to the typical software engineer (such as certain kinds of hash function, or fiddly bit manipulation tricks) that prototype code can be very valuable, simply for showing a proof of concept.A reference implementation, even if crudely written and lightly documented, is much preferable and more tangible than pseudocode in a paper.So it is strongly encouraged for researchers to make their code available to others, via github or other forms.
Inflict ideas on the next generation.The computer science curriculum is far more dynamic than, say, the mathematics curriculum, and it is still feasible to include research ideas in undergraduate classes.There may not be room to go in-depth on cutting edge ideas, but including a few results from the current century may help to keep students engaged.Sketches are a good exemplar for this, since the ideas can fit well into an algorithms or database class, and illustrate some of the underlying principles and concepts.A long-term benefit of this approach is that some students may just remember these ideas after graduation, and be motivated to make use of them in whatever career they choose to go in to.
Write accessible notes, and put them where people can read them.While we may think of peer-reviewed academic publications as the medium for sharing research ideas, these are unfortunately not the place where practitioners will find them.You can have more reach by writing accessible notes addressing the software engineering community.For sketches, we made web pages and wrote articles in practitioner-focused journals.Today, you should consider making more use of platforms like medium and substack, and promoting posts via social media.
Give talks and tutorials.On a similar note, talks can be more accessible than articles, particularly if they are captured as an online video that is easy to share.For some people, their first stop to learn about a new topic is on YouTube, rather than arXiv.Things don't have to be too polished, but there is potential for short-form, custom made introductory videos to reach a wide audience interested in learning new techniques that they can adopt.
Work directly with companies.Lastly, it can be valuable to work directly with a company that can benefit from translating research into practice.This is not a simple proposition: working very closely with a company requires building up a lot of trust and understanding, or going through a demanding application and recruitment process.Most organizations want a solution to their problems that can be implemented and deployed at scale.This does not necessarily align with research novelty: sometimes simple partial solutions will suffice, while research-inspired approaches are viewed as too complex and not scalable enough.However, the experience can be eye-opening, and can allow not only real-world use of research ideas, but inspiration for new research questions that are well-motivated.

CONCLUDING REMARKS
The notion of sketching is a compelling one, to build a compact representation of a large dataset that nonetheless allows certain properties of the data to be accurately approximated.Contributions to this topic have required substantial theoretical advances, appearing in venues such as PODS, but promising substantial impact in practice.
The road to genuine impact is a long and bumpy one.Many great theoretical ideas never make any significant impact, because the scenario they solve does not arise in practice as urgently, or can be tackled satisfactorily with heuristic or less efficient approaches.Sketches can rightly claim to have had meaningful impact on the practice of computer science.However, the motivations for sketches have changed over time, and once compelling demands may no longer be relevant.Moreover, of the hundreds, if not thousands, of papers that have presented ideas on sketches, there may be only a handful that have achieved very widespread use.Bloom filters and Hyperloglog sketches are the most well-known, along with Count sketches and Count-Min sketches, plus various sketches for quantile and frequent item estimation (SpaceSaving, Misra-Gries, Q-Digest, T-Digest, Greenwald-Khanna).
This illustrates the principle that good theory can lead to good impact.The biggest contributor to this may be time: it can take time for ideas to diffuse and to find their application.After this, active effort to connect the ideas with their potential beneficiaries is useful -make the ideas as easy as possible to access and digest.Finally, there is always some element of luck -will the right motivation align with the idea's potential, and will the right people be inspired by the ideas to put them into practice?