SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles
ACM2019 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles Huntsville Ontario Canada October, 2019
ISBN:
978-1-4503-6873-5
Sponsors:
In-Cooperation:
USENIX Assoc

Bibliometrics

Abstract

SOSP is the flagship conference of ACM SIGOPS. Held every two years, it brings together the leading researchers and practitioners interested in the design, implementation, and evaluation of computer systems software.

PipeDream: generalized pipeline parallelism for DNN training

DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available ...

research-article
A generic communication scheduler for distributed DNN training acceleration

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory ...

Parity models: erasure-coded resilience for prediction serving systems

Machine learning models are becoming the primary work-horses for many applications. Services deploy models through prediction serving systems that take in queries and return predictions by performing inference on models. Prediction serving systems are ...

TASO: optimizing deep learning computation with automatic generation of graph substitutions

Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators ...

Teechain: a secure payment network with asynchronous blockchain access

Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the ...

research-article
Fast and secure global payments with Stellar

International payments are slow and expensive, in part because of multi-hop payment routing through heterogeneous banking systems. Stellar is a new global payment network that can directly transfer digital money anywhere in the world in seconds. The key ...

research-article
Open Access
Notary: a device for secure transaction approval

Notary is a new hardware and software architecture for running isolated approval agents in the form factor of a USB stick with a small display and buttons. Approval agents allow factoring out critical security decisions, such as getting the user's ...

research-article
CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
October 2019, pp 114–130https://doi.org/10.1145/3341301.3359645

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...

research-article
Open Access
The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure
October 2019, pp 131–146https://doi.org/10.1145/3341301.3359650

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a ...

Finding semantic bugs in file systems with an extensible fuzzing framework
October 2019, pp 147–161https://doi.org/10.1145/3341301.3359662

File systems are too large to be bug free. Although handwritten test suites have been widely used to stress file systems, they can hardly keep up with the rapid increase in file system size and complexity, leading to new bugs being introduced and ...

Efficient scalable thread-safety-violation detection: finding thousands of concurrency bugs during testing
October 2019, pp 162–180https://doi.org/10.1145/3341301.3359638

Concurrency bugs are hard to find, reproduce, and debug. They often escape rigorous in-house testing, but result in large-scale outages in production. Existing concurrency-bug detection techniques unfortunately cannot be part of industry's integrated ...

research-article
Privacy accounting and quality control in the sage differentially private ML platform
October 2019, pp 181–195https://doi.org/10.1145/3341301.3359639

Companies increasingly expose machine learning (ML) models trained over sensitive user data to untrusted domains, such as end-user devices and wide-access model stores. This creates a need to control the data's leakage through these models. We present ...

research-article
Honeycrisp: large-scale differentially private aggregation without a trusted core
October 2019, pp 196–210https://doi.org/10.1145/3341301.3359660

Recently, a number of systems have been deployed that gather sensitive statistics from user devices while giving differential privacy guarantees. One prominent example is the component in Apple's macOS and iOS devices that collects information about ...

research-article
Open Access
Yodel: strong metadata security for voice calls
October 2019, pp 211–224https://doi.org/10.1145/3341301.3359648

Yodel is the first system for voice calls that hides metadata (e.g., who is communicating with whom) from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been ...

Scaling symbolic evaluation for automated verification of systems code with Serval
October 2019, pp 225–242https://doi.org/10.1145/3341301.3359641

This paper presents Serval, a framework for developing automated verifiers for systems software. Serval provides an extensible infrastructure for creating verifiers by lifting interpreters under symbolic evaluation, and a systematic approach to ...

Verifying concurrent, crash-safe systems with Perennial
October 2019, pp 243–258https://doi.org/10.1145/3341301.3359632

This paper introduces Perennial, a framework for verifying concurrent, crash-safe systems. Perennial extends the Iris concurrency framework with three techniques to enable crash-safety reasoning: recovery leases, recovery helping, and versioned memory. ...

research-article
Using concurrent relational logic with helpers for verifying the AtomFS file system
October 2019, pp 259–274https://doi.org/10.1145/3341301.3359644

Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable ...

Verifying software network functions with no verification expertise
October 2019, pp 275–290https://doi.org/10.1145/3341301.3359647

We present the design and implementation of Vigor, a software stack and toolchain for building and running software network middleboxes that are guaranteed to be correct, while preserving competitive performance and developer productivity. Developers ...

Optimizing data-intensive computations in existing libraries with split annotations
October 2019, pp 291–305https://doi.org/10.1145/3341301.3359652

Data movement between main memory and the CPU is a major bottleneck in parallel data-intensive applications. In response, researchers have proposed using compilers and intermediate representations (IRs) that apply optimizations such as loop fusion under ...

research-article
Niijima: sound and automated computation consolidation for efficient multilingual data-parallel pipelines
October 2019, pp 306–321https://doi.org/10.1145/3341301.3359649

Multilingual data-parallel pipelines, such as Microsoft's Scope and Apache Spark, are widely used in real-world analytical tasks. While the involvement of multiple languages (often including both managed and native languages) provides much convenience ...

research-article
Nexus: a GPU cluster engine for accelerating DNN-based video analysis
October 2019, pp 322–337https://doi.org/10.1145/3341301.3359658

We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. ...

Lineage stash: fault tolerance off the critical path
October 2019, pp 338–352https://doi.org/10.1145/3341301.3359653

As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad ...

File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution
October 2019, pp 353–369https://doi.org/10.1145/3341301.3359656

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the ...

I4: incremental inference of inductive invariants for verification of distributed protocols
October 2019, pp 370–384https://doi.org/10.1145/3341301.3359651

Designing and implementing distributed systems correctly is a very challenging task. Recently, formal verification has been successfully used to prove the correctness of distributed systems. At the heart of formal verification lies a computer-checked ...

research-article
Aegean: replication beyond the client-server model
October 2019, pp 385–398https://doi.org/10.1145/3341301.3359663

This paper presents Aegean, a new approach that allows fault-tolerant replication to be implemented beyond the confines of the client-server model. In today's computing, where services are rarely standalone, traditional replication protocols such as ...

research-article
Snap: a microkernel approach to host networking
October 2019, pp 399–413https://doi.org/10.1145/3341301.3359657

This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google's rapidly evolving needs with flexible modules that implement a range of ...

Risk based planning of network changes in evolving data centers
October 2019, pp 414–429https://doi.org/10.1145/3341301.3359664

Data center networks evolve as they serve customer traffic. When applying network changes, operators risk impacting customer traffic because the network operates at reduced capacity and is more vulnerable to failures and traffic variations. The impact ...

research-article
Open Access
Taiji: managing global user traffic for large-scale internet services at the edge
October 2019, pp 430–446https://doi.org/10.1145/3341301.3359655

We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests.

Taiji models edge-to-datacenter ...

research-article
KVell: the design and implementation of a fast persistent key-value store
October 2019, pp 447–461https://doi.org/10.1145/3341301.3359628

Modern block-addressable NVMe SSDs provide much higher bandwidth and similar performance for random and sequential access. Persistent key-value stores (KVs) designed for earlier storage devices, using either Log-Structured Merge (LSM) or B trees, do not ...

Recipe: converting concurrent DRAM indexes to persistent-memory indexes
October 2019, pp 462–477https://doi.org/10.1145/3341301.3359635

We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM). The main insight behind Recipe is that isolation provided by a certain class of concurrent in-memory indexes can be ...

Contributors

  • Tim Benedict Brecht
    University of Waterloo
  • Carey Williamson
    University of Calgary

Comments

About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!