MLSys 2021 Papers

Skip to yearly menu bar Skip to main content

Layout:

mini compact topic detail

by

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Don't Forget to Sign the Gradients!

Scaling Polyhedral Neural Network Verification on GPUs

In-network Aggregation for Shared Machine Learning Clusters

Scaling Distributed Training with Adaptive Summation

Pufferfish: Communication-efficient Models At No Extra Cost

A Learned Performance Model for Tensor Processing Units

Accounting for Variance in Machine Learning Benchmarks

Learning on Distributed Traces for Data Center Storage Systems

CODE: Compiler-based Neuron-aware Ensemble training

Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Lost in Pruning: The Effects of Pruning Neural Networks beyond Test Accuracy

Adaptive Gradient Communication via Critical Learning Regime Identification

A Distributed Graph-Theoretic Framework for Automatic Parallelization in Multi-core Systems

Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick

Pipelined Backpropagation at Scale: Training Large Models without Batches

Value Learning for Throughput Optimization of Deep Learning Workloads

MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Equality Saturation for Tensor Graph Superoptimization

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

FLAML: A Fast and Lightweight AutoML Library

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Larq Compute Engine: Design, Benchmark and Deploy State-of-the-Art Binarized Neural Networks

Accelerate Inference of CNNs for Video Analysis While Preserving Exactness Exploiting Activation Sparsity

RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads

Learning Fitness Functions for Machine Programming

Fluid: Resource-aware Hyperparameter Tuning Engine

Bit Error Robustness for Energy-Efficient DNN Accelerators

ByzShield: An Efficient and Robust System for Distributed Training

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

ModularNAS: Towards Modularized and Reusable Neural Architecture Search

Wavelet: Efficient DNN Training with Tick-Tock Scheduling

sensAI: ConvNets Decomposition via Class Parallelism for Fast Inference on Live Data

Data Movement Is All You Need: A Case Study on Optimizing Transformers

Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks

IOS: Inter-Operator Scheduler for CNN Acceleration

TensorFlow Lite Micro: Embedded Machine Learning for TinyML Systems

EXPLORING THE LIMITS OF CONCURRENCY IN ML TRAINING ON GOOGLE TPUS

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Swift for TensorFlow: A portable, flexible platform for deep learning

Characterizing and Taming Model Instability Across Edge Devices

PipeMare: Asynchronous Pipeline Parallel DNN Training

A Deep Learning Based Cost Model for Automatic Code Optimization

Amazon SageMaker Debugger: A System for Real-Time Insights into Machine Learning Model Training

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Cortex: A Compiler for Recursive Deep Learning Models