Timezone: »

Software-Hardware Codesign for Machine Learning Workloads
Ritwik Gupta · John Wohlbier · Tze Meng Low · Jeffrey Vetter · Natalia Vassilieva

Wed Mar 04 07:00 AM -- 03:30 PM (PST) @ Level 3 Room 9
Event URL: https://www.sei.cmu.edu/go/codesign »

Machine learning development workflows today involve the siloed design and optimization of task-specific software for a limited number of fixed hardware options. As a result, hardware and software are seen as individual components where the impact of either SW or HW on each other cannot be optimized or assessed jointly. This abstraction leads to computationally inefficient machine learning workloads.

Recently, both software and hardware have taken steps to become more domain specific. Machine learning focused software libraries provide operations and abstractions limited to workload-relevant use cases. Hardware makers have started manufacturing workload-relevant chips in the form of FPGAs, ASICs, and DLAs. However, these efforts are still largely independent of each other, resulting in inefficiencies and less-than-ideal workload performances.

Ideally, hardware and software would be codesigned for a specific ML workload, but investing in a particular hardware design is costly, especially in the face of the rapidly evolving state of ML. This workshop is soliciting extended abstracts that seek to bridge the gap between software and hardware in the areas of model design, model abstractions, model primitives, workload compression, hardware design, hardware optimization for power, data flow optimization, and compiler technologies.

Wed 6:00 a.m. - 6:10 a.m.
Welcome, Introduction, Logistics (Programmatics)
Wed 6:10 a.m. - 6:35 a.m.

Dr. Thomas Rondeau - Program Manager - DARPA



Keywords: software defined hardware, domain specific SoC, reconfigurable hardware, many core heterogeneous specialized accelerators

Bio: Tom Rondeau is a program manager in DARPA's Microsystems Technology Office with a focus on adaptive and reconfigurable radios, improving the development cycle for new signal-processing techniques, and exploring new approaches and applications with the electromagnetic spectrum. Prior to joining DARPA, Tom was the maintainer and lead developer of the GNU Radio project, a visiting researcher with the University of Pennsylvania, and an Adjunct with the IDA Center for Communications Research in Princeton, NJ.

Wed 6:35 a.m. - 7:00 a.m.

Dr. Christopher Aberger - Director, Software Engineering - SambaNova Systems

Title: Abstract: In many applications traditional software development is being replaced by machine learning generated models resulting in accuracy improvements and deployment advantages. This fundamental shift in how we develop software is known as Software 2.0. The continued success of Software 2.0 will require efficient and flexible computer hardware optimized for the dataflow computational graphs at the core of machine learning. In this talk, we will discuss the design of high-performance dataflow computer architectures for machine learning. Our vertically integrated approach to machine learning performance combines new machine learning algorithms, new domain-specific languages, advanced compilation technology and software-defined hardware.

Keywords: dataflow computational graph, domain specific languages, compiler technology, software defined hardware Bio: Dr. Christopher Aberger is a director of software engineering at SambaNova Systems where he leads the machine learning team. Christopher works on efficient training algorithms for new and emerging hardware architectures. He received his Ph.D. degree in Computer Science from Stanford University where he studied the intersection of graph, database, and machine learning systems; this work received a Best Of award at VLDB in 2016 and an invited TODS article in 2017.

Wed 7:00 a.m. - 7:25 a.m.

Dr. Dennis Abts - Chief Architect - Groq

Title: From Supercomputers to Superchips: Deep Learning One PataOp at a Time


Keywords: tensor streaming architecture, inference, software defined hardware, compiler technology

Bio: Dennis is the Chief Architect at Groq, and is an expert in scalable vector architectures for high performance computing. Previously at Google, he worked on datacenter network topologies for energy-proportional networking and Cray where he was a Sr. Principal Architect on several Top500 massively-parallel supercomputers. Dennis has published over 20 technical papers in areas of memory systems, interconnection networks, and fault-tolerant systems. He holds over two dozen patents spanning 20+ years of experience at Cray and Google. Dennis holds a Ph.D. in Computer Science from the University of Minnesota and is a Senior Member of IEEE and ACM Computer Society.

Wed 7:25 a.m. - 7:50 a.m.

Matt Fyles - VP Software - Graphcore

Title: Compiling For Distributed Memory Architectures

Abstract: The Graphcore Intelligence Processing Unit (IPU) is designed for targeting machine learning workloads and supporting the scaling of applications across multiple devices. The IPU architecture is based around massively parallel distributed processing where applications are mapped over thousands of processor cores and operate using a Bulk Synchronous Parallel (BSP) execution model which separates computation from communication. In order to achieve performance from applications mapped onto the IPU the software tool chain has to deal with the complex task of partitioning machine learning computational graphs. In this presentation we discuss how we take a machine learning application and through our software tools partition and schedule the work across the IPU. We also discuss the hardware / software trade-offs that were made to build a processor to execute these workloads.

Keywords: IPU, ML computational graph, bulk synchronous parallel, training, inference

Bio: Matt Fyles is a computer scientist with over 20 years experience in the design, development, delivery and support of software and hardware for the microprocessor market, spanning a wide range of applications from consumer electronics to high performance computing, with a particular focus on parallel processors. He began his career at STMicroelectronics, Europe’s largest semi-conductor company, followed by SuperH, Clearspeed and XMOS. He is currently Vice President of Software at Graphcore, a Bristol-based artificial intelligence hardware and software company. Matt is a graduate of Computer Science from the University of Exeter.

Wed 7:50 a.m. - 8:05 a.m.
Wed 8:05 a.m. - 8:30 a.m.

Dr. Natalia Vassilieva - Technical Product Manager - Cerebras Systems

Title: Accelerating Deep Learning with a purpose-built solution: the Cerebras approach

Abstract: The new era of chip specialization for deep learning is here. Traditional approaches to computing can no longer meet the computational and power requirements of this workload, arguably the most important of our generation. What is the right processor for deep learning? To answer this question, this talk will provide an overview of deep neural nets, discuss computational requirements of different types of models and limitations of existing hardware architectures and scale-out approaches. Then we will discuss Cerebras' approach to meet computational requirements of deep learning with the Cerebras Wafer Scale Engine (WSE) -- the largest computer chip in the world, and the Cerebras Software Platform, co-designed with the WSE. The WSE provides cluster-scale resources on a single chip with full utilization for tensors of any shape -- fat, square and thin, dense and sparse -- enabling researchers to explore novel network architectures and optimization techniques at any batch sizes. Finally, we will discuss potential co-design ideas for new neural net models and learning methods for the WSE.

Keywords: WSE, training, inference, dataflow computational graph

Bio: Natalia Vassilieva is a Technical Product Manager at Cerebras Systems, a computer systems company dedicated to accelerating deep learning. Her focus is machine learning and artificial intelligence, analytics, and application-driven software-hardware optimization and co-design. Most recently before joining Cerebras Natalia has been a Sr. Research Manager at Hewlett Packard Labs, where she led the Software and AI group and served as the head of HP Labs Russia from 2011 till 2015. Prior to HPE, she was an Associate Professor at St. Petersburg State University and worked as a software engineer for different IT companies. Natalia holds a PhD in computer science from St. Petersburg State University.

Wed 8:30 a.m. - 8:55 a.m.

Dr. Jeffrey Vetter - Future Technologies Group Leader - Oak Ridge National Laboratory




Wed 8:55 a.m. - 9:20 a.m.

Professor Tze Meng Low - Carnegie Mellon University




Bio: Tze Meng Low is an Assistant Research Professor with the Department of Electrical and Computer Engineering at Carnegie Mellon University. He graduated from the University of Texas at Austin with an M.S.(C.S) in 2004, and a Ph.D. in Computer Science in 2013. His research focuses on the systematic derivation and implementation of high-performance algorithms through the use of formal methods and analytical models. His goal is to achieve performance portability across both architectures and domains by understanding and capturing the interaction between software algorithms and hardware features through analytical models so as to build better code-generators, and/or software libraries for emerging domains and architectures.

Wed 9:20 a.m. - 9:45 a.m.

Professor Michael Taylor - University of Washington




Wed 9:45 a.m. - 10:10 a.m.

Professor Luca Carloni - Columbia University

Title: Accelerating Embedded Machine Learning with the Open-Source ESP Infrastructure

Abstract: Recent advances in machine learning (ML) have depended on the continued progress of hardware computing platforms. Future advances will depend even more on the synergistic progress of hardware and software. This is the case particularly for embedded ML applications, where developers must meet performance requirements under tighter resource constraints. The emerging open-source hardware community can play a unique role in supporting embedded ML research. ESP is an open-source research platform to design and program heterogeneous systems-on-chip. With the design automation capabilities of ESP, application developers can synthesize hardware accelerators from models specified in common ML frameworks, integrate these accelerators in a complete system-on-chip, and quickly obtain FPGA-based prototypes to evaluate their design by running embedded ML applications.

Keywords: SoC design, ML frameworks, FPGA

Bio: Luca Carloni is Professor of Computer Science at Columbia University in the City of New York. He holds a Laurea Degree Summa cum Laude in Electronics Engineering from the University of Bologna, Italy, and the MS and PhD degrees in Electrical Engineering and Computer Sciences from the University of California, Berkeley. His research interests include methodologies and tools for system-on-chip platforms with emphasis on heterogeneous computing, intellectual property reuse, design of networks-on-chip, embedded software, and distributed embedded systems. He coauthored over one hundred and fifty refereed papers and is the holder of two patents. Luca received the Faculty Early Career Development (CAREER) Award from the National Science Foundation in 2006, was selected as an Alfred P. Sloan Research Fellow in 2008, and received the ONR Young Investigator Award and the IEEE CEDA Early Career Award in 2010 and 2012, respectively. In 2013 Luca served as general chair of Embedded Systems Week (ESWeek), the premier event covering all aspects of embedded systems and software. Luca is an IEEE Fellow.

Wed 10:10 a.m. - 12:00 p.m.
Wed 12:00 p.m. - 12:25 p.m.
Facebook (Talk)
Wed 12:25 p.m. - 12:50 p.m.

Nick Ni - Director of Product Marketing, AI and Software - Xilinx

Title: Vitis AI: TensorFlow to FPGAs from edge to cloud

Abstract: AI scientists are moving from research (training) using high price, high power, large form factor HPCs to productization (inference). AI inference requires orders of magnitude more horsepower while keeping the price, power, latency, form factor intact, Xilinx adaptable devices are ideal for that. However, the biggest challenge has been the programming model where it required developers to be hardware savvy. In this talk, we will introduce the newly released development environment called Vitis AI, which allows users to directly take their TensorFlow trained models and target Xilinx devices from edge to cloud. Vitis AI consists of a suite of familiar tools for AI scientists: quantizer, pruner, compiler, profiler, runtime, and pre-optimized Deep Learning Processing Units (DPU).

Keywords: ML toolkit, FPGA, DPU

Bio: Nick Ni is the director of product marketing, AI and software at Xilinx. His team’s responsibilities include business development, go-to-market plans, ecosystem development, and outbound marketing for Xilinx’s artificial intelligence products and software/hardware development tools. Ni joined Xilinx in 2014. Before Xilinx, he held multiple roles in R&D and applications at ATI, AMD, Qualcomm, and Intel, focusing on embedded systems design and high-level synthesis. Ni earned a master’s degree in Computer Engineering from the University of Toronto and holds over 10 patents and publications.

Wed 12:50 p.m. - 1:05 p.m.
Wed 1:05 p.m. - 1:30 p.m.

James Moawad - Technical Solution Specialist - Intel


Abstract: We propose to show a Deep Learning Inference toolkit (OpenVINO), which provides a common API for inference independent of the underlying compute hardware. The inference engine can operate on CPU or be accelerated with GPU, VPU or FPGA. We will further look into details of an OpenCL based Deep Learning Accelerator running on FPGA and how this is integrated into the software flow. We will conclude with a brief discussion of the potential use of the oneAPI unified programming model could be used for future developments of such hardware agnostic accelerators.

Keywords: Inference, ML toolkit, CPU, GPU, VPU, FPGA

Bio: James Moawad is a Technical Solution Specialist with Intel’s Programmable Solutions Group specializing in compute acceleration using Field Programmable Gate Arrays (FPGA). He holds a B.S. in Electrical Engineering from the University of Illinois at Urbana-Champaign and a M.S. in Electrical and Computer Engineering from Georgia Institute of Technology with a focus on processor architecture. He designed telecommunication systems at Bell Laboratories / Lucent Technologies from 1999 to 2006 utilizing FPGAs and multi-processor arrays. Since 2006, he has worked as a Field Application Engineer helping customers architect systems with FPGA, embedded processors, DSP and various memory solutions including DRAM, solid state drives and high bandwidth memory (HBM).

Wed 2:00 p.m. - 2:40 p.m.

Dr. Kshitij Sudan - Principle Solutions Architect - Arm


Abstract: Machine learning processing gets a lot of attention due to novel hardware accelerators being developed to speed-up emerging use-cases. The large and rapidly evolving accelerator space for ML processing however is eclipsed in reality by the amount of ML processing that happens on general purpose CPUs. Some estimates rate >80% of ML inference to occur on general-purpose CPUs. The driving factors for on-CPU processing are three fold: 1) Ease of programming, 2) Integration of ML analysis output with business applications, 3) Duty-cycle of ML workloads. In this talk we will first outline the use-cases that are well served by on-CPU ML workload execution followed by how Arm is working to enable more efficient use of general-purpose Arm CPUs for edge-to-cloud processing of ML workloads. Efficient processing requires both hardware and software features to be co-developed – especially since ML algorithms are rapidly evolving. Arm is leveraging this co-design philosophy along with its traditional strength in energy efficient design to make on-CPU ML processing pervasive and easy-to-use.

Keywords: inference, CPU

Bio: Dr. Kshitij Sudan is a Principal Solution Architect in the Infrastructure Business Unit at Arm where he helps build solutions to address market and customer needs. A solution could either be a single piece of Arm IP or a whole platform offering consisting of Arm IP and enabling open-source software stack. His current areas of focus include smart-offload (like SmartNICs), platform security, video encoding, and efficient ML/AI processing. He received his Ph.D. from the University of Utah where his research focused on DRAM-based memory systems. He has been granted two US patents and has multiple applications in the pipeline.

Wed 2:20 p.m. - 3:00 p.m.

Author Information

Ritwik Gupta (Carnegie Mellon University Software Engineering Institute)
John Wohlbier (Carnegie Mellon University Software Engineering Institute)

Dr. John G. Wohlbier is a Senior Research Scientist in the Emerging Technology Center at Carnegie Mellon University’s Software Engineering Institute. Wohlbier started his career at Los Alamos National Laboratory where he spent over a decade working on computational physics for the US Department of Energy Advanced Simulation and Computing program. After Los Alamos he spent several years supporting DoD HPC programs. His current focus is performance engineering for data intensive software. His interests include computation on modern and emerging hardware, performance engineering, and computational physics.

Tze Meng Low (Carnegie Mellon University)
Jeffrey Vetter (Oak Ridge National Laboratory)
Natalia Vassilieva (Cerebras Systems)