My name is Amir Yazdanbakhsh. I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including WIRED, ZDNet, AnalyticsInsight, InfoQ.

I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort.

I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.

Publications

Please visit my Google Scholar page for the complete bibliography.

  1. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks, ASPLOS, 2023. [PDF]

  2. Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango, ArXiv, 2022. [PDF]

  3. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation, IISWC, 2022.

  4. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks, IISWC 2022.

  5. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation, MICRO 2022. [PDF]

  6. Accelerating Attention through Gradient-Based Learned Runtime Pruning, ISCA 2022. [PDF]

  7. Data-Driven Offline Optimization for Architecting Hardware Accelerators, ICLR 2022. [PDF]

  8. Apollo: Transferable Architecture Exploration, ArXiv 2020. [PDF]

  9. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation, ICLR 2020. [PDF]

  10. ReLeQ: A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks, IEEE MICRO 2020. [PDF]

  11. Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic, PACT 2020. [PDF]

  12. AxMemo: Hardware-Compiler Co-design for Approximate Code Memoization, ISCA 2019. [PDF | Talk]

  13. GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks, ISCA 2018. [PDF | Talk]

  14. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks, ISCA 2018. [PDF]

  15. Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration, ISCA 2016. [PDF]

  16. Neural Acceleration for GPU Throughput Processors, MICRO 2015. [PDF]

  17. General-purpose Code Acceleration with Limited-precision Analog Computation, ISCA 2014. [PDF]

Awards and Honors

  • Microsoft Research PhD Fellowship, 2016-2018.

  • Qualcomm Innovation Fellowship, 2014-2015.

  • Gold Medal in ACM Student Research Competition (SRC), 2018.

  • Honorable Mention in IEEE Micro Top Picks, 2016.

Media Coverage

  • Need to Fit Billions of Transistors on a Chip? Let AI Do It, WIRED, 2021.

  • Google’s Deep Learning Finds a Critical Path in AI Chips, ZDNet, 2021.

  • Google's Apollo AI for Chip Design Improves Deep Learning Performance by 25%, InfoQ, 2021.

  • Google's AI Lab Unveiled A New Framework for Efficient Chip Design, Analytics Insight, 2021.

Blogs

In “Data-Driven Offline Optimization for Architecting Hardware Accelerators”, accepted at ICLR 2022, we introduce PRIME, an approach focused on architecting accelerators based on data-driven optimization that only utilizes existing logged data (e.g., data leftover from traditional accelerator design efforts), consisting of accelerator designs and their corresponding performance metrics (e.g., latency, power, etc) to architect hardware accelerators without any further hardware simulation. This alleviates the need to run time-consuming simulations and enables reuse of data from past experiments, even when the set of target applications changes (e.g., an ML model for vision, language, or other objective), and even for unseen but related applications to the training set, in a zero-shot fashion.

In “Apollo: Transferable Architecture Exploration”, we present the progress of our research on ML-driven design of custom accelerators. While recent work has demonstrated promising results in leveraging ML to improve the low-level floor-planning process (in which the hardware components are spatially laid out and connected in silicon), in this work we focus on blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the overall performance of the chip in which the design elements that control the high-level functionality are established. Our research shows how ML algorithms can facilitate architecture exploration and suggest high-performing architectures across a range of deep neural networks, with domains spanning image classification, object detection, OCR and semantic segmentation.

Today we introduce Menger, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement. In this post we describe how we implement Menger using Google TPU accelerators for fast training iterations, and present its performance and scalability on the challenging task of chip placement. Menger reduces the training time by up to 8.6x (down to ~one hour) compared to a baseline implementation.