My name is Amir Yazdanbakhsh. I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including WIRED, ZDNet, AnalyticsInsight, InfoQ

I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort. 

I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.

Publications

Please visit my Google Scholar page for the complete bibliography.

Model Compression


ML for Computer Architecture and Systems

Foundational Models


Computer Architecture

Awards and Honors

Media Coverage

Blogs

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. These shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other's work, leading to significant advances in the field. Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively. To alleviate these challenges, inArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms.

In “Data-Driven Offline Optimization for Architecting Hardware Accelerators”, accepted at ICLR 2022, we introduce PRIME, an approach focused on architecting accelerators based on data-driven optimization that only utilizes existing logged data (e.g., data leftover from traditional accelerator design efforts), consisting of accelerator designs and their corresponding performance metrics (e.g., latency, power, etc) to architect hardware accelerators without any further hardware simulation. This alleviates the need to run time-consuming simulations and enables reuse of data from past experiments, even when the set of target applications changes (e.g., an ML model for vision, language, or other objective), and even for unseen but related applications to the training set, in a zero-shot fashion. 

In “Apollo: Transferable Architecture Exploration”, we present the progress of our research on ML-driven design of custom accelerators. While recent work has demonstrated promising results in leveraging ML to improve the low-level floor-planning process (in which the hardware components are spatially laid out and connected in silicon), in this work we focus on blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the overall performance of the chip in which the design elements that control the high-level functionality are established. Our research shows how ML algorithms can facilitate architecture exploration and suggest high-performing architectures across a range of deep neural networks, with domains spanning image classification, object detection, OCR and semantic segmentation.

Today we introduce Menger, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement. In this post we describe how we implement Menger using Google TPU accelerators for fast training iterations, and present its performance and scalability on the challenging task of chip placement. Menger reduces the training time by up to 8.6x (down to ~one hour) compared to a baseline implementation.