Amir Yazdanbakhsh

My name is Amir Yazdanbakhsh. I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including WIRED, ZDNet, AnalyticsInsight, InfoQ.

I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort.

I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.

Publications

Please visit my Google Scholar page for the complete bibliography.

Efficient Generative AI (Sparsity, Quantization, Linearization, Parallelization Strategies. etc.)

🆕 Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding, ICML 2025. [PDF][X Post]
🆕 SLIM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression, ICML 2025. [PDF]
🆕 SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity, ICML 2025. [TBD]
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws, ICLR 2025. [PDF]
SLOPE: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs, ICLR 2025. [PDF]
Effective Interplay between Sparsity and Quantization: From Theory to Practice, ICLR 2025. [PDF] ** 🏆 Spotlight Presentation **
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers, CPAL 2025. [PDF][Source] ** Oral Presentation **
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization, NeurIPS 2024. [PDF]
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models, ICML 2024. [PDF]
USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models, ICASSP 2024. [PDF] ** Oral Presentation **
Jaxpruner: A Concise Library for Sparsity Research, CPAL 2024. [PDF][Source] ** Oral Presentation **
STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition, ICML 2023. [PDF]
ReLeQ: A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks, IEEE MICRO 2020. [PDF]

ML for Computer Architecture and Systems / ML for Code

🆕 ECO: An LLM-Driven Efficient Code Optimizer for Warehouse Scale Computers, arXiv 2025. [PDF]
Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion, ISCA 2025. [PDF]
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture, IEEE CAL 2025. [PDF]
CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming, NeurIPS 2024. [PDF]
TAO: Re-Thinking DL-based Microarchitecture Simulation, ACM SIGMETRICS / IFIP PERFORMANCE 2024. [PDF]
Learning Performance-Improving Code Edits, ICLR 2024. [PDF][Source] ** 🏆 Spotlight Presentation ** || ** 🏆 MICRO Top Picks 2025 **
GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation, IISWC 2022. [PDF][Source]
An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks, IISWC 2022. [PDF]
Data-Driven Offline Optimization for Architecting Hardware Accelerators, ICLR 2022. [PDF][Source]
Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation, ICLR 2020. [PDF]

Foundational Models

Self-Refine: Iterative Refinement with Self-Feedback, NeurIPS 2023. [PDF][Source]
What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study, EMNLP 2023. [PDF][Arxiv][Source]

Computer Architecture

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving, ISCA 2025. [PDF]
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading, ISCA 2025. [TBD]
DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics, ISCA 2024. [PDF] ** 🏆 Distinguished Artifact Award **
Tandem Processor: Grappling with Emerging Operators in Neural Networks, ASPLOS 2024. [PDF][Source] ** 🏆 MICRO Top Picks Honorable Mention **
In-Storage Domain-Specific Acceleration for Serverless Computing, ASPLOS 2024. [PDF][Source]
MESA: Microarchitecture Extensions for Spatial Architecture Generation, ISCA 2023. [PDF] ** 🎖️ Inducted into the ISCA Hall of Fame **
Architecture Gym for Benchmarking Machine-Learning Aided Design, ISCA 2023. [PDF][Source]

- Architecture 2.0: Data-Centric AI Gymnasium for Computer Architecture Design, Global OCP Summit, 2023. [Video]

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks, ASPLOS 2023. [PDF]

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation, MICRO 2022. [PDF]
Accelerating Attention through Gradient-Based Learned Runtime Pruning, ISCA 2022. [PDF]
AxMemo: Hardware-Compiler Co-design for Approximate Code Memoization, ISCA 2019. [PDF]
GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks, ISCA 2018. [PDF]
SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks, ISCA 2018. [PDF]
Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration, ISCA 2016. [PDF]
Neural Acceleration for GPU Throughput Processors, MICRO 2015. [PDF]
General-purpose Code Acceleration with Limited-precision Analog Computation, ISCA 2014. [PDF][Retrospective] ** 🏆 MICRO Top Picks Honorable Mention **

Awards and Honors

Inducted into the ISCA Hall of Fame, 2023.
Microsoft Research PhD Fellowship, 2016-2018.
Qualcomm Innovation Fellowship, 2014-2015.
Gold Medal in ACM Student Research Competition (SRC), 2018.
Honorable Mention in IEEE Micro Top Picks, 2016.

Media Coverage

Need to Fit Billions of Transistors on a Chip? Let AI Do It, WIRED, 2021.
Google’s Deep Learning Finds a Critical Path in AI Chips, ZDNet, 2021.
Google's Apollo AI for Chip Design Improves Deep Learning Performance by 25%, InfoQ, 2021.
Google's AI Lab Unveiled A New Framework for Efficient Chip Design, Analytics Insight, 2021.

Blogs and Videos

An open-source gymnasium for machine learning assisted computer architecture design

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. These shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other's work, leading to significant advances in the field. Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively. To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms.

Offline Optimization for Architecting Hardware Accelerators

In “Data-Driven Offline Optimization for Architecting Hardware Accelerators”, accepted at ICLR 2022, we introduce PRIME, an approach focused on architecting accelerators based on data-driven optimization that only utilizes existing logged data (e.g., data leftover from traditional accelerator design efforts), consisting of accelerator designs and their corresponding performance metrics (e.g., latency, power, etc) to architect hardware accelerators without any further hardware simulation. This alleviates the need to run time-consuming simulations and enables reuse of data from past experiments, even when the set of target applications changes (e.g., an ML model for vision, language, or other objective), and even for unseen but related applications to the training set, in a zero-shot fashion.

Machine Learning for Computer Architecture

In “Apollo: Transferable Architecture Exploration”, we present the progress of our research on ML-driven design of custom accelerators. While recent work has demonstrated promising results in leveraging ML to improve the low-level floor-planning process (in which the hardware components are spatially laid out and connected in silicon), in this work we focus on blending ML into the high-level system specification and architectural design stage, a pivotal contributing factor to the overall performance of the chip in which the design elements that control the high-level functionality are established. Our research shows how ML algorithms can facilitate architecture exploration and suggest high-performing architectures across a range of deep neural networks, with domains spanning image classification, object detection, OCR and semantic segmentation.

Massively Large-Scale Distributed Reinforcement Learning with Menger

Today we introduce Menger, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement. In this post we describe how we implement Menger using Google TPU accelerators for fast training iterations, and present its performance and scalability on the challenging task of chip placement. Menger reduces the training time by up to 8.6x (down to ~one hour) compared to a baseline implementation.