Faculty, Computer Science, Johns Hopkins University
I'm a faculty member at Johns Hopkins University and, until earlier this year, concurrently a scientist at Meta.
I lead the Foundational Networked Systems Lab, where my research focuses on the intersection of
AI and large-scale networked systems — both
building the networks that power distributed AI and using AI to manage these networks intelligently.
Our work spans AI Datacenters (e.g., network architecture, congestion, and collectives)
and AI for Datacenters (e.g., multi-agent LLMs for autonomous infrastructure management).
Our systems and tools have been deployed in some of the largest production networks in the world,
shaping how modern AI infrastructure is built and operated.
Prospective students and postdocs: I’m always looking for strong, motivated researchers interested in
AI-driven networking, distributed systems, and datacenter infrastructure.
Acknowledgment: Our research is supported by the National Science Foundation, Meta, Intel, Microsoft,
HPE, and Google.
Our lab sits at the interface of AI and networked systems.
We design datacenter fabrics that can sustain the traffic patterns of distributed AI,
and we build AI-driven control planes that make these fabrics observable and self-optimizing.
The questions below motivate some of our recent work.
AI Datacenters: Building Networks for Distributed AI
Why do AI networks experience millisecond-scale traffic explosions even under steady workloads?
We trace these “microbursts” to feedback loops between transport protocols, NIC queues, and switch scheduling,
not just application behavior. Using synchronized telemetry across thousands of endpoints, we show how
interactions between NIC pacing, queue build-up, and congestion signals amplify small timing skews into
fabric-wide pressure waves. The resulting model revises long-standing assumptions about where burst mitigation
should occur — shifting focus from application reshaping to cross-layer feedback control.
Where does congestion actually “live” in AI clusters?
Our large-scale measurements reveal that traffic pressure has shifted from the edge to the fabric core
— driven by short, synchronized bursts aligned with collective communication barriers.
We correlate pause-frame propagation with RDMA-level counters and uncover distinct spatiotemporal “congestion
signatures” that repeat across training waves. These insights led to new coarse-grained congestion
indicators (e.g., PFC pause ratio vectors) that accurately localize core imbalance without requiring
expensive fine-grained sampling, enabling fabric-aware scheduling and topology optimization.
One to Many: Scalable Multicast for AI Datacenters
Can in-network replication scale to trillion-parameter training jobs?
PEEL introduces a layer-peeling heuristic that builds near-optimal multicast trees in asymmetric Clos fabrics.
Instead of treating tree construction as a global optimization, PEEL incrementally peels outer network layers,
deriving compact, prefix-based rules that achieve bandwidth within 1.4% of Steiner-optimal solutions while
maintaining switch rule sets under 64 entries. This design makes hardware-accelerated collectives viable at
hyperscale and exposes new tradeoffs between rule compression and replication latency.
AI for Datacenters: Learning-Based Network Management
Confucius: Multi-Agent LLMs for Network Management
Can LLM agents reason about datacenter state at the scale and precision engineers require?Confucius is a multi-agent LLM framework for datacenter operations that blends retrieval-augmented generation
with domain-specific reasoning. Each agent specializes in planning, validation, or execution, using
structured telemetry embeddings to translate high-level intent into low-level configuration actions.
Deployed at Meta for over two years, Confucius executes workflows like topology redesigns, fault triage,
and capacity planning — achieving 80%+ automation accuracy while preserving human-in-the-loop safety
via explainable intermediate reasoning.
Toward Intelligent and Efficient AI Infrastructure
Together, these projects span the full stack of AI datacenter networking —
from diagnosing emergent traffic behavior to rethinking communication primitives and embedding intelligence
into management planes. Our broader goal is infrastructure that is self-optimizing, observable, and efficient at unprecedented scale.
Department of Computer Science, Johns Hopkins University — Spring 2025, Fall 2021, Fall 2018
Selected Topics in Cloud and Networked Systems
Department of Computer Science, Johns Hopkins University — Spring 2022, Spring 2021, Fall 2020
Bio & Research Lab
I'm a faculty member in the Department of Computer Science at Johns Hopkins (since 2018) and was a scientist at Meta until early 2025. I received my Ph.D. at the University of Illinois at Urbana-Champaign, M.Sc. at the University of Toronto, and B.Sc. at Sharif University of Technology. I previously worked or interned at the Max Planck Institute (MPI), Microsoft Research, and Princeton University.
At Hopkins, I direct the Foundational Networked Systems Lab. Our research integrates systems and theory and has been supported by Intel, Meta, NSF, VMware, Microsoft, and others. I'm fortunate to work with the outstanding group of students listed below.