Soudeh Ghorbani

I'm a faculty member at Johns Hopkins University and, until earlier this year, concurrently a scientist at Meta. I lead the Foundational Networked Systems Lab, where my research focuses on the intersection of AI and large-scale networked systems — both building the networks that power distributed AI and using AI to manage these networks intelligently.

Our work spans AI Datacenters (e.g., network architecture, congestion, and collectives) and AI for Datacenters (e.g., multi-agent LLMs for autonomous infrastructure management). Our systems and tools have been deployed in some of the largest production networks in the world, shaping how modern AI infrastructure is built and operated.

Prospective students and postdocs: I’m always looking for strong, motivated researchers interested in AI-driven networking, distributed systems, and datacenter infrastructure.

Acknowledgment: Our research is supported by the National Science Foundation, Meta, Intel, Microsoft, HPE, and Google.


Explore Our Research

Our lab sits at the interface of AI and networked systems. We design datacenter fabrics that can sustain the traffic patterns of distributed AI, and we build AI-driven control planes that make these fabrics observable and self-optimizing. The questions below motivate some of our recent work.


AI Datacenters: Building Networks for Distributed AI

Origins of Bursts


NSDI'23 | Paper

Why do AI networks experience millisecond-scale traffic explosions even under steady workloads? We trace these “microbursts” to feedback loops between transport protocols, NIC queues, and switch scheduling, not just application behavior. Using synchronized telemetry across thousands of endpoints, we show how interactions between NIC pacing, queue build-up, and congestion signals amplify small timing skews into fabric-wide pressure waves. The resulting model revises long-standing assumptions about where burst mitigation should occur — shifting focus from application reshaping to cross-layer feedback control.

Congestion Patterns in AI Datacenters


IMC'25 | Paper

Where does congestion actually “live” in AI clusters? Our large-scale measurements reveal that traffic pressure has shifted from the edge to the fabric core — driven by short, synchronized bursts aligned with collective communication barriers. We correlate pause-frame propagation with RDMA-level counters and uncover distinct spatiotemporal “congestion signatures” that repeat across training waves. These insights led to new coarse-grained congestion indicators (e.g., PFC pause ratio vectors) that accurately localize core imbalance without requiring expensive fine-grained sampling, enabling fabric-aware scheduling and topology optimization.

One to Many: Scalable Multicast for AI Datacenters


HotNets'25 | Paper

Can in-network replication scale to trillion-parameter training jobs? PEEL introduces a layer-peeling heuristic that builds near-optimal multicast trees in asymmetric Clos fabrics. Instead of treating tree construction as a global optimization, PEEL incrementally peels outer network layers, deriving compact, prefix-based rules that achieve bandwidth within 1.4% of Steiner-optimal solutions while maintaining switch rule sets under 64 entries. This design makes hardware-accelerated collectives viable at hyperscale and exposes new tradeoffs between rule compression and replication latency.

AI for Datacenters: Learning-Based Network Management

Confucius: Multi-Agent LLMs for Network Management


SIGCOMM'25 | Paper

Can LLM agents reason about datacenter state at the scale and precision engineers require? Confucius is a multi-agent LLM framework for datacenter operations that blends retrieval-augmented generation with domain-specific reasoning. Each agent specializes in planning, validation, or execution, using structured telemetry embeddings to translate high-level intent into low-level configuration actions. Deployed at Meta for over two years, Confucius executes workflows like topology redesigns, fault triage, and capacity planning — achieving 80%+ automation accuracy while preserving human-in-the-loop safety via explainable intermediate reasoning.

Toward Intelligent and Efficient AI Infrastructure

Together, these projects span the full stack of AI datacenter networking — from diagnosing emergent traffic behavior to rethinking communication primitives and embedding intelligence into management planes. Our broader goal is infrastructure that is self-optimizing, observable, and efficient at unprecedented scale.

Computer Networks

Department of Computer Science, Johns Hopkins University — Fall 2022, Spring 2021

Cloud Computing

Department of Computer Science, Johns Hopkins University — Fall 2025, Spring 2022, Fall 2020, Spring 2020, Spring 2019

Advanced Computer Networks

Department of Computer Science, Johns Hopkins University — Spring 2025, Fall 2021, Fall 2018

Selected Topics in Cloud and Networked Systems

Department of Computer Science, Johns Hopkins University — Spring 2022, Spring 2021, Fall 2020

I'm a faculty member in the Department of Computer Science at Johns Hopkins (since 2018) and was a scientist at Meta until early 2025. I received my Ph.D. at the University of Illinois at Urbana-Champaign, M.Sc. at the University of Toronto, and B.Sc. at Sharif University of Technology. I previously worked or interned at the Max Planck Institute (MPI), Microsoft Research, and Princeton University.

At Hopkins, I direct the Foundational Networked Systems Lab. Our research integrates systems and theory and has been supported by Intel, Meta, NSF, VMware, Microsoft, and others. I'm fortunate to work with the outstanding group of students listed below.

Lab Members

  • Erfan Sharafzadeh (PhD student, 2019–2025, now a scientist at Meta)
  • Jinqi (Kevin) Lu (PhD student since 2024)
  • Sana Mahmood (PhD student since 2021)
  • Sepehr Abdous (PhD student since 2019)
  • Venkata Datta Adithya Gadhamsetty (MS student, 2024)
  • Shriya Atulbhai Kaneriya (MS student, 2020–21)
  • Katarina Mayer (MS student, 2020–21, Gerald M. Masson Fellow)
  • Pranav Shirke (MS student, 2020–21)
  • Wajiha Naveed (undergraduate student, 2024)
  • Alexandra Minetree Dill (undergraduate student, 2025)