Orchestrating AI Factory Cooling at Gigawatt Scale

We are attending NVIDIA GTC March 16-19. Book a demo and come see us at booth #95.

Cookie Settings

We use cookies to operate this website, improve usability, personalize your experience and improve our marketing. Your privacy is important to us. Privacy Policy.

Log Entry: We're not building a data center

I've commissioned and operated a lot of facilities—submarines, data centers, industrial plants. You learn pretty quickly that the engineering challenge is about more than capacity. It's about understanding what kind of machine you're actually building.

This time, we weren't building a data center. We were building a race car.

That's what an AI factory is—a high-performance machine where every system has to coordinate in real time. If one component lags, the whole system skids. And at gigawatt scale, any slide is expensive.

The challenge was building infrastructure that operates as a single unit of compute. It must handle synchronized AI workloads and extreme thermal demands. Most importantly, there’s some room for over-sizing to compensate but every redundant piece of equipment cut into the margin needed to make the facility profitable.

In the following log entries, I’ll cover:

What makes AI factories fundamentally different from traditional data centers
Why thermal control became the bottleneck we didn't see coming
How Phaidra's AI agents changed the way we think about infrastructure management
What it takes to coordinate systems at the speed AI workloads demand

Log Entry: One big computer

Rule #1 of AI factories: This is NOT just a larger data center.

In a traditional data center, loads are distributed. Different tenants, different workloads, different times. You might see a spike on Black Friday or during a product launch, but mostly things ramp gradually and somewhat predictably. This predictability creates flexibility. You can even over-subscribe power (install more IT equipment than the power infrastructure is rated for) because not everything's going to be on at once.

AI factories don't work that way.

When an AI training job kicks off, every rack goes from 20% to 100% almost instantly. Thousands of GPUs light up in sync. And when the job completes, everything drops back down together.

It's like Black Friday every day, except it happens multiple times, with little to no warning, across the entire facility.

This is not a bigger data center. Traditional data centers are digital commercial real estate where compute space is rented. There are multiple tenants or just one, but the compute infrastructure is assigned to do different things at different times, in different ways. Like a flexible commercial office building.

An AI factory is one big computer. Every component must synchronize toward a single purpose. And that changes everything about how you design, operate, and control it.

I'd worked on Google's early AI infrastructure—systems with TPUs and CDUs that behaved a lot like what we were building here. Dense, synchronized, unforgiving. But even those systems didn't operate at this scale or with this level of coordination demand.

The challenge wasn't theoretical anymore. It was right in front of us.

Log Entry: The systems weren't keeping up

Early load testing revealed the problem fast.

We'd bring a training workload online, watch power spike across the racks, and then—lag. The cooling system would react, but not fast enough. By the time chillers ramped up and water started circulating at the right temperature, we'd already seen thermal overshoot.

Not catastrophic. But enough to throttle performance.

Enough to make you nervous.

Traditional control systems wait for temperatures to rise before they respond. That's fine when loads change slowly—comfort cooling, district energy, steady industrial processes. Those systems were engineered for gradual ramps, not synchronized spikes.

But in an AI factory, gradual doesn't exist.

You either keep up, or you fall behind.

The issue wasn't capacity. We had enough chillers, enough pumps, enough cooling infrastructure to handle the load—if we had time to respond. But time was the constraint we couldn't engineer around with more equipment.

The limit wasn't on total cooling available. The limit was on the rate of change.

And that's when it became clear: we needed a different kind of control.

Log Entry: Leading indicators, not lagging ones

There's an analogy I kept coming back to during those early weeks.

When humans learned to hunt, we didn't become the dominant species because we were faster or stronger. We became dominant because we learned to throw rocks then spears where the animal was going to be, not where it was.

That's the difference between lagging indicators and leading indicators.

One reacts. The other predicts.

Traditional control systems are reactive. They wait for a temperature sensor to report that something's getting hot, then they respond. Therefore, they follow lagging indicators.

By that point, the thermal spike is already happening.

What we needed was feed forward control to see the spike coming—workload changes, IT load signals, power ramping, system response times—and act before temperatures ever started to rise.

That's what Phaidra's AI agents do.

They don't wait for the problem. They anticipate it, adjust proactively, and create the needed cooling procedure to get ahead of things, so it won’t need to catch up - it’s already there. It's like traction control in a Formula 1 car… adjusting power delivery before the tires lose grip, not after.

We deployed liquid cooling agents on the CDUs across a 100MW GPU cluster and watched it go to work.

Log Entry: The system started tuning itself

Within hours, thermal behavior tightened.

Thermal overshoot that used to spike 3-5°C dropped to less than 0.5°C. Thermal spikes that would trigger equipment throttling or manual interventions—gone. The system wasn't just responding anymore. It was predicting, adjusting, and learning.

And the best part was that it kept getting better!

Phaidra's agents were trained in simulation first—high-fidelity digital twins of NVIDIA DGX SuperPODs that let the team prototype and test different architectures before deployment. Once the agent hit our live system, it continued learning through reinforcement learning. No manual tuning. No human intervention.

It observed, adapted, and improved.

In one test, we threw a 30-70% load ramp at it—the kind of spike that would've caused major thermal overshoot with traditional controls. The AI agent absorbed it like it was nothing. Smooth, stable, precise.

It reminded me of working on a nuclear submarine, where you had layers of redundancy and safety systems engineered into everything. Except this time, the safety system was learning in real time and getting smarter with every cycle.

The results were clear:

80% reduction in thermal spikes compared to the baseline control system
<0.5°C precision in maintaining target temperatures
Higher TCS temperature setpoint without sacrificing stability
Less cooling infrastructure required overall—no need to over-build just to buffer spikes

We didn't tune the system. The system tuned itself.

And that fundamentally changed how we thought about infrastructure management.

Log Entry: Coordination at gigawatt scale

Here's what I've learned after years of working in mission-critical facilities: systems only perform as well as their weakest coordination point.

You can have the best chillers, the cleanest power delivery, the most advanced GPUs—but if those systems don't talk to each other in real time, you're leaving performance on the table.

AI factories are no different. They're just faster, denser, and less forgiving of lag.

That's why the NVIDIA Omniverse DSX ecosystem exists. It's not just about hardware specs or design plans. It's about coordination—every partner bringing precision to their piece of the stack so the whole system works as one.

Partners like Jacobs design the mechanical and electrical core. Siemens provides digital twin capabilities and automation systems. Schneider Electric handles power distribution. GE Vernova interfaces with the grid. Emerald AI runs predictive simulations. And Phaidra brings the real-time control layer that keeps thermal systems stable under dynamic load conditions.

It's like a Formula 1 team.

Every component matters. Every system has to be synchronized. And when it works, the machine doesn't just run—it out-performs.

Phaidra's role became clear to me: the traction control that keeps the system on-line. Phaidra manages thermal precision so the rest of the infrastructure can do its job without over-building, without waste, and without slipping.

Because at gigawatt scale, small inefficiencies magnify and compound fast.

A single gigawatt AI factory represents a $50 billion investment and a $200 billion revenue opportunity. Every 1% of inefficiency costs ~$2 billion in lost performance.

You can't afford to guess. You can't afford to react too late. And you can't afford systems that don't talk to each other.

Log Entry: A new baseline

We're building smarter facilities—not just bigger ones.

AI factories require infrastructure that thinks ahead—systems that predict, adapt, and coordinate at speeds that exceed human reaction time. That's not optional. It's foundational.

Before Phaidra's agents came online, we were reacting to problems. Now, we're preventing them. The system anticipates thermal spikes before they happen, adjusts proactively, and maintains precision without manual intervention.

That's the difference between lagging and leading. And at gigawatt scale, that difference is everything.

I've seen a lot of control systems in my career. The ones that work best aren't the ones with the most capacity. They're the ones that know how to use what they have.

Phaidra's AI agents do exactly that.

They extend system response time by acting on leading indicators, reducing the need for oversized infrastructure while improving reliability and efficiency.

It's like traction control in a race car. You don't win by having the biggest engine. You win by staying on the racing line—maintaining grip, avoiding drift, and extracting every bit of performance the system has to offer.

That's what intelligent control does. And that's what the future of AI infrastructure looks like.

Final Log Entry: The machine only works when everything synchronizes

An AI factory is a machine built for intelligence generation. But it only works when every system coordinates in real time—power delivery, cooling response, workload scheduling, thermal management. Everything working in harmony to create an infrastructure asset that operates as a single unit of compute.

At the speeds AI workloads demand, I can’t track everything manually anymore. That's why AI agents became essential—not to replace operators like me, but to handle the coordination speed that exceeds what manual oversight can manage alone.

Phaidra's work within the NVIDIA Omniverse DSX ecosystem is defining what this looks like at gigawatt scale. Their AI agents deliver real-time precision that makes it possible with systems stable, efficient, and performing at their peak.

Because in the end, the fastest machine isn't always the one with the most power.

It's the one that can use it all, consistently and reliably.

Featured Expert

Learn more about one of our subject matter experts interviewed for this post

Daniel Fuenffinger

Solution Engineer

As a Solution Engineer, Daniel is responsible for identifying and aiding in the application of AI solutions to optimize physical infrastructure. He bridges the gap between operational needs and technical innovation, collaborating with customers, researchers, and software engineers to help define the "what" and "why" behind high-impact AI agents. With a career rooted in mission-critical environments, Daniel brings over a decade of experience in the operation and maintenance of nuclear submarine propulsion plants for the US Navy and hyper-scale data centers at Google. His background in technical program management for data center controls allows him to deliver solutions that prioritize long-term efficiency and system reliability.

A digital & mechanical symphony: Orchestration of an AI factory