Why Human-in-the-Loop Really Means AI Failed

2nd Order Thinkers.

Why AI's Best Hackathon Demos Are Built on Workarounds

0:00

-15:30

Why AI's Best Hackathon Demos Are Built on Workarounds

The fix that got three products to the top of latest Google Hackathon leaderboard also created the jobs every CEO swore AI would eliminate.

Jing Hu

May 25, 2026

For years, the pitch was: the AI will handle it.

Bigger context windows, smarter reasoning, and longer chains of thought are what AI labs are pushing for.

The AI CEOs keep telling you, any month now, the AI will understand your business, run workflow, draft contracts, and be the yellow brick road to no more lazy human employees.

In the case of the Gemini 3 Hackathon, judges had thousands of entries to look at. Somehow, they handed the top three prizes to the teams that had stopped believing the pitch.

The grand prize is won by accepting that the AI shouldn’t make the final call. The 2nd won by forcing AI to think out loud so we can verify the logic. And the 3rd won the prize by accepting that the AI shouldn’t pretend to know what’s been lost in memory.

While the CEOs have spent years telling us AI will replace human workers (and many are convinced). None of the awarded products would have received the prize if that were true.

I’ve studied all three winning products, and I’ll walk you through each point of the workaround they built to compensate for what AI failed to achieve; what’s more important is that I will teach you how to read any AI pitch backward, so you’ll also have my ability to see through AI BS.

First up is the human-in-the-loop problem that the industry keeps trying to design around.

Human-in-the-Loop, Translation, AI Can’t be Trusted.

Globot is built for the moment any supply chain manager has to make a million-dollar rerouting call with a geopolitical or natural crisis breaking in real time.

This product starts with four agents pulling information from each crucial source to help make supply chain decisions.

It pulls geopolitical signals from Reuters and Bloomberg, calculates the financial impacts, redraws the shipping route, and checks whether the reroute will change the insurance status.

And finally runs a fifth agent whose only job is to find holes in what the other four just concluded. After the check, the system then hands over for human action.

This fifth agent and the human intervention are part of the product, and the key points to discuss here.

You’ve heard of variants of this architecture. For example, Anthropic baked it into training as Constitutional AI, ie, one model writes, a second model marks its homework against a written rulebook. Some call them ‘critic’ or ‘judge,’ whatever the label, the design is the same.

Even after the judge, you still need a layer of humans to make the final call.

Because unfortunately, this exponentially more expensive multi-agent setup, plus the judge, still cannot be trusted to make the right call confidently (basically, the judge can still BS its way out).

No matter how many check layers there are, it just cannot be trusted to make a $X million decision on its own.

The whole “human-in-the-loop” phrase is load-bearing for this architecture (though the label sounds like some kind of brilliant, all-inclusive, and be-kind-to-humans gesture).

Ironically, the cheapest and final fix is still the human approval gate.

No, it doesn’t just stop here.

Because if you dig deeper, you realize this creates one thing none of these AI lab CEOs expected to see: jobs.

Someone now builds and safeguards the agents; experts to verify the agents’ output, and someone sits at the end of this pipeline, reviewing high-stakes AI outputs. And repeat.

So you aren’t saving the cost of getting rid of human labor after all.

Next, force AI to reveal its thinking.

Explainable AI, Translation, The Black Box Is Still a Black Box.

The demo for Aegis opens with the volume problem.

When a natural disaster occurs, the regional 911 system is most likely overwhelmed. You'd have ten thousand distress signals (calls, SMS, tweets, drone feeds) within a matter of minutes.

There are mainly two design pillars in this demo.

First, it has a total of five agents. The coordinator routes raw signals, triage assigns priority, surveillance analyses incoming drone imagery, logistics calculates routes for rescue assets, and the reporter writes the post-mission summary.

A similar multi-agent scaffolding to the one we just saw in Globot.

And it’s there for the same reason: a single model can’t hold a complex task end to end. Worth pointing out: this is the moment you’d finally respect how incredible the human brain is in handling such a complex task.

Second, the Explainable AI.

Early users couldn’t act on the AI’s priority scoring because the AI’s recommendations are blackboxed, so they couldn’t see why one casualty had been ranked above another. The same black-box problem that has killed many enterprise AI pilots.

The fix was to force every agent to display its reasoning next to its decision, visible on the main canvas.

The industry has a name for this. It’s sold to enterprises as Explainable AI, or simply “show your working.” But showing the reasoning doesn’t open the black box. It just gives the black box a narrator.

Both the multiagents and the explainable AI ‘solutions’ fixed the issues but created other unwanted consequences.

First, as we’ve covered earlier, leveraging multiple agents is exponentially more expensive than a single one. Especially when AI labs like Anthropic are aggressively redesigning their pricing. It’s expected to spend 4, even 5, digits in US dollars monthly just on a single small project.

Which arguably is the cost of hiring an engineer full-time.

Second, the forced thinking-out-loud.

When people see a confident, readable explanation attached to an automated decision, they are more likely to accept it — including when it is wrong. It has a name, automation bias: the tendency to trust automated outputs over our own judgment, even when the evidence shows the opposite.

Deloitte's research for the UK Ministry of Defense illustrates the risk with a simple driving analogy. Imagine you're driving and see a 30 mph sign, but your dashboard shows 60, so you accelerate, trusting the machine over your own eyes. Or you skip a turn you know is right because the sat-nav didn't prompt it.

Either way, the system has quietly replaced your judgment instead of supporting it. Here’s what’s said in the report:

In this situation, the use of AI has unwittingly impaired the operator’s judgment, rather than improving it.

But all of this assumes the system even knows what situation it is returning to. Which brings us to the next demo.

Engineered Memory, Translation, LLMs Start From Zero Every Time.

A user with limited vision asks Netra to find their water bottle.

The system scans the room and says: “Your water bottle is on the desk, to the left of your silver laptop, approximately a meter away.” The user walks to it and picks up the bottle.

Now move the water bottle. Don’t tell the AI, and come back tomorrow and see what happens.

The project brief calls the underlying system the Memory Palace, “a persistent JSON store” that “builds a robust, long-term spatial map of a user’s most important locations.”

However, that is precisely the problem with LLMs in many similar implementations.

Yes, the water bottle is still on the desk, to the left of the laptop — in the file.

But what if someone loaded it in the dishwasher and forgot to put it back next to the laptop? The model’ll send the user confidently to an empty desk.

The brief didn’t mention how often they update the memory. But it’s going to be a delicate balance between real time and cost.

This is statelessness, one of the serious negative characteristics of LLMs.

The model has no continuous thread of experience. Every session, it wakes up as a stranger with notes. This demo is only a workaround that introduces a new failure mode on top of the original problem: stale data that the model treats with the same confidence as fresh data because it has no mechanism to distinguish between them.

At demo scale, it’s a file on a server. At any real deployment scale, it is a retention liability and risk of contamination, scope creep, and leakage of what’s in the memory.

This is one of the reasons why Klarna piloted AI customer service bots that could fully replace human agents, but rolled them back to "assistant only" because the bot forgets you the second you switch from chat to phone, or if your connection drops.

And once you see that pattern, it’d be hard to unsee it.

The industry keeps selling these workarounds as breakthroughs. I’m going to teach you how to read the pitch backward so you can see through the sales lines, understand the model’s limits, and assess the cost of workarounds for these limits.

If this gave you a better BS detector for AI demos, pass it on to someone who needs one.

Learn to Read Any Tech Pitch Backward.

You probably started this article expecting an analysis of why each demo did right to deserve the hackathon prize. But I'd rather rip the label off and show you the plumbing underneath.

Learning to read things backward is going to help you evaluate AI’s true value and cost

But how?

Let’s start by understanding this: a solution always generates a new set of issues. The railroad created rail engineers. The cloud created DevOps and FinOps.

We’ll use Harness Engineer/ Multi-Agent Orchestration, for example.

Forward reading:

Like this one:

We’re hiring harness engineers to build sophisticated multi-agent systems that orchestrate specialised AI workers.

Again, a sales pitch.

Backward reading:

A single model loses the plot on anything longer than a certain number of tokens of real reasoning. So you split the job across x narrow agents, maybe add one more to check their work, and hire a human (the “harness engineer”) whose entire role is to glue the crew together and keep it from drifting.

Ignore what the pitch says, ‘one harness engineer + multiagents replaces a five-person team.’

Because the invoice by the end of the month says otherwise.

A harness engineer in 2026 is a specialist hire at $180k–$350k, and you don’t get just one.

Underneath them sits a compute bill that scales multiplicatively; an x-agent crew with a critic re-ingests prior agents’ outputs as context, so a single task runs 10–15× the tokens of one model call, and labs now bill agentic tool calls separately on top of seat licenses.

And then the hidden stack appears. You need databases for retrieval, orchestration services to route work, memory stores to preserve context, and monitoring tools to stop the whole thing from wandering off.

Even the guardrails tell on the system. Iteration caps, timeouts, and cost ceilings are not just safety features. They are admissions that, left alone, an agent may keep spending without converging.

Zoom out from the engineering; you also have the oversight tax: Carnegie Mellon and Stanford found that when humans supervise agents' work closely, productivity drops by about 20% compared with humans working alone.

So the cost of operators doesn’t really disappear as many might dream.

Instead, they get replaced by a more expensive team, a variable compute bill that grows with usage, a stack of net-new infrastructure, and a human still sitting at the gate to sign the high-stakes calls.

But most importantly (this will likely scare away most bosses), a brand-new way of operating that bosses have nearly zero control over and zero expertise in.

None of these is something you intended to buy. However, each one is a tax on the model’s limits, and you are charged with a new name tag.

For paid subscribers, I will send you a few more AI terms, with backward reading examples. Sign up if you’d like to receive this exclusive piece too; it will save you millions and headaches.

I hope this one is insightful. If yes please do like and share this, this article is only possible with your engagement and support.