Sarah stared at her laptop screen, watching the quarterly reports pile up in her inbox. As a project manager at a mid-sized tech company, she often wondered if artificial intelligence could handle the endless coordination meetings, spreadsheet analysis, and cross-department emails that filled her days. “Maybe AI really could do my job,” she thought, half-joking with her colleague over coffee.
Little did she know, researchers were already putting that exact theory to the test. And the results would surprise everyone who’s ever fantasized about AI taking over the mundane parts of office work.
In what might be the most ambitious AI company experiment ever conducted, scientists created an entire virtual workplace staffed exclusively by artificial intelligence “employees.” No human workers, no human managers – just AI systems trying to run a business from top to bottom.
When AI Tries to Play Office
Carnegie Mellon University researchers didn’t just want to see if AI could answer questions or write emails. They wanted to know if these systems could actually replace human workers in the messy, complicated world of real office work. So they built something unprecedented: a complete simulated company where every single employee was an AI agent.
The AI company experiment featured some of the most advanced systems available today. Each “worker” was powered by a different leading AI model – Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT-4o, Google Gemini, Amazon Nova, Meta’s Llama, and Alibaba’s Qwen. Think of it as an all-star team of artificial intelligence, each given their own desk, responsibilities, and deadlines.
But here’s what made this experiment different from typical AI tests: these weren’t simple chatbots answering one question at a time. Each AI agent was assigned a specific role – financial analyst, project manager, software engineer – and expected to handle complex, multi-step tasks that required coordination with other “departments.”
“Instead of neat, one-off prompts, the models faced messy, multi-step tasks that looked a lot like an ordinary day at the office,” the researchers noted.
Real Work for Artificial Workers
The tasks assigned in this AI company experiment weren’t designed to trick or confuse the systems. They were the kind of routine but complex work that millions of office workers handle every day:
- Navigating company file systems to analyze databases
- Pulling information from multiple documents and creating summaries
- Organizing virtual office tours and comparing options for new premises
- Communicating with other departments for approvals and additional data
- Following instructions that included both explicit steps and implied expectations
- Managing time and cost constraints while completing projects
The researchers also created simulated departments like HR that the AI agents had to interact with, just like real employees would. The artificial workers had to send messages, request information, coordinate schedules, and navigate office politics – all the stuff that makes work feel like, well, work.
| AI Model | Success Rate | Partial Credit Score |
|---|---|---|
| Claude 3.5 Sonnet | 24% | 34.4% |
| GPT-4o | 18% | 28.2% |
| Google Gemini | 15% | 22.8% |
| Other Models | 10-12% | 15-20% |
The Sobering Reality Check
If you’re expecting a story about AI domination in the workplace, prepare to be disappointed. The results of this AI company experiment were, in the researchers’ own words, “sobering.”
Even the best-performing AI, Claude 3.5 Sonnet, successfully completed only 24% of its assigned tasks. When researchers gave partial credit for work that was incomplete but heading in the right direction, Claude’s score improved to just 34.4%. The other AI systems performed even worse, with most struggling to complete even 15-20% of their assignments.
“We expected some challenges, but the extent of the failures was eye-opening,” one researcher commented. “These are supposed to be some of the smartest AI systems in the world, yet they couldn’t handle basic office workflows.”
The problems weren’t just about getting wrong answers. The AI agents frequently got lost in their own processes, forgot important details from earlier in their tasks, or simply gave up when faced with multi-step challenges. Some would start strong but lose track of their objectives halfway through complex assignments.
What made these failures particularly striking was that the tasks weren’t especially difficult by human standards. Most involved the kind of analytical and coordination work that entry-level employees handle routinely after a few weeks of training.
Why Your Job Is Probably Safe (For Now)
The AI company experiment reveals something important about the current state of artificial intelligence: there’s a massive gap between answering questions in a chat interface and actually doing work in a realistic environment.
Real office work requires what researchers call “contextual persistence” – the ability to remember what you’re doing, why you’re doing it, and how it fits into larger goals over extended periods. It also demands flexibility when plans change, creativity when standard approaches don’t work, and social intelligence to navigate relationships with colleagues.
“The AI systems excel at individual tasks but struggle with the kind of sustained, adaptive problem-solving that defines most knowledge work,” explained one of the study’s authors.
This doesn’t mean AI won’t eventually get better at these challenges. But it does suggest that the timeline for AI replacing human workers in complex roles might be longer than many predictions suggest.
For workers like Sarah, the project manager we met at the beginning, this research offers both reassurance and insight. While AI can certainly help with specific parts of her job – writing emails, analyzing data, scheduling meetings – the complete replacement of human judgment, creativity, and adaptability remains a distant prospect.
The experiment also highlights an important distinction between AI as a tool and AI as a replacement. These systems showed their value when used to support human decision-making, but struggled when asked to operate independently in complex, changing environments.
FAQs
How did the researchers create an entire AI company?
They built a simulated business environment where different AI models were assigned specific roles and given realistic office tasks that required coordination between departments.
Which AI system performed best in the experiment?
Claude 3.5 Sonnet achieved the highest success rate at 24%, though even this top performer struggled with most assigned tasks.
What kinds of tasks were the AI employees asked to complete?
Typical office work like analyzing databases, summarizing documents, choosing office locations, and coordinating with other departments under time and budget constraints.
Does this mean AI can’t help with office work at all?
Not at all – AI can be very effective at specific tasks, but this experiment showed the challenges of using AI as complete replacements for human workers in complex environments.
Will AI eventually be able to run entire companies?
While AI will likely improve, this research suggests that fully autonomous AI companies remain a significant technical challenge requiring advances in contextual understanding and sustained problem-solving.
Should office workers be worried about AI taking their jobs?
This experiment suggests that complex knowledge work requiring adaptation, creativity, and sustained coordination remains challenging for current AI systems, though AI will likely continue serving as a valuable tool to support human workers.
