Research on the Effectiveness of Modern AI Systems
A study conducted by Scale AI and the AI Safety Center revealed that modern artificial intelligence systems, such as ChatGPT, Gemini, and Claude, exhibit low effectiveness when handling real-world projects. Specifically, in testing hundreds of tasks, the best AI system managed to complete only 2.5% of projects. Almost half of the tasks were completed with low quality, and a third remained unfinished.
Examples of Failed Task Completion
Among the specific examples of failed task completion, the following are noteworthy:
- An interior design project where the AI created an implausible floor plan.
- When creating a data visualization panel, the system overlapped text on graphs and confused colors.
- A brewing-themed game developed by AI turned out to be abstract and did not meet expectations.
ChatGPT, which was released three years ago, and the new Gemini 3 Pro model, tested in November 2025, demonstrated similar results, completing only 1.3% of tasks. In comparison, creating a game by a human cost $1485, while launching Claude Sonnet cost less than $30.
Jason Hauzenloy, one of the authors of the study, stated that 'AI cannot learn from mistakes within a project that lasts for weeks.'
These results highlight the importance of acknowledging the limitations of artificial intelligence when using it for practical projects. Despite significant progress in development, AI systems are still not ready to fully replace human expertise in complex tasks. This raises new questions about how to integrate AI into workflows to maximize its potential and ensure quality project execution.