Most popular now

AI Failed Real-World Task Test: ChatGPT and Gemini Struggled with 97% of Projects

AI struggled with projects
Штучний інтелект не впорався з реальними завданнями: ChatGPT та Gemini показали низькі результати у 97% випадків.

Research on the Effectiveness of Modern AI Systems

According to Главком: A study conducted by Scale AI and the AI Safety Center revealed that modern artificial intelligence systems, such as ChatGPT, Gemini, and Claude, exhibit low effectiveness when handling real-world projects. Specifically, in testing hundreds of tasks, the best AI system managed to complete only 2.5% of projects. Almost half of the tasks were completed with low quality, and a third remained unfinished.

Examples of Failed Task Completion

Among the specific examples of failed task completion, the following are noteworthy:

  • An interior design project where the AI created an implausible floor plan.
  • When creating a data visualization panel, the system overlapped text on graphs and confused colors.
  • A brewing-themed game developed by AI turned out to be abstract and did not meet expectations.

ChatGPT, which was released three years ago, and the new Gemini 3 Pro model, tested in November 2025, demonstrated similar results, completing only 1.3% of tasks. In comparison, creating a game by a human cost $1485, while launching Claude Sonnet cost less than $30.

Jason Hauzenloy, one of the authors of the study, stated that 'AI cannot learn from mistakes within a project that lasts for weeks.'

These results highlight the importance of acknowledging the limitations of artificial intelligence when using it for practical projects. Despite significant progress in development, AI systems are still not ready to fully replace human expertise in complex tasks. This raises new questions about how to integrate AI into workflows to maximize its potential and ensure quality project execution.

Read also

Advertisement