AI4 views5 min read

AI Models Face Data Shortage, Goldman Sachs Warns

Goldman Sachs analysts warn the AI industry is running out of high-quality public data for training models, pushing focus towards untapped enterprise data.

David Chen
By
David Chen

David Chen is a senior technology and markets correspondent for Neurozzio, specializing in the semiconductor industry, corporate finance, and the business of artificial intelligence. He analyzes major deals and strategic shifts shaping the global tech landscape.

Author Profile
AI Models Face Data Shortage, Goldman Sachs Warns

Analysts from Goldman Sachs are warning that the artificial intelligence industry is approaching a critical limit: the available supply of high-quality public data for training advanced AI models is nearly exhausted. The solution, they suggest, lies within the vast, untapped data reserves held by private companies.

According to top executives at the financial institution, the continued progress of AI, particularly in business applications, now depends on unlocking and utilizing proprietary enterprise information. This shift presents both a significant opportunity and a major challenge for the industry.

Key Takeaways

  • Goldman Sachs analysts report a growing scarcity of high-quality public data for training large AI models.
  • Developers are increasingly using synthetic data or the output of other AIs, which creates a risk of performance degradation known as "model collapse."
  • The next major source of valuable training material is believed to be the proprietary data stored within corporations.
  • Unlocking enterprise data requires significant investment in data cleaning, normalization, and semantic understanding to be effective.

The Looming Data Scarcity

The development of powerful AI systems is fundamentally dependent on massive volumes of training data. However, the industry may have reached a ceiling on readily available, high-quality information from the public internet.

During a recent webcast, George Lee, co-head of the Goldman Sachs Global Institute, highlighted this dependency. "The quality of the outputs from these models, particularly in enterprise settings, is highly dependent on the quality of the data that you're sourcing and referencing," Lee stated.

The core issue is the finite nature of this resource. Neema Raphael, Goldman Sachs' chief data officer, put the situation in stark terms.

"We've already run out of data," Raphael said, pointing to an emerging crisis for developers building the next generation of AI.

This scarcity forces developers to seek alternative, and potentially problematic, data sources to continue advancing their models.

Risks of Current Training Methods

With the well of public data running low, AI developers are turning to two main alternatives: creating synthetic data or training new models on the output generated by existing AI systems. While these methods provide a path forward, they introduce significant risks.

The Danger of Model Collapse

A primary concern is a phenomenon known as model collapse. This occurs when an AI system's performance degrades after being trained on data generated by another AI. Over successive generations, the model can lose its understanding of complex nuances and begin to amplify errors.

Raphael commented on this trend, noting that some newer, more efficient models are likely trained on the outputs of their predecessors. "The interesting thing is going to be how previous models then shape what the next iteration of the world looks like," he explained, if models are trained less on real-world data.

What is 'AI Slop'?

The term 'AI slop' refers to the low-quality, often erroneous or biased content generated by AI systems. When this content is used to train new models, it can perpetuate and worsen inaccuracies, leading to a decline in overall AI reliability and performance.

This cycle of training on AI-generated content, or "AI slop," could lead to a future where AI systems become less reliable and detached from the complexities of real-world information.

Enterprise Data The Next Frontier

Despite the challenges, Goldman Sachs analysts believe a solution exists. Raphael suggested that the risk of model collapse will not be a "massive constraint" on future AI progress for one key reason: the enormous amount of data that remains locked away inside corporations.

"There is a lot of trapped enterprise data that still has not been harnessed," he said. This proprietary information, which includes everything from customer interactions and operational logs to internal research, is considered highly valuable for creating specialized, high-performing AI applications.

The Value of Proprietary Data

Enterprise data offers unique advantages over public web data. It is often structured, specific to a particular industry or business process, and contains insights not available anywhere else. For an AI model, this data can provide the context needed to perform specialized tasks with high accuracy, creating a competitive advantage for the company that uses it effectively.

According to Goldman Sachs, this information is "highly salient to garnering business value." The ability to harness this internal data is what will separate companies that successfully integrate AI from those that do not.

The Challenge of Unlocking Corporate Data

While enterprise data holds immense potential, accessing and preparing it for AI training is a major undertaking. Much of this information is stored in disparate systems, is often unstructured, and may contain inconsistencies.

Raphael emphasized the foundational work required. "Cleaning your data, normalizing it, having the semantics of the data understood, all of this stuff is what's going to allow enterprises to level up," he explained.

This data preparation phase is critical for any company looking to build a meaningful advantage with AI. Without a clean and well-organized data foundation, any AI initiatives are likely to produce disappointing results.

A Reality Check on AI Investment

This optimistic view of enterprise data's potential is tempered by current market realities. The article referenced by Goldman Sachs noted several key findings:

  • U.S. companies have reportedly invested up to $40 billion in Generative AI projects with limited returns so far.
  • Autonomous AI agents tasked with office work have been found to get tasks wrong a majority of the time.
  • Current AI systems still require significant human oversight to monitor their performance and correct mistakes.

These points underscore the gap between AI's potential and its current practical application in the enterprise. While the data inside companies may be the key to future breakthroughs, the journey to unlock that value is complex and requires substantial investment in both technology and strategy.