A new report from AI data provider Appen highlights that companies are increasingly facing challenges in sourcing and managing the high-quality data essential for powering AI systems. Appen’s 2024 State of AI report, surveying over 500 U.S. IT decision-makers, reveals a 17% increase in generative AI adoption over the past year. However, organizations now confront significant hurdles in data preparation and quality, with a 10% rise in bottlenecks related to data sourcing, cleaning, and labeling.

“As AI models take on more complex and specialized tasks, their data needs also shift,” said Si Chen, Head of Strategy at Appen, in an interview with VentureBeat. “Companies are realizing that it’s no longer enough to have lots of data. For optimal model performance, data must be high-quality, meaning accurate, diverse, well-labeled, and tailored to specific AI applications.”

The adoption of generative AI (GenAI) has surged by 17% in 2024, driven by advances in large language models (LLMs) enabling automation across IT, R&D, and other enterprise applications. However, the increase in GenAI use is accompanied by new challenges in data management. “Generative AI outputs are diverse, unpredictable, and subjective, which complicates defining and measuring success,” said Chen. Custom data collection has become the preferred approach to sourcing training data, shifting away from generic datasets to more specific, reliable ones.

Despite the growing interest in AI, fewer projects are reaching deployment, and ROI is dropping. Since 2021, the percentage of AI projects reaching deployment has fallen by 8.1%, while deployed projects showing meaningful ROI have declined by 9.4%. Chen attributes this trend to the complexity of emerging AI applications. While simpler use cases such as image recognition are now mature, new, ambitious initiatives—particularly in generative AI—require high-quality, specialized data, making them harder to execute effectively.

Data accuracy has dropped by nearly 9% since 2021, a concerning trend for AI development. With AI models becoming more sophisticated, data complexity has increased, often requiring specialized annotations. The report notes that 86% of companies now retrain or update their AI models at least quarterly, heightening the demand for fresh, accurate data. Nearly 90% of companies now rely on external data providers to support their training and evaluation needs.

“Data quality will continue to be a major challenge as models grow more complex,” Chen noted. “With more advanced generative AI models, sourcing, cleaning, and labeling data have already become critical bottlenecks.”

As AI applications become more specialized, preparing the right data has become increasingly complex. “Data preparation issues have worsened,” said Chen. “More specialized models demand unique, tailored datasets.” Many companies are developing long-term strategies to improve data accuracy, consistency, and diversity, and partnering with external data providers to navigate the complexities of the AI data lifecycle

Despite AI advancements, human input remains essential. The report indicates that 80% of respondents value human-in-the-loop (HITL) machine learning, which integrates human expertise to guide and improve AI models. “Human involvement is critical for creating high-performing, ethical, and contextually accurate AI,” Chen explained. Human experts are particularly valuable for mitigating biases and ensuring ethical AI development, helping to refine models and align them with real-world behaviors. This need is especially pressing for generative AI, where unpredictable outputs require close monitoring to avoid harmful or biased results.

Appen’s report underscores that as AI continues to advance, companies must focus on high-quality, specialized data and human oversight to overcome emerging challenges and achieve successful AI deployments.

By Impact Lab