GenAI models and LLMs (Large Language Models) use enormous volumes of unstructured data like photos, texts, audios, and videos. It is very difficult to ascertain the quality of this data, which may contain ambiguous, duplicated, and unverified information. How can you assure the quality of GenAI output when the quality of the input unstructured data is questionable?
An IDC study notes that companies that used unstructured data in the past 12 months reported improved customer satisfaction and retention, data governance, compliance with regulations, innovation, and employee productivity. Naturally, there is a rush to leverage unstructured data with GenAI for business growth, innovation, and compliance. However, Forrester reports that data quality is now the primary limiting factor for GenAI adoption.
So, is it the time to rethink data quality in the GenAI era?
What is Data Quality
In the traditional definition, data quality is a measure of how fit the data is for its intended use. The fitness of data is measured by accuracy, completeness, consistency, validity, uniqueness, integrity, accessibility, and timeliness. Assessing these dimensions of data is possible only for structured data, which has well-defined formats and organization.
When dealing with unstructured data, the absence of any defined format makes it challenging to evaluate completeness, consistency, or validity. Uniqueness is also hard to confirm, as unstructured data is often duplicated across different silos. For instance, sending a document to a group results in multiple copies saved in various accounts. Determining the most recent and relevant version of a document is crucial, especially when multiple versions exist. Additionally, understanding the context of the document is essential to ensure that GenAI interprets and utilizes it correctly.
Ultimately, the quality of unstructured data hinges on its contextual accuracy, relevance, and freshness. But how do you assess these attributes in the vast volumes of unstructured data that organizations are constantly flooded with?
Challenges in Assuring the Quality of Unstructured Data
Assuring the quality of unstructured data presents several challenges:
- No standards: There is no single way to determine the quality of unstructured data. The various formats of text, images, videos, and audio make it harder to apply a uniform quality standard.
- Large volume and noise: The sheer volume of real-time streaming of unstructured data can be overwhelming to process. It also typically contains irrelevant, redundant, or noisy information that affects quality.
- Contextual accuracy: Ensuring the data accurately reflects its context is challenging, as the interpretation is based on various factors not captured by simple analysis.
- Resource-intensive processing: Delivering quality requires sophisticated tools and human oversight to interpret ambiguous data correctly, which can be resource-intensive.
- Sensitive information: Unstructured data may contain PI, PII, or sensitive information, posing privacy risks. However, omitting this data can affect the quality and subsequently, the GenAI responses. Sanitizing data is essential for its safe use.
Addressing these challenges involves deploying advanced tools and establishing robust data governance frameworks to maintain high data quality.
Data Quality: Structured vs. Unstructured Data
Structured Data
|
Unstructured Data
|
Data organized in tables with rows and columns, ensuring that each data point conforms to a specific type, range, and structure. |
Data includes text, images, and videos with no predefined format or organization, making it difficult to apply any standard definition of quality. |
Quality is defined by the accuracy, completeness, and consistency. |
Quality depends on the richness and contextual accuracy of the content, along with relevance and freshness. |
Quality implies the data is fit for use in business processes and analytics. |
Quality indicates that the data can be reliably processed and analyzed using advanced techniques like NLP and ML. |
Rethinking Data Quality for GenAI
To deliver high data quality, it is essential to understand how GenAI works with unstructured data. GenAI builds the context around data by inferring metadata and connecting data concepts, which is not possible with relational tables. It also interprets data that can take any value within a range rather than well-defined discrete datasets, so your data quality approach should be about curating ongoing GenAI interactions. Finally, GenAI consumes large volumes of data and needs inline processing to deliver fast, accurate, contextual conversations.
It is also important to note that GenAI consumes everything you provide, including sensitive data, and retains the information forever. Safeguarding sensitive data as part of the data quality initiative can ensure safe and compliant data use.
In essence, GenAI needs uniquely new data quality measures such as freshness, relevance, and uniqueness, along with data curation and data sanitization to build trusted, robust models.
How Securiti Delivers High Data Quality
Delivering high data quality begins with understanding data and the GenAI models that will use the data. Securiti helps you gain contextual insights for data from all key perspectives with a multidimensional Data Command Graph. It is a Knowledge Graph that captures all essential metadata and relationships between them for all types, including documents, images, audio, video, CLOBs, and many more.
With the Securiti Data Command Graph, you can get a complete view of:
- File categories based on content, for example, legal, finance, or HR
- Access and user entitlements
- Sensitive objects within a file
- Regulations applicable to file content
- File quality, such as freshness, relevance, or uniqueness
- Lineage of files and embeddings used in GenAI pipes.
With these insights, you can respond to any question about data, GenAI models, and their relationships, enabling the safe use of data and AI.
Next comes data curation, data sanitization, and inline data quality.
Data Curation
Securiti helps you curate and auto-label files and objects for use in GenAI projects. You can
- Curate data by analyzing content and automatically adding data labels to files based on content.
- Use an extensible policy framework to automatically apply sensitivity and use case labels within files and documents. These labels can include personal data category, purpose, retention, and more, to deliver contextual accuracy and relevance to ensure you use only appropriate data for your GenAI projects.
- Preserve labels and tags when moving files from source systems for feeding to GenAI models.