Securiti AI Recognized as a Customers’ Choice For DSPM By Gartner Peer Insights


The Evolution of Data Quality: How GenAI is Setting New Standards

Published July 9, 2024

Listen to the content

A few years back, Google's Photos app, an AI tool designed to categorize and tag images, made several mistakes in labeling photos. The biased results stemmed from poor quality training data lacking diversity and representation of different skin tones. This incident highlighted the criticality of using complete, correct, and representative training data to ensure AI systems perform accurately and without misrepresentation, a sentiment echoed by a recent survey in which 46% of data leaders identified “data quality” as the greatest challenge to realizing GenAI’s potential in their organizations.

Garbage in, garbage out. This timeless truth makes much more sense now in the GenAI (Generative AI) era. Unstructured data is at the epicenter of the GenAI revolution, driving innovation with diverse inputs to sophisticated models. The wealth of information contained in unstructured data enables deep and accurate insights, transforming industries and enhancing decision-making processes. Our earlier blog discusses how Unstructured Data Intelligence is critical for harnessing it effectively.

Unstructured Data

GenAI models and LLMs (Large Language Models) use enormous volumes of unstructured data like photos, texts, audios, and videos. It is very difficult to ascertain the quality of this data, which may contain ambiguous, duplicated, and unverified information. How can you assure the quality of GenAI output when the quality of the input unstructured data is questionable?

An IDC study notes that companies that used unstructured data in the past 12 months reported improved customer satisfaction and retention, data governance, compliance with regulations, innovation, and employee productivity. Naturally, there is a rush to leverage unstructured data with GenAI for business growth, innovation, and compliance. However, Forrester reports that data quality is now the primary limiting factor for GenAI adoption.

So, is it the time to rethink data quality in the GenAI era?

What is Data Quality

In the traditional definition, data quality is a measure of how fit the data is for its intended use. The fitness of data is measured by accuracy, completeness, consistency, validity, uniqueness, integrity, accessibility, and timeliness. Assessing these dimensions of data is possible only for structured data, which has well-defined formats and organization.

When dealing with unstructured data, the absence of any defined format makes it challenging to evaluate completeness, consistency, or validity. Uniqueness is also hard to confirm, as unstructured data is often duplicated across different silos. For instance, sending a document to a group results in multiple copies saved in various accounts. Determining the most recent and relevant version of a document is crucial, especially when multiple versions exist. Additionally, understanding the context of the document is essential to ensure that GenAI interprets and utilizes it correctly.

Ultimately, the quality of unstructured data hinges on its contextual accuracy, relevance, and freshness. But how do you assess these attributes in the vast volumes of unstructured data that organizations are constantly flooded with?

Challenges in Assuring the Quality of Unstructured Data

Assuring the quality of unstructured data presents several challenges:

  1. No standards: There is no single way to determine the quality of unstructured data. The various formats of text, images, videos, and audio make it harder to apply a uniform quality standard.
  2. Large volume and noise: The sheer volume of real-time streaming of unstructured data can be overwhelming to process. It also typically contains irrelevant, redundant, or noisy information that affects quality.
  3. Contextual accuracy: Ensuring the data accurately reflects its context is challenging, as the interpretation is based on various factors not captured by simple analysis.
  4. Resource-intensive processing: Delivering quality requires sophisticated tools and human oversight to interpret ambiguous data correctly, which can be resource-intensive.
  5. Sensitive information: Unstructured data may contain PI, PII, or sensitive information, posing privacy risks. However, omitting this data can affect the quality and subsequently, the GenAI responses. Sanitizing data is essential for its safe use.

Addressing these challenges involves deploying advanced tools and establishing robust data governance frameworks to maintain high data quality.

Data Quality: Structured vs. Unstructured Data

Structured Data

Unstructured Data

Data organized in tables with rows and columns, ensuring that each data point conforms to a specific type, range, and structure. Data includes text, images, and videos with no predefined format or organization, making it difficult to apply any standard definition of quality.
Quality is defined by the accuracy, completeness, and consistency. Quality depends on the richness and contextual accuracy of the content, along with relevance and freshness.
Quality implies the data is fit for use in business processes and analytics. Quality indicates that the data can be reliably processed and analyzed using advanced techniques like NLP and ML.

Rethinking Data Quality for GenAI

To deliver high data quality, it is essential to understand how GenAI works with unstructured data. GenAI builds the context around data by inferring metadata and connecting data concepts, which is not possible with relational tables. It also interprets data that can take any value within a range rather than well-defined discrete datasets, so your data quality approach should be about curating ongoing GenAI interactions. Finally, GenAI consumes large volumes of data and needs inline processing to deliver fast, accurate, contextual conversations.

It is also important to note that GenAI consumes everything you provide, including sensitive data, and retains the information forever. Safeguarding sensitive data as part of the data quality initiative can ensure safe and compliant data use.

In essence, GenAI needs uniquely new data quality measures such as freshness, relevance, and uniqueness, along with data curation and data sanitization to build trusted, robust models.

How Securiti Delivers High Data Quality

Delivering high data quality begins with understanding data and the GenAI models that will use the data. Securiti helps you gain contextual insights for data from all key perspectives with a multidimensional Data Command Graph. It is a Knowledge Graph that captures all essential metadata and relationships between them for all types, including documents, images, audio, video, CLOBs, and many more.

With the Securiti Data Command Graph, you can get a complete view of:

  • File categories based on content, for example, legal, finance, or HR
  • Access and user entitlements
  • Sensitive objects within a file
  • Regulations applicable to file content
  • File quality, such as freshness, relevance, or uniqueness
  • Lineage of files and embeddings used in GenAI pipes.

With these insights, you can respond to any question about data, GenAI models, and their relationships, enabling the safe use of data and AI.

Next comes data curation, data sanitization, and inline data quality.

Data Curation

Securiti helps you curate and auto-label files and objects for use in GenAI projects. You can

  • Curate data by analyzing content and automatically adding data labels to files based on content.
  • Use an extensible policy framework to automatically apply sensitivity and use case labels within files and documents. These labels can include personal data category, purpose, retention, and more, to deliver contextual accuracy and relevance to ensure you use only appropriate data for your GenAI projects.
  • Preserve labels and tags when moving files from source systems for feeding to GenAI models.
Data Curation

Data Sanitization

If GenAI models learn from any sensitive information, it remains with them forever, compromising data privacy and security. Securiti enables you to

  • Discover and classify data in flight for PII and sensitive information for sanitization.
  • Automatically mask, anonymize, redact, or tokenize data in-flight within a GenAI pipeline.
  • Ensure compliance with internal controls and the ever-evolving global data and AI regulations before transferring data for use with LLMs for training or inference.
Data Sanitization

Data Quality

Securiti helps ensure GenAI model efficacy by maintaining file quality and eliminating duplicates and stale data.

  • Infer and analyze metadata on files, such as their recency and topic, to measure data quality
  • Evaluate files inline to ensure:
    • Freshness
    • Uniqueness
    • Relevance to the topic
    • Reliability of sources
  • Develop new data quality measures, such as robustness and non-hallucination of model responses in a non-deterministic world.
Data Quality

5 Best Practices to Ensure Data Quality for GenAI

Here are five best practices to ensure you deliver high-quality data essential for GenAI's success.

  1. Include unstructured data in your quality strategy: In a recent survey of CDOs and data leaders, 93% of respondents agreed that data strategy is critical for getting value from GenAI. Extend your data quality management strategy to include unstructured data for comprehensive quality across all data types. This inclusion helps capture valuable insights from diverse unstructured data sources like text, images, and social media.
  2. Define your data quality objectives for GenAI projects: Evaluate your quality requirements to gain clarity on your specific goals. They can include relevance of data, accuracy, freshness, or other attributes. Prioritize them to decide on controls.
  3. Choose the right tools to deliver inline data quality: For GenAI, dynamic controls across diverse data sources and flows are essential to deliver accurate, non-hallucinating model responses.
  4. Harness the power of the Knowledge Graph for quality: The Knowledge Graph reveals interconnected relationships essential for building context and intelligence on data. This visibility drives the quality and security of data within GenAI pipelines.
  5. Invest in a Data Command Center for streamlined collaboration: A comprehensive Data Command Center addresses privacy, security, governance, and compliance, complementing your quality initiatives. It can streamline operations across organizational data silos to deliver a single source of truth for data and AI intelligence.

In Summary

In the GenAI era, large volumes of unstructured data can impact the GenAI output's accuracy, which is essential for driving business growth and compliance. However, defining and delivering the quality of this data is fraught with several challenges, especially the lack of standards and the risk of exposing sensitive data.

Securiti empowers you to safely harness your structured and unstructured data with GenAI models. Overcome the data quality challenges with Securiti and follow best practices to ensure trusted GenAI responses. Learn how to assure the quality of unstructured data and use it effectively for powering your GenAI use cases.

In our upcoming blog, we will explore how tracing the lineage of unstructured data is critical to the success of GenAI initiatives.

Harnessing Unstructured Data for GenAI:
A Primer for CDOs

Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


More Stories that May Interest You