Securiti leads GigaOm's DSPM Vendor Evaluation with top ratings across technical capabilities & business value.

View

The Evolution of Data Quality: How GenAI is Setting New Standards

Author

Ankur Gupta

Director for Data Governance and AI Products at Securiti

Listen to the content

This post is also available in: Brazilian Portuguese

A few years back, Google's Photos app, an AI tool designed to categorize and tag images, made several mistakes in labeling photos. The biased results stemmed from poor quality training data lacking diversity and representation of different skin tones. This incident highlighted the criticality of using complete, correct, and representative training data to ensure AI systems perform accurately and without misrepresentation. The quality of data used for AI is the key here. Thomas C. Redman, the Data Doc, notes that the quality requirements for AI are far broader and deeper.

Garbage in, garbage out. This timeless truth makes much more sense now in the GenAI (Generative AI) era, a sentiment echoed by a recent survey in which 46% of data leaders identified “data quality” as the greatest challenge to realizing GenAI’s potential in their organizations. The complexity of managing unstructured data adds to the challenge.

Unstructured data is at the epicenter of the GenAI revolution, driving innovation with diverse inputs to sophisticated models. The wealth of information contained in unstructured data enables deep and accurate insights, transforming industries and enhancing decision-making processes. Our earlier blog discusses how Unstructured Data Intelligence is critical for harnessing it effectively.

Unstructured Data

GenAI models and LLMs (Large Language Models) use enormous volumes of unstructured data like photos, texts, audios, and videos. It is very difficult to ascertain the quality of this data, which may contain ambiguous, duplicated, and unverified information. How can you assure the quality of GenAI output when the quality of the input unstructured data is questionable?

An IDC study notes that companies that used unstructured data in the past 12 months reported improved customer satisfaction and retention, data governance, compliance with regulations, innovation, and employee productivity. Naturally, there is a rush to leverage unstructured data with GenAI for business growth, innovation, and compliance. However, Forrester reports that data quality is now the primary limiting factor for GenAI adoption.

So, is it the time to rethink data quality in the GenAI era?

What is Data Quality

In the traditional definition, data quality is a measure of how fit the data is for its intended use. The fitness of data is measured by accuracy, completeness, consistency, validity, uniqueness, integrity, accessibility, and timeliness. Assessing these dimensions of data is possible only for structured data, which has well-defined formats and organization.

When dealing with unstructured data, the absence of any defined format makes it challenging to evaluate completeness, consistency, or validity. Uniqueness is also hard to confirm, as unstructured data is often duplicated across different silos. For instance, sending a document to a group results in multiple copies saved in various accounts. Determining the most recent and relevant version of a document is crucial, especially when multiple versions exist. Additionally, understanding the context of the document is essential to ensure that GenAI interprets and utilizes it correctly.

Ultimately, the quality of unstructured data hinges on its contextual accuracy, relevance, and freshness. But how do you assess these attributes in the vast volumes of unstructured data that organizations are constantly flooded with?

Challenges in Assuring the Quality of Unstructured Data

Assuring the quality of unstructured data presents several challenges:

  1. No standards: There is no single way to determine the quality of unstructured data. The various formats of text, images, videos, and audio make it harder to apply a uniform quality standard.
  2. Large volume and noise: The sheer volume of real-time streaming of unstructured data can be overwhelming to process. It also typically contains irrelevant, redundant, or noisy information that affects quality.
  3. Contextual accuracy: Ensuring the data accurately reflects its context is challenging, as the interpretation is based on various factors not captured by simple analysis.
  4. Resource-intensive processing: Delivering quality requires sophisticated tools and human oversight to interpret ambiguous data correctly, which can be resource-intensive.
  5. Sensitive information: Unstructured data may contain PI, PII, or sensitive information, posing privacy risks. However, omitting this data can affect the quality and subsequently, the GenAI responses. Sanitizing data is essential for its safe use.

Addressing these challenges involves deploying advanced tools and establishing robust data governance frameworks to maintain high data quality.

Data Quality: Structured vs. Unstructured Data

Structured Data

Unstructured Data

Data organized in tables with rows and columns, ensuring that each data point conforms to a specific type, range, and structure. Data includes text, images, and videos with no predefined format or organization, making it difficult to apply any standard definition of quality.
Quality is defined by the accuracy, completeness, and consistency. Quality depends on the richness and contextual accuracy of the content, along with relevance and freshness.
Quality implies the data is fit for use in business processes and analytics. Quality indicates that the data can be reliably processed and analyzed using advanced techniques like NLP and ML.

Rethinking Data Quality for GenAI

To deliver high data quality, it is essential to understand how GenAI works with unstructured data. GenAI builds the context around data by inferring metadata and connecting data concepts, which is not possible with relational tables. It also interprets data that can take any value within a range rather than well-defined discrete datasets, so your data quality approach should be about curating ongoing GenAI interactions. Finally, GenAI consumes large volumes of data and needs inline processing to deliver fast, accurate, contextual conversations.

It is also important to note that GenAI consumes everything you provide, including sensitive data, and retains the information forever. Safeguarding sensitive data as part of the data quality initiative can ensure safe and compliant data use.

In essence, GenAI needs uniquely new data quality measures such as freshness, relevance, and uniqueness, along with data curation and data sanitization to build trusted, robust models.

How Securiti Delivers High Data Quality

Delivering high data quality begins with understanding data and the GenAI models that will use the data. Securiti helps you gain contextual insights for data from all key perspectives with a multidimensional Data Command Graph. It is a Knowledge Graph that captures all essential metadata and relationships between them for all types, including documents, images, audio, video, CLOBs, and many more.

With the Securiti Data Command Graph, you can get a complete view of:

  • File categories based on content, for example, legal, finance, or HR
  • Access and user entitlements
  • Sensitive objects within a file
  • Regulations applicable to file content
  • File quality, such as freshness, relevance, or uniqueness
  • Lineage of files and embeddings used in GenAI pipes.

With these insights, you can respond to any question about data, GenAI models, and their relationships, enabling the safe use of data and AI.

Next comes data curation, data sanitization, and inline data quality.

Data Curation

Securiti helps you curate and auto-label files and objects for use in GenAI projects. You can

  • Curate data by analyzing content and automatically adding data labels to files based on content.
  • Use an extensible policy framework to automatically apply sensitivity and use case labels within files and documents. These labels can include personal data category, purpose, retention, and more, to deliver contextual accuracy and relevance to ensure you use only appropriate data for your GenAI projects.
  • Preserve labels and tags when moving files from source systems for feeding to GenAI models.
Data Curation

Data Sanitization

If GenAI models learn from any sensitive information, it remains with them forever, compromising data privacy and security. Securiti enables you to

  • Discover and classify data in flight for PII and sensitive information for sanitization.
  • Automatically mask, anonymize, redact, or tokenize data in-flight within a GenAI pipeline.
  • Ensure compliance with internal controls and the ever-evolving global data and AI regulations before transferring data for use with LLMs for training or inference.
Data Sanitization

Data Quality

Securiti helps ensure GenAI model efficacy by maintaining file quality and eliminating duplicates and stale data.

  • Infer and analyze metadata on files, such as their recency and topic, to measure data quality
  • Evaluate files inline to ensure:
    • Freshness
    • Uniqueness
    • Relevance to the topic
    • Reliability of sources
  • Develop new data quality measures, such as robustness and non-hallucination of model responses in a non-deterministic world.
Data Quality

Properly managed high-quality data is increasingly seen as an asset of potentially limitless value, with AI as the key to unlocking that potential. Securiti helps you realize this potential.

5 Best Practices to Ensure Data Quality for GenAI

Here are five best practices to ensure you deliver high-quality data essential for GenAI's success.

  1. Include unstructured data in your quality strategy: In a recent survey of CDOs and data leaders, 93% of respondents agreed that data strategy is critical for getting value from GenAI. Extend your data quality management strategy to include unstructured data for comprehensive quality across all data types. This inclusion helps capture valuable insights from diverse unstructured data sources like text, images, and social media.
  2. Define your data quality objectives for GenAI projects: Evaluate your quality requirements to gain clarity on your specific goals. They can include relevance of data, accuracy, freshness, or other attributes. Prioritize them to decide on controls.
  3. Choose the right tools to deliver inline data quality: For GenAI, dynamic controls across diverse data sources and flows are essential to deliver accurate, non-hallucinating model responses.
  4. Harness the power of the Knowledge Graph for quality: The Knowledge Graph reveals interconnected relationships essential for building context and intelligence on data. This visibility drives the quality and security of data within GenAI pipelines.
  5. Invest in a Data Command Center for streamlined collaboration: A comprehensive Data Command Center addresses privacy, security, governance, and compliance, complementing your quality initiatives. It can streamline operations across organizational data silos to deliver a single source of truth for data and AI intelligence.

In Summary

In the GenAI era, large volumes of unstructured data can impact the GenAI output's accuracy, which is essential for driving business growth and compliance. However, defining and delivering the quality of this data is fraught with several challenges, especially the lack of standards and the risk of exposing sensitive data.

Securiti empowers you to safely harness your structured and unstructured data with GenAI models. Overcome the data quality challenges with Securiti and follow best practices to ensure trusted GenAI responses. Learn how to assure the quality of unstructured data and use it effectively for powering your GenAI use cases.

In our upcoming blog, we will explore how tracing the lineage of unstructured data is critical to the success of GenAI initiatives.

Harnessing Unstructured Data for GenAI:
A Primer for CDOs

Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


Share


More Stories that May Interest You

Videos

View More

Mitigating OWASP Top 10 for LLM Applications 2025

Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...

View More

DSPM vs. CSPM – What’s the Difference?

While the cloud has offered the world immense growth opportunities, it has also introduced unprecedented challenges and risks. Solutions like Cloud Security Posture Management...

View More

Top 6 DSPM Use Cases

With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...

View More

Colorado Privacy Act (CPA)

What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...

View More

Securiti for Copilot in SaaS

Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...

View More

Top 10 Considerations for Safely Using Unstructured Data with GenAI

A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....

View More

Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes

As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...

View More

Navigating CPRA: Key Insights for Businesses

What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...

View More

Navigating the Shift: Transitioning to PCI DSS v4.0

What is PCI DSS? PCI DSS (Payment Card Industry Data Security Standard) is a set of security standards to ensure safe processing, storage, and...

View More

Securing Data+AI : Playbook for Trust, Risk, and Security Management (TRiSM)

AI's growing security risks have 48% of global CISOs alarmed. Join this keynote to learn about a practical playbook for enabling AI Trust, Risk,...

Spotlight Talks

Spotlight 11:29

Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like

Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Watch Now View
Spotlight 11:18

Rewiring Real Estate Finance — How Walker & Dunlop Is Giving Its $135B Portfolio a Data-First Refresh

Watch Now View
Spotlight 13:38

Accelerating Miracles — How Sanofi is Embedding AI to Significantly Reduce Drug Development Timelines

Sanofi Thumbnail
Watch Now View
Spotlight 10:35

There’s Been a Material Shift in the Data Center of Gravity

Watch Now View
Spotlight 14:21

AI Governance Is Much More than Technology Risk Mitigation

AI Governance Is Much More than Technology Risk Mitigation
Watch Now View
Spotlight 12:!3

You Can’t Build Pipelines, Warehouses, or AI Platforms Without Business Knowledge

Watch Now View
Spotlight 47:42

Cybersecurity – Where Leaders are Buying, Building, and Partnering

Rehan Jalil
Watch Now View
Spotlight 27:29

Building Safe AI with Databricks and Gencore

Rehan Jalil
Watch Now View
Spotlight 46:02

Building Safe Enterprise AI: A Practical Roadmap

Watch Now View
Spotlight 13:32

Ensuring Solid Governance Is Like Squeezing Jello

Watch Now View

Latest

Inside Echoleak View More

Inside Echoleak

How Indirect Prompt Injections Exploit the AI Layer and How to Secure Your Data What is Echoleak? Echoleak (CVE-2025-32711) is a vulnerability discovered in...

The Overprivileged Access Crisis: A CISO’s Guide to Data Access Governance View More

The Overprivileged Access Crisis: A CISO’s Guide to Data Access Governance

Overprivileged data access has quietly become a systemic risk, where users, groups, and machines routinely hold far broader permissions than their jobs require. Approximately...

What is SSPM? (SaaS Security Posture Management) View More

What is SSPM? (SaaS Security Posture Management)

This blog covers all the important details related to SSPM, including why it matters, how it works, and how organizations can choose the best...

View More

“Scraping Almost Always Illegal”, Netherlands DPA Declares

Explore the Dutch Data Protection Authority's guidelines on web scraping, its legal complexities, privacy risks, and other relevant details important to your organization.

Beyond DLP: Guide to Modern Data Protection with DSPM View More

Beyond DLP: Guide to Modern Data Protection with DSPM

Learn why traditional data security tools fall short in the cloud and AI era. Learn how DSPM helps secure sensitive data and ensure compliance.

Mastering Cookie Consent: Global Compliance & Customer Trust View More

Mastering Cookie Consent: Global Compliance & Customer Trust

Discover how to master cookie consent with strategies for global compliance and building customer trust while aligning with key data privacy regulations.

ROI of Data Minimization: Save Millions in Cost, Risk & AI With DSPM View More

ROI of Data Minimization: Save Millions in Cost, Risk & AI With DSPM

ROT data is a costly liability. Discover how DSPM-powered data minimization reduces risk and how Securiti’s two-phase framework helps.

From AI Risk to AI Readiness: Why Enterprises Need DSPM Now View More

From AI Risk to AI Readiness: Why Enterprises Need DSPM Now

Discover why shifting focus from AI risk to AI readiness is critical for enterprises. Learn how Data Security Posture Management (DSPM) empowers organizations to...

Gencore AI and Amazon Bedrock View More

Building Enterprise-Grade AI with Gencore AI and Amazon Bedrock

Learn how to build secure enterprise AI copilots with Amazon Bedrock models, protect AI interactions with LLM Firewalls, and apply OWASP Top 10 LLM...

DSPM Vendor Due Diligence View More

DSPM Vendor Due Diligence

DSPM’s Buyer Guide ebook is designed to help CISOs and their teams ask the right questions and consider the right capabilities when looking for...

What's
New