Announcing Agent Commander - The First Integrated solution from Veeam + Securiti.ai enabling the scaling of safe AI agents

View

The Evolution of Data Quality: How GenAI is Setting New Standards

Author

Ankur Gupta

Director for Data Governance and AI Products at Securiti

Listen to the content

A few years back, Google's Photos app, an AI tool designed to categorize and tag images, made several mistakes in labeling photos. The biased results stemmed from poor quality training data lacking diversity and representation of different skin tones. This incident highlighted the criticality of using complete, correct, and representative training data to ensure AI systems perform accurately and without misrepresentation. The quality of data used for AI is the key here. Thomas C. Redman, the Data Doc, notes that the quality requirements for AI are far broader and deeper.

Garbage in, garbage out. This timeless truth makes much more sense now in the GenAI (Generative AI) era, a sentiment echoed by a recent survey in which 46% of data leaders identified “data quality” as the greatest challenge to realizing GenAI’s potential in their organizations. The complexity of managing unstructured data adds to the challenge.

Play Video

Unstructured data is at the epicenter of the GenAI revolution, driving innovation with diverse inputs to sophisticated models. The wealth of information contained in unstructured data enables deep and accurate insights, transforming industries and enhancing decision-making processes. Our earlier blog discusses how Unstructured Data Intelligence is critical for harnessing it effectively.

Unstructured Data

GenAI models and LLMs (Large Language Models) use enormous volumes of unstructured data like photos, texts, audios, and videos. It is very difficult to ascertain the quality of this data, which may contain ambiguous, duplicated, and unverified information. How can you assure the quality of GenAI output when the quality of the input unstructured data is questionable?

An IDC study notes that companies that used unstructured data in the past 12 months reported improved customer satisfaction and retention, data governance, compliance with regulations, innovation, and employee productivity. Naturally, there is a rush to leverage unstructured data with GenAI for business growth, innovation, and compliance. However, Forrester reports that data quality is now the primary limiting factor for GenAI adoption.

So, is it the time to rethink data quality in the GenAI era?

What is Data Quality

In the traditional definition, data quality is a measure of how fit the data is for its intended use. The fitness of data is measured by accuracy, completeness, consistency, validity, uniqueness, integrity, accessibility, and timeliness. Assessing these dimensions of data is possible only for structured data, which has well-defined formats and organization.

When dealing with unstructured data, the absence of any defined format makes it challenging to evaluate completeness, consistency, or validity. Uniqueness is also hard to confirm, as unstructured data is often duplicated across different silos. For instance, sending a document to a group results in multiple copies saved in various accounts. Determining the most recent and relevant version of a document is crucial, especially when multiple versions exist. Additionally, understanding the context of the document is essential to ensure that GenAI interprets and utilizes it correctly.

Ultimately, the quality of unstructured data hinges on its contextual accuracy, relevance, and freshness. But how do you assess these attributes in the vast volumes of unstructured data that organizations are constantly flooded with?

Challenges in Assuring the Quality of Unstructured Data

Assuring the quality of unstructured data presents several challenges:

  1. No standards: There is no single way to determine the quality of unstructured data. The various formats of text, images, videos, and audio make it harder to apply a uniform quality standard.
  2. Large volume and noise: The sheer volume of real-time streaming of unstructured data can be overwhelming to process. It also typically contains irrelevant, redundant, or noisy information that affects quality.
  3. Contextual accuracy: Ensuring the data accurately reflects its context is challenging, as the interpretation is based on various factors not captured by simple analysis.
  4. Resource-intensive processing: Delivering quality requires sophisticated tools and human oversight to interpret ambiguous data correctly, which can be resource-intensive.
  5. Sensitive information: Unstructured data may contain PI, PII, or sensitive information, posing privacy risks. However, omitting this data can affect the quality and subsequently, the GenAI responses. Sanitizing data is essential for its safe use.

Addressing these challenges involves deploying advanced tools and establishing robust data governance frameworks to maintain high data quality.

Data Quality: Structured vs. Unstructured Data

Structured Data

Unstructured Data

Data organized in tables with rows and columns, ensuring that each data point conforms to a specific type, range, and structure. Data includes text, images, and videos with no predefined format or organization, making it difficult to apply any standard definition of quality.
Quality is defined by the accuracy, completeness, and consistency. Quality depends on the richness and contextual accuracy of the content, along with relevance and freshness.
Quality implies the data is fit for use in business processes and analytics. Quality indicates that the data can be reliably processed and analyzed using advanced techniques like NLP and ML.

Rethinking Data Quality for GenAI

To deliver high data quality, it is essential to understand how GenAI works with unstructured data. GenAI builds the context around data by inferring metadata and connecting data concepts, which is not possible with relational tables. It also interprets data that can take any value within a range rather than well-defined discrete datasets, so your data quality approach should be about curating ongoing GenAI interactions. Finally, GenAI consumes large volumes of data and needs inline processing to deliver fast, accurate, contextual conversations.

It is also important to note that GenAI consumes everything you provide, including sensitive data, and retains the information forever. Safeguarding sensitive data as part of the data quality initiative can ensure safe and compliant data use.

In essence, GenAI needs uniquely new data quality measures such as freshness, relevance, and uniqueness, along with data curation and data sanitization to build trusted, robust models.

How Securiti Delivers High Data Quality

Delivering high data quality begins with understanding data and the GenAI models that will use the data. Securiti helps you gain contextual insights for data from all key perspectives with a multidimensional Data Command Graph. It is a Knowledge Graph that captures all essential metadata and relationships between them for all types, including documents, images, audio, video, CLOBs, and many more.

With the Securiti Data Command Graph, you can get a complete view of:

  • File categories based on content, for example, legal, finance, or HR
  • Access and user entitlements
  • Sensitive objects within a file
  • Regulations applicable to file content
  • File quality, such as freshness, relevance, or uniqueness
  • Lineage of files and embeddings used in GenAI pipes.

With these insights, you can respond to any question about data, GenAI models, and their relationships, enabling the safe use of data and AI.

Next comes data curation, data sanitization, and inline data quality.

Data Curation

Securiti helps you curate and auto-label files and objects for use in GenAI projects. You can

  • Curate data by analyzing content and automatically adding data labels to files based on content.
  • Use an extensible policy framework to automatically apply sensitivity and use case labels within files and documents. These labels can include personal data category, purpose, retention, and more, to deliver contextual accuracy and relevance to ensure you use only appropriate data for your GenAI projects.
  • Preserve labels and tags when moving files from source systems for feeding to GenAI models.
Data Curation

Data Sanitization

If GenAI models learn from any sensitive information, it remains with them forever, compromising data privacy and security. Securiti enables you to

  • Discover and classify data in flight for PII and sensitive information for sanitization.
  • Automatically mask, anonymize, redact, or tokenize data in-flight within a GenAI pipeline.
  • Ensure compliance with internal controls and the ever-evolving global data and AI regulations before transferring data for use with LLMs for training or inference.
Data Sanitization

Data Quality

Securiti helps ensure GenAI model efficacy by maintaining file quality and eliminating duplicates and stale data.

  • Infer and analyze metadata on files, such as their recency and topic, to measure data quality
  • Evaluate files inline to ensure:
    • Freshness
    • Uniqueness
    • Relevance to the topic
    • Reliability of sources
  • Develop new data quality measures, such as robustness and non-hallucination of model responses in a non-deterministic world.
Data Quality

Properly managed high-quality data is increasingly seen as an asset of potentially limitless value, with AI as the key to unlocking that potential. Securiti helps you realize this potential.

5 Best Practices to Ensure Data Quality for GenAI

Here are five best practices to ensure you deliver high-quality data essential for GenAI's success.

  1. Include unstructured data in your quality strategy: In a recent survey of CDOs and data leaders, 93% of respondents agreed that data strategy is critical for getting value from GenAI. Extend your data quality management strategy to include unstructured data for comprehensive quality across all data types. This inclusion helps capture valuable insights from diverse unstructured data sources like text, images, and social media.
  2. Define your data quality objectives for GenAI projects: Evaluate your quality requirements to gain clarity on your specific goals. They can include relevance of data, accuracy, freshness, or other attributes. Prioritize them to decide on controls.
  3. Choose the right tools to deliver inline data quality: For GenAI, dynamic controls across diverse data sources and flows are essential to deliver accurate, non-hallucinating model responses.
  4. Harness the power of the Knowledge Graph for quality: The Knowledge Graph reveals interconnected relationships essential for building context and intelligence on data. This visibility drives the quality and security of data within GenAI pipelines.
  5. Invest in a Data Command Center for streamlined collaboration: A comprehensive Data Command Center addresses privacy, security, governance, and compliance, complementing your quality initiatives. It can streamline operations across organizational data silos to deliver a single source of truth for data and AI intelligence.

In Summary

In the GenAI era, large volumes of unstructured data can impact the GenAI output's accuracy, which is essential for driving business growth and compliance. However, defining and delivering the quality of this data is fraught with several challenges, especially the lack of standards and the risk of exposing sensitive data.

Securiti empowers you to safely harness your structured and unstructured data with GenAI models. Overcome the data quality challenges with Securiti and follow best practices to ensure trusted GenAI responses. Learn how to assure the quality of unstructured data and use it effectively for powering your GenAI use cases.

In our upcoming blog, we will explore how tracing the lineage of unstructured data is critical to the success of GenAI initiatives.

Harnessing Unstructured Data for GenAI:
A Primer for CDOs

Analyze this article with AI

Prompts open in third-party AI tools.
Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


Share

More Stories that May Interest You
Videos
View More
Rehan Jalil, Veeam on Agent Commander : theCUBE + NYSE Wired: Cyber Security Leaders
Following Veeam’s acquisition of Securiti, the launch of Agent Commander marks an important step toward helping enterprises adopt AI agents with greater confidence. In...
View More
Mitigating OWASP Top 10 for LLM Applications 2025
Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...
View More
Top 6 DSPM Use Cases
With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...
View More
Colorado Privacy Act (CPA)
What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...
View More
Securiti for Copilot in SaaS
Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...
View More
Top 10 Considerations for Safely Using Unstructured Data with GenAI
A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....
View More
Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes
As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...
View More
Navigating CPRA: Key Insights for Businesses
What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...
View More
Navigating the Shift: Transitioning to PCI DSS v4.0
What is PCI DSS? PCI DSS (Payment Card Industry Data Security Standard) is a set of security standards to ensure safe processing, storage, and...
View More
Securing Data+AI : Playbook for Trust, Risk, and Security Management (TRiSM)
AI's growing security risks have 48% of global CISOs alarmed. Join this keynote to learn about a practical playbook for enabling AI Trust, Risk,...

Spotlight Talks

Spotlight 50:52
From Data to Deployment: Safeguarding Enterprise AI with Security and Governance
Watch Now View
Spotlight 11:29
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Watch Now View
Spotlight 11:18
Rewiring Real Estate Finance — How Walker & Dunlop Is Giving Its $135B Portfolio a Data-First Refresh
Watch Now View
Spotlight 13:38
Accelerating Miracles — How Sanofi is Embedding AI to Significantly Reduce Drug Development Timelines
Sanofi Thumbnail
Watch Now View
Spotlight 10:35
There’s Been a Material Shift in the Data Center of Gravity
Watch Now View
Spotlight 14:21
AI Governance Is Much More than Technology Risk Mitigation
AI Governance Is Much More than Technology Risk Mitigation
Watch Now View
Spotlight 12:!3
You Can’t Build Pipelines, Warehouses, or AI Platforms Without Business Knowledge
Watch Now View
Spotlight 47:42
Cybersecurity – Where Leaders are Buying, Building, and Partnering
Rehan Jalil
Watch Now View
Spotlight 27:29
Building Safe AI with Databricks and Gencore
Rehan Jalil
Watch Now View
Spotlight 46:02
Building Safe Enterprise AI: A Practical Roadmap
Watch Now View
Latest
View More
Introducing Agent Commander
The promise of AI Agents is staggering— intelligent systems that make decisions, use tools, automate complex workflows act as force multipliers for every knowledge...
Risk Silos: The Biggest AI Problem Boards Aren’t Talking About View More
Risk Silos: The Biggest AI Problem Boards Aren’t Talking About
Boards are tuned in to the AI conversation, but there’s a blind spot many organizations still haven’t named: risk silos. Everyone agrees AI governance...
Largest Fine In CCPA History_ What The Latest CCPA Enforcement Action Teaches Businesses View More
Largest Fine In CCPA History: What The Latest CCPA Enforcement Action Teaches Businesses
Businesses can take some vital lessons from the recent biggest enforcement action in CCPA history. Securiti’s blog covers all the important details to know.
View More
AI & HIPAA: What It Means and How to Automate Compliance
Explore how the Health Insurance Portability and Accountability Act (HIPAA) applies to Artificial Intelligence (AI) in securing Protected Health Information (PHI). Learn how to...
Consent Orchestration for Safe AI View More
Consent Orchestration for Safe AI
Access the whitepaper and learn how to operationalize consent across data and GenAI with a practical framework, enforceable controls, and a 30/60/90-day implementation roadmap.
View More
2026 Privacy Compliance Readiness Checklist
Access the whitepaper to unlock a practical guide to strengthening privacy readiness, featuring key insights, the 2026 privacy compliance checklist, and how to operationalize...
DataAI Security for Retail View More
DataAI Security for Retail
Download the brief and explore how retailers can securely scale Data & AI with Securiti DataAI Command Center and protect sensitive data, manage risk,...
Emerging AI Security Trends For 2026 View More
Emerging AI Security Trends For 2026
Securiti’s latest infographic provides security leaders with a walkthrough of all the emerging AI security trends for 2026 to help them assess and plan...
View More
Take the Data Risk Out of AI
Learn how to prepare enterprise data for safe Gemini Enterprise adoption with upstream governance, sensitive data discovery, and pre-index policy controls.
View More
Navigating HITRUST: A Guide to Certification
Securiti's eBook is a practical guide to HITRUST certification, covering everything from choosing i1 vs r2 and scope systems to managing CAPs & planning...
What's
New