Veeam Completes Acquisition of Securiti AI to Create the Industry’s First Trusted Data Platform for Accelerating Safe AI at Scale

View

How Tracing Unstructured Data Lineage Can Make or Break GenAI Success

Author

Ankur Gupta

Director for Data Governance and AI Products at Securiti

Listen to the content

This post is also available in: Arabic

Businesses have leveraged structured data for decades to make business decisions. But they also have been generating a substantial amount of unstructured data in the form of reports, emails, texts, voicemails, photos, illustrations, and videos. And, of course, a whole lot of social media posts and messages in recent times. Despite the immense value this data holds, its lack of a defined structure has prevented businesses from using and managing it effectively. An IDC study notes that even though the volume and variety of unstructured data is vastly greater than that of structured data, less than half of all unstructured data is shared, analyzed, or reused.

GenAI (Generative AI) has changed the situation of unstructured data lying unused in the numerous silos of your enterprise data. For the first time in history, the real worth of unstructured data is in the limelight. Unstructured data is now at the epicenter of the GenAI revolution, powering wide-ranging use cases from aviation to retail and medicine to research.

No wonder 80% of the CDOs agree that GenAI would eventually transform their organization’s business environment. While this is a positive shift to leverage this vast treasure of information, the safe use of unstructured data is essential for GenAI's success. One of the central pillars of its safe and compliant use is data lineage, knowing where the data has originated and how it has been transformed across its GenAI lifecycle.

Why Unstructured Data Lineage is Critical for GenAI

Google Cloud Chief Evangelist Richard Seroter says, “If you don’t have your data house in order, AI is going to be less valuable.” Getting your data house in order for GenAI requires a multipronged approach, which we covered in our earlier blogs on data intelligence and data quality. In this blog, let us focus on data lineage.

Unstructured Data

Harnessing unstructured data safely is crucial, as it often contains sensitive and proprietary information. Data lineage plays a critical role in it by tracing data sources, destinations, transformations along the way, and usage. It can help you optimize operations, enhance security, and support better decision-making through improved data governance and traceability.

For GenAI, tracing the data mapping and flow from data systems to vector databases, LLMs (Large Language Models), and final endpoints is essential. It provides an end-to-end lineage view, helping monitor GenAI model inputs, identify contributing sources, verify response integrity, and protect sensitive data.

Effective management of data lineage can enhance transparency, compliance, and trust in GenAI outputs, making it a critical factor that can make or break GenAI success. The volume and variety of unstructured data, coupled with the loss of associated metadata—such as sensitivity, access, and regulatory intelligence—make tracing its lineage extremely challenging.

Challenges in Tracing the Lineage of Unstructured Data

Tracing the lineage of unstructured data poses several challenges due to its inherent nature of diverse formats and the complexity of its handling. Here are some key challenges:

  1. Massive volumes and data silos: Enterprises generate large quantities of data every day, and 90% of it is unstructured. Moreover, this data streams in real-time and goes through several processes. Keeping track of this data across organizational silos is not easy.
  2. Complex data systems: Modern data systems can be complex. When data moves across several systems, tracking the transformations at every stage can be difficult.
  3. Tool limitations: Traditional tools rely on structured ETL processes for tracing lineage. Unstructured data requires a new approach to infer lineage by changes in content and metadata.
  4. Metadata limitations: Unstructured data and embeddings may lack clear and complete metadata associated with them, such as file date, ownership, and sensitive information. A context needs to be built around unstructured data to fully understand it.
  5. Sensitive Information: Unstructured data can contain sensitive or personally identifiable information (PII). Proper handling and strict access control are crucial when tracing its lineage.

The challenges in tracing the lineage of unstructured data arise from its large volumes, complex transformations, limited tool support, and privacy concerns. Addressing these challenges requires advanced technologies and robust methodologies.

Data Lineage: Structured vs. Unstructured Data

Structured Data

Unstructured Data

Tracking data lineage in structured databases is facilitated by clear schemas and transactional logs. Data lineage is harder to establish due to the lack of clear, traceable pathways in unstructured formats.
Lineage tools can trace data transformations and movements through structured ETL processes. New tools must infer lineage by analyzing content changes and metadata across various systems and formats.

How Securiti Delivers Unstructured Data Lineage for GenAI Success

For unstructured data, Securiti infers data lineage by tracking changes in content and metadata. The changes are documented and analyzed to build data and flow mapping. Metadata and context are preserved before the data is chunked and loaded into the vector database. This approach provides a complete view of the data map. For example, a clear visual map can provide information about where the specific unstructured data originated, how it was processed, how it was used during RAG (Retrieval Augmented Generation), model fine-tuning, or model training models, and how it was finally consumed by the end-user or system.

Unstructured Data

Securiti enables you to gain contextual insights for data with a multidimensional Data Command Graph that captures key metadata and relationships between them for all types of data. It provides a complete view of

  • File categories
  • Sensitive objects within a file
  • File access and entitlements
  • Internal policies and controls
  • Applicable regulations for a file
  • Lineage of files and embeddings used in GenAI pipelines

A key use case for lineage in GenAI involves ensuring that sensitive data is accessible only to authorized users. For instance, within an organization, the HR team may access employee personal data like salaries, whereas the marketing team cannot. If a marketing team member creates a prompt potentially accessing employee data, how can this be prevented? Securiti Data Command Graph monitors the data sources used by GenAI models for specific prompts and checks if the user has the right to access those sources. This capability helps identify and manage vulnerabilities that could expose sensitive data, using a clear visual map to establish appropriate controls.

5 Best Practices to Deliver Unstructured Data Lineage for GenAI

Here are five best practices to ensure your data lineage collection is accurate and efficient.

  1. Set your data lineage objectives to match your use cases: Data lineage collection is a resource-intensive process. To optimize resource use, ensure you collect the essential data lineage and not too much unnecessary information. Evaluate what lineage information your GenAI use case needs, and set your objectives.
  2. Choose the right data lineage tool: One of the challenges of unstructured data lineage is capturing the metadata, as it is often not fully defined. Selecting a tool that leverages AI and ML can significantly improve the ability to get complete metadata information as well as data transformations in real-time.
  3. Invest in a Data Command Center: The Data Command Center can break down silos to provide a unified view of your data landscape and capture lineage for both unstructured and structured data. It also addresses privacy, security, governance, and compliance across a broad range of use cases in your organization.
  4. Integrate with Data Quality and Security Initiatives: Use data lineage to support your Data Quality and Data Security efforts. Knowing where your data comes from, how it changes, and where it goes helps ensure its accuracy and reliability. This is especially crucial for sensitive information, which needs to be trusted and protected throughout its lifecycle.
  5. Promote a data governance culture: Foster a culture of Data Governance in your organization through training, awareness, and collaboration. This will ensure the value of data lineage is fully appreciated.

In Summary

Unstructured data, like emails, reports, and social media posts, is valuable but often underutilized due to its complexity. GenAI brings this data to the forefront, unlocking its potential for business growth and innovation. The success of GenAI depends on the safe and compliant use of unstructured data, making data lineage crucial to trace data movement across its GenAI lifecycle for integrity.

Securiti helps you overcome lineage challenges of data volumes and tool limitations, ensuring trust, transparency, and compliance in your GenAI projects. Learn to unlock the value of unstructured data safely and effectively. Download the white paper Harnessing Unstructured Data for GenAI: A Primer for CDOs.

Harnessing Unstructured Data for GenAI:
A Primer for CDOs

In our next blog, we will explore the need to honor data permissioning and entitlements to prevent data leakage and ensure the safe utilization of unstructured data with GenAI.

Analyze this article with AI

Prompts open in third-party AI tools.
Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


Share

More Stories that May Interest You
Videos
View More
Mitigating OWASP Top 10 for LLM Applications 2025
Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...
View More
Top 6 DSPM Use Cases
With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...
View More
Colorado Privacy Act (CPA)
What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...
View More
Securiti for Copilot in SaaS
Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...
View More
Top 10 Considerations for Safely Using Unstructured Data with GenAI
A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....
View More
Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes
As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...
View More
Navigating CPRA: Key Insights for Businesses
What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...
View More
Navigating the Shift: Transitioning to PCI DSS v4.0
What is PCI DSS? PCI DSS (Payment Card Industry Data Security Standard) is a set of security standards to ensure safe processing, storage, and...
View More
Securing Data+AI : Playbook for Trust, Risk, and Security Management (TRiSM)
AI's growing security risks have 48% of global CISOs alarmed. Join this keynote to learn about a practical playbook for enabling AI Trust, Risk,...
AWS Startup Showcase Cybersecurity Governance With Generative AI View More
AWS Startup Showcase Cybersecurity Governance With Generative AI
Balancing Innovation and Governance with Generative AI Generative AI has the potential to disrupt all aspects of business, with powerful new capabilities. However, with...

Spotlight Talks

Spotlight 50:52
From Data to Deployment: Safeguarding Enterprise AI with Security and Governance
Watch Now View
Spotlight 11:29
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Watch Now View
Spotlight 11:18
Rewiring Real Estate Finance — How Walker & Dunlop Is Giving Its $135B Portfolio a Data-First Refresh
Watch Now View
Spotlight 13:38
Accelerating Miracles — How Sanofi is Embedding AI to Significantly Reduce Drug Development Timelines
Sanofi Thumbnail
Watch Now View
Spotlight 10:35
There’s Been a Material Shift in the Data Center of Gravity
Watch Now View
Spotlight 14:21
AI Governance Is Much More than Technology Risk Mitigation
AI Governance Is Much More than Technology Risk Mitigation
Watch Now View
Spotlight 12:!3
You Can’t Build Pipelines, Warehouses, or AI Platforms Without Business Knowledge
Watch Now View
Spotlight 47:42
Cybersecurity – Where Leaders are Buying, Building, and Partnering
Rehan Jalil
Watch Now View
Spotlight 27:29
Building Safe AI with Databricks and Gencore
Rehan Jalil
Watch Now View
Spotlight 46:02
Building Safe Enterprise AI: A Practical Roadmap
Watch Now View
Latest
View More
DataAI Security: Why Healthcare Organizations Choose Securiti
Discover why healthcare organizations trust Securiti for Data & AI Security. Learn key blockers, five proven advantages, and what safe data innovation makes possible.
View More
The Anthropic Exploit: Welcome to the Era of AI Agent Attacks
Explore the first AI agent attack, why it changes everything, and how DataAI Security pillars like Intelligence, CommandGraph, and Firewalls protect sensitive data.
View More
Aligning Your AI Systems With GDPR: What You Need to Know
Securiti’s latest blog walks you through all the important information and guidance you need to ensure your AI systems are compliant with GDPR requirements.
Network Security: Definition, Challenges, & Best Practices View More
Network Security: Definition, Challenges, & Best Practices
Discover what network security is, how it works, types, benefits, and best practices. Learn why network security is core to having a strong data...
Australia’s Guidance for AI Adoption View More
Australia’s Guidance for AI Adoption
Access the whitepaper to learn about what businesses need to know about Australia’s Guidance for AI Adoption. Discover how Securiti helps ensure compliance.
Montana Privacy Amendment on Notices: What to Change by Oct 1 View More
Montana Privacy Amendment on Notices: What to Change by Oct 1
Download the whitepaper to learn about the Montana Privacy Amendment on Notices and what to change by Oct 1. Learn how Securiti helps.
View More
Solution Brief: Microsoft Purview + Securiti
Extend Microsoft Purview with Securiti to discover, classify, and reduce data & AI risk across hybrid environments with continuous monitoring and automated remediation. Learn...
Top 7 Data & AI Security Trends 2026 View More
Top 7 Data & AI Security Trends 2026
Discover the top 7 Data & AI security trends for 2026. Learn how to secure AI agents, govern data, manage risk, and scale AI...
View More
Navigating HITRUST: A Guide to Certification
Securiti's eBook is a practical guide to HITRUST certification, covering everything from choosing i1 vs r2 and scope systems to managing CAPs & planning...
The DSPM Architect’s Handbook View More
The DSPM Architect’s Handbook: Building an Enterprise-Ready Data+AI Security Program
Get certified in DSPM. Learn to architect a DSPM solution, operationalize data and AI security, apply enterprise best practices, and enable secure AI adoption...
What's
New