Securiti leads GigaOm's DSPM Vendor Evaluation with top ratings across technical capabilities & business value.

View

5 Ways to Accelerate Unstructured Data Cleansing for AI with Securiti and DataBricks

Author

Jocelyn Houle

Senior Director of Product Management at Securiti

Listen to the content

This post is also available in: Brazilian Portuguese

The Unstructured Data Challenge

LLMs has created an opportunity for organizations to extract tremendous value from their unstructured data. However, CDAOs are all too aware of the challenges involved in incorporating unstructured data into large-scale data transformations. In an ideal world, it would be just as easy to use unstructured data as it is to use structured data. Organizations need to know that data can be trusted, that it has been thoroughly sanitized at an element level with granular access entitlements that protect all the data in the data estate. Today, organizations struggle to apply the same degree of governance typically afforded to their business-critical structured data as they do to their ever-expanding reservoir of unstructured data. Meanwhile, eagerly awaited AI initiatives stall.

Organizations that leverage Databricks for analytics and AI face specific technical challenges when working with unstructured data, which comprises approximately 90% of enterprise information. While Databricks excels at handling structured data and has made progress on unstructured sources, teams in complex, hybrid cloud data environments may encounter several critical pain points when attempting to incorporate unstructured sources into their data pipelines:

1. Complex and Manual Preprocessing Requirements

Ingesting unstructured data (including zipped folders, mixed file types, and inconsistent CSV formats) requires preprocessing before it can be loaded into Databricks. Teams typically need to build custom Python scripts or use external tools to parse, clean, and convert data into Delta Lake format, which creates scalability challenges and maintenance overhead.

2. Granular Permission Management is Cumbersome

To build permission-aware AI applications that safeguard confidential and proprietary data, firms must ensure that only authorized users can access sensitive unstructured data. Today, that often requires meticulous configuration. Unity Catalog provides centralized access control, but setting up granular permissions—especially for external locations in cloud storage—is a manual and error-prone process. Why is that? The answer is technical and organizational.  Locking down unstructured data in general requires the organization to have comprehensive fine grained permissions established - unfortunately, due to constantly changing data sources, even the best run companies  tend to be over provisioned with far too many people having access.  For AI use cases, the matter is even more complicated as the AI workflow includes a process called vectorization that turns all the info into an indexable representation  LLMs can read and in the process, breaks the access controls you thought you had in the first place.

3. Security and Compliance Risks in Data Sharing and Rapid Deployment

Databricks' collaborative environment, like all modern cloud data platforms, accelerates the speed at which data can be shared, which in turn increases the risk of accidental or intentional data exposure. Unstructured data often contains sensitive information, and, if not thoroughly scanned, it is impossible to ensure that sensitive data is fully accounted for. Rapid data ingestion and sharing often result in partial scans and misconfigured access controls, making it difficult to maintain compliance with regulations such as GDPR, HIPAA, or PCI-DSS.

4. Feature Extraction and Structuring Overhead

It is not enough to find sensitive data in complex multi-user scenarios. Tools must be in place to minimize, redact, and sanitize sensitive data before it is loaded or considered a gold copy. Before unstructured data can be used for analytics or AI, it must undergo complex feature extraction and transformation. Today, this requires additional pipelines and specialized tooling that engineering teams must build and maintain.

5. Query Performance and Storage Management Challenges

Querying unstructured data can be slow and resource-intensive. Transformations such as flattening nested data degrade performance at scale, while unstructured data quickly balloons storage costs and complicates governance. Without the tools to curate and trust the precise unstructured data you absolutely need- no more no less- organizations may get unpleasant surprise bills.

How Securiti Expands Solutions to Unstructured Data Challenges

Securiti has partnered with Databricks to deliver end-to-end, trusted unstructured data management with full context through Securiti’s Gencore AI solution newly directly integrated into Delta Tables and Unity Catalog. This new partnership enables organizations to more easily and quickly build safe, enterprise-grade generative AI (GenAI) systems and AI agents, using high-value, proprietary enterprise data.

Securiti AI enhances Databricks in five powerful ways:

1. Simplified Unstructured Data Ingestion

Gencore AI safely ingests unstructured and structured data from SaaS apps and on-prem systems into Databricks Delta tables. It eliminates the need for custom preprocessing scripts by providing hundreds of native connectors to quickly and securely ingest data at scale from anywhere, including public, private, SaaS, and data clouds.

Data engineers benefit: Instead of building and maintaining custom scripts, teams can leverage Securiti's extensive connector library to streamline the ingestion process, reducing data preparation time by up to 60% as reported by shared Securiti and Databricks customers.

2. Automated Data Sanitization and Protection

Gencore AI helps sanitize (redact, mask, or anonymize) sensitive information before bringing it into Databricks. The solution automatically classifies and redacts sensitive data on-the-fly, ensuring privacy and compliance before data is exposed to AI models or transformed into vectors that can be later retrieved.

Security teams benefit: Before data enters AI pipelines and LLMs, comprehensive checks ensure alignment with AI governance, privacy, security, compliance, and sovereignty requirements - dramatically reducing security and compliance risks.

3. Advanced Data Security & Governance

Built-in data protection, alignment with OWASP Top 10 for LLMs, and a graph-based full provenance view of AI and data enable safe AI systems at scale. Gencore AI implements advanced LLM firewalls to understand the context of all AI interactions, including prompts, responses, and data retrievals, to offer end-to-end protection of enterprise data far beyond easily circumvented model guardrails.

Compliance teams benefit: Custom and pre-configured policies block malicious attacks, prevent sensitive data leaks, and ensure enterprise AI systems align with corporate policies. These context-aware firewalls also preserve access entitlements to documents and files throughout the AI pipeline.

4. Enhanced Unity Catalog Intelligence

Unity Catalog gains enriched context through Securiti's Data Command Graph, thus increasing data utilization. Securiti's Data Command Graph contains rich context about relationships between files, tables, columns, AI objects, users, permissions, and regulations that can be seamlessly registered within Unity Catalog.

Data administrators benefit: The comprehensive context increases Unity Catalog's utility and enables safer data usage across the platform.

5. The Securiti Data Command Graph: A Game-Changer for Databricks

At the heart of Securiti's solution is the Data Command Graph—a knowledge graph that provides contextual intelligence about enterprise data. This graph enables:

  • Precise selection of relevant files and datasets based on labels, entitlements, regulations, and quality
  • Comprehensive visibility into data lineage and relationships
  • Preservation of user entitlements at the prompt level, enhancing security and compliance

"Contextual intelligence for both unstructured and structured data is at the heart of GenAI use cases," said Jocelyn Houle, Sr. Director of Product Management, “The Data Command Graph automatically builds knowledge about your data that provides insights to the GenAI pipeline at every step for its safe use.”

The graph provides in-depth contextual insights into data objects, such as files, folders, buckets, tables, or columns, including related context, such as sensitive information, entitlements, location, applicable policies and processes, and regulations.

Conclusion: Unlocking the AI adoption with Securiti and Databricks

The partnership between Securiti and Databricks represents a significant advancement in enterprise AI and permission-aware solution-building capabilities. By addressing the critical challenges of unstructured data management, organizations can now unlock the full potential of their data assets while maintaining rigorous security, governance, and compliance standards.

As organizations continue to invest in AI initiatives, solutions like Gencore AI will become essential for scaling enterprise AI responsibly and efficiently. The integration enables teams to focus on innovation rather than wrestling with the complexities of unstructured data management, ultimately accelerating the path to AI-driven business transformation.

To learn more about how Securiti and Databricks can help your organization, visit Securiti's Gencore AI website.

Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


Share


More Stories that May Interest You

Videos

View More

Mitigating OWASP Top 10 for LLM Applications 2025

Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...

View More

DSPM vs. CSPM – What’s the Difference?

While the cloud has offered the world immense growth opportunities, it has also introduced unprecedented challenges and risks. Solutions like Cloud Security Posture Management...

View More

Top 6 DSPM Use Cases

With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...

View More

Colorado Privacy Act (CPA)

What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...

View More

Securiti for Copilot in SaaS

Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...

View More

Top 10 Considerations for Safely Using Unstructured Data with GenAI

A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....

View More

Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes

As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...

View More

Navigating CPRA: Key Insights for Businesses

What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...

View More

Navigating the Shift: Transitioning to PCI DSS v4.0

What is PCI DSS? PCI DSS (Payment Card Industry Data Security Standard) is a set of security standards to ensure safe processing, storage, and...

View More

Securing Data+AI : Playbook for Trust, Risk, and Security Management (TRiSM)

AI's growing security risks have 48% of global CISOs alarmed. Join this keynote to learn about a practical playbook for enabling AI Trust, Risk,...

Spotlight Talks

Spotlight 13:38

Accelerating Miracles — How Sanofi is Embedding AI to Significantly Reduce Drug Development Timelines

Sanofi Thumbnail
Watch Now View
Spotlight 10:35

There’s Been a Material Shift in the Data Center of Gravity

Watch Now View
Spotlight 14:21

AI Governance Is Much More than Technology Risk Mitigation

AI Governance Is Much More than Technology Risk Mitigation
Watch Now View
Spotlight 12:!3

You Can’t Build Pipelines, Warehouses, or AI Platforms Without Business Knowledge

Watch Now View
Spotlight 47:42

Cybersecurity – Where Leaders are Buying, Building, and Partnering

Rehan Jalil
Watch Now View
Spotlight 27:29

Building Safe AI with Databricks and Gencore

Rehan Jalil
Watch Now View
Spotlight 46:02

Building Safe Enterprise AI: A Practical Roadmap

Watch Now View
Spotlight 13:32

Ensuring Solid Governance Is Like Squeezing Jello

Watch Now View
Spotlight 40:46

Securing Embedded AI: Accelerate SaaS AI Copilot Adoption Safely

Watch Now View
Spotlight 10:05

Unstructured Data: Analytics Goldmine or a Governance Minefield?

Viral Kamdar
Watch Now View

Latest

Securiti Powers Sovereign AI in the EU with NVIDIA View More

Securiti Powers Sovereign AI in the EU with NVIDIA

The EU has taken the lead globally in ensuring that the power of AI systems is harnessed for the overall wellbeing of human citizens...

The Risks of Legacy DLP: Why Cloud Security Needs DSPM View More

The Risks of Legacy DLP: Why Cloud Security Needs DSPM

82% of 2024 data breaches involved cloud data, raising concerns about the effectiveness of legacy data loss prevention (DLP) solutions in today's cloud-centric data...

Data Classification: A Core Component of DSPM View More

Data Classification: A Core Component of DSPM

Data classification is a core component of DSPM, enabling teams to categorize data based on sensitivity and allocate resources accordingly to prioritize security, governance,...

9 Key Components of a Strong Data Security Strategy View More

9 Key Components of a Strong Data Security Strategy

Securiti’s latest blog breaks down the 9 key components of a robust data security strategy and explains how it helps protect your business, ensure...

Beyond DLP: Guide to Modern Data Protection with DSPM View More

Beyond DLP: Guide to Modern Data Protection with DSPM

Learn why traditional data security tools fall short in the cloud and AI era. Learn how DSPM helps secure sensitive data and ensure compliance.

Mastering Cookie Consent: Global Compliance & Customer Trust View More

Mastering Cookie Consent: Global Compliance & Customer Trust

Discover how to master cookie consent with strategies for global compliance and building customer trust while aligning with key data privacy regulations.

From AI Risk to AI Readiness: Why Enterprises Need DSPM Now View More

From AI Risk to AI Readiness: Why Enterprises Need DSPM Now

Discover why shifting focus from AI risk to AI readiness is critical for enterprises. Learn how Data Security Posture Management (DSPM) empowers organizations to...

The European Health Data Space Regulation View More

The European Health Data Space Regulation: A Legislative Timeline and Implementation Roadmap

Download the infographic on the European Health Data Space Regulation, which features a clear timeline and roadmap highlighting key legislative milestones, implementation phases, and...

Gencore AI and Amazon Bedrock View More

Building Enterprise-Grade AI with Gencore AI and Amazon Bedrock

Learn how to build secure enterprise AI copilots with Amazon Bedrock models, protect AI interactions with LLM Firewalls, and apply OWASP Top 10 LLM...

DSPM Vendor Due Diligence View More

DSPM Vendor Due Diligence

DSPM’s Buyer Guide ebook is designed to help CISOs and their teams ask the right questions and consider the right capabilities when looking for...

What's
New