Securiti+Veeam Will Accelerate Safe Enterprise Al at Scale

View

10 Best Practices to Effectively Manage Unstructured Data

Author

Anas Baig

Product Marketing Manager at Securiti

Published July 14, 2024

Listen to the content

The CDO circles echo the discussions around unstructured data and its innate potential in driving data and AI initiatives. Generative AI is more adept at understanding the richness found in unstructured data than in structured data that lacks depth. The nuanced insights unstructured data offers enable large language models (LLMs) to understand human communication and behavior better, improve machine learning, or simulate complex real-world scenarios. It further allows LLMs to develop critical natural language understanding capabilities for generating human-like outputs.

However, are CDOs data-ready to make the most of unstructured data to fuel their AI and data-related transformational initiatives? The answer is seemingly unexpected. A 2023 survey of 334 CDOs and data leaders reveals that organizations, although enthusiastic about the transformative impact of GenAI, have yet to develop new data strategies that focus on leveraging the technology effectively.

Read on to learn more about the challenges CDOs face in managing unstructured data and the best practices for governing it.

What is Unstructured Data?

Before moving to the challenges and best practices, let’s take a quick look at what unstructured data is.

Unlike structured data, which has a purposeful format, unstructured data lacks a pre-defined data model. As the name implies, it is available in a free-form format, ranging from media files to text documents and markup texts to database files.

As such data lacks a pre-defined format, it is commonly managed in non-relational (NoSQL) databases or data lakes, where it is stored in its native or raw format.

Since unstructured data is available in diverse and most commonly used formats, it is no wonder that, as estimated by IDC, it makes up 90% of an organization’s data. Astonishingly, less than a fraction of this data is used and analyzed.

Learn More About Unstructured Data Here

Top Challenges of Managing Unstructured Data

Traditional discovery and cataloging tools were built primarily for managing structured data. Hence, they fail to provide detailed insights into unstructured data, hindering organizations from leveraging it for analytics, machine learning, or other strategic purposes.

Following are the top challenges that organizations face with managing unstructured data.

Volume and Variety

Unstructured data exists everywhere across an organization’s data landscape, including shadow data assets. Moreover, it speaks in different tongues in that the data is available in varying formats, such as video and audio files, markup texts, source codes, text and image files, emails, etc. The sheer volume and variety of the data make it significantly challenging for organizations to discover and classify the data via conventional discovery and automated classification tools.

Data Quality Issues

To make the most of unstructured data, it is critical that the data is meticulously compiled for accuracy and quality. However, it is easier said than done. To put things into perspective, the same survey reveals that 46% of CDOs and data leaders believe data quality is the biggest challenge that hinders their GenAI initiatives. Data quality is impaired when unstructured data is stockpiled over time with outdated, duplicated, and trivial data. Moreover, it is yet another challenge for organizations to reduce redundant or outdated data, as it requires complex tools to identify such data across hundreds of data lakes and other repositories.

Lack of Data Lineage

The dynamic nature of unstructured data allows it to be swiftly moved across different repositories and cloud environments. As it moves through systems, applications, and departments, it undergoes various transformations. Without clear insights into data sources, it is difficult to track the lineage or verify its integrity and authenticity. Due to cloudy lineage and transparency, organizations face compliance, governance, and security risks.

Compliance & Security Problems

Unstructured data is a privacy and security minefield if it is not managed appropriately. Unstructured data contains high volumes of personally identifiable information (PII), including sensitive information. GenAI applications use this data for training the LLM or fine-tuning its performance. Without proper controls and policies in place to accurately identify sensitive information and redact or encrypt this information can lead to compliance and security threats. Similarly, there are now various data and AI laws that may have overlapping regulations regarding the collection, use, and selling of personal information and the development of AI systems. Without clear visibility of sensitive data and AI models across the environment, organizations fail to implement appropriate security, governance, and compliance controls.

Access Governance Challenges

Governing access control of unstructured data is a significant challenge for mid to large-scale organizations as they have it in their environment in petabyte volumes. Lack of or inefficient access controls could mean risks of sensitive data exposure. Unfortunately, organizations do not have a unified approach to govern access. After all, traditional tools do not have the capabilities to address unstructured data access in silos.

10 Best Practices to Manage Unstructured Data

A piecemeal approach to managing unstructured data can result in more silos, lack of data context across teams, and increased challenges and costs. Organizations must strive for a unified framework to govern unstructured data that includes key capabilities like unstructured data discovery and classification, access entitlements, lifecycle management, data sanitization and validation, and robust security controls.

To begin with, CDOs must implement the following best practices to manage data effectively.

1. Discover Unstructured Data

Effective governance of unstructured data begins with having complete visibility of all your data across all your repositories and environments. Hence, discover unstructured data in all your repositories, including data lakes, enterprise applications, cloud storage, emails, and content management systems. Gain insights into the metadata of your unstructured data assets, such as encryption status, location of the data, owner, size of the data, etc. These insights help security, governance, and compliance teams to drive and implement better data strategies.

2. Catalog Unstructured Data

Organizations must build a comprehensive catalog of their data to gain complete visibility. Data cataloging further allows teams to have a single source of truth. Consequently, every team and department across the business knows the same definition of specific datasets. Cataloging also enables seamless searchability and accessibility of data based on different categories. For instance, legal teams may easily search datasets based on their regulatory labels, or a marketing team may look for the required data based on marketing tags. Therefore, build the inventory by adding tags and metadata to files according to their content and context for relevancy. Or group the files according to departments, formats, or functions.

3. Classify Unstructured Data

Classification enables the discovery and identification of personally identifiable information (PII), including sensitive data, in unstructured datasets. Leverage out-of-the-box classifiers and automate the classification of data based on sensitivity and other important attributes. To go beyond the conventional keyword and pattern-matching approach, governance teams may capitalize on AI/ML techniques and algorithms. For instance, Natural Language Processing (NLP) techniques like text classification, entity recognition, topic modeling, and text mining can transform unstructured data into valuable insights for seamless classification and searchability.

4. Ensure Access Entitlements

Knowing and preserving data entitlements is critical for preventing unauthorized access and sensitive data leakage. Access governance teams must start by identifying users and roles with access to sensitive data, files, and folders in unstructured repositories. Secondly, they must map the relationship of those entitlements between users, roles, and permissions. For GenAI systems, teams must ensure that they preserve the entitlements from source systems while extracting the data and enforce those entitlements within GenAI pipelines or at the prompt level.

5. Track Data Lineage

Monitor the flow and transformation of data across its lifecycle to ensure its integrity, reliability, and transparency. Start by evaluating and documenting the source and usage of data in GenAI and other projects for compliance and risk assessments. Create a visual map that illustrates where the unstructured data originated, how it was processed, such as during LLM training or fine-tuning, and how the end user consumed it. Verify the source and integrity of each response of the GenAI output to ensure transparency and compliance.

6. Curate Unstructured Data

Successful GenAI transformational initiatives also depend on data precision and usefulness. For that purpose, it is important to ensure that the data is high-quality in terms of its accuracy and reliability (precision) and relevancy and applicability (utility) to specific data or GenAI initiatives. To achieve that objective, data teams must curate unstructured data and automate labeling based on its content, sensitivity, and use cases.

7. Extract Data for Utilization

There are a number of benefits associated with data extraction. Enhancing data utilization and analysis top the lists. Extracting data from multiple sources allows teams to create a unified view of all their data and make it more accessible for analysis. To ensure efficient extraction, unstructured data must be extracted from every available format, and there are a number of ways to do that. For instance, with high-fidelity parsing, teams can capture a document or file’s visual layout that improves chunking for vectorization and enhances an LLM’s ability to understand the data better. Similarly, Optical Character Recognition (OCR) can be utilized to extract data from images.

8. Run Data Sanitization

Data must go through a careful sanitization process before it is made available to be used in GenAI projects. After all, once an LLM is trained on a specific set of data, it cannot untrain itself. Therefore, when unstructured data is extracted, especially when it contains sensitive data, it should be sanitized using automated masking, anonymization, redaction, and tokenization. It is further critical that the data goes through internal compliance controls to make sure that it doesn’t violate any data or AI regulations before it is used for LLM training.

9. Ensure Data Quality

As discussed earlier, data quality is one of the biggest concerns of CDOs and data leaders that hinder their GenAI projects. To drive meaningful analysis or results out of data or develop ethically sound and reliable GenAI applications, the data should be fresh, unique, complete, accurate, and relevant. Measure data quality by inferring metadata, such as its recency and topic, and evaluating files in-line for freshness and reliability of source.

10. Establish Data+AI Security Controls

Build in-line privacy and security controls around data and LLM interactions. Make sure that the data systems and AI models are properly configured and appropriate permissions are assigned to authorized users only to prevent sensitive data exposure. Formulate and implement policies that cover sensitive data tone, topics, phishing, and attacks.

Manage & Safeguard Your Unstructured Data with Securiti

Conventional data governance tools are not equipped with the necessary capabilities required to govern unstructured data, such as inline data discovery and classification, data quality insights, lineage tracking, or data extraction and sanitization controls.

Securiti Data Command Graph, a key capability of our Data+AI Command Center, helps organizations capture all the important metadata and the relationships between them, providing contextual insights into unstructured data for all key perspectives, such as:

  • Data Systems.
  • Buckets / Folders.
  • Files / Objects / Documents.
  • Data Sensitivity.
  • Access & Entitlements.
  • Internal Policies & Controls.
  • Applicable Regulations.
  • GenAI Models / Pipelines.

This is the baseline intelligence that organizations need for effective data utilization and enable the safe use of GenAI. Together with the Data Command Graph, the Data+AI Command Center helps organizations:

  • Discover files of all types (docs, audio, video, images, etc.). CLOBs.
  • Identify file categories (legal, finance, HR, etc.) based on content.
  • Gain insights into and automate access and user entitlements.
  • Find sensitive objects within a file.
  • Map regulations applicable to file content.
  • Ensure data quality (freshness, relevance, uniqueness, etc.)
  • Track the lineage of files & embeddings used in GenAI pipes.

Request a demo to learn more.

Analyze this article with AI

Prompts open in third-party AI tools.
Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox



More Stories that May Interest You
Videos
View More
Mitigating OWASP Top 10 for LLM Applications 2025
Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...
View More
Top 6 DSPM Use Cases
With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...
View More
Colorado Privacy Act (CPA)
What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...
View More
Securiti for Copilot in SaaS
Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...
View More
Top 10 Considerations for Safely Using Unstructured Data with GenAI
A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....
View More
Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes
As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...
View More
Navigating CPRA: Key Insights for Businesses
What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...
View More
Navigating the Shift: Transitioning to PCI DSS v4.0
What is PCI DSS? PCI DSS (Payment Card Industry Data Security Standard) is a set of security standards to ensure safe processing, storage, and...
View More
Securing Data+AI : Playbook for Trust, Risk, and Security Management (TRiSM)
AI's growing security risks have 48% of global CISOs alarmed. Join this keynote to learn about a practical playbook for enabling AI Trust, Risk,...
AWS Startup Showcase Cybersecurity Governance With Generative AI View More
AWS Startup Showcase Cybersecurity Governance With Generative AI
Balancing Innovation and Governance with Generative AI Generative AI has the potential to disrupt all aspects of business, with powerful new capabilities. However, with...

Spotlight Talks

Spotlight 50:52
From Data to Deployment: Safeguarding Enterprise AI with Security and Governance
Watch Now View
Spotlight 11:29
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Not Hype — Dye & Durham’s Analytics Head Shows What AI at Work Really Looks Like
Watch Now View
Spotlight 11:18
Rewiring Real Estate Finance — How Walker & Dunlop Is Giving Its $135B Portfolio a Data-First Refresh
Watch Now View
Spotlight 13:38
Accelerating Miracles — How Sanofi is Embedding AI to Significantly Reduce Drug Development Timelines
Sanofi Thumbnail
Watch Now View
Spotlight 10:35
There’s Been a Material Shift in the Data Center of Gravity
Watch Now View
Spotlight 14:21
AI Governance Is Much More than Technology Risk Mitigation
AI Governance Is Much More than Technology Risk Mitigation
Watch Now View
Spotlight 12:!3
You Can’t Build Pipelines, Warehouses, or AI Platforms Without Business Knowledge
Watch Now View
Spotlight 47:42
Cybersecurity – Where Leaders are Buying, Building, and Partnering
Rehan Jalil
Watch Now View
Spotlight 27:29
Building Safe AI with Databricks and Gencore
Rehan Jalil
Watch Now View
Spotlight 46:02
Building Safe Enterprise AI: A Practical Roadmap
Watch Now View
Latest
View More
Securiti+Veeam Will Accelerate Safe Enterprise Al at Scale
We started Securiti Al with the strong conviction that in the Information Age, the Information aka Data, is the life blood of businesses and a unified platform was needed to provide all essential controls and deep intelligence around...
View More
DataAI Security for Financial Services: Turn Risk Into competitive Advantage
Financial services run on sensitive data. AI is now in fraud detection, underwriting, risk modelling, and customer service, raising both upside and risk. Institutions...
View More
Navigating China’s AI Regulatory Landscape in 2025: What Businesses Need to Know
A 2025 guide to China’s AI rules - generative-AI measures, algorithm & deep-synthesis filings, PIPL data exports, CAC security reviews with a practical compliance...
View More
All You Need to Know About Ontario’s Personal Health Information Protection Act 2004
Here’s what you need to know about Ontario’s Personal Health Information Protection Act of 2004 to ensure effective compliance with it.
The 5 Tenets of Modern DSPM for Financial Services View More
The 5 Tenets of Modern DSPM for Financial Services
Learn the 5 tenets of modern DSPM for financial services: continuous discovery, access governance, real-time risk visibility, automated remediation, and continuous compliance.
Maryland Online Data Privacy Act (MODPA) View More
Maryland Online Data Privacy Act (MODPA): Compliance Requirements Beginning October 1, 2025
Access the whitepaper to discover the compliance requirements under the Maryland Online Data Privacy Act (MODPA). Learn how Securiti helps ensure swift compliance.
DSPM vs Legacy Security Tools: Filling the Data Security Gap View More
DSPM vs Legacy Security Tools: Filling the Data Security Gap
The infographic discusses why and where legacy security tools fall short, and how a DSPM tool can make organizations’ investments smarter and more secure.
Operationalizing DSPM: 12 Must-Dos for Data & AI Security View More
Operationalizing DSPM: 12 Must-Dos for Data & AI Security
A practical checklist to operationalize DSPM—12 must-dos covering discovery, classification, lineage, least-privilege, DLP, encryption/keys, policy-as-code, monitoring, and automated remediation.
The DSPM Architect’s Handbook View More
The DSPM Architect’s Handbook: Building an Enterprise-Ready Data+AI Security Program
Get certified in DSPM. Learn to architect a DSPM solution, operationalize data and AI security, apply enterprise best practices, and enable secure AI adoption...
Gencore AI and Amazon Bedrock View More
Building Enterprise-Grade AI with Gencore AI and Amazon Bedrock
Learn how to build secure enterprise AI copilots with Amazon Bedrock models, protect AI interactions with LLM Firewalls, and apply OWASP Top 10 LLM...
What's
New