Securiti Tops DSPM Ratings in GigaOm Report


The Role of Unstructured Data in GenAI: From Key Driver to Top Challenge

Published April 23, 2024

Listen to the content

Not a day goes by without a new “breaking” headline emerging about generative AI. The topic runs the gamut from the overwhelming opportunities that AI can unlock for businesses to the staggering risk these new technologies hold … to how regulators are dealing with the frenzy of AI-related activity (new GenAI models being released, new products being built and launched with GenAI technology) to shape the future of AI governance. In March, the European Union adopted a sweeping set of regulations around the use of AI by businesses in the form of the EU AI Act — and the US Treasury Department issued a report on the safe use of AI in FinServ just two weeks later. The discussion, however broad, has shifted to a real concern about managing, using, and governing unstructured data.

Cybersecurity, privacy, and data teams are now finding themselves having to react — and react quickly — to take advantage of generative AI technology and, at the same time, ensure that their customers are protected and their companies can meet compliance. And that means, among other things, quickly learning how to deal with unstructured data.

Back to basics: What is unstructured data, and why does it put a knot in your stomach?

Unstructured data refers to data that does not have a predefined data model or is not organized in a traditional row-column database format. It’s typically text-heavy and lacks the structural organization and properties of structured data — for example, all of the documents, emails, social media posts, web pages, and multimedia content that a company may have or own. It can also include all the regulations and policies that companies may need to adhere to, such as tax codes or insurance terms of coverage.

While the majority of most organizations’ data is of the unstructured variety, the bulk of their data management investments are in structured data, which lives in databases or spreadsheets. Semi-structured data has also received some attention over the past several years, with many companies improving their handling of formats like XML documents or returns from APIs in JSON format, which are often used in integrations for exchanging data within or between companies.

But, for most companies, this still leaves enormous volumes of unstructured data deprioritized at best and neglected at worst. Unstructured data management and handling has simply not seen the same level of attention as its structured-data counterpart, with many organizations even struggling to identify all the locations where their unstructured data might live — across which shared drives, cloud systems, applications, and so on. And once it is identified, unstructured data requires different, more complex management and specialized techniques in order for data teams to extract meaningful insights and patterns from it — techniques such as natural language processing, text mining, and machine learning.

Enter GenAI: Why unstructured data is especially relevant in new GenAI technologies

Unstructured data is the driving input for most generative AI systems, particularly for language models and multimodal systems (think picture and video applications), for several reasons:

  1. Massive training data: Generative AI models require massive amounts of training data to learn patterns and representations, and unstructured data provides a rich and diverse source of information.
  2. Natural language understanding: Unstructured text data — such as books, articles, and websites — is crucial for developing natural language understanding capabilities in AI systems. Language models like OpenAI GPT-4 and Anthropic Claude are trained on vast amounts of unstructured text data, enabling them to understand and generate human-like text.
  3. Contextual understanding: Unstructured data often contains rich contextual information, such as sentiment, tone, and implicit relationships, which are essential for AI systems to develop a deep understanding of human communication and behavior.
  4. Domain-specific knowledge: Unstructured data from specific domains — like medical records, legal documents, or scientific papers — can provide valuable domain-specific knowledge for AI systems, enabling them to generate more accurate and relevant outputs in those domains.

Whether a company licenses access to a commercial generative AI system or wants to build or fine-tune its own, the critical components are the documents, images, videos, and other content used to train the system—which provides the context around which the system operates.

Companies’ challenges around unstructured data

For most organizations, unstructured data is inherently difficult to manage, govern, and secure. Here are a few reasons why:

  1. Volume and variety: The sheer volume and variety of unstructured data sources — from emails to documents to social media posts to multimedia files — is the core issue, making it difficult for teams to keep track of and enforce consistent governance and security policies across the organization.
  2. Uncontrolled access and sharing: Once created, unstructured data proliferates rapidly across various systems, devices, and cloud services as people copy, modify, manipulate, and share the content, making it easy to lose track of the data’s original provenance.
  3. Data silos and ambiguous ownership: Compounding this, unstructured data is often created and managed by different departments or individuals within an organization, leading to data silos and ambiguity around data ownership and accountability. While structured data is more likely to have known ownership within an organization due to understood security or cost implications, a company’s unstructured data is often either sequestered for legitimate reasons (e.g., upcoming commentary for an acquisition) or for less desired causes (e.g., political boundaries between divisions).
  4. Inconsistent formats: Finally, the formats of unstructured data are varied. Whereas structured data has collapsed into a small set of universal standards, SQL being a principal one, unstructured content systems have a multitude of formats and legacy patterns. The tools needed to manage these formats in a unified way are unique and require a commitment from the organization to deploy and use them.

In the past, Enterprise Content Management (ECM) systems gained popularity for their ability to manage and organize unstructured data, including documents, images, and other content. However, due to cost, architecture, user experience, and—most notably—many companies’ migration to the cloud, they fell out of favor with most businesses.

Today, many organizations have opted to replace or augment ECM systems with more modern, cloud-native, AI-powered content services platforms that better align with their digital transformation initiatives and the evolving needs of managing unstructured data at scale. Today, systems like Microsoft’s Office365, Atlassian Confluence, and Google’s Office Suite dominate usage. Unlike their ECM predecessors, these systems are flexible and easy to use, which is great for creative use but still doesn’t do much from a governance or security perspective.

How companies can start to tackle the unstructured data problem

To effectively manage their unstructured data, companies should implement the following strategies:

  1. Data discovery and classification: Identify and classify unstructured data assets across the organization, including documents, emails, multimedia files, and other content. Use data discovery tools, machine learning, and natural language processing to automate the process and categorize data based on sensitivity, content, and purpose.
  2. Data governance framework: Establish a comprehensive data governance framework that defines policies, roles, and responsibilities for managing unstructured data throughout its lifecycle. This includes data creation, storage, access, retention, and disposal.
  3. Metadata management: Implement metadata management practices to enrich unstructured data with contextual information, such as data owners, access permissions, retention periods, and other relevant metadata.
  4. Access controls and data security: Apply appropriate access controls, encryption, and data loss prevention (DLP) measures to protect sensitive unstructured data from unauthorized access, data breaches, or accidental exposure.
  5. Data lifecycle management: Define and enforce policies for data retention, archiving, and disposal. Automate processes for managing data lifecycle stages, ensuring compliance with regulatory requirements and minimizing data storage costs.
  6. Cloud and on-premises integration: Develop strategies to manage unstructured data across cloud and on-prem environments, ensuring consistent governance, security, and compliance across hybrid infrastructure.
  7. Continuous monitoring and auditing: Implement processes to track data access, usage, and potential data leakage or misuse.

Overcoming the challenges presented by unstructured data requires a comprehensive data governance strategy that includes data discovery, classification, access controls, lifecycle management, and robust security measures. Organizations need to invest in specialized tools and technologies and train and educate their employees on best practices for handling and securing unstructured data.

For the first time, Securiti, the pioneer of the Data Command Center, and Lacework, a best-in-class Cloud Native Application Protection Platform (CNAPP), come together with a strategic, collaborative solution built to empower enterprises to manage and safeguard their unstructured data across complex multicloud environments. Learn more about how the combined solution can protect you and your data — everywhere, at scale — and contribute to your peace of mind.

Join Our Newsletter

Get all the latest information, law updates and more delivered to your inbox


More Stories that May Interest You