Products

Data Command Center
View

Data+AI Security Teams

Data+AI Teams

Data Governance Teams

Data Privacy Teams

Secure Data+AI anywhere

Data Security Posture Management

Secure sensitive data everywhere from hybrid multicloud to SaaS

AI Security & Governance

Establish controls for safe adoption of AI technologies including GenAI

Security for AI Copilots in SaaS

Unblock the biggest impediments for Safe Adoption of AI Copilots like Microsoft 365 Copilot

Data Access Intelligence & Governance

Monitor user access to data and enforce least privilege controls

Data Discovery & Classification

Discover shadow and cloud-native assets and accurately classify data

Compliance Management

Assess & improve compliance with security best practices frameworks

Breach Impact Analysis

Analyze breach impact & automate notifications to affected individuals

Data Flow Governance

Understand data lineage and secure real-time streaming data

Build safe enterprise AI systems

Safe Enterprise AI Copilots

Implement rule-aware AI copilots across your organization’s data anywhere

Data Vectorization and Ingestion

Extract info from complex Unstructured Files, convert it into AI-ready formats, and sync to vector databases

Data Curation and Sanitization for AI

Transform raw, unstructured files into data ready for model training and tuning

Context-aware LLM Firewalls

Protect AI interactions with intelligent retrieval, response, and prompt firewalls

Unstructured Data Governance

Manage and govern unstructured data to enable its safe use with generative AI

Govern data for safe innovation

Data Discovery & Classification

Discover shadow and cloud-native assets and accurately classify data

Unstructured Data Governance

Manage unstructured data to enable safe use with generative AI

Data Access Governance

Monitor sensitive data access and prevent unauthorized use

AI Governance

Establish controls for safe adoption of AI technologies including GenAI

Data Catalog

Enable users to easily find, understand, trust and access the data they need

Data Lineage

Automatically track changes and transformations of data throughout its lifecycle

Data Quality

Conduct data quality checks and validation across various data types

Automate data privacy operations

Data Mapping Automation

Manage your entire data mapping lifecycle and automate RoPA reports

AI Governance

Comply with emerging AI regulations and ensure safe use of AI

Data Subject Request Automation

Automate entire DSR lifecycle from consumer request intake to secure report delivery

Assessment Automation

Automate your entire assessment lifecycle and demonstrate compliance

Consent Management

Manage your first-party and third-party consent lifecycle from scanning to reporting

Breach Management

Automate your incident management and optimize notifications to users & regulatory bodies

Privacy Center

Elegant Consumer Frontend, Fully Automated Backend, Privacy Regulation Intelligent Everywhere

Compliance Management

Use automation to audit and improve compliance with global regulations and industry standards
Solutions
Technologies

Covering you everywhere with 1000+ integrations across data systems.

GCP

View

AWS

View

Databricks

View

Snowflake

View

Azure

View

+ More

View

Learn more

Regulations & Frameworks

Automate compliance with global privacy regulations.

CDMC

View

EU AI Act

View

OWASP

View

NIST AI RMF

View

European Union GDPR

View

California's CPRA

View

Brazil's LGPD

View

Canada's PIPEDA

View

China's PIPL

View

+ More

View

Learn more

Roles

Identify data risk and enable protection & control.

Data+AI Builders

View

Data Security

View

Data Privacy

View

Data Governance

View

Marketing

View
Resources

Blog

Read through our articles written by industry experts

Collateral

Product brochures, white papers, infographics, analyst reports and more.

Securiti Education

Courses and Certifications for data privacy, security and governance professionals.

Webinars

Learn from industry thought leaders why you need a Data Command Center to enable safe use of data.
Company

About Us

Learn all about Securiti, our mission and history

Partner Program

Join our Partner Program

Contact Us

Contact us to learn more or schedule a demo

News Coverage

Read about Securiti in the news

Press Releases

Find our latest press releases

Careers

Join the talented Securiti team

Home Knowledge Center Data Privacy Automation An Overview of Singapore’s Proposed Guide on Synthetic Data Generation

An Overview of Singapore’s Proposed Guide on Synthetic Data Generation

Published October 26, 2024

Contributors

Anas Baig

Product Marketing Manager at Securiti

Syeda Eimaan Gardezi

Associate Data Privacy Analyst at Securiti

Salma Khan

Data Privacy Analyst at Securiti

CIPP/Asia

Singapore's rapidly evolving digital landscape has positioned the country as a leader in technological innovation, namely in artificial intelligence (AI) and data science. The nation's commitment to responsible AI development has recently led to the introduction of a proposed synthetic data generation guide.

On 15 July 2024, the Personal Data Protection Commission (PDPC), Singapore's main regulatory authority for administering the Personal Data Protection Act (PDPA), introduced the Proposed Guide on Synthetic Data Generation (Guide) to assist organizations in understanding and utilizing synthetic data. The Guide comprehensively explains synthetic data, its potential applications, and the best practices for generating it.

The guide, jointly developed with the Agency for Science, Technology and Research and supported by the Info-communications Media Development Authority (IMDA), is accessible as a resource within the Privacy Enhancing Technology Sandbox.

Understanding Privacy Enhancing Technologies (PETs) and Synthetic Data (SD)

Understanding the guide requires diving into Privacy Enhancing Technologies (PETs) and Synthetic Data (SD).

A. What are Privacy Enhancing Technologies (PETs)?

PETs, or privacy-enhancing technologies, are methods and tools that facilitate data processing, analysis, and insight extraction while preventing the exposure of personal or sensitive information.

Benefits of PETs

PETs assist organizations innovating by making the most of their data assets while ensuring compliance with evolving data protection regulations, lowering the possibility of data breaches, and demonstrating a strong dedication to data security. In the digital age, PETs help preserve an organization’s reputation by securing data and fostering a proactive data protection culture.

Categories of PETs

PETs can generally be classified into three key categories:

a. Data Obfuscation

PETs include many classifications designed to safeguard data privacy.

Data obfuscation methods, including anonymization and pseudonymization, mask personal identifiers for secure storage, data sharing, retention, and software testing.
Synthetic data production also promotes privacy-preserving AI and machine learning, enhances data sharing and analysis, and supports software testing by generating artificial data replicating real-world datasets.
Differential privacy broadens research possibilities and facilitates secure data sharing.
Zero-knowledge proofs facilitate the verification of information without disclosing specific details, such as age verification, enhancing privacy in transactions.

b. Encrypted Data Processing

Encrypted data processing includes techniques such as:

Homomorphic encryption that enables secure data storage in the cloud while allowing computations to be performed on private data without disclosing it. Multi-party computing, including techniques like private set intersection and computing on undisclosed private data.
Trusted execution environments provide secure computing for privacy-requiring models. They ensure that private data is processed securely while respecting the privacy of the data and the models utilized.

c. Federated Analytics

Federated analytics, including federated learning, enables privacy-preserving AI and distributed machine learning by allowing models to be trained and evaluated across numerous decentralized devices or servers without revealing the underlying data.

B. What is Synthetic Data (SD)?

Synthetic data, generally called artificial data, is generated using a purpose-built mathematical model, including AI/machine learning (ML) models or algorithms. It may be generated by training a model (or algorithm) on a source dataset to imitate the properties and structure of the source data. Good-quality synthetic data may maintain the statistical features and patterns of the original data to a great degree. Consequently, executing analysis on synthetic data may generate findings comparable to those obtained using source data.

When is Synthetic Data Most Beneficial?

Synthetic data has many use cases, including data analysis, collaboration, and the creation of training datasets for artificial intelligence algorithms. Moreover, it is used for software testing in place of real data to ensure the software works properly without risking breaches. It can speed up research, creativity, teamwork, and decision-making while addressing cybersecurity incidents and data breaches. Thus, it can help improve compliance with data privacy and protection regulations. Notable instances that demonstrate its uses are:

J.P. Morgan effectively trained AI models using synthetic data for fraud detection.
Mastercard used synthetic data to develop fair and unbiased AI models while protecting demographic privacy.
Johnson & Johnson improved data analysis using high-quality synthetic data in healthcare research.
A*STAR enabled data collaboration by generating synthetic data for external use in pharmaceutical research.

Key Considerations and Best Practices in Synthetic Data Generation

Here’s an overview of the five-step approach to generating synthetic data:

Step 1: Know Your Data

Before initiating a synthetic data project, it's necessary to understand its aim, use cases, and source data that the synthetic data will replicate. This knowledge helps evaluate the relevance and possible risks of utilizing synthetic data. Key factors include:

recognizing that synthetic data duplicates broad patterns of the underlying data, which may not ensure security for sensitive insights;
prioritizing data protection above relevance when sharing data publicly; and
implementing contractual safeguards to prevent reidentification attacks.

Risk thresholds and benchmarks should be set and included in the organization's risk assessments, such as a Data Protection Impact Assessment (DPIA). They should be reviewed and updated regularly to meet business goals while minimizing residual risks.

Step 2: Prepare Your Data

When preparing source data for producing synthetic data, organizations must determine the critical insights and qualities needed to accomplish the organization’s goals. This entails analyzing relevant patterns, statistical properties, and attribute relationships in the underlying data, such as demographic and health condition correlations.

Moreover, decisions must be made about including outliers. If they are not essential and provide a significant re-identification risk, they should be deleted, as maintaining them may require extra risk mitigation techniques. If the aim is to mimic source data, including outliers, the organization should retain them but address higher re-identification risks. While, if the aim is to balance data categories, generating more outliers in synthetic data can be useful.

Organizations should also adopt data minimization practices, such as deleting or pseudonymizing direct identifiers and generalizing or adding noise to data to limit re-identification risks. Moreover, standardizing and documenting data properties in a data dictionary is critical for protecting the integrity of the synthetic data and assuring consistency throughout validation.

Step 3: Generate Synthetic Data

Depending on their particular use cases, data aims, and data kinds, organizations may choose from various techniques for producing synthetic data, including copulas, sequential tree-based synthesizers, and deep generative models (DGMs). After the data has been generated, data fidelity, data integrity, and data utility tests should be performed to assess its quality.

Data Integrity – ensures that the synthetic data is accurate, complete, consistent, and valid compared to the source data.
Data fidelity – assesses whether the synthetic data accurately reflects the characteristics and statistical attributes of the original data.
Data utility – evaluates the degree to which synthetic data can support or replace the original data to achieve the organization's goals.

Step 4: Assess Re-identification Risks

Organizations should conduct a re-identification risk assessment based on their internal criteria when synthetic data is generated and its value is considered acceptable. Potential risks, including inference attacks, linkability, and singling out, should be included in this evaluation.

Step 5: Manage Residual Risks

Organizations should identify residual risks and enforce appropriate governance, contractual, and technological mitigation strategies, which should be documented and formally approved within the organization's risk management framework. Key risks to consider include:

new insights derived from synthetic data that could be sensitive or misleading;
the potential impact of membership disclosure where adversaries might determine group inclusion in the source data;
the risks posed by parties receiving the synthetic data;
the increasing risk of re-identification over time due to advancements in technology; and
the potential for malicious attacks, where adversaries could reconstruct parts of the source data.

Best Practices and Security Controls to Implement and Manage Risks

1. Ensure Data Governance

Governance for synthetic data involves implementing the following:

a. Access Controls

Establishing access controls for both the source data and the synthetic data generator model is crucial, especially if re-identification risks are high or the data is sensitive.

b. Asset Management

Effective asset management, including labeling synthetic data to avoid misunderstandings with source data, is essential for ensuring clarity and accuracy.

c. Risk Management

Risk management should involve regular assessments of re-identification risks, particularly for publicly accessible datasets.

d. Legal Controls

Legal controls are essential when establishing contractual agreements with third-party recipients and solution providers. These restrictions specify their responsibility to secure the data or model, prohibiting attempts to re-identify individuals. Additionally, solution providers may need to verify that adequate risk assessments and mitigations are implemented.

2. ICT controls

ICT controls are policies, processes, and activities an organization implements to protect its ICT systems and data's confidentiality, integrity, and availability. For example, ICT controls for database security should include separating storage for synthetic data and source data to ensure that they are maintained separately, decreasing the risk of inadvertent data breaches or mismanagement.

3. IT Operations

To ensure data handling activities are secure and traceable, IT operations should incorporate complete logging and monitoring of use and access to both source and synthetic data, as well as the synthetic data generator model.

4. Risk Management

Risk management practices should involve securely destroying source data, synthetic data, and the synthetic data generator model whenever they are no longer required or have passed the end of their retention term to ensure that sensitive data is not kept longer than necessary.

5. Incident Management

Organizations that use synthetic data must identify the risks associated with data breaches involving synthetic data, models used to generate the data, and model parameters as part of their incident management process. Key considerations include:

Loss of Fully Synthetic Data: Although fully synthetic data with low re-identification risk is generally not considered personal data, organizations should investigate incidents. This helps organizations identify primary causes, enhance internal safeguards, and monitor for potential re-identification attempts. If re-identification occurs, it may need to be notified to the PDPC as a data breach.

Loss of Synthetic Data Generator Model or Parameters: A model inversion attack raises the possibility of reconstructing the original source data if an adversary gains access to the synthetic data generator model or its parameters. Organizations must investigate such instances, along with the likelihood of a successful attack and whether or not a data breach notice is necessary.

Methods of Synthetic Data Generation

Methods of Synthetic Data Generation are categorized into statistical methods and deep generative models:

1. Statistical Methods:

Bayesian Networks (BN): These models incorporate variable dependencies and are used in industries such as healthcare and finance. However, it may be computationally demanding, limiting its scalability.
Conditional-Copulas: They effectively handle complicated, non-linear relationships at a lower computational cost and ensure joint distributions based on variable correlations for moderate-sized datasets.
Marginal-Based Data Synthesis: This approach adds noise to ensure privacy while maintaining correlations via statistical models (e.g., Bayesian networks) and marginal distribution selection.
Sequential Tree-based Synthesizers (SEQ): SEQ generates synthetic data for regression and classification using decision trees. However, smoothing may be necessary for more accurate distributions.

2. Deep Generative Models:

Generative Adversarial Networks (GANs): GANs are deep learning models that create realistic synthetic data. A generator produces data, while a discriminator evaluates it, helping the generator improve its results.
Language Models (LLMs): Models like Transformers, originally developed for natural language processing, can now efficiently synthesize tabular data via attention processes despite their significant computing overhead.

These methods address various industry-specific data kinds, privacy requirements, and computational limitations.

Re-identification Risks

The main types of re-identification attacks in synthetic data include:

Singling-out attack: This attack targets outliers or unique data points in synthetic datasets. While it may not directly identify individuals, it can reveal characteristics that, when combined with external data, may lead to re-identification.
Linkability Attack: In this attack, adversaries link synthetic data with other datasets to identify individuals. By matching similar data points, such as demographics, adversaries can infer sensitive information, raising the risk of privacy breaches.
Inference Attack: Adversaries analyze patterns in synthetic data to infer sensitive attributes not explicitly present, such as health conditions or lifestyle details, which could be traced back to individuals from the original dataset.

These attacks emphasize the need to establish strong privacy safeguards in synthetic data to reduce the possibility of re-identification.

How to Evaluate Re-identification Risks

Three key approaches to evaluating re-identification risks in synthetic data are:

Attribution and Membership Disclosure: This method checks how easily individuals can be identified in synthetic data. It assesses whether personal attributes match real and synthetic datasets (attribution) and if someone’s inclusion in the original dataset can be inferred (membership).
Privacy Risk Assessment Framework: This framework tests synthetic data against three main risks: identifying unique individuals, linking synthetic and real data, and inferring sensitive information. It splits the original data into control and training sets to measure how well synthetic data protects privacy.
Differential Privacy Audit: This audit evaluates the privacy integrity of the synthetic data generation process using differential privacy principles. It identifies sensitive outliers in the original data and checks if they can be found in synthetic data, ensuring a balance between privacy and data utility.

How Securiti Can Help

Securiti’s Data Command Center enables organizations to comply with a myriad of Singapore data privacy, cybersecurity and GenAI laws by securing the organization’s data, enabling organizations to maximize data value, and fulfilling an organization’s obligations around data security, data privacy, data governance, and compliance.

Organizations can overcome hyperscale data environment challenges by delivering unified intelligence and controls for data across public clouds, data clouds, and SaaS, enabling organizations to swiftly comply with privacy, security, governance, and compliance requirements.

Request a demo to witness Securiti in action.

Videos

January 20, 2025

Mitigating OWASP Top 10 for LLM Applications 2025

Generative AI (GenAI) has transformed how enterprises operate, scale, and grow. There’s an AI application for every purpose, from increasing employee productivity to streamlining...

January 15, 2025

DSPM vs. CSPM – What’s the Difference?

While the cloud has offered the world immense growth opportunities, it has also introduced unprecedented challenges and risks. Solutions like Cloud Security Posture Management...

January 15, 2025

Top 6 DSPM Use Cases

With the advent of Generative AI (GenAI), data has become more dynamic. New data is generated faster than ever, transmitted to various systems, applications,...

January 2, 2025

Colorado Privacy Act (CPA)

What is the Colorado Privacy Act? The CPA is a comprehensive privacy law signed on July 7, 2021. It established new standards for personal...

December 24, 2024

Securiti for Copilot in SaaS

Accelerate Copilot Adoption Securely & Confidently Organizations are eager to adopt Microsoft 365 Copilot for increased productivity and efficiency. However, security concerns like data...

November 1, 2024

Top 10 Considerations for Safely Using Unstructured Data with GenAI

A staggering 90% of an organization's data is unstructured. This data is rapidly being used to fuel GenAI applications like chatbots and AI search....

October 29, 2024

Gencore AI: Building Safe, Enterprise-grade AI Systems in Minutes

As enterprises adopt generative AI, data and AI teams face numerous hurdles: securely connecting unstructured and structured data sources, maintaining proper controls and governance,...

August 12, 2024

Navigating CPRA: Key Insights for Businesses

What is CPRA? The California Privacy Rights Act (CPRA) is California's state legislation aimed at protecting residents' digital privacy. It became effective on January...