{"id":1038870,"date":"2024-05-29T09:00:00","date_gmt":"2024-05-29T16:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=1038870"},"modified":"2024-05-30T07:37:31","modified_gmt":"2024-05-30T14:37:31","slug":"the-crossroads-of-innovation-and-privacy-private-synthetic-data-for-generative-ai","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/the-crossroads-of-innovation-and-privacy-private-synthetic-data-for-generative-ai\/","title":{"rendered":"The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1.jpg\" alt=\"diagramA flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.\" class=\"wp-image-1041408\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introduction\">Introduction<\/h2>\n\n\n\n<p>In today&#8217;s data-driven world, organizations strive to leverage data to train and adapt AI models. However, this pursuit often faces an important challenge: balancing the value of data with the need to safeguard individuals\u2019 right to privacy and comply with data privacy regulations like the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/gdpr.eu\/\" target=\"_blank\" rel=\"noopener noreferrer\">General Data Protection Regulation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (GDPR) and the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.europarl.europa.eu\/topics\/en\/article\/20230601STO93804\/eu-ai-act-first-regulation-on-artificial-intelligence\" target=\"_blank\" rel=\"noopener noreferrer\">EU AI Act<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<p>Synthetic data has emerged as a powerful solution to privacy and compliance challenges. It allows organizations to create realistic and useful datasets, tailored to specific use cases, without compromising individual privacy. This enables organizations to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Train and adapt AI models<\/strong>: Synthetic data can be used to train and adapt models to specific domains and industries, even when real-world data is limited, or privacy concerns exist.<\/li>\n\n\n\n<li><strong>Comply with regulations<\/strong>: Since it doesn\u2019t require user data, synthetic data generation helps organizations adhere to data privacy regulations.<\/li>\n\n\n\n<li><strong>Unlock new possibilities<\/strong>: Synthetic data opens doors to innovative AI applications that were previously limited by data availability or privacy constraints.<\/li>\n<\/ul>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/introducing-phi-3-redefining-whats-possible-with-slms\/\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft&#8217;s Phi-3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> small language model (SLM) is a good example of how synthetic data can contribute to responsible AI development, enabling the creation of powerful language models without compromising privacy. Phi-3 leverages a combination of \u201ctextbook quality\u201d web data and LLM-generated synthetic content, creating a strategic approach that doesn&#8217;t need real-world personal data.&nbsp;<\/p>\n\n\n\n<p>However, synthetic data carries limitations. It can be difficult to artificially generate realistic data that anticipates a wide range of use cases and individual scenarios. Furthermore, synthetic data generated by pre-trained large-language models (LLMs) can sometimes reduce accuracy and increase bias on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2403.04190v1\" target=\"_blank\" rel=\"noopener noreferrer\">down-stream tasks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. So, how could we generate synthetic data that accurately captures the diversity and specificity of private data while maintaining strict privacy protections for data contributors?&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"differential-privacy-a-bridge-between-innovation-and-privacy\">Differential privacy: A bridge between innovation and privacy<\/h3>\n\n\n\n<p>Differentially private (DP) synthetic data generation is a promising solution. It allows developers to pursue innovations in machine learning while prioritizing privacy. The goal of synthetic data generation is to produce data statistically similar to real-world data sources. However, when the data is too similar, replicating uniquely identifying details of the source data, the promise of preserving privacy is compromised. This is where DP can help. DP is a mathematical framework for providing a guarantee that a particular computation is relatively invariant to the addition or removal of a single data contributor. <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/privacy-preserving-machine-learning-maintaining-confidentiality-and-preserving-trust\/\" target=\"_blank\" rel=\"noreferrer noopener\">Using DP techniques<\/a>, researchers can generate synthetic datasets that retain the statistical properties of the original data while ensuring that information that could help identify data contributors remains obscured.&nbsp;<\/p>\n\n\n\n<p>This blog post explores recent advancements in private synthetic data generation. We examine four recently published research papers that propose innovative techniques for generating synthetic data with strong privacy guarantees, while maintaining its usefulness for analytics, training AI models, and other tasks.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/synthetic-text-generation-with-differential-privacy-a-simple-and-practical-recipe\/\">Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe<\/a><\/strong> by Yue <em>et al., <\/em>which appeared at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/2023.aclweb.org\/calls\/main_conference\/\" target=\"_blank\" rel=\"noopener noreferrer\">ACL 2023<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, proposes using DP in the fine-tuned training process of a generative LLM. This approach injects noise into the model&#8217;s updates during training, ensuring privacy guarantees while maintaining the model&#8217;s ability to generate realistic text.<\/li>\n\n\n\n<li><strong><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/differentially-private-synthetic-data-via-foundation-model-apis-1-images\/\">Differentially Private Synthetic Data via Foundation Model APIs 1: Images<\/a><\/strong> and <strong><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/differentially-private-synthetic-data-via-foundation-model-apis-2-text-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Differentially Private Synthetic Data via Foundation Model APIs 2: Text<\/a><\/strong> by Lin, Xie, <em>et al<\/em>., which appeared at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/iclr.cc\/Conferences\/2024\" target=\"_blank\" rel=\"noopener noreferrer\">ICLR 2024<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/icml.cc\/\" target=\"_blank\" rel=\"noopener noreferrer\">ICML 2024<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, respectively, present an approach to data synthesis that focuses on leveraging pre-trained foundation models as black boxes. This method utilizes differentially private queries to the models&#8217; inference APIs for data generation, offering an API-based, training-free approach.<\/li>\n\n\n\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/privacy-preserving-in-context-learning-with-differentially-private-few-shot-generation\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation<\/strong><\/a> by Tang <em>et al., <\/em>which appeared at ICLR 2024, explores applying DP to the task of few-shot learning, where models are conditioned on a handful of synthetically generated demonstration examples at inference time. This approach is useful when only private labeled examples are available, and the generalizing power of an LLM can be leveraged to solve an in-context task.<\/li>\n<\/ul>\n\n\n\n<p>In the remainder of this blog post, we describe each approach in more detail, and present experimental results illustrating their value.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"technical-deep-dive-differentially-private-synthetic-data-generation\">Technical deep dive: Differentially private synthetic data generation&nbsp;<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"synthetic-text-generation-with-differential-privacy-a-simple-and-practical-recipe\">Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe<\/h3>\n\n\n\n<p>Generative LLMs offer the opportunity to produce synthetic text by sampling from LLM outputs.&nbsp;One avenue to generating realistic synthetic text is to fine-tune<strong> <\/strong>an LLM using representative data. For example, we could consider fine-tuning a pre-trained LLM on a corpus of scientific papers, enabling the model to more readily produce text that captures the knowledge and writing style used in scientific writing. Suppose, however, that we want to produce synthetic text based on a <em>private <\/em>corpus of documents. What steps can we take to protect the document authors and any sensitive information in their documents?&nbsp;For example, we may want to produce synthetic medical notes, or personal emails. LLMs have a well-known capacity to memorize training examples, and a model with the potential for reproducing samples from the training set might pose significant privacy risks.<\/p>\n\n\n\n<p>In the paper <em>Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe<\/em>, researchers from Microsoft presented an approach to leveraging a private data corpus for synthetic generation, without compromising the privacy of the data subjects. This approach uses <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/differentially-private-fine-tuning-of-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">differentially private stochastic gradient descent<\/a> (DP-SGD) to fine-tune an LLM on the private documents with a strong privacy guarantee. <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/differentially-private-fine-tuning-of-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Differentially private model training<\/a> provides a mathematical guarantee that the trained model parameters, and any<strong> <\/strong>subsequent model outputs, are relatively unaffected by the addition or removal of any single user\u2019s training examples.<\/p>\n\n\n\n<p>The synthetic generation approach described in this work was validated by training on restaurant reviews with varying levels of privacy protection, then prompting the model to generate novel reviews. These reviews were then used for downstream classification tasks, such as sentiment prediction and restaurant genre classification, and the results, which are shown in Table 1, demonstrated only small accuracy penalties compared to training on the raw private data. This approach unlocks a powerful way for realistic synthetic data to be generated from private data without compromising privacy or confidentiality.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"529\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation.jpg\" alt=\"A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.\" class=\"wp-image-1039950\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation-300x113.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation-1024x387.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation-768x290.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure1_Synthetic-Data-Generation-240x91.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1: By fine-tuning an LLM with differential privacy, the model can be used to generate synthetic examples that resemble the private corpus&nbsp;<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"700\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation.jpg\" alt=\"A table of results with four columns and four rows. The columns indicate data type, data generator, epsilon, rating and category.  The first row indicates \u201coriginal\u201d data type and no entry for data generator or epsilon. The rating is 0.733 and category is 0.775.  The following three rows all indicate Synthetic for data type and GPT2, GPT2-Medium, and GPT2-Large for the data generator. Each row is further divided into two rows corresponding to epsilon = 4 and epsilon = infinity respectively. In all cases the rating and category scores are lower than the row marked original by a few percentage points. The rows corresponding to epsilon = 4 are lower than corresponding rows marked epsilon=infinity by 1-2 percentage points. In general the epsilon = 4 rows have increased scores for larger GPT2 models, while the epsilon=infinity rows are relatively flat.\" class=\"wp-image-1039956\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation-300x150.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation-1024x512.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation-768x384.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table1_Synthetic-Data-Generation-240x120.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Table 1: Various versions of GPT-2 were trained on restaurant reviews both with (\u03b5=4) and without (\u03b5 =\u221e) a privacy guarantee. These models were used to produce synthetic training sets, which were used to train classification models for review rating and restaurant category, and subsequently evaluated for accuracy on a private hold-out set. The results show that models trained on the synthetic data can achieve accuracy competitive with models trained without a privacy guarantee.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"differentially-private-synthetic-data-via-foundation-model-apis\">Differentially Private Synthetic Data via Foundation Model APIs<\/h3>\n\n\n\n<p>While the ACL paper demonstrated a robust approach to synthetic data generation, fine-tuning a large model can be impractical. Model training requires significant computing capacity and some of the most powerful models available are proprietary and not accessible for DP training. Recognizing this challenge, researchers at Microsoft explored whether synthetic data can be generated directly using <strong>only inference API access to a model<\/strong>, even while utilizing an untrusted model controlled by a third party. Crucially, the synthetic data should resemble a targeted private corpus, and yield a similar DP guarantee as was met in the previous work based on model training. In two separate papers, the authors demonstrate an approach to this problem using a differentially private sampling approach called Private Evolution (PE).&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1138\" height=\"676\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2.png\" alt=\"Two independent flow charts. In the first, private data is applied to a pre-trained model using DP-SGD. The fine-tuned model is used to produce differentially private synthetic data.  In the second chart, a pre-trained model is prompted via its API to produce generic data. Private data is used to inform selection of the generated data, with a strong privacy guarantee, yielding differentially private synthetic data. \" class=\"wp-image-1040847\" style=\"width:718px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2.png 1138w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2-300x178.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2-1024x608.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2-768x456.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/Synthetic-Data-Generation-NEW-Fig2-240x143.png 240w\" sizes=\"auto, (max-width: 1138px) 100vw, 1138px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Instead of fine-tuning pre-trained models with DP-SGD (top figure), Private Evolution (PE) only requires accessing the inference APIs of a model (bottom figure). Thus, PE is easily compatible with foundation models that are difficult to DP-fine-tune (e.g., because they are too large) or infeasible to fine-tune (e.g., they are only accessible through inference APIs).<\/figcaption><\/figure>\n\n\n\n<p><strong>Synthetic image generation using foundation model APIs: <\/strong>In <em>Differentially Private Synthetic Data via Foundation Model APIs 1: Images<\/em>, the authors introduced Private Evolution (PE), an approach that enables DP image synthesis merely through inference APIs of a generative model. PE operates by sampling from a pre-trained diffusion model such as Stable Diffusion, which has no knowledge of the private corpus. PE then iteratively compares these samples to the private corpus, keeps the ones that are most similar to the private corpus, and uses the pre-trained model to generate more such samples. Crucially, the comparison to the private corpus is done with a DP guarantee, so that any information revealed about the private corpus is strictly bounded. Also, all the queries to the foundation model APIs satisfy the same DP guarantee, so that we can safely use APIs provided by (untrusted) third parties.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"818\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation.jpg\" alt=\"Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration.\u00a0\" class=\"wp-image-1039968\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation-300x175.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation-1024x598.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation-768x449.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation-480x280.jpg 480w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure3_Synthetic-Data-Generation-240x140.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>Even without doing any model training, PE significantly advances state-of-the-art results on some of the datasets. For example, on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\" target=\"_blank\" rel=\"noopener noreferrer\">CIFAR10 dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we achieve FID score (image quality measure, smaller is better) \u2264 7.9 with DP privacy cost \u03f5 = 0.67, significantly improving the previous SOTA from \u03f5 = 32. In the paper, we also show that PE requires less computational resource (GPU hours) than DP fine-tuning to achieve such results.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1011\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation.jpg\" alt=\"A 2D line chart with six line series, comprising conditional and unconditional variations on the private evolution and DP-MEPF methods, as well as DP-GAN and DP-Diffusion. The x axis presents values of epsilon from 0 to 32. The y axis presents values of the image quality measure FID from 0 to 80, where values are better.  All six series show decreasing values of FID for increasing values of epsilon. Both of the series corresponding to private evolution show significantly lower FID values, ranging from about epsilon = 0.1 to epsilon = 2.\" class=\"wp-image-1039971\" style=\"width:531px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation-300x217.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation-1024x739.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation-768x555.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure4_Synthetic-Data-Generation-240x173.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 4: FID (image quality measure, lower is better) vs. DP privacy cost \u03f5 on CIFAR10 (\u03b4 = 10<sup>\u22125<\/sup> ). (Un)cond means (un)conditional generation. Ours achieves the best privacy-quality trade-off compared to prior training-based approaches.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1400\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation.jpg\" alt=\"An array of ten rows of thumbnails, each row depicting ten instances of generated synthetic images. The rows include birds, cars, cats, dogs, and other animals, planes, boats and trucks.  Most of the images appear to be realistic with some exhibiting unusual artifacts. \" class=\"wp-image-1039974\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-300x300.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-1024x1024.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-150x150.jpg 150w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-768x768.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-180x180.jpg 180w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure5_Synthetic-Data-Generation-360x360.jpg 360w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Private Evolution-generated samples using CIFAR-10 as the private corpus (\u03b5 =0.67, \u03b4 =10<sup>-5<\/sup>). Each row corresponds to one object class.<\/figcaption><\/figure>\n\n\n\n<p><strong>Synthetic Text Generation using foundation model APIs: <\/strong>the PE approach described above works well for images since it is easy to produce nearby perturbations of promising images. In <em>Differentially Private Synthetic Data via Foundation Model APIs 2: Text<\/em>, Microsoft researchers explored whether a similar approach could be applied to text. Their method, called Augmented Private Evolution (Aug-PE), operates similarly to the basic PE approach, but leverages the power of a pre-trained LLM to produce variations and re-wordings of input text. Aug-PE also proposes some fundamental algorithmic improvements that may benefit future development of PE.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"813\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation.jpg\" alt=\"An overview of the Augmented Private Evolution algorithm for synthetic text generation. Step 1 invokes a language model to produce random text. Step 2.1 uses private data and differential private to vote on the best candidates from step 1, Step 2.2 samples from this differentially private histogram to produce a selected set of generations. Step 2.3 prompts a language model to produce variants of the selected generations, and steps 2.1 to 2.3 are repeated.\" class=\"wp-image-1039977\" style=\"width:804px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation-300x174.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation-1024x595.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation-768x446.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation-480x280.jpg 480w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure6_Synthetic-Data-Generation-240x139.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 6: Augmented Private Evolution (Aug-PE) leverages a foundational LLM to synthesize text and compare in a privacy-preserving way with a private corpus. Similar to PE for images, in Aug-PE, samples that more closely resemble the private data are retained and refined to produce new synthetic text with a strong privacy guarantee. The illustration shows how we generate DP synthetic reviews for restaurants given two private samples.<\/figcaption><\/figure>\n\n\n\n<p>Results show that Aug-PE is a promising alternative to DP-fine-tuning for DP text synthesis. With the same foundation model, PE can match or even beat DP-fine-tuning in terms of the trade-off between text quality and privacy. Moreover, as Aug-PE only requires inference APIs, Aug-PE can easily work with the most advanced LLMs such as GPT-3.5, LLaMA, and Mixtral to further improve the text quality. In terms of computational cost (GPU hours), PE can achieve up to 65.7x speedup compared to the DP fine-tuning approach.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"682\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation.jpg\" alt=\"A table of results for area and rating classification accuracy for a variety of models and comparing PE with DP synthesis. The table contains the remark that with the same model PE matches or beats DP fine-tuning on text quality vs privacy, and PE works well with advanced LLMs which may be challenging or impossible to fine-tune. The models compared include three sizes of GPT-2, several major open source models, and GPT-3.5. PE on the Mixtral model shows the strongest Area classification accuracy at 43.6 while PE on GPT-3.5 shows the strongest Rating classification accuracy at 43.1.\" class=\"wp-image-1039980\" style=\"width:698px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation-300x146.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation-1024x499.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation-768x374.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table2_Synthetic-Data-Generation-240x117.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Table 2: Results on ICLR 2023 paper reviews (\u03f5 = 1). We use each method to generate DP synthetic paper reviews and test the utility of the data by training downstream paper area or rating classifiers and evaluate their accuracies on the real hold-out data (higher is better). Under the same base model (GPT-2 families), PE achieves competitive results with DP fine-tuning. PE also supports advanced LLMs that may be challenging to work with DP fine-tuning due to large model sizes or black box access.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"privacy-preserving-in-context-learning-with-differentially-private-few-shot-generation\">Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation<\/h3>\n\n\n\n<p>In-context learning is a technique for performing tasks with an LLM by providing a sample of demonstration examples in the prompt of the LLM before presenting it with a specific task. For example, we might show a few movie plots and their genre and ask the LLM to suggest the genre for a particular plot of interest. In-context learning harnesses the strong generalization capabilities of LLMs, but it requires a sample of labeled demonstration examples at inference time. How can we perform in-context learning when the only available labeled examples are private? A na\u00efve solution might be to use the private examples but hide the demonstration prompt from the user. However, the threat posed by <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/security\/blog\/2024\/04\/11\/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails\/\" target=\"_blank\" rel=\"noreferrer noopener\">jailbreak attacks<\/a> puts these examples at risk for exposure to a malicious user.<\/p>\n\n\n\n<p>In <em>Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation<\/em>, Microsoft researchers explored how demonstration examples can be synthesized from a private corpus with a privacy guarantee. The method operates by incrementally drawing samples from a token distribution defined by the private examples but with noise added to the distribution. The noise is calibrated to ensure a bound on the privacy lost with each sample. The research demonstrated that in-context learning can out-perform zero-shot learning (querying a model without any demonstration examples) and comes close to performing at the same level as the case with no privacy mitigations, as shown in Table 3.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"328\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation.jpg\" alt=\"An overview of differentially private few-shot generation.  A round of token generation is depicted with four steps. Given the tokens generated so far, step 1 selects the relevant private data. Step 2 takes an M by N sample of the private data, producing M batches of N examples. Step 3 assembles M LLM prompts with task instructions and the N examples appended. Step 4 feeds the M prompts to the LLM and performs noisy aggregation over the LLM\u2019s output probabilities to select the next generated token. \" class=\"wp-image-1039983\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation-300x70.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation-1024x240.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation-768x180.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/figure7_Synthetic-Data-Generation-240x56.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 7: Illustration of DP few-shot generation. The example shows a synthetic demonstration generated token by token for the topic school with a differentially private guarantee. As new tokens are sampled, the private examples inform the sampling probability of each subsequent token, with noise injected to preserve privacy.&nbsp;<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"395\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation.jpg\" alt=\"A table of results for private in-context learning tasks, including text classification on three datasets (AGNews, DBPedia, and TREC) and information extraction on two datasets (MIT-G and MIT-D).  Accuracy is compared across two cases where epsilon = 0 (zero-shot and four-shot) and values of epsilon at 1, 2, 4, 8 and infinity. Generally, accuracy improves as epsilon increases but epsilon = 8 often outperforms epsilon = infinity. \" class=\"wp-image-1039986\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation-300x85.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation-1024x289.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation-768x217.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/table3_Synthetic-Data-Generation-240x68.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Table 3: For classification and information extraction tasks, DP in-context learning achieves accuracy similar to non-private ICL (\u03f5 =\u221e)&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h2>\n\n\n\n<p>Synthetic data generation presents enormous opportunities to develop AI systems without compromising end-user privacy. In this blog post, we have explored recent innovations in synthetic data generation with strong privacy guarantees. These approaches can enable practitioners to produce synthetic data from private entities, while mitigating the risk that private information might be revealed. While these approaches are highly promising, they do have limitations. For example, we are currently limited to producing relatively short text passages. Future work will continue to explore the opportunities presented by these approaches, with an aim to produce increasingly realistic data with strong privacy guarantees.<\/p>\n\n\n\n<p><strong>Acknowledgments: <\/strong>The authors are grateful for the contributions of the co-authors of the papers reviewed in this blog post: Xiang Yue, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, Chulin Xie, Arturs Backurs, Sivakanth Gopi, Da Yu, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Janardhan Kulkarni, Xinyu Tang, Richard Shin, Andre Manoel, and Niloofar Mireshghallah.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Synthetic data could potentially help address some privacy concerns with AI model development and training, but it comes with limitations. Researchers at Microsoft are exploring techniques for producing more realistic data with strong privacy protections.<\/p>\n","protected":false},"author":37583,"featured_media":1041408,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Gbola Afonja","user_id":"42846"},{"type":"user_nicename","value":"Robert Sim","user_id":"36650"},{"type":"user_nicename","value":"Zinan Lin","user_id":"42327"},{"type":"user_nicename","value":"Huseyin Atahan Inan","user_id":"40426"},{"type":"user_nicename","value":"Sergey Yekhanin","user_id":"34990"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1038870","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[1054512],"related-projects":[556311],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Gbola Afonja","user_id":42846,"display_name":"Gbola Afonja","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/gafonja\/\" aria-label=\"Visit the profile page for Gbola Afonja\">Gbola Afonja<\/a>","is_active":false,"last_first":"Afonja, Gbola","people_section":0,"alias":"gafonja"},{"type":"user_nicename","value":"Robert Sim","user_id":36650,"display_name":"Robert Sim","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/rsim\/\" aria-label=\"Visit the profile page for Robert Sim\">Robert Sim<\/a>","is_active":false,"last_first":"Sim, Robert","people_section":0,"alias":"rsim"},{"type":"user_nicename","value":"Zinan Lin","user_id":42327,"display_name":"Zinan Lin","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/zinanlin\/\" aria-label=\"Visit the profile page for Zinan Lin\">Zinan Lin<\/a>","is_active":false,"last_first":"Lin, Zinan","people_section":0,"alias":"zinanlin"},{"type":"user_nicename","value":"Huseyin Atahan Inan","user_id":40426,"display_name":"Huseyin Atahan Inan","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/huinan\/\" aria-label=\"Visit the profile page for Huseyin Atahan Inan\">Huseyin Atahan Inan<\/a>","is_active":false,"last_first":"Inan, Huseyin Atahan","people_section":0,"alias":"huinan"},{"type":"user_nicename","value":"Sergey Yekhanin","user_id":34990,"display_name":"Sergey Yekhanin","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/yekhanin\/\" aria-label=\"Visit the profile page for Sergey Yekhanin\">Sergey Yekhanin<\/a>","is_active":false,"last_first":"Yekhanin, Sergey","people_section":0,"alias":"yekhanin"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/05\/PSD-for-Gen-AI-2024-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"May 29, 2024","formattedExcerpt":"Synthetic data could potentially help address some privacy concerns with AI model development and training, but it comes with limitations. Researchers at Microsoft are exploring techniques for producing more realistic data with strong privacy protections.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1038870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1038870"}],"version-history":[{"count":44,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1038870\/revisions"}],"predecessor-version":[{"id":1041411,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1038870\/revisions\/1041411"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1041408"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1038870"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1038870"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1038870"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1038870"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1038870"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1038870"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1038870"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1038870"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1038870"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1038870"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1038870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}