Data Privacy
Data privacy in AI refers to the practices, technical controls, and legal requirements governing how personal information is collected, stored, used, and protected throughout AI system development and deployment. It covers risks specific to AI including training data memorization and sensitive attribute inference.
AI systems trained on large datasets frequently encounter personal information — names, health records, communications, and behavioral logs — either as deliberate training inputs or as incidental inclusions in scraped web data. Data privacy governs who has the right to control this information, under what conditions it may be used for AI training, how long it may be retained, and what protections apply when AI outputs could expose or infer personal details. In regulatory frameworks, "personal data" is broadly defined: the EU's General Data Protection Regulation (GDPR) covers any information that can identify an individual directly or indirectly, including inferred attributes.
Technical privacy measures in AI include differential privacy — adding calibrated mathematical noise to training gradients so that the model cannot reliably reproduce individual training records — and federated learning, where training occurs locally on user devices and only aggregate model updates are shared with a central server. At inference time, access controls, output filtering, and data minimization reduce the risk of exposing sensitive information. Legally, mechanisms include informed consent requirements, data subject rights (access, erasure, and portability), purpose limitation, and contractual data processing agreements between AI developers and data sources.
AI models trained on internet-scale corpora have demonstrated measurable memorization — the ability to reproduce verbatim passages of training text, including personally identifiable information, when appropriately prompted. Research by Carlini et al. (2021) showed that GPT-2 could be prompted to output email addresses, phone numbers, and names present in its training data. Beyond memorization, AI systems can infer sensitive attributes such as health conditions, political views, or financial status from ostensibly innocuous inputs, creating secondary privacy risks that existing regulations did not originally anticipate.
By 2026, the EU AI Act layers AI-specific obligations on top of GDPR, requiring documentation of training data sources and prohibiting certain uses of sensitive personal data in high-risk systems. Italy's data protection authority (Garante) temporarily blocked ChatGPT in March 2023 over GDPR concerns, resulting in OpenAI implementing user opt-out mechanisms and data deletion requests for European users. Differential privacy is standard at Apple and Google for on-device model training, though its adoption in large language model pretraining remains limited due to the accuracy-privacy tradeoff at scale.