Generative AI's Third Party Risk Problem

May 23, 2024

technology

A look at the risks, responses, and predictions in a rapidly evolving Gen AI environment.

Blog cover image

INTRODUCTION

NEW AVENUES FOR UNWANTED DATA EXPOSURE

AI has made a case for itself as a valuable assistant to employees across all industries. The use of Gen AI allows employees to complete programming tasks 55% more quickly and professional writing tasks 44% faster, showing its usefulness in roles ranging from software engineering to management consulting. A 2023 study found that management consultants using GPT-4 saw a 25% increase in speed and a 43% improvement in task performance compared to a control group. However, to accomplish these tasks, employees must enter considerable amounts of data into these models. Without appropriate guardrails to prevent the mishandling of this data, these tools can become vectors for data exfiltration.

WIDESPREAD DATA EXPOSURE

In a recent survey, 62% of respondents admitted to entering information about internal processes into public GenAI tools, and 48% reported inputting non-public information about their company. When proprietary data enters these GenAI models, it leaves the company's data ecosystem, introducing unprecedented challenges beyond traditional third-party risk concerns.

"We know that employees, by accident and not in a malicious way, are exposing company data in prompts."

- CISO, leading financial services organization







DATA PRIVACY RISKS

MODEL MEMORIZATION

Most leading model providers, including ChatGPT, train on the prompts they receive, exposing organizations to the risk that sensitive or proprietary information shared by employees is unintentionally memorized by these models. For example, Amazon discovered that certain ChatGPT output closely matched internal Amazon data, leading to company-wide restrictions on ChatGPT use. Similarly, in a heavily publicized incident, Samsung experienced a data leakage where employees inadvertently shared confidential source code and internal meeting notes with ChatGPT, resulting in the unintentional exposure of proprietary information.

ADVERSARIAL DATA EXTRACTION

Through sophisticated extraction attacks, adversaries can exfiltrate training data — including sensitive or confidential information — with alarming ease. Google and academic researchers found that with appropriate prompting, private information such as email signatures and personal contact information could be extracted from ChatGPT. Even when models are aligned to behave responsibly, they can inadvertently leak confidential data.


"Using only $200 USD worth of queries to ChatGPT, we are able to extract over 10,000 unique verbatim memorized training examples. Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data."

- Nasr et al.



LEGAL AND REGULATORY RISKS

Generative AI's integration into business operations introduces compliance challenges, especially in highly regulated industries and jurisdictions.

HIPAA

Description

HIPAA mandates stringent security and privacy standards for entities handling personal health information (PHI).

Example Violation

An employee at a healthcare provider uses a public AI tool to draft patient communications or notes, including patient details such as medical conditions or treatment plans in the prompt, breaching HIPAA regulations.

PCI-DSS

Description

PCI-DSS sets requirements for companies that process, store, or transmit credit card information.

Example Violation

An employee at a retail company uses a public AI tool to generate customer service scripts and inadvertently inputs credit card details or transaction data, violating PCI-DSS guidelines.

GDPR

Description

GDPR requires that personal data of EU residents be processed under strict data protection rules, including data residency requirements.

Example Violation

An employee at a multinational corporation uses a public AI tool that stores and processes data in a non-EU country, exposing EU customer data, which could result in substantial fines and legal action against the company.








ORGANIZATIONAL RESPONSES

BANNING PUBLIC AI TOOLS

Companies like Apple, Spotify, Verizon, and Bank of America have reportedly issued blanket bans on the use of public AI tools such as ChatGPT. A report earlier this year found that 27% of organizations have banned GenAI use. However, in the absence of alternatives, this approach may restrict productivity gains and lead to unsanctioned and ungoverned use, known as shadow AI.

DEVELOPING EMPLOYEE USAGE POLICIES

Organizations are establishing AI usage policies to enforce compliant use of AI tools, including data handling protocols. A 2023 survey found that 21% of respondents reported that their organizations have policies regarding the workplace use of generative AI. However, the effectiveness of these policies depends heavily on consistent enforcement and employee adherence.

EMPLOYEE TRAINING AND AWARENESS PROGRAMS

Employee training and awareness programs focusing on the risks of generative AI and data leakage can bolster an organization's overall security posture. A 2023 study found that 44% of business leaders and 14% of front-line workers received AI-related training. However, the effectiveness of these programs relies on regular updates given the rapidly evolving nature of AI tools and the extent to which employees apply their learnings in practice.

PRIVATE MODELS

Some organizations are opting for private AI models hosted in on-premises or secure cloud environments. This approach offers more control over data but is costly and resource-intensive to develop and maintain. Additionally, there is a risk of shadow AI if employees seek out more capable public models for specialized tasks.

DLP TOOLS

Data Loss Prevention (DLP) tools are increasingly being integrated with AI systems to monitor and control the information entering public AI tools. As these tools are just beginning to emerge, they heavily rely on primitive techniques such as pattern recognition to identify sensitive information, limiting the scope of their applicability to information such as social security and credit card numbers. The development of models that can classify and identify sensitive information specific to a company can resolve many of the issues around data exposure with Gen AI tools.








PREDICTIONS

CONTINUED FRAGMENTATION OF AI USAGE

When ChatGPT was released in November 2022, it was the first and only LLM to see mainstream adoption. Times have changed. OpenAI is no longer the sole leading LLM provider, and a wave of bespoke AI tools and embedded AI solutions have taken the stage. The trend among AI tools is increased fragmentation, not consolidation.

In the short term, organizations will continue to adopt increasingly specialized AI tools, as well as embedded AI solutions within existing software. These tools will cater to specific industry needs, enabling more precise and intelligent applications of AI.

In the long term, we can expect a shift toward agentic workflows and autonomous agent-to-agent communication. The tide is already turning. For instance, half of customer contacts in the banking, telecommunications, and utilities sectors in North America are now automated. Initially, AI agents will augment existing workflows as evidenced by the rapid growth of human-in-the-loop (HITL) customer operations. However, as these agents become more advanced and capable, they may fully automate these workflows, reducing or even eliminating the need for human intervention.



1. THIRD-PARTY APPLICATIONS WILL INGEST A LOT MORE DATA THAN THEY DO TODAY

With the proliferation of AI tools, third-party applications will ingest more data than ever before. LLMs require vast amounts of data to be effective, both during training and inference. Every part of the software stack, from CRM systems to SCM platforms, will require more data.

2. LLM DATA PROTECTION WILL BE MODEL-AGNOSTIC BY NECESSITY

Data protection agreements with a single model provider or the use of private models may serve as an effective short-term solution. However, in a landscape where dozens of specialized models and autonomous agents are in play, this approach will no longer suffice. Organizations will need to adopt model-agnostic data protection strategies to ensure the safe and compliant use of these tools across various platforms and providers.

3. AUTONOMOUS COMMUNICATION WILL REQUIRE AUTONOMOUS DATA SECURITY

As we move toward autonomous agent-to-agent communication, prompt validation and access controls will be critical to preventing data exfiltration. Achieving fully autonomous communication will require reliable data classification workflows, real-time data classification at the perimeter, and dynamic access controls to ensure that only authorized agents can access and transmit sensitive information securely.

Copyright © 2025 Vallum

Govern your AI usage with conversation.

Securely use ChatGPT, Gmail, and much more today with Vallum.