Is private AI data even possible?

Hey Reader,

Tech companies have always had access to a lot of our confidential personal and business data. But prior to AI, I never got a question like:

"Is Google reading all my documents and stealing my ideas?"
"Is Microsoft scanning my spreadsheets and using them as training data?"

AI changed everything about data privacy because it showed that the value of data access goes way beyond targeted advertising (which has been the primary use case for Meta and Google for two decades).

Now, we can build smarter computers just by having those computers consume more information – which means your creativity and unique insights need to be guarded much more closely.

Today I’ll answer the biggest data-security questions that I’ve been getting from students in the Innovating with AI Incubator program, including...

Do I need to care about how OpenAI processes my data?
If I want to keep my data private, which tools should I use (and which should I avoid)?
Are there any actual legal protections here?

Let’s dive in.

Data security: who needs to care?

The status quo is that OpenAI, Google, Amazon, Meta and other creators of large language models are rapidly ingesting everything that has ever been published into their LLMs. Whether or not this is legal is an open question (the New York Times, for example, thinks it is illegal), but that won’t stop these companies for at least the next few years.

From a data-security standpoint, this means that if your data or content is already public somewhere on the internet, there’s no real way for you to stop it from becoming part of an LLM’s knowledge base. There are ways to add directives to your website to discourage crawling, but ultimately these are just polite suggestions – your content could still end up being part of the Web Archive, for example, and could make its way indirectly into an LLM’s knowledge.

The flip side is that if your content is already public (or you plan to make it public), you can worry less about pasting it into ChatGPT or another language model. Even if OpenAI were to eventually use it for future training, it’s highly likely they would also find it elsewhere on the internet anyway, so pasting it into ChatGPT really makes no difference.

On the other hand, if you have data that is proprietary and not already somewhere on the internet, you may have a good reason to worry about data privacy, particularly if it’s something that you wouldn’t want ChatGPT to be able to easily answer questions about in the future. Many large companies, for example, have already invested in private LLMs and private datacenter operations to avoid their internal, proprietary data being ingested by OpenAI.

We have a number of clients and students who are in this boat. Here’s what they’re doing...

How to protect your content from LLMs

Within the OpenAI ecosystem, the key data-privacy trick is to use business-focused tools rather than consumer-focused ones. For example, if you are using the OpenAI APIs via their Platform, the company promises not to use your content to train future versions of GPT. (Note: Every time OpenAI releases a new version of GPT, that version is being trained on new and more recent content. The older versions never “update” – it’s just a one-and-done release since the training process is so intense.)

OpenAI also recently released ChatGPT for Teams, which is also considered a business tool and promises not to use your content for future training.

However, the “normal” version of ChatGPT and ChatGPT Plus are considered consumer products, so by default, the company can use any content you put in there (including in your custom GPTs) for future training. You can change this setting by submitting a “do not train on my content” request via your OpenAI account. Keep in mind that this request affects future submissions only; it doesn’t retroactively affect anything you’ve already typed into ChatGPT. (In my experience, it takes a couple days for them to process your request.)

Other nascent AI models, such as Google’s Gemini, offer similar protections at the business level – in fact, this has become the industry standard since so many corporations have trade secrets they want to protect.

It’s the smaller businesses and consumers that might not have this as a top-of-mind concern, so if you’re in that group, it’ll be valuable for you to get to know the different policies and options that come with each AI platform.

The limits of “do not train”

While I’ve taken advantage of all the privacy settings above on my own accounts (and recommend you do the same), I’m also somewhat skeptical of the idea that we can keep AI contained to “approved content only” in the long term.

That’s because, as LLMs grow and multiply, it’s going to be more and more difficult to determine what actually is and is not in their knowledge base. And even if you know something is in their knowledge base, it will be very difficult to determine where it originally came from and whether it is possible to remove it. We are already seeing some of these problems in the New York Times v. OpenAI case. These issues are confounding even for companies willing to spend millions on legal fees, so it’s going to be extremely difficult for smaller players to enforce the rules.

The result is that the big companies can effectively say whatever they want about privacy, but if there’s no way to actually verify what’s happening behind the scenes, those promises don’t add up to very much actual security.

Even non-AI privacy initiatives, like Europe’s GDPR, struggle with the difficulty of enforcement – but if you have a situation where it is literally impossible to see what training data is incorporated into an LLM, then you’re in a position where nobody can realistically enforce any rules.

In other words, you’re just trusting the companies to do what they say they’re going to do – which means someone, somewhere is likely to eventually break the rules without anyone else realizing.

The balanced approach

The upshot of this pessimistic outlook is that you should optimize the amount of time you spend caring about data privacy by choosing not to care in many situations.

If it is not mission-critical that your data remain private, you should feel comfortable doing “the basics” that I described above and then not worrying much about it – there’s just so much uncertainty that it’s not a useful thing to stress about.

And if you’re dealing with data that absolutely must never see the light of day, it’s probably better to not use it within an AI application quite yet, since it’s very difficult to guarantee (or verify) the level of security the big AI companies are providing.

Talk to you soon,

– Rob Howard
Founder of Innovating with AI

PS. I'm putting the finishing touches on my next AI education workshop. It's called "Find Your Next 5 AI Ideas".

I started putting it together when I realized that the first step to launching an AI idea in 30 days was [insert drumroll 🥁 here] to make sure you had a really valuable idea to work on.

If you want to be the first to hear about it, click here to join the waitlist for Find Your Next 5 AI Ideas.