Hey Reader, Tech companies have always had access to a lot of our confidential personal and business data. But prior to AI, I never got a question like:
AI changed everything about data privacy because it showed that the value of data access goes way beyond targeted advertising (which has been the primary use case for Meta and Google for two decades). Now, we can build smarter computers just by having those computers consume more information – which means your creativity and unique insights need to be guarded much more closely. Today I’ll answer the biggest data-security questions that I’ve been getting from students in the Innovating with AI Incubator program, including...
Let’s dive in. Data security: who needs to care?The status quo is that OpenAI, Google, Amazon, Meta and other creators of large language models are rapidly ingesting everything that has ever been published into their LLMs. Whether or not this is legal is an open question (the New York Times, for example, thinks it is illegal), but that won’t stop these companies for at least the next few years. From a data-security standpoint, this means that if your data or content is already public somewhere on the internet, there’s no real way for you to stop it from becoming part of an LLM’s knowledge base. There are ways to add directives to your website to discourage crawling, but ultimately these are just polite suggestions – your content could still end up being part of the Web Archive, for example, and could make its way indirectly into an LLM’s knowledge. The flip side is that if your content is already public (or you plan to make it public), you can worry less about pasting it into ChatGPT or another language model. Even if OpenAI were to eventually use it for future training, it’s highly likely they would also find it elsewhere on the internet anyway, so pasting it into ChatGPT really makes no difference. On the other hand, if you have data that is proprietary and not already somewhere on the internet, you may have a good reason to worry about data privacy, particularly if it’s something that you wouldn’t want ChatGPT to be able to easily answer questions about in the future. Many large companies, for example, have already invested in private LLMs and private datacenter operations to avoid their internal, proprietary data being ingested by OpenAI. We have a number of clients and students who are in this boat. Here’s what they’re doing... How to protect your content from LLMsWithin the OpenAI ecosystem, the key data-privacy trick is to use business-focused tools rather than consumer-focused ones. For example, if you are using the OpenAI APIs via their Platform, the company promises not to use your content to train future versions of GPT. (Note: Every time OpenAI releases a new version of GPT, that version is being trained on new and more recent content. The older versions never “update” – it’s just a one-and-done release since the training process is so intense.) OpenAI also recently released ChatGPT for Teams, which is also considered a business tool and promises not to use your content for future training. However, the “normal” version of ChatGPT and ChatGPT Plus are considered consumer products, so by default, the company can use any content you put in there (including in your custom GPTs) for future training. You can change this setting by submitting a “do not train on my content” request via your OpenAI account. Keep in mind that this request affects future submissions only; it doesn’t retroactively affect anything you’ve already typed into ChatGPT. (In my experience, it takes a couple days for them to process your request.) Other nascent AI models, such as Google’s Gemini, offer similar protections at the business level – in fact, this has become the industry standard since so many corporations have trade secrets they want to protect. It’s the smaller businesses and consumers that might not have this as a top-of-mind concern, so if you’re in that group, it’ll be valuable for you to get to know the different policies and options that come with each AI platform. The limits of “do not train”While I’ve taken advantage of all the privacy settings above on my own accounts (and recommend you do the same), I’m also somewhat skeptical of the idea that we can keep AI contained to “approved content only” in the long term. That’s because, as LLMs grow and multiply, it’s going to be more and more difficult to determine what actually is and is not in their knowledge base. And even if you know something is in their knowledge base, it will be very difficult to determine where it originally came from and whether it is possible to remove it. We are already seeing some of these problems in the New York Times v. OpenAI case. These issues are confounding even for companies willing to spend millions on legal fees, so it’s going to be extremely difficult for smaller players to enforce the rules. The result is that the big companies can effectively say whatever they want about privacy, but if there’s no way to actually verify what’s happening behind the scenes, those promises don’t add up to very much actual security. Even non-AI privacy initiatives, like Europe’s GDPR, struggle with the difficulty of enforcement – but if you have a situation where it is literally impossible to see what training data is incorporated into an LLM, then you’re in a position where nobody can realistically enforce any rules. In other words, you’re just trusting the companies to do what they say they’re going to do – which means someone, somewhere is likely to eventually break the rules without anyone else realizing. The balanced approachThe upshot of this pessimistic outlook is that you should optimize the amount of time you spend caring about data privacy by choosing not to care in many situations. If it is not mission-critical that your data remain private, you should feel comfortable doing “the basics” that I described above and then not worrying much about it – there’s just so much uncertainty that it’s not a useful thing to stress about. And if you’re dealing with data that absolutely must never see the light of day, it’s probably better to not use it within an AI application quite yet, since it’s very difficult to guarantee (or verify) the level of security the big AI companies are providing. Talk to you soon, – Rob Howard PS. I'm putting the finishing touches on my next AI education workshop. It's called "Find Your Next 5 AI Ideas". |
We help entrepreneurs and executives harness the power of AI.
Hey Reader, Hope you're getting ready for a wonderful holiday season! Thanks so much for being part of the Innovating with AI community this year. 🎄 ☃️ 🍾 Here's what's coming next year plus what I'm reading during holiday downtime... ••• We'll be opening enrollment in The AI Consultancy Project in January. You're already on the list to get all the details, but if you also want to get extra-early access, sign up for text or WhatsApp alerts here. ••• Rob's Holiday AI Reading List #1 –...
Hey Reader, A super-quick share for today – last night I got access to Sora, the new video-generation model from OpenAI. I handed the keyboard to my 10-year-old son to see what he'd do with it, and here are the results: Yes, that's a Corgi Taco Police Chase Watch: Our First Look at Sora Video AI ••• (By the way, be sure to subscribe to our YouTube channel for lots more free tutorials and walkthroughs like this.) Enjoy! And if you have time this weekend, I highly recommend creating an account...
Hey Reader, I hope life, work and the world of AI are treating you well! We're back with 3 big things in AI. It was a huge week for AI, so I've done my best to actually synthesize this down to just a few items – more to come next week! By the way, make sure you're subscribed to our YouTube channel for free AI tutorials. I just posted a new one where Brian (our CTO) shows you how to connect Google Forms, Zapier and ChatGPT step-by-step to build a no-code AI client intake system. Watch the...