Playing Chor-Police with Generative AI
Concerns around AI companies being sitting ducks for cyber-attacks is just one side of the emerging story
A top-ranking official in the US government recently came out with a rather interesting take on generative AI and its impact on cybercrime. His view is that while threat actors are using the new technology to steal data, US Intelligence was using the same tech to seek out malicious actors and clamp down on their activities.
Now, anyone remotely involved in cybercrime would tell you that this is one catch-up game where the criminals are always a couple of steps ahead of the victims. The recent reports about how OpenAI might have gotten hacked has now fuelled concerns over how AI companies could become the juiciest targets for cybercriminals.
Why AI could also be the scourge of cybercrime
However, there appears to be a flip side to this argument as well. US Intelligence says it has effectively used AI technologies to track malicious activity with NSA director of cybersecurity Rob Joyce claiming that AI and ML was making them better at such tracking though he did not refer to any specific instance in this regard.
He did mention though that recent efforts by China-backed hackers to target critical infrastructure in the US was an example of how AI technologies are surfacing malicious activity and giving the intelligence community an upper hand. “Machine learning, AI and big data helps us surface those activities [and] brings them to the fore because those accounts don’t behave like the normal business operators on their critical infrastructure, so that gives us an advantage,” Joyce said.
The issue appeared centre stage once again following the NYT report of a hack around OpenAI’s systems, which was initially considered serious but then dismissed as a minor security incident where the hackers only got as far as an employee discussion forum. However, the point that security experts are making is that companies working on LLMs are too good to resist for hackers, given the availability of massive datasets.
Three types of data that is critical for AI companies
Which actually brings us to the types of data that these companies appear to be accessing and storing in large volumes. We could classify these into three types – one that is used to train new language models, the second involves bulk user interactions that can identify patterns and the third, and most important, is customer data (minus masking efforts).
None can be sure about the training data, given the opaqueness built by companies engaged in this activity and OpenAI is no exception. To think these may just be large sets of scraped web content might be wrong as there is bound to be human intervention with partial automation that goes into cleaning up raw data and structuring it.
Given the importance of data quality in training large language models, it is highly likely that these AI companies will have systems and processes that could be worth every effort of the cybercriminal. Such training datasets could be of immense value to competition and adversaries, which could also include governments wanting more transparency.
User conversations give context to data
What could be more important and hack-worthy would be the user data comprising very large numbers of conversations with ChatGPT that users have across millions of topics. In some ways, this data forms the core of understanding the psyche of the web users, the same way that search data became critical for serving ads.
Unlike in search where user queries form the basis of further analytics, chatbots can actually delve much deeper through conversations about the need, the capacity and willingness to pay and their social surroundings. Google’s attempt to activate AI interaction for searches is a clear indication that such data would be critical for future AI enhancements.
Every conversation that a user has with ChatGPT potentially generates datasets that can be used not only for training models and development of AI use cases, but also teams involved in marketing, content generation and a whole host of allied activities. So, this data would be equal to a gold mine that AI companies can use at their pace.
Customer data that they themselves volunteer
However, above these two datasets, the most important category of data that OpenAI and its ilk capture relates to how customers are actually using AI and the data that they volunteer to share with the models. In other words, companies using AI chatbots and similar tools would find such datasets critical to fine-tune their use case and develop their own datasets.
This sort of data could be anything ranging from code of an unreleased software module to something as mundane as personnel records or budget planning data from a bygone era. Of course, these datasets could have little use for companies in the future but the fact remains that the AI provider uses its privileged access and captures it.
Now, on the face of it, this seems like just another act of data collection, but when seen in the light of what sort of data could be available within a company, one won’t be wrong in classifying them as industrial secrets. That this side of the picture is only emerging now also heightens the risk for AI companies.
What’s the solution? Nothing for now!
Like any other SaaS providers, even AI companies have put industry-leading levels of security, privacy and on-premises options to safeguard private databases and API calls of their Fortune 1000 companies. But the challenge is that security practices are akin to a cat-and-mouse game where there’s always someone trying to sneak in.
The NYT story around OpenAI and its culmination proves that while there still is no reason to hit the panic button, the fact remains that AI companies have enough data for any irate actor to feel excited and prepare for a break-in. Which is how we should see the NYT story – the targets are out in the open and visible. It is up to each one to do more to safeguard it.