OpenAI to Check for Plagiarism
This could well be the weirdest headline to parse our editorial in recent times. Why would the company that gave us a tool to freely plagiarize now give us another one to check if the content is plagiarized? As they old TV commercial says: “Yeh Baat Kuch Hazam Nahi Hui”
At first, they created a large language model-based chatbot that encourages students to plagiarize and content creators to indulge in other forms of cheating. Now, they have built a tool that uses text watermarking to catch these culprits. What exactly is Open AI doing here? Attempting to right earlier wrongs? Or claim moral high ground over gibberish creation?
Either way, reports that OpenAI now had a tool to catch students asking ChatGPT to write their assignments, is more confusing than ever. Especially when The Wall Street Journal reported that the company, in which Microsoft has a major stake, is facing a dilemma over whether to release the tool or let things be.
To launch or not to launch is the question
One can understand their reticence over the issue, given that in the past such tools had erroneously labelled original content written by non-native English speakers as content that ChatGPT spewed. The latest effort by OpenAI is based on making minor changes to the kind of words that the chatbot uses from its repertoire. The tool adds an invisible watermark to ChatGPT output that can later be detected by a tool with a 99.9% rate of accuracy.
In fact, OpenAI even updated an earlier blog post on the topic after the WSJ report surfaced. It says the text watermarking was “effective against local tampering such as paraphrasing” that was deployed to detect AI-generated content in the past. But it is not robust when it comes to tampering using translation systems, paraphrasing another generative AI model or asking it to insert a special character between words and then remove it.
The blog says teams were researching on the use of metadata as a text provenance method though it was still in very early stages of exploration and a touch early to say whether it would be effective. “But there are characteristics of metadata that would make this approach particularly promising,” says the blog post.
What exactly is OpenAI hoping for now?
In a statement carried by TechCrunch, OpenAI confirmed the research and claimed it was taking a deliberate approach, given the complexities involved and its impact on the broader ecosystem beyond OpenAI. “The text watermarking method we’re developing is technically promising but has important risks we’re weighing while we research alternatives, including susceptibility to circumvention by bad actors and the potential to disproportionately impact groups like non-English speakers,” the report said, quoting an OpenAI spokesperson.
Of course, the indications are that the tool would be only catching plagiarism and cheating done via ChatGPT. So, in case you shift over to Gemini or one of the other generative AI based chatbots, OpenAI will just do a facepalm. What we also aren’t sure about is how AI would treat content created using ChatGPT when it loops back and becomes data for another of its training modules.
Why do we ask? Quite obviously, the most important challenge before generative AI is the one related to what academics have called “model collapse” where it has been proved that content created by AI models and fed back to the system will soon result in gibberish. Maybe, this is one area where actually OpenAI’s tool can help its own model.
Then there is also the issue of web search, till recently the monopoly of Google and a few other smaller players. OpenAI has come out with SearchGPT, which says it will provide timely answers to questions and do so by drawing from web sources. What happens when some of these web resources are produced by ChatGPT itself? How long before the quality of search dips or can it be saved using the new tool?
For now, there are more questions than answers. So, all we can do is wait and watch for further developments – that is in case OpenAI does launch its new plagiarism detection tool.