TEXT, AND DATA MINING – Decoding Copyright Challenges in India 

By Anju Jain Kumar, Gunjan Jadiya, Hriday Chokshi

  • Ø  Increasing number of copyright claims in the United States and other countries, challenging use of in-copyright works in machine learning and AI generated output.
  • Ø  Increasing trend of countries introducing exceptions in their copyright laws enabling machine learning / text and data mining.
  • Ø  Indian copyright law does not include specific safeguards for machine learning or text and data mining and the general exceptions under the law are limited.
  • Ø  Due diligence of terms of the platforms /databases where the content resides becomes important by participants in the value chain, whether you are a content creator or an AI product developer.


    With the increasing use of AI and generative AI, human ingenuity is being challenged by these rapidly advancing technologies. In December 2023, the New York Times1 sued Open AI and Microsoft in the United States for the alleged infringing use of its copyright works.2 The primary claims made by the NY Times are that millions of its articles were used to train chatbots who now compete with it. This legal battle is one amongst the many copyright claims against Open AI, including actions brought forth by numerous authors and artists.3 While the law on use of in- copyright works in training data continues to develop in the United States, countries like Japan, Singapore and the EU have included limited exceptions under their copyright laws to enable text and data mining.4 Closer to home, a pivotal question looms – how does the copyright law in India balance the interests of copyright works on the one hand and the enabling of machine learning and AI on the other? According to a recent press release, the GOI5 has expressed confidence in the adequacy of the copyright laws to address concerns surrounding AI generated works and related innovations. This write-up looks at this question under Indian law.


    Training data or TDM is the foundational step for any AI model. It is the systematic collection of extensive digitized material, coupled with the utilization of software to analyze and extract valuable information from this corpus.6 This involves web scraping, web crawling, and web archiving amongst other things. The EU Directive on Copyright describes TDM as “New technologies that

    enable the automated computational analysis of information in digital form, such as text, sounds, images or data.”7 According to the EU Directive, TDMs make possible the processing of large amounts of information to gain new knowledge and discover new trends. While TDM finds application in several non-AI8 contexts, this writeup focuses on TDM employed for training AI models.

    Use Cases

    A. Machine learning, deep learning, pattern recognition without reproducing in-copyright works in generative output

    TDM involves collecting and cleaning data for analysis, pattern recognition, deep learning etc. This includes making a copy of the data to be studied and subsequently transferring it to a tool for examination.9 Making a copy or the reproduction of an in-copyright work is the exclusive right vesting with the copyright owner unless permitted by the copyright owner or an exception permitted under the Indian copyright law.

    One of the claims made in the NYT Complaint is that the act of making an unauthorized copy of the in-copyrighted works for machine learning amounts to copyright infringement. 10 In previous non-AI related cases11, US courts have supported the view that copying of in-copyright texts in TDM for research purposes is fair use. These claims are yet to be determined in the context of AI. Countries like Singapore12 and Japan13 have reduced the uncertainty and introduced exceptions in their copyright laws, permitting copying of in-copyright works for machine learning, pattern recognition, data verification, subject to conditions.14 The EU Directive on Copyright issued to its member states directs the member states to allow reproduction (i) by research organizations and cultural heritage institutions of in-copyright works for TDM, for the purpose of scientific research; and (ii) for all other purposes on the condition that the right holder has not opted out of such use of their work.15

    In India, there are no specific exceptions for copying or reproduction of in-copyright works for machine learning purposes. Therefore, the use cases need to fall within the ambit of existing exceptions under the copyright law, such as fair dealing.16 The scope of fair dealing in India is narrow and applies to the literary, dramatic, musical, or artistic works.17 Sound recordings and cinematograph films fall outside the scope of the fair dealing.18 Only use cases such as private or personal use, including research, criticism, or review that satisfy the test of fair dealing are not considered infringement.19 The courts have traditionally looked at the following three factors in deciding what is fair dealing of an in-copyright work: (i) the amount and substantiality of the portion used; (ii) the purpose and character of the use; and (iii) the effect on the potential market.20

    Courts have held that if the purpose of the use is commercial in nature then it is not considered private or personal use, thus falling outside the scope of fair dealing.21

    Keeping in mind the narrow exceptions under Indian copyright law, it would be prudent to evaluate certain aspects of the TDM activity, for example (i) purpose or use of the TDM and would any of these purposes fall within the exceptions; and (ii) terms and conditions of the data bases/sets that are used for machine learning or TDM.

    B. Machine learning, deep learning, pattern recognition with use of in-copyright works in generative output

    In generative AI, the training data may also be reproduced while generating responses solutions or services. Such reproductions in output could trigger rights of copyright holders such as reproduction rights, communication rights and adaptation rights. In the NYT Complaint the NY Times claims that “the current GPT-4 LLM will output near-verbatim copies of significant portions of Times Works when prompted to do so. Such memorized examples constitute unauthorized copies or derivative works of the Times Works used to train the model.”22 Other cases in the United States have made similar claims. In October 2023, Universal Music Group filed copyright infringement lawsuit against Anthropic AI alleging that the AI is “copying and distributing lyrics from over 500 songs by renowned artists such as Katy Perry, the Rolling Stones, and Beyonce╠ü.”23

    Under Indian law, the analysis would hinge on where does the generative output fall on the spectrum of copyright, full reproduction – adaptation/derivative – new original work. The commonly used test by Indian courts has been whether the work is substantially similar to the in- copyright learning data. If there is substantial similarity, it would be considered infringement unless it falls with the statutory exceptions, which as we observed in A, are limited. Courts have looked at (i) quality of the content copied as opposed to quantity24; (ii) ‘total concept and feel test’, where the determination is based on whether a reader, spectator, or viewer, after experiencing both works, unmistakably perceives the subsequent work as a copy of the original;25 and (iii) abstraction- filtration-comparison test, that involves analyzing works by abstracting their core ideas, filtering out unprotectable elements, and comparing the remaining protected elements to assess if infringement has occurred.26

    Our Lens

    We are seeing an increasing number of countries amending their copyright laws to include TDM related exceptions, some wider than the others. These changes are being brought to participate and stay ahead in the build and adoption of AI models. In India, there has been a history of exceptions being carved out to balance the rights of the copyright holders and technological advancements.27 The government’s current stance, as articulated in the press release, indicates the

    absence of immediate plans to modify existing laws in the context of training data and AI. Would the existing exceptions support the increasing use of in-copyright works in training AI models for commercial use? Unlikely. Currently, individual participants in the value chain are left to determine how their works and databases are used and the commercials associated with such use.28

    (The authors are Anju Jain Kumar, Gunjan Jadiya, Hriday Chokshi | Veritas Legal,  and the views expressed in this article are their own)