Copyright Exceptions in the EU for Artificial Intelligence: A Legal Analysis of Text and Data Mining under the DSM Directive

The rapid development of artificial intelligence (AI) technologies has fundamentally transformed approaches to data processing, analysis, and the creation of digital products. At the core of most modern AI systems lies text and data mining (TDM) — an automated process of analyzing large volumes of text and data to identify patterns, trends, and correlations.

At the same time, TDM inevitably involves the reproduction and extraction of information from copyrighted works, raising complex legal issues. In response to these challenges, the European Union adopted Directive (EU) 2019/790 on Copyright in the Digital Single Market (DSM Directive), which for the first time provides a systematic legal framework for the use of works for TDM in Articles 3 and 4.

The purpose of this article is to analyze the legal regime of TDM in the EU, assess its impact on AI development, and evaluate the balance between innovation and the protection of right holders.

Legal Framework of TDM under the DSM Directive

Article 3: A Mandatory Exception for Scientific Research

Article 3 of the DSM Directive establishes a mandatory exception to copyright for the use of works for scientific research purposes. It allows research organizations and cultural heritage institutions to reproduce and extract information from works to which they have lawful access for the purpose of carrying out TDM.

The key features of this provision are:

Beneficiaries: universities, research institutions, libraries, and archives;
Purpose: strictly scientific research;
Condition: lawful access to the materials;
Legal nature: mandatory exception (no opt-out permitted).

A crucial aspect of Article 3 is that it cannot be overridden by right holders. Even where contractual restrictions exist, research institutions retain the right to conduct TDM. This ensures a stable legal environment for academic research and non-commercial AI development.

Article 4: A General Exception with an Opt-Out Mechanism

Unlike Article 3, Article 4 provides a broader exception applicable to all users, including commercial entities and AI developers. However, this exception is subject to an important limitation: right holders may expressly reserve their rights, effectively opting out.

The main elements of Article 4 include:

Beneficiaries: any natural or legal persons;
Purpose: any form of TDM, including commercial use;
Condition: lawful access;
Limitation: opt-out by right holders;
Retention: copies may be stored as long as necessary.

The opt-out mechanism is central to Article 4. Right holders may prohibit TDM through:

contractual terms;
machine-readable means (e.g., metadata or robots.txt for online content).

As a result, although Article 4 appears broad, its practical scope depends significantly on the decisions of right holders.

The Concept of “Lawful Access”

Both Articles 3 and 4 rely on the concept of “lawful access.” This means that users must obtain content through legitimate means, such as:

subscriptions;
licenses;
publicly accessible online sources.

However, lawful access does not equate to unrestricted use. A user may have the right to read or view content, but not necessarily to reproduce or extract it for AI training purposes. This distinction creates a complex legal boundary between access and use, which remains a subject of ongoing debate.

Implications for Artificial Intelligence Development

Facilitating Innovation

The DSM Directive represents a significant step toward legitimizing TDM in the EU. It:

reduces legal uncertainty;
provides predictable rules;
encourages investment in AI technologies.

Article 3 is particularly important, as it guarantees freedom of research for academic institutions without requiring authorization from right holders.

Constraints on Commercial AI

At the same time, Article 4 introduces significant limitations for the commercial sector. The opt-out mechanism means that:

access to data may become fragmented;
large datasets may be partially unavailable;
developers must verify the legal status of each data source.

This leads to:

increased compliance costs;
legal uncertainty;
risk of copyright infringement.

Comparison with the U.S. Approach

In contrast to the EU, the United States relies on the flexible doctrine of fair use. U.S. courts have often recognized TDM and AI training as transformative uses, which may be lawful even without the consent of right holders.

For example, in Authors Guild v. Google, large-scale digitization of books was held to be lawful under fair use.

As a result:

the U.S. provides greater flexibility;
the EU offers greater legal certainty, but with more restrictions.

Balancing Innovation and Copyright Protection

The DSM Directive reflects an attempt to balance two competing objectives:

promoting technological innovation and AI development;
protecting the economic interests of right holders.

Article 3 clearly favors the public interest and scientific research, while Article 4 prioritizes the control of right holders.

Critics argue that the opt-out mechanism may:

limit access to data;
create barriers for smaller AI companies;
reduce the EU’s competitiveness in the AI sector.

Conclusion

For lawyers worldwide, it is increasingly important not only to analyze existing legislation but also to anticipate its evolution. In the field of artificial intelligence, this dynamic is particularly evident: technological progress significantly outpaces legal regulation, forcing states and international institutions to develop new legal frameworks.

In this context, the DSM Directive can be seen as only the first step toward a fundamentally new architecture of copyright law adapted to the data-driven era. It is already clear that traditional licensing models — such as licenses for public performance, broadcasting, or online distribution — are insufficient to address new forms of use related to AI training.

In the near future, legal systems are likely to evolve toward the creation of new types of licenses, focused not on the “display” or “communication” of works, but on the processing, analysis, and transformation of data. Such licenses may regulate access to datasets for TDM purposes, establish conditions for training AI systems, and provide mechanisms for compensating right holders whose works are used within training datasets.

In a sense, this evolution resembles the transformation of the media industry — from cinema to television, and later to streaming platforms. However, unlike previous stages, where the primary object of regulation was access to content, the new paradigm focuses on content as a resource for generating knowledge and technological innovation.

Thus, the key challenge for modern law will be to develop flexible and technologically neutral mechanisms that both foster AI development and ensure fair remuneration for right holders. The effectiveness of this balance will determine not only the future of copyright law but also the competitiveness of legal systems in the global digital economy.

Olena Yaremchuk
Attorney-at-Law, Patent Attorney Managing Partner International Legal Consulting Group “Yaremchuk & Partners”
www.yaremchukandpartners.com