Cross-border research collaboration reveals gaps in how Brazilian news websites shield content from AI crawlers

Press Releases

No items found.

As of November 2025, only 7.2% of Brazilian news sites block at least one “AI crawler” through their robots.txt files, even though 75% have the file installed. AI crawlers are automated bots that systematically browse and collect data from websites to train large language models and power AI assistants and search engines. The findings show that the small number of outlets restricting access focus mainly on well-known companies such as OpenAI, Google, Common Crawl, ByteDance, Amazon, Apple, Meta, and Huawei.

Amid surging demand for data to train large language models (LLMs), develop AI assistants, and power AI-driven search engines, the findings suggest that many news websites are not yet using robots.txt as a strategic tool to signal their preferences regarding AI companies scraping their content.

Without mechanisms to control AI crawlers, media organizations are unable to effectively protect or monetize content reused by AI systems—further straining outlets already facing declining traffic and visibility from digital platforms.

Robots.txt is a simple text file placed on a website’s root domain that provides instructions to web crawlers about which pages or sections they are allowed to access. While robots.txt compliance is not legally binding and has its limitations, it is one of the few free and widely recognized tools available for publishers to signal their preferences regarding AI scraping.

‍

The impact of AI crawlers on media business models was one of the key issues highlighted during the 2025 CTRL+J conference series, co-hosted and sponsored by the International Fund with regional partners. Participants emphasized the need for stronger technical defense mechanisms that allow publishers to control AI crawler access and potentially monetize it.

Produced by the Journalism Relay Project, Momentum - Journalism and Tech Task Force, and the International Fund for Public Interest Media, this first report is part of a broader research collaboration focused on Brazil, Indonesia, and South Africa. Through technical and qualitative research, it aims to increase awareness among media organizations on how AI systems access their content and help to inform strategies and policies to manage AI crawlers in ways that best serve newsroom interests. The report includes a detailed methodology for all those who may want to replicate this research on robots.txt in other countries and regions.

Read the full report in English, Portuguese or Spanish.