Protecting your site from AI scraping March 26, 2024
AI is a frequent topic since last year. It is a constant problem of AI bots using – some may call it exploiting – text and images from any public site for training without actual consent which naturally collides with intellectual property rights. The EU even made a new law regarding this issue last year.
This article will just focus on the ways that are available to protect your site.
Make your site non public via htaccess
This is actually the only really reliable way to cover all the different AI bots. Using the gallery protection of Zenphoto is a way but this is not a server side level protection for folders so bots may bypass that.
But using htaccess to password protect your site is on server level. Drawback is of course that this blocks everything and everyone including search engines and visitors you actually want to visit your site.
There are lots of tools and descriptions out there to help you like: https://www.web2generators.com/apache-tools/htpasswd-generator
Block bots via htaccess
This actually works pretty well but you need to know the name of every bot you want to block. Sadly lots of new AI tools appear every day. A htaccess example for a few known bots will look like this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|Omgilibot|Omgili|FacebookBot) [NC]
RewriteRule ^ – [F]
You can add that to Zenphoto's htaccess. We have not included this because it is likey to frequently require changes.
Block bots via robots.txt
You can also advise bots via robots.txt just like with search engines. Here an example with some known bots:
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: FacebookBot
Disallow: /
We also have not included this because it is likey to frequently require updates.
This is basically a recommendation and no bot has to comply with it as the big search engines do (voluntary as well).
Spawning's ai.txt
There is also the Initiative Spawning that aims to create tools to provide a way to block or consent AI access by opting out. There you can add yourself to a registry and create a specific ai.txt which is similar to robots.txt: https://site.spawning.ai/spawning-ai-txt#ai-text-generator.
Sadly it is more a concept of an opt-out infrastructure and so far only two AI companies seems to follow that registry. And the ai.text is no more a secure way of protection than a robots.txt as no one needs to comply with it.
Use html meta for blocking AI
You can use something like this code via the HTML head on your site:
<meta name="robots" content="noai, noimageai">
The html_meta_tags plugin included in Zenphoto 1.6.1 does include options for this now.
This is a recommendation and no bot has to comply with it.
Exif/IPTC: Use copyright notes
Some sites also recommend to indicate via EXIF/IPTC metadata that your images are copyrighted and AI usage is not allowed. There are plans to add an extra IPTC field specifially for AI. Note that only the Imagick library can preserve metadata in resizing and processing images!
Of course metadata will not make a bot comply but it may be useful for legal conflicts at some time.
Image modifications
Another recommendation is to use a watermark on the images itself. Of course a predominant watermark may disturb the image itself. But generally an indication of the source within a larger image size itself is not a bad idea.
There are also various research projects that "posion" images with additions that should confuse AI usages like the projects Glaze and Nightshade of the University Of Chicago.
Add legal info
It does not keep any bot from crawling but for any legal issue it might be a good idea to add a legal notice on our legal/imprint page to prohibit usage of your content by bots. Some jurisdictions have paragraphs within their laws (e.g. the EU/German intellectual propertty rights).
Extra note: Social media
Note that if you are posting your images on social media platforms you may grant rights to use these for AI training. For example X formerly known as Twitter has this condition in its terms of service.
Conclusion
The only really secure way to protect images on the net against AI or other scraping is not to put images on the net.
Sources and recommended reading
This article only scratches the surface of the issue so here a few more:
This text by www.zenphoto.org is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Code examples are released under the GPL v2 or later license
For questions and comments please use the forum or discuss on the social networks.