Protecting your site from AI scraping March 26, 2024

Categories:

FAQ,
User Guide

Content

Make your site non public via htaccess
Block bots via htaccess
Block bots via robots.txt
Spawning's ai.txt
Use html meta for blocking AI
Exif/IPTC: Use copyright notes
Image modifications
Add legal info
Extra note: Social media
Conclusion
Sources and recommended reading

Author:

Malte Müller (acrylian)

AI is a frequent topic since last year. It is a constant problem of AI bots using – some may call it exploiting – text and images from any public site for training without actual consent which naturally collides with intellectual property rights. The EU even made a new law regarding this issue last year.

This article will just focus on the ways that are available to protect your site.

Make your site non public via htaccess

This is actually the only really reliable way to cover all the different AI bots. Using the gallery protection of Zenphoto is a way but this is not a server side level protection for folders so bots may bypass that.

But using htaccess to password protect your site is on server level. Drawback is of course that this blocks everything and everyone including search engines and visitors you actually want to visit your site.

There are lots of tools and descriptions out there to help you like: https://www.web2generators.com/apache-tools/htpasswd-generator

Block bots via htaccess

This actually works pretty well but you need to know the name of every bot you want to block. Sadly lots of new AI tools appear every day. A htaccess example for a few known bots will look like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|Omgilibot|Omgili|FacebookBot) [NC]
RewriteRule ^ – [F]

You can add that to Zenphoto's htaccess. We have not included this because it is likey to frequently require changes.

Block bots via robots.txt

You can also advise bots via robots.txt just like with search engines. Here an example with some known bots:

User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: FacebookBot
Disallow: /

We also have not included this because it is likey to frequently require updates.

This is basically a recommendation and no bot has to comply with it as the big search engines do (voluntary as well).

Spawning's ai.txt

There is also the Initiative Spawning that aims to create tools to provide a way to block or consent AI access by opting out. There you can add yourself to a registry and create a specific ai.txt which is similar to robots.txt: https://site.spawning.ai/spawning-ai-txt#ai-text-generator.

Sadly it is more a concept of an opt-out infrastructure and so far only two AI companies seems to follow that registry. And the ai.text is no more a secure way of protection than a robots.txt as no one needs to comply with it.

Use html meta for blocking AI

You can use something like this code via the HTML head on your site:

<meta name="robots" content="noai, noimageai">

The html_meta_tags plugin included in Zenphoto 1.6.1 does include options for this now.

This is a recommendation and no bot has to comply with it.

Exif/IPTC: Use copyright notes

Some sites also recommend to indicate via EXIF/IPTC metadata that your images are copyrighted and AI usage is not allowed. There are plans to add an extra IPTC field specifially for AI. Note that only the Imagick library can preserve metadata in resizing and processing images!

Of course metadata will not make a bot comply but it may be useful for legal conflicts at some time.

Image modifications

Another recommendation is to use a watermark on the images itself. Of course a predominant watermark may disturb the image itself. But generally an indication of the source within a larger image size itself is not a bad idea.

There are also various research projects that "posion" images with additions that should confuse AI usages like the projects Glaze and Nightshade of the University Of Chicago.

Add legal info

It does not keep any bot from crawling but for any legal issue it might be a good idea to add a legal notice on our legal/imprint page to prohibit usage of your content by bots. Some jurisdictions have paragraphs within their laws (e.g. the EU/German intellectual propertty rights).

Note that if you are posting your images on social media platforms you may grant rights to use these for AI training. For example X formerly known as Twitter has this condition in its terms of service.

Conclusion

The only really secure way to protect images on the net against AI or other scraping is not to put images on the net.

Sources and recommended reading

This article only scratches the surface of the issue so here a few more:

This text by www.zenphoto.org is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Code examples are released under the GPL v2 or later license

For questions and comments please use the forum or discuss on the social networks.

Tags:

Categories

Need project help?

Latest additions and updates

Protecting your site from AI scraping March 26, 2024

Content

Author:

Make your site non public via htaccess

Block bots via htaccess

Block bots via robots.txt

Spawning's ai.txt

Use html meta for blocking AI

Exif/IPTC: Use copyright notes

Image modifications

Add legal info

Conclusion

Sources and recommended reading

Related items

Zenphoto 1.6.8 (News)

Zenphoto 1.6.7 (News)

Deprecated themes - Some clarifications and recommendations (User guide)

Database password problems (User guide)

Zenphoto 1.6.6 (News)

RSS feeds!

Join the Google Group!

Information

Legal stuff

Follow us

Categories

Need project help?

Latest additions and updates

Protecting your site from AI scraping March 26, 2024

Content

Author:

Make your site non public via htaccess

Block bots via htaccess

Block bots via robots.txt

Spawning's ai.txt

Use html meta for blocking AI

Exif/IPTC: Use copyright notes

Image modifications

Add legal info

Extra note: Social media

Conclusion

Sources and recommended reading

Related items

Zenphoto 1.6.8 (News)

Zenphoto 1.6.7 (News)

Deprecated themes - Some clarifications and recommendations (User guide)

Database password problems (User guide)

Zenphoto 1.6.6 (News)

RSS feeds!

Join the Google Group!

Information

Legal stuff

Follow us