Protecting your site from AI scraping

    AI is a frequent topic since last year. It is a constant problem of AI bots using – some may call it exploiting – text and images from any public site for training without actual consent which naturally collides with intellectual property rights. The EU even made a new law regarding this issue last year.

    This article will just focus on the ways that are available to protect your site.

    Make your site non public via htaccess

    This is actually the only really reliable way to cover all the different AI bots. Using the  gallery protection of Zenphoto is a way but this is not a server side level protection for folders so bots may bypass that.

    But using htaccess to password protect your site is on server level. Drawback is of course that this blocks everything and everyone including search engines and visitors you actually want to visit your site.

    There are lots of tools and descriptions out there to help you like: https://www.web2generators.com/apache-tools/htpasswd-generator 

    Block bots via htaccess

    This actually works pretty well but you need to know the name of every bot you want to block. Sadly lots of new AI tools appear every day. A htaccess example for a few known bots will look like this:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|Omgilibot|Omgili|FacebookBot) [NC]
    RewriteRule ^ – [F]

    You can add that to Zenphoto's htaccess. We have not included this because it is likey to frequently require changes.

    Block bots via robots.txt

    You can also advise bots via robots.txt just like with search engines. Here an example with some known bots:

    User-agent: CCBot
    Disallow: /
    User-agent: ChatGPT-User
    Disallow: /
    User-agent: GPTBot
    Disallow: /
    User-agent: Google-Extended
    Disallow: /
    User-agent: anthropic-ai
    Disallow: /
    User-agent: Omgilibot
    Disallow: /
    User-agent: Omgili
    Disallow: /
    User-agent: FacebookBot
    Disallow: /

    We also have not included this because it is likey to frequently require updates.

    This is basically a recommendation and no bot has to comply with it as the big search engines do (voluntary as well).

    Spawning's ai.txt

    There is also the Initiative Spawning that aims to create tools to provide a way to block or consent AI access by opting out. There you can add yourself to a registry and create a specific ai.txt which is similar to robots.txt: https://site.spawning.ai/spawning-ai-txt#ai-text-generator.

    Sadly it is  more a concept of an opt-out infrastructure and so far only two AI companies seems to follow that registry. And the ai.text  is no more a secure way of protection than a robots.txt as no one needs to comply with it.

    Use html meta for blocking AI

    You can use something like this code via the HTML head on your site:

    <meta name="robots" content="noai, noimageai">

    The html_meta_tags plugin included in Zenphoto 1.6.1 does include options for this now.

    This is a recommendation and no bot has to comply with it.

    Exif/IPTC: Use copyright notes

    Some sites also recommend to indicate via EXIF/IPTC metadata that your images are copyrighted and AI usage is not allowed. There are plans to add an extra IPTC field specifially for AI. Note that only the Imagick library can preserve metadata in resizing and processing images!

    Of course metadata will not make a bot comply but it may be useful for legal conflicts at some time.

    Image modifications

    Another recommendation is to use a watermark on the images itself. Of course a predominant watermark may disturb the image itself. But generally an indication of the source within a larger image size itself is not a bad idea.

    There are also various research projects that "posion" images with additions that should confuse AI usages like the projects Glaze and Nightshade of the University Of Chicago.

    Add legal info

    It does not keep any bot from crawling but for any legal issue it might be a good idea to add a legal notice  on our legal/imprint page to prohibit usage of your content by bots. Some jurisdictions have paragraphs within their laws (e.g. the EU/German intellectual propertty rights).

    Extra note: Social media

    Note that if you are posting your images on social media platforms you may grant rights to use these for AI training. For example X formerly known as Twitter has this condition in its terms of service.

    Conclusion

    The only really secure way to protect images on the net against AI or other scraping is not to put images on the net.

    Sources and recommended reading

    This article only scratches the surface of the issue so here a few more:

    Creative Commons LicenseThis text by www.zenphoto.org is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

    Code examples are released under the GPL v2 or later license

    For questions and comments please use the forum or discuss on the social networks.

    Related items