Update, January 2024: A new version of the AI image detector described in this article has been released, expanding its scope to include non-artistic imagery and bringing it up to date with more recent image generation models, such as SDXL:
https://huggingface.co/Organika/sdxl-detector
The open-source release of Stable Diffusion, a text-to-image tool developed by a group of computer vision researchers with funding from Stability.ai, has created boom times for “AI art”. There have been other similar tools in the past (such as VQGAN+CLIP, Disco Diffusion, DALLE-2, Midjourney, and Craiyon), but Stable Diffusion is the first free and open source project that can produce convincing, highly detailed images based upon user-input prompts. It is a tremendous gift to the “creatively constipated” (as Stability.ai founder Emad Mostaque has called them) masses lacking the skills, tools, and time traditionally needed in order to produce visual art.
However, the open release of Stable Diffusion has not been good news for a different group: working artists, graphic designers, and illustrators. While the output of DALLE-2 is more passable as plausibly human-made, Stable Diffusion comes close — and is free. This creates almost impossible price competition for the services of trained professional human artists. And, in response to their outcry, some online art communities have moved to ban AI art from their sites.
Such restrictions will be difficult to enforce, creating a technical challenge that I thought might be interesting to explore: could a machine learning model be used to detect AI-generated art? Kaggle hosted a competition in 2020 to detect AI deepfakes, with some success, and image classification models are used to identify an astonishing variety of things, ranging from diseases in plants to NSFW content in online image posts. Having watched the domain of text-to-image AI emerge over a few years, I personally have become able to quickly notice subtle visual artifacts and “tells” that give away when an image is AI-generated (in fact the trippy aesthetic quality of those artifacts is part of what has interested me in such tools). Could an image classification model trained on a sufficiently large sample of images labeled as “artificial” and “human” learn to do the same thing?
I decided to initially try re-purposing a framework developed by a Reddit user for training a NSFW detection bot to assist subreddit moderators. An advantage of their repository was that it included an easy-to-follow Colab notebook for training the model with one’s own data. Using an online tool called Reddit Downloader I was able to quickly collect thousands of images from “traditional” art subreddits (e.g. r/art, r/painting, r/learntodraw) as well as explicitly AI-generated art subreddits (e.g. r/bigsleep, r/midjourney and r/stablediffusion). I labeled the images rather naively as either “human” or “artificial” depending on the nature of the sub they came from.
Training the model using this dataset was quick and easy thanks to the underlying FastAI library, and resulted in a model that purported to have about 80% accuracy — not too bad compared to the leaderboards of the Deepfake detection challenge.
To test this model, I dropped the model into a modified clone of the Reddit NSFW detection bot. I was then able to start running the bot on r/SubSimGPT2Interactive, a subreddit which features AI-powered bots that can make image posts. Most of these images are retrieved by the bots via Bing search, but some bots generate their own original images using tools like Stable Diffusion. The humans on this sub tend to be AI enthusiasts, and they sometimes post AI-generated images as well. This makes it a great playground for testing AI art detection “in the wild”.
The results of this initial effort were somewhat disappointing. The bot did correctly identify some AI-generated images, but also flagged many others which were not AI-generated, in error. The false positives appeared to often be recognizably “computer-generated” graphics, e.g. screenshots or memes, which made sense given that the training dataset did not include images of this type because my focus was on differentiating AI-generated images from art manually created by humans. However, there were also some false negatives — AI-generated images that did not get recognized as such by the model. One subreddit user commented that it seemed as though the bot was “just making random numbers up”, leading me to conclude that there was room for improvement.
Computer vision progresses with every passing day, and I wanted to make sure that I didn’t overlook models that I might not have studied before. Thus, I became curious about the support for image classification models that Hugging Face had recently added to its AutoTrain service. AutoTrain builds a variety of appropriate candidate models based upon the nature of your dataset, allowing you to choose the one that best fits your data according to whichever metric you prefer.
At first, AutoTrain essentially rejected my dataset, so I did some work to clean it up, converting all of the images to the same size and aspect ratio using ImageMagick scripts. Based upon feedback from a friend, I also re-gathered the “human” category, being careful to only include images from before 2019 (when text-to-image really took off), so as to avoid accidentally including AI-generated images mis-labeled as human art. I stuck with my naive labeling strategy based upon subreddit topic, however, and chose not to address the false positives associated with noticeably computer-generated, though not AI-generated, imagery. The use case in which I envision such a tool being deployed is art-specific, and no matter how good the model, I would expect human moderators to be involved in reviewing posts flagged as AI-generated, so false positives are better than false negatives. My cleaned-up dataset was accepted by AutoTrain; however it was too large to run the job for free. I decided it would be a worthwhile investment to make the upgrade to a Pro Huggingface account and pay the excess data surcharge so that I could continue on with my experiment.
Actually running Autotrain was fast and easy, and it was kind of fun to watch the model accuracy metrics update in real-time as the job progressed, like a machine learning horse race. Ultimately the best-fitting model had a reported loss of 0.163, while the worst loss was 0.209, and the rest fell in between. Other validation metrics are visualized below.
So, now I had a brand new, shiny image classification model that purported to identify AI-generated images — cool! But what kind of model was it? The model card automatically generated by AutoTrain only described it by the problem type, “binary classification” (and informed me how many grams of CO2 I had emitted in defense of human artists). I knew it was a PyTorch model, but that could mean a wide variety of things. Eventually, by reading the config.json file stored with the model (which had to be downloaded due to being stored with Git LFS for some reason, despite being a relatively small text file), I was able to identify it as a “SwinForImageClassification” model. The Huggingface documentation for this model provided some helpful background information, including a link to the original paper by Liu et. al. proposing the Swin Transformer architecture (published in 2021, three years after I audited that course on computer vision).
It was also dead-simple to create a Gradio demo of the model, although I found that the performance of both this and the inference API sandbox built into its page on the Hugging Face model hub was very spotty — probably because a major update to Hugging Face’s Diffusers library had been released the day I was testing it, and hordes of others were overwhelming their servers.
As with the previous FastAI model, the Swin model did not appear to fare as well “in the wild” as it did in the Autotrain validation tests. Some of this was the same out-of-domain difficulty associated with screenshots and memes or other “captioned” images I noticed before — I’m chalking that up as a known limitation of the model, for now. However I also noticed some false negatives associated with very convincing images generated using Stable Diffusion and DALLE-2. These models hadn’t been around very long, so they didn’t represent an overwhelming majority of the data in my training dataset. My model would probably be better at differentiating, say, VQGAN-generated imagery (or DALLE-mini output) from real paintings.
After sharing the model demo with some artist friends and AI enthusiasts, another quirk became apparent: AI-generated images could sneak past the model after being cropped. In one case this was because the cropping removed a visible watermark added by DALLE-2 to all its outputs in the lower-right corner — clearly the vision transformer had picked up on that as a feature which would indicate an AI-generated image, though this is of little added value to a human reviewer who would recognize the same thing without the help of an automated tool. The same effect was observed with images generated by Stable Diffusion, which by default includes an invisible watermark (though this is easily removed, as is the case with some of the most popular forks of the original codebase). It’s possible that the Swin model had picked up on this despite it explicitly not being recognizable to human eyes, and cropping the image disrupted its ability to rely on that hidden feature to identify AI-generated art. However, it’s worth noting that many images in my training dataset had also been automatically cropped by my pre-processing scripts, so this doesn’t completely explain why cropping would increase the chance of false negatives, in the absence of watermarks.
I have another theory about the “cropping effect”: in my experience, AIs tend to deteriorate and generate artifacts near the edges of an image. This has always been the case, and is most visible with samples from the viral website ThisPersonDoesNotExist.com, which serves up fake people from an older — yet still very impressive — model called StyleGAN. Refresh the page enough times, and you are very likely to get images like the one below. Stable Diffusion is not immune to this problem either, and cropping the image might reduce a significant enough portion of such artifacts to allow it to pass through the AI detection filter.
Additional testing revealed that the use of digital image processing filters could trigger false positives. This suggests that the subtle artifacts which the Swin transformer picks up on in AI art are similar to those produced by some filters — though perhaps not all. If similar tools are widely deployed by artist communities to filter out AI-generated images, artists may need to minimize or reduce their use of filters, to avoid being falsely flagged.
This is probably as far as I will take this experiment on my own time and dime. If greater resources were at my disposal, I would probably focus on Stable Diffusion and build a much larger training dataset using Open Prompts, a dataset of 10 million prompts and generated images that has been used to create krea.ai and lexica.art, along with some subset of the LAION data used to train the Stable Diffusion model. Presumably an image classification model based on millions of images would outperform my current results based on a few thousand examples scraped from Reddit, but I don’t have the cash nor the computers to find out. Perhaps my experiment will inspire someone else, or some organization, to investigate further… and I’d be happy to collaborate with them if interested!
Many thanks go to @vertabia and @devoidgazer for testing the AI art detection tool and noting the quirks described above.