“Wait, not like that”: Free and open access in the age of generative AI
The real threat isn’t AI using open knowledge — it’s AI companies killing the projects that make knowledge free


The visions of the open access movement have inspired countless people to contribute their work to the commons: a world where “every single human being can freely share in the sum of all knowledge” (Wikimedia), and where “education, culture, and science are equitably shared as a means to benefit humanity” (Creative Commonsa).
But there are scenarios that can introduce doubt for those who contribute to free and open projects like the Wikimedia projects, or who independently release their own works under free licenses. I call these “wait, no, not like that” moments.
When a passionate Wikipedian discovers their carefully researched article has been packaged into an e-book and sold on Amazon for someone else’s profit? Wait, no, not like that.
When a developer of an open source software project sees a multi-billion dollar tech company rely on their work without contributing anything back? Wait, no, not like that.
When a nature photographer discovers their freely licensed wildlife photo was used in an NFT collection minted on an environmentally destructive blockchain? Wait, no, not like that.
And perhaps most recently, when a person who publishes their work under a free license discovers that work has been used by tech mega-giants to train extractive, exploitative large language models? Wait, no, not like that.
These reactions are understandable. When we freely license our work, we do so in service of those goals: free and open access to knowledge and education. But when trillion dollar companies exploit that openness while giving nothing back, or when our work enables harmful or exploitative uses, it can feel like we've been naïve. The natural response is to try to regain control.
This is where many creators find themselves today, particularly in response to AI training. But the solutions they're reaching for — more restrictive licenses, paywalls, or not publishing at all — risk destroying the very commons they originally set out to build.
The first impulse is often to try to tighten the licensing, maybe by switching away to something like the Creative Commons’ non-commercial (and thus, non-free) license. When NFTs enjoyed a moment of popularity in the early 2020s, some artists looked to Creative Commons in hopes that they might declare NFTs fundamentally incompatible with their free licenses (they didn’t1). The same thing happened again with the explosion of generative AI companies training models on CC-licensed works, and some were disappointed to see the group take the stance that, not only do CC licenses not prohibit AI training wholesale, AI training should be considered non-infringing by default from a copyright perspective.2
But the trouble with trying to continually narrow the definitions of “free” is that it is impossible to write a license that will perfectly prohibit each possibility that makes a person go “wait, no, not like that” while retaining the benefits of free and open access. If that is truly what a creator wants, then they are likely better served by a traditional, all rights reserved model in which any prospective reuser must individually negotiate terms with them; but this undermines the purpose of free, and restricts permitted reuse only to those with the time, means, and bargaining power to negotiate on a case by case basis.b
Particularly with AI, there’s also no indication that tightening the license even works. We already know that major AI companies have been training their models on all rights reserved works in their ongoing efforts to ingest as much data as possible. Such training may prove to have been permissible in US courts under fair use, and it’s probably best that it does.3456
There’s also been an impulse by creators concerned about AI to dramatically limit how people can access their work. Some artists have decided it’s simply not worthwhile to maintain an online gallery of their work when that makes it easily accessible for AI training. Many have implemented restrictive content gates — paywalls, registration-walls, “are you a human”-walls, and similar — to try to fend off scrapers. This too closes off the commons, making it more challenging or expensive for those “every single human beings” described in open access manifestos to access the material that was originally intended to be common goods.
Often by trying to wall off those considered to be bad actors, people wall off the very people they intended to give access to. People who gate their work behind paywalls likely didn’t set out to create works that only the wealthy could access. People who implement registration walls probably didn’t intend for their work to only be available to those willing to put up with the risk of incessant email spam after they relinquish their personal information. People who try to stave off bots with CAPTCHAs asking “are you a human?” probably didn’t mean to limit their material only to abled people7 who are willing to abide ever more protracted and irritating riddles.8 And people using any of these strategies likely didn’t want people to struggle to even find their work in the first place after the paywalls and regwalls and anti-bot mechanisms thwarted search engine indexers or social media previews.
And frankly, if we want to create a world in which every single human being can freely share in the sum of all knowledge, and where education, culture, and science are equitably shared as a means to benefit humanity, we should stop attempting to erect these walls. If a kid learns that carbon dioxide traps heat in Earth's atmosphere or how to calculate compound interest thanks to an editor’s work on a Wikipedia article, does it really matter if they learned it via ChatGPT or by asking Siri or from opening a browser and visiting Wikipedia.org?
Instead of worrying about “wait, not like that”, I think we need to reframe the conversation to “wait, not only like that” or “wait, not in ways that threaten open access itself”. The true threat from AI models training on open access material is not that more people may access knowledge thanks to new modalities. It’s that those models may stifle Wikipedia and other free knowledge repositories, benefiting from the labor, money, and care that goes into supporting them while also bleeding them dry. It’s that trillion dollar companies become the sole arbiters of access to knowledge after subsuming the painstaking work of those who made knowledge free to all, killing those projects in the process.
Irresponsible AI companies are already imposing huge loads on Wikimedia infrastructure, which is costly both from a pure bandwidth perspective, but also because it requires dedicated engineers to maintain and improve systems to handle the massive automated traffic. And AI companies that do not attribute their responses or otherwise provide any pointers back to Wikipedia prevent users from knowing where that material came from, and do not encourage those users to go visit Wikipedia, where they might then sign up as an editor, or donate after seeing a request for support. (This is most AI companies, by the way. Many AI “visionaries” seem perfectly content to promise that artificial superintelligence is just around the corner, but claim that attribution is somehow a permanently unsolvable problem.)
And while I rely on Wikipedia as an example here, the same goes for any website containing freely licensed material, where scraping benefits AI companies at often extreme cost to the content hosts. This isn't just about strain on one individual project, it's about the systematic dismantling of the infrastructure that makes open knowledge possible.
Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work publicly available going forward (freely licensed or otherwise). They drain resources from maintainers of those common repositories often without any compensation. They reduce the visibility of the original sources, leaving people unaware that they can or should contribute towards maintaining such valuable projects. AI companies should want a thriving open access ecosystem, ensuring that the models they trained on Wikipedia in 2020 can be continually expanded and updated. Even if AI companies don’t care about the benefit to the common good, it shouldn’t be hard for them to understand that by bleeding these projects dry, they are destroying their own food supply.
And yet many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them rather than operating on years-long timescales. (Though perhaps anyone who has observed AI companies’ activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be sustainable on the order of years.)
It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons, so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly, miraculously come to their senses or develop a conscience en masse.
Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators' terms.
There are ways to do it: models like Wikimedia Enterprise, which welcomes AI companies to use Wikimedia-hosted data, but requires them to do so using paid, high-volume pipes to ensure that they do not clog up the system for everyone else and to make them financially support the extra load they’re placing on the project’s infrastructure. Creative Commons is experimenting with the idea of “preference signals” — a non-copyright-based model by which to communicate to AI companies and other entities the terms on which they may or may not reuse CC licensed work.c Everyday people need to be given the tools — both legal and technical — to enforce their own preferences around how their works are used.
Some might argue that if AI companies are already ignoring copyright and training on all-rights-reserved works, they'll simply ignore these mechanisms too. But there's a crucial difference: rather than relying on murky copyright claims or threatening to expand copyright in ways that would ultimately harm creators, we can establish clear legal frameworks around consent and compensation that build on existing labor and contract law. Just as unions have successfully negotiated terms of use, ethical engagement, and fair compensation in the past, collective bargaining can help establish enforceable agreements between AI companies, those freely licensing their works, and communities maintaining open knowledge repositories. These agreements would cover not just financial compensation for infrastructure costs, but also requirements around attribution, ethical use, and reinvestment in the commons.
The future of free and open access isn't about saying “wait, not like that” — it’s about saying "yes, like that, but under fair terms”. With fair compensation for infrastructure costs. With attribution and avenues by which new people can discover and give back to the underlying commons. With deep respect for the communities that make the commons — and the tools that build off them — possible. Only then can we truly build that world where every single human being can freely share in the sum of all knowledge.
As I was writing this piece, I discovered that a SXSW panel featuring delegates from the Wikimedia Foundation and Creative Commons, titled “Openness Under Pressure: Navigating the Future of Open Access”, discussed some of the same topics. (I was, sadly, scheduled to speak at the same time and so was unable to attend in person). The audio recording is available online, and I would highly recommend giving it a listen if this is a topic that interests you!
Footnotes
Creative Commons is a non-profit that releases the Creative Commons licenses: easily reusable licenses that broadly release some rights so that anyone can share and/or build upon the works under specified terms. ↩
However, these restrictive licenses cut both ways. The more restrictive the license on your work, the more incentive for powerful entities to bargain your own rights away from you. For example: when I agree to restrictive licensing terms in freelance writing contracts, I am often prohibited from republishing my own writing later on (e.g. in an anthology of my work) or sharing it with others (such as with my readers who have not purchased access to a paywalled publication). ↩
This is somewhat similar to my approach with Web3 is Going Great, which I published under a CC BY 3.0 license while also separately stating that I do not wish for the content to be reused in NFT or other crypto projects. The question here will come down to enforceability: frankly, I do not think this is a problem we can solve by simply asking AI companies nicely, and hoping they are generous enough to comply with our requests. Many of these companies have shown nothing but contempt to those who have created the works they have used without consent to train their models, and I don’t see how that will suddenly change. If CC can establish a way to communicate these preferences and for creators to subsequently enforce them, I will be very interested. ↩
References
“FAQ: CC and NFTs”, Creative Commons. ↩
“Should CC-Licensed Content be Used to Train AI? It Depends.” Creative Commons. ↩
“Comment from Creative Commons”, published by the US Copyright Office. ↩
“AI And Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead”, EFF. ↩
“Neither the devil you know nor the devil you don’t”, Cory Doctorow. ↩
“If Creators Suing AI Companies Over Copyright Win, It Will Further Entrench Big Tech”, TechDirt. ↩
“Inaccessibility of CAPTCHA”, W3C. ↩
“You’re not imagining it, Captchas are getting harder”, The Times. ↩