OSI's Open Source AI Definition v1.0
Lars Marowsky-Brée November 02, 2024 #Open Source #Policy #AIIntroduction
Artificial Intelligence (AI) and Machine Learning (ML), such as Generative AI and Large Language Models (LLMs), have, for better or worse, a growing impact in shaping critical societal outcomes; from healthcare via finance to public safety in general (consider self-driving cars) - as well as in our every day lives as individuals.
Without openness and transparency, AI could perpetuate unchecked biases that shape public policy, healthcare, and legal outcomes. Without openness and the ability to modify the systems that influence and run our lives, we are robbed of freedom and agency.
While the jury is still out whether they can deliver on their promises in the real world, there's no denying that AI/ML/LLM-based systems are replacing, or at the very least augmenting, traditional Information Technology systems.
For the traditional IT world, we have now familiar and highly successful paradigms of Free, Libre, and Open Source Software that strive to ensure our rights and abilities to not just Use, but also to Study, Modify, and Share them.
For individuals and society at large not to be at the questionable mercy of inscrutable blackboxes (often proprietary, and under the control of for-profit businesses), such rights are fundamental to maintain our digital sovereignty.
If our digital human rights are not sufficient arguments, Free, Libre, and Open Source Software Systems nowadays provide inconceivable value for the global economy.
Thus arises the necessity to protect those freedoms in AI/ML-driven systems.
The Open Source Initiative (OSI) has just released the first version of their Open Source AI Definition (OSAID) and its associated Frequently Asked Questions (FAQ).
The OSI declares themselves to the "the authority that defines Open Source", and the steward of the Open Source Definition. It is fair to say that they are one of the most influential organizations in this space, with broad impact on the IT industry, communities, and political policy.
This means we must hold OSI's release of version one point zero, which is no longer declared a draft asking for comments but for active endorsement, to a rather high standard and critically evaluate whether it manages to uphold and protect the values behind Open Source Software (that they do not wish to uphold the values of Free Software in full we already know), and what might be needed to address any gaps or misses.
This is not the first such writing, and I want to give explicit acknowledgement to the sources and authors referenced at the end for influencing my thinking. The tenor of the positions from the FSF or the Software Freedom Conservancy's aspirational stance, but also other respected individuals, clearly shows that advocates expect the OSI to do better.
The Positive Intentions and Achievements
OSI states to have engaged with diverse stakeholders - who will not always have had perfectly aligned interests and positions. Anyone who has ever worked on any sort of large collaborative definition effort recognizes the impossibility to achieve everyone's goals without often painful compromises. Especially when global stakeholders from all across the industry meet ethical and moral concerns.
It is always easy to criticize someone else's work - after it has been published. We would not be having this discourse right now if OSI hadn't, and it behoves us to acknowledge this, too, despite criticism.
Thus, providing an initial framework for "Open Source AI" and addressing a previously largely undefined space is an important step that I do not wish to discount. OSI is providing a baseline to argue on.
The key goal of OSAID is not merely to address the "traditional" source code used to build, train, run, and operate AI/ML systems; it would have been easy for OSI to limit themselves to this aspect which most closely aligns with their original scope and experiences. Instead, it recognizes and aims to cover the multiple components that together form an AI/ML-based system.
Software Source Code
Requiring the "complete" software source to be made available under an OSI-approved license is the obvious application of FLOSS principles to AI systems. This is fairly broadly phrased, including even supporting libraries and the model architecture.
It would be nice if this included explicit requirements for the (kernel) drivers as well, and there's always the question where "software" ends for GPUs and other hardware components that implement quite complex firmware. Or whether this covers all the layers through which a user interacts with the system in the end (and which might further process the results).
This is indeed the minimal standard any "Open Source AI" system must meet. Though if the scope was limited to this, I'd argue we have not gone beyond anything previously sufficiently covered.
So how successful is OSI's address of the new components?
Critique
The Data
Positively, the OSAID does outline guidelines that clearly make systems that comply with them more "open" than those that do not.
Describing the data used in fair detail, its provenance, the sourcing, the processing, labelling, whether public or from third parties (including for fee) (which is a more polite way to say proprietary and non-free) is a significant step up from those AI/ML systems where we have no idea.
However, here is also where OSAID falls crucially, critically short of its stated goal: it does not even recommend that the training data should be free/open by default, much less require it.
Unlike traditional software, AI/ML systems derive their main logic and function from training on their source data. Not providing those denies some of the core freedoms and rights.
Yes, traditional systems also use external assets (such as media) that is not always governed by the same open/free rights as the source code, but it was still possible to inspect how they processed them. (And it is worth pointing out that some voices would not consider even such systems "open" if they had critical non-open/non-free dependencies.) This data is usually what they operate on or what configures the system, not the core of the system itself.
The OSAID FAQ elaborates that not all data can be free or open, and perhaps not everywhere - such as, say, confidential medical data, or situations where legislation differ globally. And that is a clearly valid. Even the Free Software Foundation concedes this aspect, arguing that nonfree AI/ML systems could still be just.
Yet - to be able to assess whether a system was trained for a particular situation, whether its training data contains any relevant biases that might produce discriminatory results, that can only be reasonably done if access to this data is open.
OSI-approved terms
Another shortcoming for OSAID to be truly meaningful is that, for even the limited "data information" we get as well as the model parameters and configuration settings, it specifies only that they "shall be made available under OSI-approved terms".
Which, the FAQ elaborates somewhat unhelpfully, does not (yet?) refer to any defined legal mechanisms, and can thus at best be seen as an aspirational statement.
A typical OSI-approved license that would govern source code does not directly apply, since data is not software as such in the eyes of the law, courts, and international treaties. Yet, the Linux Foundation has developed terms that might serve, and for the OSI to suggest not anything here is disappointing.
For this to supposedly govern the key components beyond software source code (which, I might remind the reader, is the entire point) it is far too vague.
A description of the data is not the data
To summarize: for the components that go beyond traditional software systems, the OSAID does neither require their full openness nor provide concrete guidance.
Limited Transparency and Ability to Audit
A key benefit of Open Source Software is the transparency to inspect and audit a system's behaviours, decisions, actions through its source code.
This must imply access to all relevant components of our digital systems, notably the training data.
This is crucial for individuals and society to understand the systems they're interacting with or even controlled by. We must be able to discover biases of all kinds - non-representative data sets, discriminatory data, lack of coverage for critical scenarios, or even intentional data manipulation that introduces malicious exploits.
Running experiments on the resulting models remains necessary ("has the system learned the intended correlations"), but not sufficient. It is already not feasible to thoroughly test traditional software; it is similarly impossible to try and test a system whose points of reference are unknown and opaque.
Not spotting and correcting these may result in reproducing gender- or ethnic-bias in generated imagery. It could be used to manipulate financial decisions. In scenarios such as healthcare, policing, or autonomous vehicles, it might prove fatal.
Hidden data, hidden exploitation?
As pointed out in this blog article by tante, there is another reason to want to keep the training data secret, beyond valid privacy and safety concerns. Not just from society, but also competitors, or legitimate copyright holders. (The large ones; the small ones stand no chance against the offenders anyway.)
Namely: the data was not, in fact, sourced in compliance with all legal and regulatory obligations, or at least is severely ethically or morally contentious.
The blog raises the rather plausible claim that all large data sets are thus contaminated.
That is significantly easier to hide behind a mere "description" of the data or a barrier with contractual clauses attached, plus anyone able to pay your high fee is likely similarly compromised. This saves the companies from both public outrage and, worse, penalty payments or regulatory fines.
tante also further explores this theme, which I highly recommend you read and think through; even though I don't necessarily agree with the final conclusion: the conflict between Free, Libre, and Open Source values and our rights versus profit-maximizing and exploitative business models, aka un- or under-regulated capitalism with shareholder primacy and venture capital, is very real.
This aspect is particularly ironic given that there's suspicion that a significant portion of AI/ML/LLM systems intended to assist with software development and code generation are, indeed, trained on data including code licensed under Free, Libre, and Open Source terms (or, e.g., non-code assets that are governed by Creative Commons requiring attribution and/or explicitly excluding commercial use) and might be going to significant lengths to obscure that in their output.
This is impossible to tell if all you got was an "Open Source" Model/Weights (we cannot reverse the training process and make strong assertions about the training data from the resulting model), or when you can't truly inspect the source data. Shouldn't an "Open Source AI" go to great lengths to explicitly safeguard the rights and shared values of the Open Source Community?
Modifying and Sharing
In my opinion, the rights to Modify and Share are slightly less of a priority in the holistic context of this critique (since they'd likely also be resolved by addressing the core criticisms), but nonetheless necessary. Again, these are greatly hindered by the weak requirements and rights provided.
Yes, models can be iteratively trained (depending on what "OSI-approved terms" end up requiring), and that allows certain influences on their behaviours. But that's not the same as the ability to rebuild or retrain another, modified model.
Yes, this also needs to acknowledge that the training data can be huge; not mere gigabytes, but exabytes. And that the hardware resources and power requirements to train a model can be similarly vast. Not everyone can aim to reproduce everything. But we also assume that for FLOSS in general.
Still: for an "Open Source AI" definition to provide so little guidance on how to actually modify and share those modifications fails to address half of the Four Freedoms.
Risk of Open-Washing - Undermining "Open Source"
Why, in the light of all this, then the insistence to define such a seemingly comprehensive term as "Open Source AI"?
OSI does not allow calling something "Open Source Software" if it just comes with a description of what the source code looks like and where you can pay for access to it.
If the only component of the system that is fully "Open Source" is the software one, we should rather call this an "AI/ML system provided by Open Source Software".
This could lead to (further) undermining of the value of "Open Source" to individuals and society, and in fact allow companies to use the marketing branding of "Open Source AI" while, in fact, continuing business as usual.
Unfortunately, the Free, Libre, Open Source Software community has a history with such term dilution: "Open Source Software" is a more industry-pleasing alternative to "Free Software"; and calling less-protective licenses "more permissive" was a stroke of marketing genius for capitalist-value-extraction. (Some of the history is referenced to in the SFC blog post.)
Superficial marketing term?
The additional demands placed on companies to use the new label are not onerous: it is already the default for most AI/ML software development and in many other areas of IT, given the clear economical benefits and marketing value.
Describing the data sources is something a few regulatory efforts (see next section) already request, and in line with the general move towards strongly documented supply-chains and bills of materials.
From a commercial perspective, a superficial definition is preferable to one with teeth, since it allows for an easy marketing win: another positive label to stick on the box.
But there are possibly additional benefits - from a commercial perspective, not necessarily from an individual and societal one:
Potential implications for public policy and regulation
The European Union's Artificial Intelligence Act outlines certain transparency requirements for AI systems. However, it also lowers certain standards for "free and open license GPAI models".
It is not inconceivable that a company might claim compliance with the OSAID to benefit from these reduced requirements, bamboozling policy makers and public decision makers by using a term defined by a respected industry organization.
This demonstrates the need for a visionary definition that is a truer representation of Free, Libre, and Open values - and those of our open and democratic societies -, and one that is harder to subvert.
TL;DR
The Open Source AI Definition v1.0 claims to set out to address the need for a definition of "Open Source" in the context of AI/ML systems; but it falls critically short, with potentially broad implications.
As it stands, I can only conclude that "Open Source AI" is a somewhat rushed marketing effort that does not actually deliver on what I would expect or what I believe society needs for digital sovereignty and safety. Much less our individual freedoms and rights, and further erodes the proclaimed values.
A less charitable interpretation might be that the industry wanted a
lobbying pragmatic organisation to produce - and get endorsements
for to make it appear widely accepted - a weaker definition ahead of
public discourse and political regulation defining "our" requirements,
thereby shaping and steering the discourse in their commercial
interests. While I would not accuse any specific individual involved, we
need to be mindful of how this discourse is now primed.
Applications that have valid reasons for not making their sources publicly accessible exist; data privacy around personal information and medical data come to mind. For these systems to be just and aligned with our (digital) human rights, they may not be able to be fully "Open Source"; they then should not be called by this term. This is a tricky but quite feasible balance to maintain and regulate, and they can still be required to be reviewed by appointed and chartered experts; this is not a new concept.
Such valid concerns should not allow more common scenarios to take advantage of them for their for-profit goals and marketing, and to lower our baseline standards for openness transparency.
We must protect the value of the "Open Source" term (such as it is).
Given the speed at which AI/ML is adopted into critical infrastructure and thus the urgent need to ensure its compatibility with our values, now is the time to set our direction; we cannot afford to rush this to serve only industry stakeholders' needs.
Where to next
We need to create a nuanced and differentiated position beyond what OSI's OSAID delivers today. My personal highest hope is for the efforts and aspirations of the Software Freedom Conservancy, who are well respected, with a strong voice and history of upholding principled positions.
(And if you can, consider supporting Bradley M. Kuhn on the OSI Board of Directors. Whether OSI itself can be swayed or not is up for debate; at the very least, we need to ensure their respective position and background in "the community" (such as it exists) is better understood by all.)
While it would be great if all of this could indeed be wrapped in a single term, recognizing the complexity of the real world might require addressing Open Data, Open Source, and Open Weights separately, or perhaps with multiple tiers per component. A truly open and just system would also need to somehow address the resource requirements; freedoms must be exercisable in reality.
This could come together under the broad umbrella of "Open Source AI" in the end; but when used without further qualification, this must at least default to real openness for all covered components.
I hope that the currently on-going dialogue in the community, industry, media, as well as amongst advocates, researchers, ethicists, and policy makers contributes to this effort.
Please raise your own voice in demanding fully open AI/ML systems wherever possible and not just where commercially convenient.
We might see an improved Open Source AI Definition, Version 2.0. Or
need to adopt another organization's guidebook, if OSI keeps their focus
on industry palatability pragmatism.