Hi ,
Is it fair use to train commercial AI tools using publicly-available data?
If GitHub users have their way, the answer to this question will soon go from being a matter of opinion, to being a matter of law.
GitHub is a code hosting platform that allows users to publicly share and collaborate on software development
projects. Although Microsoft purchased GitHub in 2018, it does not own the code hosted on it. The code is typically made publicly available under open-source licenses whereby copyright is retained by the software authors.
However, Microsoft has recently launched GitHub Copilot, an AI-assisted coding tool trained using publicly available GitHub data without acknowledging the data owners. This has led to GitHub users investigating a potential class-action lawsuit against Microsoft for breach of their copyright.
Although this will potentially be the first US lawsuit to address the consequences of AI training using open-source data, it's not the first time concerns have been raised around the copyright and privacy implications of using publicly available
data in training AI-based tools.
Last month, stock photo website Getty Images announced it was banning AI-generated images from its site, due to copyright
concerns. Furthermore, in late 2021, Clearview AI was found to have breached Australia's privacy laws by making use of images obtained from social media without consent in its facial recognition tool.
Whether or not this case makes it to court, it highlights the importance of ensuring any data used to train AI systems is ethically sourced.
Ten years ago, it was not uncommon for people to illegally download and share music and movies from file sharing sites, such as Napster and Pirate Bay. However,
with the rise of legal streaming sites, such as Netflix and Spotify, illegally streaming movies and music has largely become a thing of the past. This is because Netflix and Spotify made it easy for people to do the right thing.
I believe most people fundamentally want to do the right thing. However, currently it can be difficult for AI
researchers and developers to obtain the vast datasets needed to train their models. I expect, in the future, companies that are the data equivalent of Netflix and Spotify will rise to ethically and legally meet this demand.
Until then, though, it will be interesting to see the outcome of any legal proceedings against Microsoft, and whether
or not the US rules the use of public data in AI training is fair.
What are your thoughts around the use of public data in AI training? Hit reply to let me know.
Talk again soon,
Dr Genevieve Hayes.