I was recently lucky enough to have an interesting discussion with smartR AI’s founder, CEO and engineer, Oliver King-Smith, about one of the most fascinating conundrums in the field of AI. Strict copyright and IP laws are making AI products more biased because of the limited sources available to train them on; how do we obtain a balance between respecting copyright and IP laws while allowing reasonable access to information for training an AI?

Oliver explains the issue in the following way:

You can definitely see why the copyright holders would be concerned about their information being used in training AI models. A legitimate concern is that, of course, AI models can consume so much information, way more than any human can possibly absorb in a lifetime…

On the flip side, if we say we are not going to use any copyright information in the creation of models then we have to go back and look for sources that we can use. And a lot of the high-quality information that we have in terms of writing, is from pre-copyright works which would go back to the 1930s and earlier.

And when you look at that era of time … it definitely comes with a bunch of biases… it was written entirely from a male, and you can argue a white male, perspective at least for the English literature…  That work is very valuable, if we are restricted from using copyrighted material, because it is high-quality English. It’s some of the best written English around. So, we use that particular data source in terms of creating the ability of these models to understand and utilize English.”

Oliver provides the examples of Japan and the United States to exemplify the extreme positions countries can take. In Japan, AI’s can be trained on copyright information without any legal repercussions. Yet, in the US the large copyright holders believe it is a legal violation to train an AI on copyrighted material.

Oliver points out that this is currently being challenged in US courts under the notion of ‘fair use of copyright’ and he doesn’t think that this strict approach is going to withstand legal challenges. He suggests that just as humans learn from reading copyrighted material, it would be reasonable for AIs to be trained to some degree on copyrighted material.

“It’s reasonable to assume that [in the future] AI models will have access to and can use some of that information… With the current system that we have, where people are arguing for no copyright material, it is creating this bias towards a view of the world that is really out of touch with the way that we look at the world now.”

When humans read copyrighted material they often have to pay for it in some way, I therefore asked Oliver whether he envisioned some kind of recompense system being implemented.  Oliver suggested that those creating the AIs would likely be happy to set up some sort of fund where a proportion of the AI’s profits go to compensating the authors of the copyright material involved in training the AI.

However, part of the problem in creating a compensation fund is that currently huge amounts of information is needed to train an AI. This means that when it comes to compensating authors the profits may be spread quite thinly, which is unlikely to satisfy copyright owners. Oliver points out that if, in the future, the technology advances to the point where it becomes more sample efficient and less sources are needed to train an AI, then this might be a more viable solution.

On the other hand, the problems involved in replication may mean copyright owners may remain unsatisfied with a compensation solution:

“Imagine robots advanced enough to attend school. Let’s say we send our AI model SCOTi to learn engineering at university. SCOTi pays for the courses and excels in its studies.

Would the university and textbook publishers be pleased? On one hand, SCOTi successfully completes the coursework and its owner pays all the fees. However, the issue arises when SCOTi can be replicated. If multiple copies of SCOTi are made after attending university, all that knowledge is transferred to the replicas.

This raises a problem for copyright holders: if they allow SCOTi to use the copyrighted material, it’s difficult to restrict its usage once it’s been replicated. Perhaps, in the case of robots like SCOTi, paying for and using copyrighted materials would be acceptable if replication doesn’t happen after learning, because their ability to interact and work in the world is similar to humans.

However, this becomes more complicated when considering computers running AI software like SCOTi. With scalability, SCOTi could answer thousands of questions per hour on a powerful computer. Therefore, pricing for copyright material might need to depend on the extent of SCOTi’s usage.”

So, if the issue of compensation is creating such difficulties between copyright holders and AI companies, are we stuck in a sort of landlock? Oliver suggests that bigger AI companies may be able to find their way forward while smaller AI companies get left behind:

“There are definitely deals that are going to get done with the really massive models who will get access to this information, but is this generally helpful or useful for the general forward step of AI? I am not sure it is. If you just say just these giant tech companies with multimillion dollar budgets can buy access to this material, even if it was created by users that are not getting compensated.

Is that really a fair system? What happens with the smaller players in the market space? Are they just locked out? Can they only work with the more biased system that already exists in the marketplace?”

Oliver describes the risks involved for smaller companies in trying to ensure they are adhering to ethical standards by ensuring their AIs are trained without copyright material. Often, they rely on large open-source datasets such as ‘The Pile’ to do their initial training. Yet, massive liability problems can arise where copyright material makes its way in.

“Just to remove two pieces of information is another $50-100 million dollar hit.”

This issue is also extensively discussed by Amanda Levendowski:

“Friction created by copyright law plays a role in AI bias by encouraging AI creators to use easily accessible, legally low-risk works as training data, even when those works are demonstrably biased,”

Levendowski is also a strong advocate in favour of ‘fair use’ of copyright material, she argues that if it were implemented then it would also encourage AI companies to disclose what databases their they use to train their algorithms without fear of legal repercussions. I posed this suggestion to Oliver, to which he had a response directed at strong supporters of AI disclosure (particularly Lord Holmes if you are listening):

“…For particular models in particular industries I think it makes a lot of sense that you understand what the data was that you were trained on. A good example is that of healthcare, at least at a high-level what was the data that was trained on.

But let’s say we are training a model for an aerospace company that has a lot of proprietary information in it… Should the aerospace company really be required to disclose the information they are using? Because they are using the model not in a consumer facing context, but rather to develop more advanced components.

You need to understand how is the AI model being used and who is using the AI model? In the case of the healthcare, the consumer has strong interest in understanding what is in the model and how it works, but in the second situation the aerospace company is producing a component that will go through normal FAA verification and testing, and consumers don’t care so much about an aircraft widget, assuming the part has been shown to be safe.”

I think the complicated issues discussed in this blog make it abundantly clear that there a lot of issues to be resolved surrounding AI and copyright law. The views of all parties should be considered carefully and what each party has to gain or lose in the process of making decisions and regulations.

Written by Celene Sandiford, smartR AI