The future of intellectual property: Al and machine learning
The future of intellectual property: Al and machine learning
https://asa1cadmoremedia.blob.core.windows.net/asset-95609fac-e804-4547-9905-2866bf558456/16 - The future of intellectual property - Al and machine l.mov
BOHYUN KIM: Welcome, everyone. My name is Bohyun Kim, and I'm the Chief Technology Officer and Associate Professor at the University of Rhode Island Libraries. As a member of the NISO Plus 2021 Conference Planning Committee, I'm here to introduce our two fabulous speakers, and also moderate the session. So first, let me introduce our speakers, Nancy and Roy. Nancy Sims is Copyright Program Librarian at the University of Minnesota Libraries.
BOHYUN KIM: She is fascinated by copyright issues in modern life. Helps people understand how copyright may affect their lives. And advocates policies and laws that enable wide public cultural participation. Roy Kaufman is Managing Director of Business Development and Government Relations for Copyright Clearance Center. He is a member of the Bar of the State of New York, the Authors Guild, and the editorial board of UKSG Insights.
BOHYUN KIM: He also advises the US government on international trade matters. And was the founding corporate secretary of CrossRef. He has written and lectured extensively on the subject of copyright licensing, open access, text and data mining, new media, artists rights, and art law. In this session, the future of intellectual property, AI, and machine learning. We will discuss what AI and machine learning mean for the future of intellectual property, and how they relate to libraries.
BOHYUN KIM: So in recent years, machine learning has been a hot topic for interesting and exciting discussion in many fields, including librarianship, publishing, and of course, technology. As most of you already know, machine learning algorithms take a large amount of data and look for a statistical pattern of correlation that can be used for purposes of analysis, prediction, or automated action. So how does machine learning relate to libraries and copyright?
BOHYUN KIM: Libraries collect, organize, and preserve many types of resources and data. And some of those resources and data that libraries hold or license will be asked to be the input of machine learning research projects. Some of the outputs and project outcomes from such a machine learning projects will also need to be stored for access, maintained over time, and preserved for the future.
BOHYUN KIM: And again, libraries are likely to be tasked with this work in the near future. And when that time comes, the outputs of a machine learning project will probably form another type of library collection. So with machine learning, first the libraries will be involved in machine learning as the party that holds all licenses, resources, and data that will be used as the input for such machine learning project and research.
BOHYUN KIM: And secondly, libraries are highly likely to be involved in the future as the steward of the outputs of machine learning projects. So now in light of this, I would like all of you to hold these two issues in mind. And they are first, what copyright related concerns would arise when library resources and data are used as the input for machine learning? And second issue is what's the copyright status of those in machine learning outputs will be?
BOHYUN KIM: Which libraries would store it in the future? Whether those outputs would take the form of an application, structured or unstructured data sets, or other types of creative work. So that having been said as a quick overview, now let's hear from Roy and Nancy who will enlighten us of copyright and machine learning. Roy will go first, and then Nancy will follow. Take it away, Roy.
ROY KAUFMAN: Thank you so much, Bohyun. Let me do screen share. Hopefully, this shall work. And let me hit share. And now I'm going to go into slide mode. There we go. So just first of all, thank you everyone for being here. If you're anything like me, voluntarily showing up for one more Zoom thing is a bit of a stretch.
ROY KAUFMAN: And I find it's hard to attend these things. It's hard to present at these things. But thank you for coming. And I just appreciate your being here. For those of you who are not from the United States, I was just told I need to speak slowly. I am a New Yorker. And so if you're not from the United States, what you don't realize is that means I talk really quickly, especially when I'm nervous because I'm presenting, and have a lot of material to get through in a short time.
ROY KAUFMAN: So I apologize if I talk too quickly, or if I use a lot of acronyms or terms of art. Just ask us questions in the question period. So with all of that, very briefly, I don't know if you know my organization. Copyright Clearance Center. We're a licensing organization. We aggregate and deliver content universities and we run libraries on behalf of corporations.
ROY KAUFMAN: And we aggregate content, and we license rights. And so we're sort of in the middle of copyright discussions pretty much at all times, especially around research, news, scientific content, things like that. So I just want to get into some really fundamental basics. So when we're talking about AI, I'm going to be talking about the inputs. The copyright implications of the material that goes into training the AI or the machines.
ROY KAUFMAN: Nancy will probably touch on that. But we'll also talk more about the outputs. And I just want to say so when we're talking about input, a lot of input that goes into AI has nothing to do with copyright. It's raw data. It's not protected by copyright. And that's not what I'm going to talk about. I'm really only talking about when copyrightable materials are used to train machines or used as input for AI.
ROY KAUFMAN: And so what's copyrightable? Well, in this context, books, journals, magazines, newspapers. And copyright is this sort of weird thing. It seems very complicated. But at its heart, it's the right to make copies for certain works that are protected. And copies are made sometimes, and sometimes they're not. So a human reads a book, there's no copy being made.
ROY KAUFMAN: Copyright is not implicated in that reading process. However, for hundreds of years, machines couldn't read or use content. So if you made a copy for a human to read, that implicated copyright law. Didn't mean it was an unexcused infringement. It just meant it implicated copyright law. And the reason I say this is sometimes we talk about the right to read and the right to mine.
ROY KAUFMAN: It's something people say. It sounds good. And it's a useful framework for a discussion But again, if the machine is making a copy, you need to figure out what does copyright law say about it. And that's why I really wanted to start with fundamentally if a copy is made, copyright applies. If the copy isn't made, copyright doesn't apply. And if copyright applies, then you have to figure out, well, is my making these copies an infringement?
ROY KAUFMAN: Did I do it under a license, which means I have permission from the rights holder? Or is it done under an exception or limitations? Some sort of excused copying for the greater good. I want to talk a little bit just very briefly-- I've been using this slide for years now talking about text and data mining, which is sort of analogous or an overlapping Venn diagram with AI more generally.
ROY KAUFMAN: And a lot of times I'll meet someone and they'll say, oh, I just met with someone who's doing text mining. And they're arguing all input should be free. You don't have to use it. And I said, well, they do text mining. Are they a software company, or are they the input company? Are they creating the content, or are they somewhere else? And a lot of times, if you're making cars, you wish gasoline were free.
ROY KAUFMAN: You're not going to worry about the implication. You just want people to drive your car. And you want the gas to be free. So you just always remember when people are talking which side they're talking about. Remember it with me too. I'm focusing on the input side. So everyone has a bias in this. And just to torture this analogy a little bit further, the other thing I often think about with this is if that's your software, that car, your input better not be kerosene.
ROY KAUFMAN: Your input has to be fit for purpose. It has to be structured, refined, appropriate for what you're trying to do. So getting a little bit into the heart of the copyright here-- and this is why this is such a complicated issue-- is all copyright exceptions really under international law should fall within the Berne Convention Article 9 Section 2, which says you can have exceptions for copying, i.e. you can have copying without the consent of the rights holder under special cases that don't conflict with normal exploitation.
ROY KAUFMAN: And don't unreasonably prejudice the legitimate rights of the author. OK. But what is normal exploitation? Because 20 years ago, I can assure you training a machine wasn't anyone's thought. And so normal exploitation 20 years ago compared to normal exploitation today when news companies will enter into a license with a hedge fund where all their data flows into a pipe and they do trades, well, that's a normal exploitation now.
ROY KAUFMAN: And they make a lot of money on that. Scientific publishers. Some of the libraries we help manage increasingly are text mining service. That's the first point of contact. The humans don't always even read the content. It's the machine that sort of does the sorting. It becomes more of a primary usage. And law, particularly the US law, which I'll get into in the next slide, only makes decisions at a point in time.
ROY KAUFMAN: And so it's a very tricky area of law, which makes it kind of fun for someone like me. Now other legal regimes-- so the European Union, which is more of a statutory regime-- they don't have this sort of mushy gushy fair use analysis that we have in America. They have statutes that say you can do this, and you can't do this. And so the EU has probably the most significant copyright law.
ROY KAUFMAN: It's not really about AI. It's about text mining. But it's really, I think, going to be the same. And what they've done is they've created directive that needs to be implemented in two years by every country. One country might make that deadline-- Germany. I don't think anyone else will make the deadline.
ROY KAUFMAN: But anyway. It created two copyright exceptions. One is for non-commercial research text and data mining. So if you are an educational institution, and you're subscribing to a lot of journals, and you're doing it for non-commercial purposes, it doesn't matter what the agreement says you can text mine. Then there's a broader exception sort of for everyone, including commercial.
ROY KAUFMAN: And this is sort of materials on the open web primarily. You can mine it unless the rights holder opts out. And the way I think about this-- and I've actually discussed this with the commission, not that they'll confirm or deny what I say-- you have things like Reddit. I'm a Reddit user. It's protected by copyright-- each little post.
ROY KAUFMAN: But it's ephemera. No one's really got an economic interest in that except for maybe the company itself. But they don't own that copyright except by agreement. Whatever. And on the other hand, you'll have a publisher-- New York Times or John Wiley and Sons. And they can put an opt out on their website saying, sorry, you can't mine this without our permission.
ROY KAUFMAN: And I'm not sure that actually is consistent with another clause of the Berne Convention. But sort of how the EU decides to splice this. The EU is sort of coming off of a limited UK commercial TDM exception. They're not identical. But for all intents and purposes in my mind they're similar. Again, it's a non-commercial exception. In the US, so far, it's all been about fair use analysis.
ROY KAUFMAN: I could spend hours on fair use. And then Nancy could disagree with everything I say. That would be wonderful and kind of fun. But nonetheless, fair use is fact determinant. And one thing I particularly-- outside the US people say, well, isn't text mining in the US a fair use? And I say, well, it's not a use. It might be fair use. Making a copy to mine may very well be fair use.
ROY KAUFMAN: But the use isn't the mining. That's like saying photocopying is a use. The use is what you're mining for. And so to give you an example, I'm going to talk about two cases and one non case. So Google Books Hathi Trust, which are actually two cases. I'm sort of calling one because it's common facts. And there was text mining of books on the-- copying made of books on the shelves.
ROY KAUFMAN: And one of the excuses was, well, we're making those copies so we can text mine. And the example given was we're doing non-commercial semantic research as to when the United States is became the United States are. That's about all I can really find directly about text and data mining in US law. But that's scanning books to do semantic research. It doesn't seem like an ordinary use of those works.
ROY KAUFMAN: I'm an Author's Guild member. I mean, I think there are other reasons why these cases are actually wrongly decided. But this one I sort of, yeah, I can see this. At the same time that Google-- and this is going back 15 years or whatever-- started scanning the books in the library, they're also scanning journals in the library. They stopped that way before there was any litigation decision because what Google found out during the litigation was that the publishers were already scanning their works and selling them to libraries.
ROY KAUFMAN: And again, Google will never admit this. But they did stop scanning the journals. And it did seem to me, at least as a lawyer who had a stake in this because I was working for one of the publishers involved, that was probably because their fair use claim was a lot weaker there. Again, we don't really know the answer. But this is what I think. And then there was another case more recently that didn't say text mining.
ROY KAUFMAN: But it was sort of similar. It was called the TVEyes case involving Fox. A lot of these cases, while it's transformative use and therefore it's fair use. And what TVEyes said was, well, yeah, it's transformative. They're sort of looking at all the programs, and cutting out little excerpts, and sending them to their clients. But that really was violating not necessarily Fox's licensing revenues, but Fox's ability to enter into licenses that it shouldn't have been able to enter into.
ROY KAUFMAN: And therefore it wasn't fair use. It was just an infringement. So the cases are different places. And the only point I want to make here is every case is fact determinate. That's how fair use works. You can't make blanket rules about fair use very easily. So I'm going to switch grounds a little bit. And I've got 3 minutes left.
ROY KAUFMAN: So Nancy, and I, and Bohyun, and our prep, we started talking about copyright. And if you need a license, you can only license what you can license. And starts getting into issues of equity, and accessibility, and all of this stuff. And so I wanted to leave with four questions, and two really depressing quotations on the next slide. But I'm going to start with the questions, which is does your input reflect the world?
ROY KAUFMAN: Or does it reflect only the world of your input? So if you're inputting 19th century books by dead white men, what you're going to get is the world according to dead white men. Is their equity in license availability? What works are available for mining? And in the selection of those works was their bias? And bias here doesn't have to be necessarily a bad thing. And I'll give you the example is we at CCC, we've created a corpus for text mining I've alluded to biomedical content from scientific journals.
ROY KAUFMAN: OK. We haven't yet aggregated social science because this is what we've aggregated for our market. And that is a bias. You're not going to get as much social science research in our corpus of biomedical research unless you're doing social science research on biomedical literature. Equity and technological availability.
ROY KAUFMAN: Again, is everything ready to be mined? Is it scattered out there? Is that API available? Is it tagged? How is it tagged? Is it tagged with the specific ontology? How did you choose that ontology? And all of these things have to do with bias. And they also have to do with equity.
ROY KAUFMAN: Are we tagging everything? Are we inclusive in what we're doing? And then also with respect to the creators-- and it could also be the subjects by the way. In the next slide, I'll talk about subjects and creators because they conflate sometimes. But have we asked people? Do they feel if we're using their content, or their output, or their image in a system, do they feel that they have the right to consent to be in?
ROY KAUFMAN: The right to consent to be out? And that is a form of equity. And so I'm just going to leave you with two thoughts. I spent a lot of time on open access. I'm a big fan of a particularly sustainable open access. I'll just say there's a bunch of different models out there for sustainable open access. One of them being the APC based model. And a study that was published in MIT Press not shockingly found out that if you have to pay to get your article published, it's going to favor people who have money at prestigious institutions.
ROY KAUFMAN: I don't have to read this. It's all there. And then, after I created this deck a couple of days ago, I came across this other thing. And this gets into one of my other topics that we don't have time to talk about, which is Creative Commons Licenses, which are quite useful, and quite confusing, and have a lot of unintended consequences.
ROY KAUFMAN: So my favorite unintended consequence-- or not favorite because it's terrifying-- was someone who's both the creator and the subject put his family photos up on Flickr. There was a CC by license. Probably didn't even notice it or care. Next thing he knows it's being used to track people-- not just Chinese Uighur population. But other things that he felt very uncomfortable.
ROY KAUFMAN: And then with that, I am about 30 seconds over time. I appreciate your indulgence. And I'm handing it over to Nancy. And going to stop sharing.
NANCY SIMS: Unmute myself, and then start sharing. So hold on one moment here, I think those are my slides there. And this one probably don't need to review because hey, this is just what we were just talking about. And doing the prep for this session was actually really helpful because I really found some connection points with what Roy was just talking about. So I'm going to be eventually shifting gears to think more about the outputs of AI and machine learning projects.
NANCY SIMS: But I think one of the interesting pieces of that is how copyright law in general and social systems in general think about creators. And what Roy was just talking about is a really interesting aspect of how creators interests are absorbed, reflected, in various ways in the input side of things. So one really big issue, which I think Roy was definitely indicating, was that there's some stuff here that's not just legal issues.
NANCY SIMS: There's some sort of consent issues. And there's some that people will talk about more. And some that people talk about less. So a lot of things that have happened in the last few years around using big data sets for research projects, especially big data sets that have been collected in public or collected from publicly available information, a lot of people say, well, you know it's OK for me to have done that because it was in the terms of service for the site that I was visiting that people would have their stuff used.
NANCY SIMS: You can see this with things like research that Facebook did on Facebook users manipulating their feeds. Another aspect that people do talk about fairly frequently is whether people have opted out or opted in to uses that are made of things that they shared publicly. So I actually think that the Flickr example that Roy raised was one that I had also already seen on Flickr-- on Twitter, sorry.
NANCY SIMS: But I had seen at least from the New York Times article site than from people sharing a new site called Exposing.ai, which if you put it in your Flickr username, will tell you which of your images has been used in right now I think it's four or five of the big machine learning data sets. I have a bunch of images on Flickr. Not all of them are public, which is nice. Flickr has been pretty good about respecting people's choices about whether things are public or not.
NANCY SIMS: Not every website does that over time. But I intentionally put Creative Commons licenses on them. And to be clear, you can't accidentally have Creative Commons license to your images on Flickr. You had to make a positive choice to do that. It wasn't something that could have happened accidentally. But I do think this sort of use was something I would have anticipated at the time because I was interning at Electronic Frontier Foundation among other things.
NANCY SIMS: But even for me, I think I may be an outlier. And people don't always expect things to be used after they shared them with Creative Commons licenses. And I think that even knowing this kind of use could have happened the way facial recognition systems have developed and are being deployed, certainly gives me a lot of pause to think that things that I took were in those.
NANCY SIMS: The images that they used from my site, from my feed, were all of sort of crowds outside. So that was also particularly interesting to me. Like why those particular images? I don't know. But thinking about consent leads me to some other thoughts here, which have to do with the ethics of communities of creators, which people often think they understand as a universal.
NANCY SIMS: And the only thing I know I understand about this is that there aren't universals here. I definitely don't know all the possible variations on what creators think about having their stuff shared. So for example here, there was a period in time where Tumblr had-- I don't think this is as big anymore because Tumblr isn't as big anymore. There were some pretty strong intercommunity frictions around the idea of re-posting versus re-blogging.
NANCY SIMS: And without going into the details, this had to do with whether you were maintaining a connection to a creator's original post or not when you re-posted their stuff. Then there were also lots of debates about whether you were crediting the creator or not. And that was just how Tumblr dealt with and cared about these sort of things. But then other communities work in completely different ways.
NANCY SIMS: And often actually just a few changes to the interface will change things even within one community. TikTok is a great example of a place where, as far as I know, there's a fairly well understood cross site expectation of reuse. Of other people reusing your stuff and changing it. But I'd suspect there are even some TikTok creators who have some schisms in their perceptions of the right ethical way to use TikTok.
NANCY SIMS: Another example on Twitter from my own experience is, among other things, I have learned from seeing other people's uses thread reader apps are a thing on Twitter where if somebody posts a long thread, you can at the end of the thread invoke a bot-- speaking of machine learning or AI. You can invoke a bot that will take all the tweets in the thread and post them usually as a separate web page where you can read them all together.
NANCY SIMS: A lot of people don't have a problem with this. A lot of people do. And the people I've seen who do have big problems with tend to be people of color. In particular women of color because their stuff ends up showing up on other people's sites, and used, and inspiring other people, and not always really linking back to them in the long term. So you can block a lot of the thread reader apps.
NANCY SIMS: Not all of them. And I think a lot of people underestimate the degree to which those variations exist even within one community. And these are just tech communities. Across different types of creative communities the expectations are so hugely varied that it's really hard to talk about. So if you're thinking about this issue, and this is just sort of barely scraping the surface, how do you know about those varied ethics and take them into consideration while still doing high quality AI machine learning work the way we've been doing?
NANCY SIMS: And for the first part of that question, I have a suggestion, which is you know more about varied community ethics if your AI or machine learning dev team reflects more communities. So that's one piece of the puzzle. The other piece I'm not sure I have an answer to, which is how do you take them into consideration while still doing the same kinds of AI machine learning work?
NANCY SIMS: There are some things about how we've been doing AI and machine learning work that maybe we should stop. Not that I expect that to actually resonate with most of the people doing the work. So I think the Twitter example that I just invoked about thread reader apps and who doesn't like them, and who uses them anyway, highlights a really interesting thing that will take us over towards the output side, which is how does copyright law in particular, but just the society in general think about who is a creator?
NANCY SIMS: And one of the things that I think people are surprised by on Twitter in the thread reader app issue is that to put an extreme lens on it, some people don't think of Black women as authors. So when you copy their work out to somewhere else, you haven't done any copying of an author's work. And they wouldn't necessarily think to care about the author's preferences in that particular setup.
NANCY SIMS: And that's just sort of ethics, morals, personal variations in norms. Whereas we can also look at this inside the scope of the law. I use a song called "Folsom Prison Gangstaz" as a teaching example fairly frequently, especially when I'm talking to musicians. And it's a remix. It's a mash up of Johnny Cash's "Folsom Prison Blues" with Eazy-E's "Luv 4 Dem Gangsta'z." It's very listenable.
NANCY SIMS: It's super fun. And I ask people basically a few questions about the remix and about creative developments in music over time. And then I say, does copyright exist in order to prevent this remix from happening without permission? And some people sort of say, yeah. Or at least they say, copyright exists so that the creators can say yes or no. And that's the sort of idea that copyright exists to add some friction and some control for creators.
NANCY SIMS: This is pretty widespread. But the interesting thing in this discussion is that except for groups that are mostly people under 25, the only creator people spontaneously talk about in this discussion is Johnny Cash. They don't talk about Eazy-E's right to say yes or no to the remix. And neither one of them had any input into whether the remix happened.
NANCY SIMS: I don't say this to castigate people for not thinking of it. But to illustrate that there are some constructions we have about who's an author that tend to center certain types of creators and certain art forms. And you see this very clearly in courts. One really great example that is hard to show without spending far too much time on it as well, but appropriation art is something that happens in the sort of fine arts world where there are some artists who have made their reputation basically by copying other people's images and usually recontextualizing them in some way.
NANCY SIMS: And their art is the copying. Richard Prince is really well known for this. And Richard Prince gets sued, and tends to win. Not always. But tends to. Jeff Koons is another example. There's a variety of examples here. Whereas music sampling, which came up in communities of color with creators who are working generally outside of established commercial systems, early case law on music sampling basically tanked unpaid for music sampling right away with the idea that copying a single note without a license is something that people have to pay for.
NANCY SIMS: So appropriation art in its context is not something you have to pay for. But music sampling in its context is for a variety of other reasons. But also something about how we think about who is creating what in those situations. Trying to keep moving here. There are a few other issues when we're thinking about who's a creator and how does creation, creatorship, authorship work?
NANCY SIMS: Contracts and work for hire often control things about who owns their work. Who owns their creative work. So creators often do not own the results of their work. Anybody who has more resources or power-- and is just absolutely connected to Roy's great example about open access-- who can publish via an APC where an author has to pay a fee. People with more resources and power are more able to do that.
NANCY SIMS: And in contracting, and I promise this does connect to machine learning outputs, people with more resources and power can shape their contracts more easily. There's also some really interesting ideas, and this is really just sort of gesturing at a huge thicket of issues, which is there are lots of people who have ideas about ownership and use that are not compatible with legal models that are currently constituted.
NANCY SIMS: If we want to diversify inputs in machine learning, that often means representing more cultures in those inputs. So what if members of an indigenous culture consider something in your online collections? Something that you're thinking about using for a machine learning project? They consider that inappropriate for public viewing at all. They would like you to remove it completely. I don't have answers, but I know that this is an issue.
NANCY SIMS: And these kinds of issues are under considered in how libraries deal with our resources and collections, and in how the whole AI machine learning field thinks about this kind of stuff. So shifting gears a little bit to being really on target with the idea of copyright and AI and machine learning outputs. Here's a slightly off-putting example of a machine generated image.
NANCY SIMS: And as you're looking at it, if you can't see it, well, it's this kind of colorful blob that sort of looks like pieces of bread. And there's places in the image that sort of look like parts of a person. But not any recognizable parts of a person. And then some sort of green outdoorsy background. And something that sort of looks like a bunting of some sort with flags on it.
NANCY SIMS: But none of it's actually like a real image. It just sort of looks like a real image. This is what happens when you let machines make photographs from inputs and you let AIs create. So a question that I think people don't always answer in the way that the law would answer is, who owns the copyright in this image? As an aside, and I'll come back to this in a minute, can you tell the source material from which this image was generated?
NANCY SIMS: Some people who have watched things will recognize this image [INAUDIBLE].. To be more clear though in terms of the people involved in its creation, Janelle Shane, who blogs at AI Weirdness produced this using some existing commercial content, and existing image data sets, and some machine learning algorithms. So ownership and authorship. Big reverberations in AI and machine learning work.
NANCY SIMS: Under US law, and this is not necessarily true in the same way in every other country interesting place where jurisdiction is going to matter a lot. Under US law, a work eligible for copyright protection must be an original work of authorship, which arguably something like the machine learning picture we just looked at it is pretty original. I never seen anything quite like that before. But we also hold that in US law the copyrightable works must have a human author.
NANCY SIMS: If there isn't a human author, they're not eligible. Artificial authorship is not a new issue. The current compendium of copyright practices-- it's a big huge book produced by the Copyright Office. The current edition says-- quoting from, well maybe not quoting from. Section 313.2 says the office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.
NANCY SIMS: They also make reference to a report about this particular issue that goes back to 1966. This opens up a huge can of worms. What is creative input or intervention from a human author? And I'm about to run out of time. So when I look at this image, I think it doesn't have creative input from a human author. It has work input from a human author.
NANCY SIMS: Janelle Shane generated it. But I'm not sure it has copyrightable work input from a human author. In fact, I'm pretty sure it doesn't. If you want to say that you let an algorithm learn how to do something from data sets, I'm pretty sure that you're not doing enough creative work to be the author. On the other hand, there are ways you could use AI absolutely as a tool to produce a creative work where you did put enough creative input into the process that you would be the author.
NANCY SIMS: So there aren't super clear answers here. This is a lot like how I was talking about fair use. When we look to forward directions on this some things to think about as you work with people who may be generating outputs, there are some folks in various places around the world who are lobbying for new approaches that do create ownership rights regardless of human authorship. So some of them specifically allow for non-human authorship.
NANCY SIMS: Creates some really interesting questions about animals. But never mind. And also around this issue of what's creative enough to create something under current understandings. And then there's these alternate approaches. And the alternate approaches are things I think we need to be really aware of in libraries, which is some people are open licensing all their outputs.
NANCY SIMS: Although that's usually hooked on to owning something in the first place, which is an interesting conundrum. And then the commercial actors in this space, they're aware that the copyright issues are unsettled. So they're often relying on things like trade secret, which is just keep your stuff secret enough nobody else knows how to do it. Or, and this is something in libraries we're super familiar with, contractual limitations on use and access.
NANCY SIMS: So those same things that we're used to thinking about in terms of how we contract for or licensed content that we're acquiring from outside sources are things that we may need to help people think about in terms of AI or machine learning outputs that they create in the long run. So thank you. And I'm going stop sharing. That's all my slides for now.
BOHYUN KIM: So thank you, Roy and Nancy for great presentations. So now let me ask a few questions from the audience before we move to the live chat. So I have the first a question about the bias and equity issues in machine learning that you both talked about in your presentations. So I have two parts. First part of the question is whether you believe copyright laws will play a significant role in this area.
BOHYUN KIM: And if so, we would like to hear more about that from you. And also in relation to this, are there any things that you can think of that libraries, publishers, or individual creators can do in order to mitigate those bias related issues and achieve more data justice and equity?
ROY KAUFMAN: Sure. Nancy, you do want to go first?
NANCY SIMS: Sure. The first part of the question, and maybe we go back and forth a little bit here, do you believe copyright laws will play a significant role in this area? And this is an interesting one because I don't see laws as being very good at addressing equity issues or bias issues. Our laws reflect our biases pretty strongly. One of the interesting things to think about here is-- and I was talking about kind of like who we think of as creators and authors in terms of people.
NANCY SIMS: Real people. And then for the AI and machine learning context, what do we think about in terms of authorship of things created by or with automated systems? And I am not 100% positive on my attribution here. But I think the researcher, Abeba Birhane, who's an Ethiopian researcher who works in Dublin right now I think. I think it was that person who pointed out that a lot of people in this field are more interested in resolving the machine authorship problem and making the laws-- in many cases, lots of people are interested in having the laws recognize a copyright and machine authored works, that's an issue people want to work on and people want to find an answer for.
NANCY SIMS: And very few people are particularly interested in why we as a society seem to with my personal anecdote only, this is not data driven, people are more ready to think of Johnny Cash as a creator than Easy E.
ROY KAUFMAN: Yeah. So a couple of different directions here. One is obviously I think copyright is related to certainly the inputs of training machines and AI regardless of the equity lens. So if stuff is copyrightable, and you're using it, then you have to look at the copyright law. And say does this need a license? Or is this under an exception? And then also as Nancy says, a lot of people are interested in that.
ROY KAUFMAN: Cynical me thinks, well, yeah because these are people who think maybe I can make money on this by claiming ownership, which is fine. That's the basis of copyright for human ownership. It's actually what sustains creativity. I think it's a good thing. The question is that something we need to continue to reward when the machine is creating it. But with that said, to me copyright is less significant in the bias and equity issue.
ROY KAUFMAN: It's just one more thing like privacy, like contracts, like formats, that you need to think about. As you're creating an AI system, you have to remember a bunch of things, and copyright is one of them. But I don't see like any really huge causal link or tension between copyright in and of itself. And sort of inclusive use except I think as the quote I pointed out, that sometimes the people-- and it was open access, but still protected by copyright.
ROY KAUFMAN: If you're using published content published in any sense of the word, think about who has access to the mechanisms of publication. So I think that's the only way. But even still, it's not really copyright so much. It's access to all the tools required.
NANCY SIMS: For the second piece about things that libraries, and publishers, and individual creators might be able to do, individual creators I think I would put off to the side. For individual creators one of the things I say about almost any problem is that you need to think about what your goals are, and what you're trying to accomplish. And I think that's very relevant here. But for organizational actors like libraries and publishers, and anybody who's working on AI, one of the things that I think is incredibly important to mitigate bias is to literally have more people in the room or in the virtual room.
NANCY SIMS: But literally have more people involved in the processes. And as I was kind of saying in my presentation, how do the different cultural community practices across Tumblr, and TikTok, and Twitter? Well, you know those by having people who are parts of those communities or who know people who are parts of those communities. And the more that your library, or your publisher, or your research group is people who have similar backgrounds do, less that group is going to be able to know.
NANCY SIMS: And then, of course, diversifying who is working on your project is a pretty intractable issue across lots of different areas. And I have ideas, but I'm not sure how many of them are things many HR departments want to implement.
ROY KAUFMAN: Yeah. It seems really simple, but if you're looking at trying to create something with equity inclusion in mind, then have your goal as Nancy says having equity and inclusion in mind. I was thinking about this a lot in preparation. If you just sit down and say, I want to create a great AI for visual recognition of people's images. And you go about to create that.
ROY KAUFMAN: And it's like, wait a minute, after you've started how well does this work? What was my input bias? Is this cover all these different races and ethnicities? Whereas if that exact same thing said, I'm going to create this AI facial recognition tool with diversity, equity, and inclusion in mind, and if you started there all of a sudden you've just changed how you're thinking of your inputs.
ROY KAUFMAN: And so if you're thinking about things as a goal from the moment you're starting it, you're just going to change your inputs. Take out images. Put in content. Put in your articles. Whatever it is. If this is your goal, start with your goal because that will automatically change your decisions as you're going through.
ROY KAUFMAN: And I think that's the best way to do it. And there just seems to be with AI and tech a lot of amorality. I don't want to say immorality. I think people just-- you worked at EFF? People just come up, hey, wouldn't it be great if I did. And then five years later say, well, gee look at all the laws I violated. But I wasn't doing it to violate the laws.
ROY KAUFMAN: I just wasn't thinking about the laws.
NANCY SIMS: Well, and who has the ability to not think about the law is always a real interesting piece in all of that.
ROY KAUFMAN: Yeah. That's there too. So again, if you start with that as a goal, it changes your path. And that's, I think, the best way to do it.
BOHYUN KIM: Thank you. I think those are really great points. A lot of times these new things come up. And people think there are going to be some new and profound solutions. But actually a lot of the solutions that seem to be like something that we already know. We just have to act on it. So I think that they were great. All right.
BOHYUN KIM: So let's move onto the second question. On the copyright law front, do you see any discernible movement or directions from commercial actors specifically related to the use of data for machine learning purposes? What are they? And how do you think they may impact the libraries?
ROY KAUFMAN: I'll take this one at first because we have a service in this. So at CCC, one of the things we've been offering for a few years is we have taken biomedical chemical articles from about 60 publishers. Turn them into normalized XML, and we license it to commercial libraries. Staying on the commercial focus. And so is about 13 million articles.
ROY KAUFMAN: And people can actually use that for text mining. It's formatted. It could get delivered there. So that's one thing that we are doing. It tends to be very commercially focused actually. Although I don't think the user tax status should change whether it's useful. But we've been focusing there, which is to say it could be used for non-commercial or commercial purposes.
ROY KAUFMAN: My clunky way of saying that. But anyway, so that's something we've done. We've noticed in news, I think I alluded to this in my talk, a lot of news providers license financial firms that want to mine the news as it comes out. And sometimes they'll move offices right next to it so they get that split second time. So they can trades. Again, not getting into the ethics of that.
ROY KAUFMAN: Just talking about what's actually going on in the world. So from the library customers who say use our XFM, it's doing pretty well. There's a whole balance on are they getting stuff that they've already subscribed to versus stuff that they haven't subscribed to versus stuff that's open access. And that's something we have to manage between the publishers and the users.
ROY KAUFMAN: Making sure everyone feels like it's fair. And then in the non-commercial space I'm actually more aware of what goes on sort of in the lobbying front. In the UK and then in the EU where at least most of the commercial science publishers didn't really oppose the exception I talked about for non-commercial reuse for their customers in an academic or non-commercial setting.
ROY KAUFMAN: So that's the commercial actor saying, yeah, this is OK as a matter of law or as a matter of lobbying practice. So that's sort of what I've been seeing.
NANCY SIMS: I think that totally sounds like things that I've been seeing too. One of the things I think about from our perspective as a library that subscribes to things is we've spent a lot of time-- and this connects to something that Roy was talking about in his prepared notes earlier about fair use and when people's plans for how they will use their work or rights holders plans change over time.
NANCY SIMS: Several years ago, text mining was not a use people were necessarily selling access for. Now it is. Does that mean it's not fair use anymore? And that's a really interesting and really deep tension in fair use in general. It's something like a variety of lawyers I know, maybe not everybody, but we talk about it as the chicken or egg problem or fair use.
NANCY SIMS: If a court rules a particular use of fair use, it's not something you have to pay for even if somebody would like you to pay for it. So when people stake out the ground that it's something that you need to pay for before a court has ruled that, I'm always really interested in that because you can equally stake out the ground that it's not. And if you don't stake out the ground that it's not, then you lose in the long run.
NANCY SIMS: If you don't say this is not something I need to pay for, you may end up having to pay for it in the future. Like I said, this is a deep tension. And it's not just about machine learning. It's riddled throughout anything to do with fair use and commercial exploitation. But one of the things that I think in the library side of things that's really interesting and challenging to deal with is, for example, my organization has been working for several years to try to make sure that our contracts say that our statutory copyright rights-- somebody is joining us, sorry-- are preserved.
NANCY SIMS: In many cases, we negotiate with publishers to say that anything that the law allows us to do, our users are allowed to do. And many commercial vendors and publishers have been changing or trying to change licenses over the years to say that text and data mining is definitionally something that our users do have to pay for regardless of other parts of our contract.
NANCY SIMS: So that's a bit of a challenge. Where what the publisher is offering is a value added product. I'm actually pretty OK with that kind of an agreement. If you have something you can give to my researchers that is better than them scraping out of your content, excellent. That's something that might be worth us paying for. If what you want us to agree is that our users need to separately get permission for something that the law would allow them to normally do with a database, that's a challenge.
NANCY SIMS: And that's not so much copyright law. That's about contracts, and licensing, and all of those things. But this is an area where these things are kind of playing out with a little bit of back and forth of do we need to pay for that? Is this something that's additional value? And all of that kind of stuff.
ROY KAUFMAN: Yeah. I mean, the challenge that Nancy's talking about largely is around fair use. And the fact that there are certainly case law that says the most important thing is the impact on the market of allowing the use, then you say, well, if you declare at fair use, then there's no impact on the market. And if you declare declare an infringement, there's a big impact on the market.
ROY KAUFMAN: So your number one test is-- so courts might say, well, is there a license in place? Or are there a licensing? Or does this look like what you would normally license? Or how does this fit? And that's why sometimes I think the European approach makes a little bit more sense. And I'll tell you about the actual contours.
ROY KAUFMAN: But just to say, here's what you can do. Here's what you can't do because otherwise it gets kind of mushy. But that's fair use, and that's what we live with. And I guess that's why we have jobs. Right, Nancy?
NANCY SIMS: And that's something I remember from your prepared remarks too that you were saying the European approach is kind of nice because you get a decision at a particular point in time. My counter point is kind of well, yes. But also you have to wait until you get a decision from a legislative body about whether something's allowed or not. If you haven't gotten a law saying it's allowed, then it's not.
NANCY SIMS: Whereas fair use provides the space of I think it should be, or I think it shouldn't be. I like arguing though. So I mean, I know a lot of people like the clear definitions.
ROY KAUFMAN: Yeah. I mean, I could cut this either way. I mean, one thing is you can say legislation solves yesterday's problems tomorrow. But you can say the same about litigation. So you're never really sure what's going on today by looking at anything. Particularly as something is quickly moving as AI.
BOHYUN KIM: Yeah. Thank you Nancy and Roy. I think these are such interesting things. And we can just talk forever about this. These things are always changing. So I think it makes it really fascinating. So thank you so much for your great answers. And now we're going to move to the live chat room. So I hope for all the attendants will come with us to talk to Roy and Nancy in real time.
BOHYUN KIM: So thank you, everyone.