Name:
The Scholarly Kitchen: Trends Around AI and Scholarly Content Licensing
Description:
The Scholarly Kitchen: Trends Around AI and Scholarly Content Licensing
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/559d1eff-68e8-4395-b60d-3f57247adf11/thumbnails/559d1eff-68e8-4395-b60d-3f57247adf11.png
Duration:
T01H00M54S
Embed URL:
https://stream.cadmore.media/player/559d1eff-68e8-4395-b60d-3f57247adf11
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/559d1eff-68e8-4395-b60d-3f57247adf11/GMT20241120-180000_Recording_gallery_1920x1080.mp4?sv=2019-02-02&sr=c&sig=b0tnA1rvcUgb%2F4%2FrOn9JEDNB6zP5B%2Bkqca41lU%2FEYGQ%3D&st=2025-08-01T18%3A08%3A47Z&se=2025-08-01T20%3A13%3A47Z&sp=r
Upload Date:
2024-11-22T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
Hello, everyone. We're just starting to assemble and let everyone in the webinar. We'll get started momentarily.
Hello all. We'll get underway in about 30 seconds. OK we'll get underway. Thank you for joining us.
And welcome to today's SSP scholarly kitchen webinar, trends around AI and scholarly content licensing. Before we start, I want to thank our 2024 education sponsors, access innovations, openathens and Silverchair. We are very grateful for your support. My name is Laurie Carlin. I am the chief commercial officer at delta ink and the lead of the SSP webinar subcommittee. Before we get started, I have just a few housekeeping items to review.
Attendee microphones have been muted automatically. Please use the Q&A feature in Zoom to enter questions for the moderator and panelists. You can also use the chat feature to communicate directly with other participants and organizers. Remember to select everyone to share your chat message with all the attendees. Closed captions have been enabled.
If you don't see the CeCe icon on your toolbar, you can view captions by selecting the More option on your screen and choosing Show Captions in the dropdown menu. This one hour session will be recorded and available to registrants following today's event. Registered attendees will be sent an email when the recording is available. A quick note on SSPs code of conduct in today's meeting. We are committed to diversity, equity, and providing an inclusive meeting environment that fosters open dialogue and the free expression of ideas free of harassment, discrimination, and hostile conduct.
We ask all participants, whether speaking or in chat, to consider and debate relevant viewpoints in an orderly, respectful and fair manner. And now I'd like to briefly introduce today's moderator, David Crotty. David is the editor in chief of the scholarly kitchen. He is also senior consultant at Clark and Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services.
So with that, I will hand it over to David to get us underway. Great Thanks. Thanks, Lori, and Thanks to everyone for joining us today. You've been to a publishing meeting in the last year or so. You've probably already heard an awful lot about artificial intelligence. Just as the potential impact of AI has dominated the conversation in so many other aspects of our lives, from technology companies to classrooms, so too does it have profound implications for the communication of research.
Most of the focus in the scholarly communication world has been on internal looking things. Can how can I make the peer review or the publishing process. More efficient or thinking about research integrity. What are the implications for AI and the issues it proposes for research integrity. But today we're going to look at a slightly different angle and talk about the business of licensing of content for AI use.
The research literature is a very valuable corpus of high value content. A lot of these AI training sets are just pulled from whatever's available. There's a lot of poor quality material in there, a lot of stuff that's unreliable. So the research literature has a lot of Appeal. It's gone through a rigorous pre and often post publication evaluation process.
So today what we're going to do is take a look at the opportunity and the value that's presented here. Is this a real business. Is it real. How is it likely to work. What sort of things do we need to be thinking about if we're going to be licensing our content for use in this manner. So I'm going to start by having our panelists introduce themselves.
Maybe we'll go alphabetically by our last names. Start with Todd. Hello, everyone. Welcome My name is Todd Carpenter. I'm the executive director of niso, the National information standards organization. We develop technical standards for publishers, libraries and software providers that serve the content creation communities.
I'm also one of the chefs on the Scully kitchen. Thank you. OK we'll go next to Mandy Hill. Hi so I'm Mandy, I'm from Cambridge University Press and which is the academic publishing part of Cambridge University Press and assessment. And just to give a bit of context for those of less familiar with our publishing, we publish about 1,500 books a year and about 400 journals.
And perhaps also relevant in this conversation is the size of our backlist. So we have, depending on how you count, somewhere between 30,000 and 50,000 ISBNs in our backlist. You would think we'd know exactly that number, but it does depend on how you count. So it's a large backlist which is relevant in some of these conversations. And on the journal side, about 2/3 of our journals are owned by scholarly societies as well.
OK next Josh Jarrett Hello, everybody. I'm glad to be here. My name is Josh Jarrett. I am the senior vice president of AI growth at Wiley. AI growth is a new team that we formed in the last year to really deeply understand these AI opportunities and risks and figure out how we can leverage them for both impact and opportunity for Wiley and our society partners.
Great and finally, Roni Levy, I want to make sure I got your name pronounced right. You did. Thank you. So I'm Roni Levy. I recently joined the Copyright Clearance Center. I work with the CeCe team on everything that's at the intersection of AI and copyright. Prior to starting to work with the CCC, I was the CEO of access copyright.
So that's the equivalent of CCC up here in Canada. I also work with enterprises that are looking to integrate generative AI into their work. To do so responsibly. So I do a lot of work on responsible AI integration into enterprises. OK, great. So I want to start with the basics and a definitional question here of what exactly is this market.
What are we talking about. Who are we licensing content to and what sort of content are AI organizations interested in. Who wants to jump in and start us off. Not everybody at once. Well, maybe I could start with a high overview. I think initially what we started seeing is, once LLM became a reality back in late 2022, it was really about licensing large language models.
But today, it's much broader than that. It's not just I companies per se, but it's anybody who's looking to leverage generative AI into their work and need to use proprietary or protected content in order to optimize their use of AI. So the market, the market is growing. So we're looking at LLM operators, but we're also looking at enterprises and AI application developers.
And I would just build on that. I think there's a distinction between exactly as Ronnie said. Ronnie said there's the model builders. And book content has historically been most interesting there because it's long form, it's a more comprehensive. It's written at an accessible level, but still a relatively high Lexile it's a lot of words.
They need a lot of words for this training. I think now we're seeing those AI developers coming back around and saying we're trying to get smarter in a particular subject. We're trying to fine tune or refine our models. So we're looking for a physics catalog as opposed to a broad brush catalog. So that's evolving in that LLM model training space. But then exactly as Randy said that we're seeing a lot of individual companies saying, I'm trying to build AI tools, and I want to bring in high quality information from third parties to match with my internal first party data.
And that's being used in much more of a real time application or what's sometimes called RAG retrieval augmented generation, where I'm not just using the weights in the model that was trained on that original data, but I'm directly referencing the content, and that's a sort of an ongoing subscription relationship, and that's where authoritative content is particularly valuable, and I think we're seeing more interest in journal content in those kind of inference rag implementations.
Yeah, let's I mean, let's break that down a little bit. I want to talk about the types of content and what's useful in a second. But this sort of LLM versus rag. So the large language models we're talking about licensing for training. Your ChatGPT something like that. We need a lot of content to train the basic model of And that's different than the rag retrieval augmented generation.
And in that you take the trained if I'm understanding this correctly, you take the trained LLM system. And then depending on the query, that system will then reach out and pull in information necessary to answer that specific query. So for. Can you talk about the difference in licensing to those different types of systems. There's sort of we're going to license everything to you to train your LLM versus potentially a subscription model of I'm at an institution that has access to this LLM and these types of content that can be brought in for RAG.
Yeah, I'll jump in here. Essentially, there's kind of a foundational level. There's a handful of really large models that are essentially trained to understand language and how humans communicate. And they need a tremendous amount of data, billions and billions, trillions of words to understand how language works and how language works in different situations, say, in Shakespearean theater versus scholarly academic papers on physics.
And in order to build that representation and both understand your query and produce a result that sounds like English. That is done through mathematical analysis of tremendous amounts of information, and it doesn't really matter what that information is necessarily, because what we're trying to what the model is trying to do, the large language model is just trying to understand how to understand your language, process it, and then generate text that you as a human will understand.
And we don't need to get into all of the details about how it tokenizes content, et cetera all of those fun stuff where building upon those general models and there's only a handful of really large general models. You then have literally hundreds and thousands of applications that are built upon those models that are specific to particular actions that you'd like to do.
As Josh described, a kind of RAG model that is specific to pharmacy. And in order for that to work, we want trusted and vetted content in pharmacy, pharmaceuticals and adjacent areas. Biomedicine, maybe. And you define the corpus against which you want to query against and maybe mix your data with their data in a way that uses natural language, but also uses other machine learning processes to generate meaningful and interesting results.
So it's important to think of those two things as very different and distinct, and also think of them as distinct licensing markets. Yeah, Josh, Josh had brought up this idea of the big deals we've heard about of Taylor and Francis doing this big deal, Wiley doing these sort of big deals. As far as I these are largely book content, and I'm assuming that sort of leans into the LLM training of we want these big sort of comprehensive pieces of things.
I think that one of the real questions as certainly on the publishing side is, how much of our stuff is too niche to play in this market. You can see a significant market for drug discovery, pharma company having their LLM and their RAG system that goes out and pulls that in or clinical medicine but there, is there a market for 17th century Italian poetry journal articles, I guess is my question.
And what's the difference in those two markets, as we think about licensing and where things are going. If I could just add kind of one nuance, it's like we're actually starting to see the market move even with licensing. So Todd's explanation of the difference between LLM or RAG systems, what's interesting is stepping away from scholarly, but just kind of learning of what's happening in other areas.
When you look at news media and the licensing that's happening there, what we're seeing is a lot of licensing happening today, which is really based on rights by the same players that are being sued for having trained their models without getting authorization first. So, it's like first they needed to build the brain. Now they have the brain. And they're trying to appease the rights holders by having some deals there too.
But really, what they're looking to do is make their brains more current, make the applications that sit on the LLM brains, such as ChatGPT ChatGPT search, for example, that uses a RAG system. And we're going to see them go through the different I think. But we're going to see them is go through the different kind of verticals. So right now there are news media and they're getting deals to be able to do rag, which allows them to do attribution as well.
Which is, a great benefit to everyone. And then we're going to see them go through the different kind of subject matters and different areas. And as the use cases and the adoption starts happening by enterprise in particular, right now, if you look at ChatGPT, for example, OpenAI, most of their revenues come from consumers and not enterprise, but within five years.
That's going to flip and it's going to be enterprise. Revenue is going to be dominant and consumer revenue is going to be lower. And then you could imagine in that context, when it's consumer then you news. But when it's enterprise. They're going to need other kinds of more specialized vetted the kind of stuff that we see from scholarly publications. And so I think we're going to see them move down that kind of different content types.
Yeah and you mentioned also this idea of licensing to AI companies versus licensing to different entities. I mean, do you see that future of are we just going to be licensing chemistry to OpenAI that then the chemistry companies will the corporations will come and license OpenAI or does everyone have their own general system. And then you start licensing corpuses of data to the different pharma companies or whoever.
Or is that yet to be determined. I think we're going to see a multitude of these different kind of scenarios. Absolutely we're seeing that already with a lot of enterprises. So we had first the LLM. Now we have enterprises who are looking to integrate generative AI. And they need data and content to do this well.
And so they're looking at rag systems. They also need things that the LLMs did not need. They need audit trails. They need transparency. They need explainability. And all of that leads them to more specific, high value content that they will need to license access to because they want it to be current. They want it to be regularly updated.
So I think we're going to see, like this whole area is going to bloom from all kinds of different directions. The other thing I wanted to throw in there is I don't think that this is all going to be about new partners. I think there are going to be, there already are discoverability partners that we already work with, who we've been licensing our content to in a variety of different ways for traditional discoverability.
And now the way that they are going to achieve that discoverability and improve their tools is going to be through the creation of and the use of generative AI and RAG2. And so the terms with which and the licensing deals with which we want to be working with some of our traditional partners may also change as well. Yeah this point that licensing is not new, and we've been doing that for a long period of time.
And it's a mechanism to extend impact and reach and to enhance discoverability. It's a new set of tools, and so it brings new challenges. But I think it's a continuation rather than something brand new. I also I think the point that it's a general purpose technology that is going to need to be customized into many, many, many different use cases is this key point if you think of the internet revolution or the smartphone revolution.
We're in that similar early days where it's really the iOS versus Android fight that's happening right now, right. These model builders are fighting over who's going to be the base infrastructure that people are going to build the applications off of. But if you think of smartphones or the internet, it's the applications, it's TikTok, it's Facebook, it's salesforce.com, it's Amazon.
That's where the value got created as we started serving different use cases. And I think that's ranney's point is that now, now the work begins to start to solve real use cases. And there's general tools that are out there that are good for how do I write haiku about my dog. And then there are people who are building these horizontal solutions that serve like customer service or marketing or whatever, and then they're going to get to vertical solutions that are going to serve medicine or electric vehicles.
And I think we're just literally have a decade of working through these use cases of getting the right solutions for the right problems. Yeah, absolutely. And to Josh's point, we have been our community has been licensing this content and we even have been testing them the waters, if you will, of applications for text and data mining.
We've been talking about licensing for text and data mining for 9, 7, eight years. This is a similar type of application to that, with some interesting caveats and how the information is used, stored. How are we going to continue that ongoing relationship. So this is not just AI sold it to you once and you sliced it into a million tokens. And then it's now embedded in the system.
That's not a sustainable model for any publisher. But those ongoing relationships that Mandy was talking about are going to be critical here. And I do expect that some of the service providers and the search index and index discovery services, other intermediaries, will provide tools to the scholarly market and the professional market that are customized to a particular domain or niche in the same ways that there were I ani services for literature or Shakespeare.
Whether they are that narrow. We will wait and see. But to Josh's point, I think there absolutely will be 10 plus years of this market trying to figure out how we serve the needs of the end users with this new technology. And similarly, we'll be talking for 10 years about how best to license this content to those tool providers, right. I want to turn to a phrase that kept coming up in our discussions as we were setting up this webinar, which was responsible AI.
So the question I have for the panel, and maybe we'll start with you, Mandy, because I know Cambridge has been working a lot on this is how do we ensure that the AI that we're working with incorporates the values of the research community. So, for example, we've seen a lot of at least very vocal, I don't quantity wise of pushback from authors who I published my book. And my publisher has now licensed it for AI training, and no one asked for my permission on that.
How do we address the concerns of our authors. Yeah and I think the point that Josh made a moment ago is critical here. This is a continuation of licensing that we've been doing for years. And I think most publishers, Cambridge included, can look at our contracts and think, legally, we have the rights to do this because these are rights we've been working with for some time.
And each publisher is then going to make its own choices about what it wants to do for us. At Cambridge, we took the decision that we wanted to ask authors if they wanted to opt in. And that was a big decision for us. And that's why I made the point about the size of our backlist. This is no small feat for us to be contacting authors to ask them to opt in. We felt that our relationship with our authors is so important that it's is more than just our contractual rights.
It is in this space in particular, at this moment in time when there's so many unknowns. And so much heat. We felt that to really get to grips with what was driving, what was motivating authors, how they felt about it was really critical in a way that just doing a survey, to be honest, wasn't telling us anything. It wasn't going to tell us anything. So we felt that this was the right thing for us.
During this first wave. We've had a really good opt in rate well over 50% of authors have already opted in. And I think that that's really encouraging to see. We're making it very clear to authors. We think this is in their best interest. We are encouraging them to opt in. We think it's the right thing for them to do through this. We're learning what matters to authors.
And obviously, that's going to inform our approaches moving forward. And this has been the approach we've taken first wave, as I say. But we need to think about what the approach means. Moving beyond that will be. Yeah, I'm cognizant of the lift involved in that. I'm thinking of the days when we first started making ebooks. And we realized, Oh, no, we have to go back to all of our every book we've ever published and track down the author.
All of the authors very similar. And yeah, so but it sounds like it sounds like it's been really positive. I think that it seems like there's a lot there's a lot of goodwill from just including the authors in the discussion rather than sort of making that choice on their behalf. And just to be clear, I completely understand why other publishers have not made that choice, because I think that we I'm sure most publishers probably look at their licenses and think we have the rights to do this.
And there is a moment in time here, as we've just been saying, I think that the opportunities from ragdolls will be persistent. But some of these LLM deals, they are a moment in time opportunity. And so different publishers are going to be driven by that need for expediency. And so Yeah it's I completely get why different people have taken different choices.
But for us, this was the choice we've gone down. Yeah, I think an interesting question there also is OK, money came in from this. How is that. You have a licensing deal in your or licensing terms in your contracts with an author. We had there's a post today in today's scholarly kitchen from an author who's published several books which are included in these deals, and she has been asking her publisher, well, what am I getting paid for this.
How does this work. And is not how do you allocate. I write we're selling 20,000 books to OpenAI to use to train their LLM for X amount of money, how is that. What How is that allocated to the authors for their share of that book, I think is a really difficult. And I think that the answer to that may well be different for on LLMs to agreements where on the racks you're going to get much more information on what's happening, what people are using.
So, you may well end up allocating in a much more even allocations on the LLM revenues versus something that's more fine tuned from the deals. Yes number of words. And to the question about responsibility here. I wrote a piece a couple of weeks ago in the kitchen about attribution as a critical feature of scholarly processes.
And my wife is a professor, and she we I spent the summer talking with a number of academics who are just friends. And they were concerned about this from the notion that and again, kind of I'm paraphrasing, but their concern was here, you'll have this machine and it will take my ideas and combine them with other ideas and then spit them out in a way that won't be connected to me in any way.
So how do I get recognition. And credit. And if you think back to the initial forays into open access, the primary driver for some of the early or at least the position of some of the early proponents of open access was should publish everything open as long as it's cc-by, so that you get that credit. And that credit was the most important thing, because you don't get any money.
And all of the kind of strategies here around open access. And the problem is, or at least previously, the initial versions of AI systems and generative AI systems broke that chain. And there are ways in which we as a community, I hope, can influence the conversation by saying, here are the things that are most important to our community. Attribution quotation.
Other elements here, preservation of copyright, et cetera, that we can recognize those things and either through best practice, through community standards, through technology implementation, maybe through legislation regulation, there are different ways that we can influence this community and adopt some of our core principles. Yeah Todd, I want to be careful on the language here because I think it's important to segregate out attribution from citation.
And Lisa Hinchcliffe sort of opened my eyes to this idea of when we talk about attribution, very often we're talking about a CeCe BY license. There's a court case. GitHub and Microsoft that uses open source software that requires. If you reuse this code, you have to attribute the code. So there's questionable. What does that mean for I reuse of open access content.
And the CeCe BY license does not require you to cite someone else. If you use their content to further your research or refer to their research in your work, it requires. If you republish that piece of work, then you have to say this paper was originally by this person and it came out, here's the Doi kind of thing, that. And that's attribution, but that's different from what the researchers themselves are looking for, which is citation of, hey, I discovered this thing.
That's my idea that you used to further your research. And so when we talk about attribution there, there are different pieces of it. Well, this is an open access license paper. There's a big chunk of it that's coming out of this, I how is that attributed to the original one versus this is the work of Todd Carpenter. How do we ensure that Todd is cited so his gets the reputational career benefits of being an important researcher.
Yeah and we all do this throughout our lives. I spend a lot of time talking with David. I talk a lot of time. I spend a lot of time talking to Josh. These ideas inform my opinions about the world. At what point do you say, well, this was Josh's idea and I'm building on it versus just generally this is information in the world. And it may have come from Josh or it may have come from Moran or from Lisa.
Yeah but you also mentioned, the idea of having some sort of enforcement, some ability to enforce the community values on how this content is used and whether that's here's our norms. Well, right now we the AI companies are largely ignoring the norms. Where can we talk a little bit about those roots. Do we have any say in this or is it just going to be, we're going to license it and they're going to do what they want with it.
Are there ways we can enforce these the values of the research community and how the research community's content gets used. Yeah, I can jump in. I do think that this is going to be an ongoing process and an ongoing dialogue. And so knowing where we're headed is going to be really important. Like, what are the things that we're solving for even if we can't get them all today or solve for them.
I have a little heuristic, although, David, you just blew it up, but I call it TPACK, which is transparency, protection for copyright attribution and compensation. And those are the four aspirations that we're working toward on behalf of the community and society, partners, and authors, et cetera. So now you're going to make it tick to citation. I don't know, that's I might have to keep some linguistic gray area there.
But I do think that sort of sometimes you can get all those things, sometimes you can't. And I think there are ways, though, that we ought to assert some points of view as a broader community, as stewards of research and embed that stewardship or force that stewardship into the counterparties in a lot of these conversations, they're not going to do it from their goodwill.
They're going to do it because, it's part of the it's part of the trade off. And I think that there are things that we can do from an advocacy standpoint. I think there's work around transparency. What goes into these models. Wouldn't it be good to know if what copyright material was already ingested into these materials. The AI Act is increasing the transparency of what's in these what's gone into these models.
So what can we do from an advocacy standpoint. How do we think about the data provenance and representing the data provenance of what comes out of these models. I know I certainly if I'm making a health care decision for a loved one, I want to know if that model was trained on the Reddit discussion board, a preprint server, or the version of record that's been checked for retractions in the last, 90 days.
And so how do we make it really easy to show quality in terms of what something is in those models, either being used for RAG or otherwise citations certainly helps that traceability, et cetera. A couple more ideas is just how when you are engaging in these conversations or contracts, how do we write in there. You have to take updates. The scientific science is not static.
It is a journey. It's a living and breathing thing. So under what time period do you have to accept updates. And you take out the retractions and you take out changes in literature. So like we need to make sure that goes into those contracts attribution standards. Let's make it super, super easy to make that attribution and citation flow through the systems and even just data standards.
I think there's a bunch of stuff we can do for. XML is not AI readable. And we think XML is like the latest and greatest format. But it's actually not the way that these systems ingest this content. So the more we can put into the metadata standards around that data, I think the more the quality flows through. And if it flows in and we put some constraints in the contractual way, I think that quality or those quality marks should flow through the out the back side too.
Josh, can you. Can you repeat what the T pack acronym stood for. Someone is asking in the. Sure sure. Sorry so t pack not Tupac. That's different. T pack is transparency, protection for copyright attribution and compensation. And just to double click on the protection for copyright.
I mean, I think a lot of the reasons that most publishers, both author advocacy groups are in support of licensing is having a robust licensing market, is one of the strongest protections against the fair use argument that's been used to train, train a lot of these models on copyright material. If there's no way to get access to this material because there's not a licensing or in some cases, if you have to go author by author to get permission, notwithstanding, I fully respect Mandy and cups point of view on that.
If you have to make it really hard to get access to this content. It supports the court's siding in favor of fair use. So a robust licensing environment that sets protections. You only have these grants of rights. Here's you cannot, substantially similar works. How many characters of verbatim are you allowed to present. What happens if we see a violation.
Do we have an escalation path. You can actually build those protections for copyright into these contracts if you do them. Can I just say, I think we need to be a bit careful about talking about what would we put into contracts. I think we might be getting into really slightly dodgy ground. We want to avoid competition law, collusion or things like that.
And I think that's an important point to make because I think I want to ask a little bit about are there standards that can be set. But let's start with what are these mindful of Mandy's caveat, I mean, what do these deals look like. Or if I'm a publisher, thinking about these deals, what sort of things do I want to be, sure to consider when thinking about doing a deal like this. So before going to that, I just want to add to Josh's comment.
Everything that he mentioned is part of this growing field called responsible AI. And it's not just R&D companies that are looking at implementing generative AI responsibly. We're talking about widget manufacturers like they're all concerned about responsible AI and everything transparency, explainability, data provenance, all of that is actually part of the different levers around responsible AI.
And I was looking at a survey that was done by McKinsey with business leaders who are looking to integrate or have integrated generative AI into their work. What are their areas of concerns. And what's interesting is that the number two area of concern is IP infringement. And when you look at the survey to say OK, what do they mean by IP infringement is infringing everybody else's copyright.
That's what they mean. That's number two. I've been working in this copyright policy space for over 30 years I've never, ever seen that copyright infringement was top of mind for business leaders across all of the industry verticals. And that is really an interesting opportunity to be in when you are in the content industry.
And it was only second to inaccuracy. So inaccuracy was number one. Number two was IP infringement and data privacy, bias. Security comes after copyright infringement. So it is something that is top of mind, not just for publishers, not just for content creators, but for people, for enterprises and leaders using AI. Really, really important to think that.
So that's really an interesting opportunity really for everybody in the content space. Yeah there's a question also before we leave that subject, we were asked could you say a bit about risk management and guidelines for third party material when publications are ingested? I run an art history journal that's got a bunch of Andy Warhol paintings in there, how are publishers thinking about making sure we're not licensing things that aren't ours to license, I guess.
I mean, certainly from our perspective, taking images out is relatively straightforward in terms of what you then provide. I think where it gets a little bit harder from our files and others may have better systems than us in our old files. How well. Tagged third party text is. So if we've got a poem in a journal article 15 years ago, are we confident we can extract that before we provide it to the llm?
I'm afraid I wouldn't have complete confidence on that. So we are having to think through that the implications of that. Does that have any ramifications about what we the terms that we would include. But images I think are much easier. Yeah and so, to get back to that question to Rohini and maybe for you Josh as well of is there standards for these types of deals that, I think of groups trying to put together of here is a standard transformative agreement, different publishers or some of the societies, smaller societies saying, well, this is what a standard transformative agreement that I'm going to go, use.
Because I'm a small society. And I don't have the capacity to do that. Are there standards or collective ways we can work on these types of things. So it's a little less every deal is bespoke or is that there's just last week at the Charleston conference, I had some conversations with some publishers who are interested potentially in developing some model license language around some of these ideas, obviously in the same sort of way that there exists model licenses for subscription to electronic content in libraries.
There's precedent for development of model license language, which obviously is model, and every publisher and every service provider is free to adapt in appropriate ways for their own situation. But there's the start of some conversations around that, particularly within certain domains, because if you're developing a RAG model in, say.
Manufacturing and you want particular manufacturing content from these engineering societies or something. And I am making this up. So this is not the conversation I had in Charleston, but I want to protect the innocent. In order for a rag developer to say we have a trusted corpus against which our RAG model will work, you need to have a significant portion of that content 70% 80% 90% of that content in order for it to be trustworthy, and so no single publisher has that in any domain, even in some of the niche fields.
So there would need to be pools of people who work together and are producing content in those domains. And it would be helpful if they started from a base of a conversation with a model developer or a tool developer. Obviously, every one of those parties can adjust that model language. But there has been precedent for this kind of collective action in the past.
That's certainly been the value proposition for collective licensing for over 40 years is actually standardizing the terms, even though each publisher would have different terms and that those would continue to be in place, then on top of that would sit like a CeCe license, which is a rights only license, not providing the content to standardize the terms so that it's easy, particularly in enterprise, for them to be able to use the content across different publishers that have slightly varying terms and then be able to use it in their RAG model, for example.
So that's definitely one of the demands that we're seeing from our enterprise customers to have these standardized terms, the ability to know that irrespective of the publishing agreement that they're with, they're still able to use it within certain parameters. And limitations in the RAG system. Yeah so I have AI have a wrap up question for the end, but I wanted to get to some of the audience questions because there's some interesting ones.
So Roger Schoenfeld, for example, is asking you no, if we draw a line between LLMs licensing to LLMs about is about language training. As Todd had said, and rags is more about knowledge acquisition. One do you agree with that framing. And then, as Mandy said is this a particular moment where LLMs are being trained. So Roger asked, does the LLMs model get good enough at some point, and therefore there's no more licensing revenue to be generated from LLM training.
And we're really just talking about the knowledge acquisition piece of things. I think this is a really, really important question. And I don't know that anybody has the answer. And you could pull hundreds people in Silicon Valley and you'd probably get 50/50. In terms of the answers, how smart do the foundational frontier models get.
And so I do think that these frontier models, the biggest models, the ones you hear about in the news are trying to get. They're trying to get really good at language. They're trying to get really good at reasoning and they're trying to get core knowledge, basic knowledge built in there. And right now they're good, like good college students. They're like smart. They do well on the test.
They haven't really lived in the world yet. They don't have a lot of common sense. And the question is, if you keep putting more content into them, do they become good master students. Do they become good phds? Do they become experienced practitioners in their field. Where is the limit where you need to say, OK, that foundational model got me so far. Now I need to really build a specialized either fine tuned or RAG model on top of that RAG application that's going to be the expert.
And I don't we know the answer. The scaling law would say eventually they'll get there. I'm a little skeptical myself because the models are building now cost a billion. And when they say scaling, it's not linear, it's logarithmic. The next model is going to cost 10 billion to train. So the models we use now are $100 million models. The models that are being developed now are billion dollar models.
And the next ones are going to be $10 billion models. I don't know that the math works to have a $10 billion model, which is a PhD in every discipline. So I think it is an open question, and I'm sure that my fellow panelists here know the answer to that question. I'm about to tell us. Oh, no. No I was going to jump in and say 100% Josh.
I think we also need to fundamentally recognize that artificial intelligence systems. Are mathematical models. They don't know things. They can recognize attributes of information, but they don't know what red is. Fundamentally and they don't know what heavy is. They don't know what weightlessness might be.
There's no reasoning going on. There's no understanding. No intelligence. Yes, Yes. There's no it's a different kind of intelligence. And so there is also an element of probabilistic nature to these systems. So if I ask it's really fascinating if you go into any one of these LLM systems and ask it the same question 30 times.
If you ask it 2 plus 2 equals 4 or two, what is 2 plus 2. It will say four. And if you ask it again and again and again and again, it should always say four. It won't always because it's probabilistic. And if you keep asking it the same question, it thinks it's done something wrong. There's some fascinating research done by a research team, I think, at Microsoft just last month, which explored this question, and there was an Apple, there was an Apple study where they change the words the same question with extra information put into it.
And it shouldn't change factual answers. But it does. It was a word problem. How many apples does Susan pick from a tree. Versus how many apples does Susan pick from a tree. Small apples are a common in the fall and things like that. And it got wildly different answers. The sort of language test you'd take in fifth grade and you'd pass.
But these things, these systems fall down. Will they get better. Certainly but I think the. Sorry sorry, Tom. No, I just think it's a fundamental when you're asking, as Josh said. Or am I going to die of cancer. You don't want a probabilistic answer. You would a definitive answer.
So, Mandy, go ahead. I was going to say the other reason why I think these they will never be done. I think that the curve will slow, but language is constantly evolving, as we all know. And as Justice Todd saying, these are probabilistic. They're building on trends and word patterns. And so as we all start to use language in different ways, as we are all of the time, they need to learn those new patterns.
And so they can't just be based on the way language was and what we publish in 2023. Or up until 2023, because that won't give them the answers to how something should be answered in 2030. Yeah I mean, Josh, you mentioned this idea of ongoing renewal of particularly like the scientific record. If we just ingest all the papers from the 1950s, you're not going to get a very good answer for what we in 2024.
So talk can you talk a little bit about this idea of is this subscription, is this ongoing. Revenue rather than a one time deal kind of approach. Yeah and I think this is at the heart of the answer to Roger's question of what's the sort of hope for the future. In a world of infinite AI, AI generated content, the market. My belief and my hope is that the market for new knowledge from trusted sources is the only safe harbor, is the only evergreen place.
Because there'll be so much other stuff. How do you cut through that. How do you update. How do you carry the human experiment forward It's going to be it's going to be the process of peer review, whatever that ends up looking like in the future. Content from scholars validated by peers. This is actually certified New knowledge that goes into the human project.
So that's hope. I do think we have to there's a related question around if this is just about distribution. Isn't this just sort of. Isn't this just and I part two. And I think that there is a part of that. That's true. But if you think of how many times does a researcher or scholar go.
They find a, they Google Scholar or wherever they are, find an article, they go, they hit one of our websites, they download it. That's not I don't need that. So is that what 8 out of 10 articles. I don't know what the number is now that I. How many people read beyond the abstract. Yeah right. And help you figure out that actually those eight you didn't need to go skim or read the abstract or whatever it was.
So I think that it will be a more refined way in which people are engaging with hopefully they're going deeper. Hopefully they're going outside of their discipline to find a similar methods that was used for a different problem. I'm hoping they're finding new content. And so maybe they're finding two of the original 10 plus 2 more new they never would have found before. That's a cool idea.
But maybe the total number of page views and downloads, if we're still in that mentality, is going to go down. Yeah, I mean, that gets to some of the questions Sam has asked in the Q&A, which are almost these sort of existential questions if I can get Google's AI Gemini to summarize a paper for me, do I really want to read that paper and what does that mean to a journal publisher. Traffic goes down by 90% and only the people who are really, really deeply into that very specific subject actually go and get that paper.
We may have lost Mandy, but. Do we just become a licensing business. I think at that point. But, David, to your point, if usage. We've conflated usage and value for a very long time, and if that paper is downloaded and read and applied by 20 people instead of 2000 people, those 20 uses are probably more valuable than the 2000.
And again, we're getting we as a community have gotten too wrapped up in the impact factor and quantification, as opposed to real value and value. Perhaps we'll have more high value use as opposed to unused downloads. Yeah and that sort of gets to as we're up on time, my closing question, we've sort of talked about licensing, we've talked about the business, but maybe let's take a step back and think about mission.
Every scholarly publisher, no matter whether you're a commercial or not for profit, shares this sort of goal, this mission of bringing essential information to the world. So final thoughts on what do you see as the business and financial matters side. What are the benefits of bringing this content that we produce to I, training or for RAG use that will help us fulfill our missions.
Thoughts Oh, Mandy, I think you're muted. Oh, is there a way to unmute Mandy. Susan I don't see any way to unmute her. Oh well, try unmuting now. Mandy, see if that's any different. There you go.
Try Nope. Still can't hear you. All right. Anyone else on thoughts on the big picture. Mission fulfillment here. It says you're not muted. Your audio is not working. Yeah I might offer to well hopefully maybe we can hear Mandy's wisdom shortly here.
I think that one is new discovery. I mean the idea of looking for white space in the research and find hypotheses that should have been asked or can be asked now or to find experiments or relevant things in adjacent literature. We're so our disciplines have gotten so deep that the interdisciplinarity like it's just impossible to be interdisciplinary in all the 75 dimensions that you'd want to be.
And so that ability to reach across to have hundreds research assistants working for you at any point in time is kind of exciting. So that's one thing that I think is compelling, and I think the other is just getting how much of this content that's in research articles doesn't get out to New audiences, doesn't get. Think of scientific education. Think of it the ways that these nuggets can actually get into the idea that a perplexity search is or a consumer search is going to pull a Nugget from a scientific journal article because that's a relevant piece.
I think there's so many more audiences for this content that it's been trapped in when it's been in that form factor for hundreds of years. Mandy, do you want to try again or are we. Nope still can't hear you. Well, but I think that's a really good point, Josh. Because I think, of getting back to Todd's question idea of that these things aren't necessarily thinking, but they are dealing with enormous quantities of information.
When we were in the big data is we're not going to have science anymore. We're just going to do big data in the machines are going to figure out what it means. There's a piece of finding those correlations that maybe we didn't come across or taking those hundreds if not thousands of small, incremental pieces and potentially bringing them together in a way that the human mind can't. I mean, I still think there's a piece of human inspiration and creativity that is always going to be essential for progress and for new ideas to come about.
But I think there's an incredible power here that can drive a lot of progress, spur those new ideas, maybe beyond discoverability. There is just accessibility. So most of those research papers are way beyond my ability to comprehend. Yeah and someone sent a paper to me recently and like, I started reading it, I was like, there's no way. I mean, I could get through page one, but I Fed it into an AI, and it helped me understand what the paper was about and where I needed to go.
And so that's a huge accessibility, which has tremendous value for society. Yeah getting beyond our small niche audience of professionals. I mean, the journal is traditionally an in-depth conversation between professionals, and this allows that same information to reach a much wider audience. So very positive. I think we want to kick back to Lori.
Do you want to close things up for us now that we're at time. Sure, sure. Just to Thank everyone, Thank our participants. Thanks greatly to our panel for a wonderful discussion. And Thanks again to our sponsors, access innovations, openathens and Silverchair. Today's webinar was recorded and all registrants will receive a link to the recording when it's posted on the SSP website.
And that concludes our session for today. Thanks, everyone. Thanks for having us. Thank you.