Name:
NISO Plus 2024 - Thomas Padilla - States of Open AI Keynote
Description:
NISO Plus 2024 - Thomas Padilla - States of Open AI Keynote
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/77531239-61fa-48aa-807f-4885a13cc672/videoscrubberimages/Scrubber_1.jpg
Duration:
T00H56M38S
Embed URL:
https://stream.cadmore.media/player/77531239-61fa-48aa-807f-4885a13cc672
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/77531239-61fa-48aa-807f-4885a13cc672/NISO Plus 2024 - Thomas Padilla - States of Open AI Keynote.mp4?sv=2019-02-02&sr=c&sig=S4b0w%2FU09e0NUHNv%2BMuXd2SnjB%2FL25%2FteZb5e5k9LKA%3D&st=2024-12-10T07%3A27%3A19Z&se=2024-12-10T09%3A32%3A19Z&sp=r
Upload Date:
2024-03-05T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
Morning morning. Morning. Thank you. So, as Jason mentioned, my name is Thomas Padilla. I'm at the Internet Archive. I'm here to talk about states of OpenAI to some degree. It's a talk about the AI, but like most talks about technology, it ends up being to talk about us and about what we want to achieve.
That's kind of what we do with. I do want to start by thanking the many folks who have contributed to the content of this particular talk. Some of them will be recognized formally on the slides, but also many of them have just been part of informal conversations and community, not unlike what happens at a conference like this. So I want to recognize them, you know, formally. For the record, thank you to all these great people.
So what the heck is open ai? Is it open source ai? Is it open data? Is it open infrastructure? Is it open science? Is it open method? Like what the heck is this? Why does it matter? A fair amount of flux about what OpenAI is and why that matters.
You know, to some degree, having some interpretive flexibility around different kinds of concepts is helpful, particularly when we're trying to do interdisciplinary or interprofessional work. We have some sort of wiggle room there. It accommodates a certain degree of collaboration. And when we concretize things, it makes things sometimes a bit more difficult where having interpretive flexibility around what OpenAI becomes a bit problematic is when we start to move into operations.
Drawing on this great work from David winters, Sarah West and Meredith Whittaker, open for business, big tech concentrated power in the political economy of OpenAI. You know, they talk about, you know, we find that terms open and open source are used in confusing and diverse ways, often constituting more aspiration marketing than technical descriptor, and that there is no commonly accepted definition about what open means in the context of AI.
Why is that consequential? You know, as I just mentioned, I think it becomes more consequential as we move towards sort of operations and implementation and we start moving into assessment of licensing terms and infrastructure requirements. We think we know what open means and then we start experiencing kind of like this Escher like experience where we're on a road, we think we know what to expect, and then we encounter a license term where it's like, whoa, that seems like that's actually not quite open.
Or we encounter what the infrastructure stack looks like. Oh gosh, that's all like highly proprietary compute. That is incredibly expensive. How the heck are we going to afford this? How is this actually going to work in practice? So, you know, I think it behooves us to have, you know, a more concrete sense of what we want from AI and the knowledge work that we do collectively together.
I want to tell a quick, I guess, allegorical story about trees before we get into the substance of this talk. There's this newsletter called the overstory, and there was an article in it by someone named Peter smallidge, the director of the Cornell University teaching and research forest. He talks about of like forest management. And one of the common things that happens in the management of the forest is that there's often a question do I girdle.
This tree or do I fell this tree? And you may be thinking, well, what is a girdle? A tree? I, for one, is not currently in treatment. Maybe because I live in Las Vegas and there aren't many trees. But for those of you who don't know, girdling trees is the process where you disrupt a living connection between the roots and the leaves, usually by cutting or chopping away the outer bark and the inner bark or the cambium can peel, technically sever the phloem or the vascular tissue that carries the products of photosynthesis from the leaves to the roots.
Therefore, girdling starves the roots of the tree, and the tree will die over a year or more of time. Why might you girdle a tree? According to this newsletter, you would girdle a tree for a couple of different reasons. One might be that you don't want to cut the tree down because if it falls, it'll smash and kill smaller trees. You don't want that to happen. Or there's some residual structural value to having a dead tree continue to stand and serve as a habitat for other creatures in the forest.
You're like, huh? How does this connect to knowledge? What is happening? Why are we talking about cutting a tree down or killing a tree? I started to think of this as sort of like an allegory for what could happen to our world. If we adopt AI too easily in a way that. Basically takes us away from the connection between the means of knowledge production and knowledge products themselves.
So we become just like an intermediary that's trafficking something that we no longer have any say of or any control of that essentially we could end up severing our own bark and separating ourselves from the means of knowledge production and the products that we try and get out into the world to advance research and understanding. I would rather have us be like this graduate student, perhaps not a graduate student, because I remember that time and you know, my son was not very much, but I would rather have a seat in this position if we consider our community to be a forest that we're in the position of cross-pollinating each other with good practices and the common sense or common understanding of what those good practices might be.
For me. I think that OpenAI should evidence the following characteristics that it should be reusable and more in the sense of true open source and not sort of open eyerolling like marketing grammar, but true open source that the OpenAI should be transparent. It should support the promotion of knowledge, integrity and credit for authors and knowledge producers that the OpenAI should be accountable.
That as we're developing and we're using these tools in our workflows, that you should be subject to specific rather than relying alone upon generalized community needs. That the use of OpenAI should be affirmative, that is to say, positive and supporting affirmative evolution of roles in our organizations and the organizations themselves. And that our use of OpenAI should be sustainable and that we might adopt something like a stewardship mindset in that work.
So that's generally the roadmap, and now I'm going to move through them one by one. So reusable, more like open source, not OpenAI drawing again on the winter West and Whittaker piece. You know, they say as a rule that open refers to systems that offer transparency, reusability and extensibility that they can be scrutinized, reuse and build upon.
And they then transitioned to reference some fantastic work by Irene Suleiman at the company huggingface, where she notes that I kind of exists on a gradient of openness in actual practice. And Irene provides a really useful sort of graphical depiction in her article The gradient generative AI release methods and considerations, where she characterizes a fair amount of granularity in assessing the relative openness of various models that are on the market.
And it's not a total simple view, right? Like the like open is automatically good and closed is automatically bad. And she kind of does characterize it on the fully closed side of the model release. There is, say, low auditability of the model because it's closed, but because it's closed and it's within control of a particular company, there is a high degree of risk control because that control is centralized as opposed to a fully Open Model where there is the possibility of highly decentralized auditability of the model by a broader community because it is open, but because of the reliance upon decentralization, the arguably that there's lower risk control because the responsibility is distributed across many actors in the community that are variably resource.
Irene goes on to provide other really great sort of wonderful graphical depictions that really help us to try and understand the universe of models that are coming to market. Both closed and open. So I'm thankful to her and also thankful to her for sort of establishing this concept of a gradient of generative AI release with various degrees of openness and trade.
One of the more popular, ostensibly open source models that has been released and what is recent in terms I guess like a but elements weeks is lamented from meta Facebook open source free for research and commercial use self-described so I put my librarian hat on like I want to look at like license terms for lava. And so if you go to their GitHub repo and bear with me, we're going to look at pictures of some license terms for Google Slides.
We go, OK, OK, let's see, llama to community license agreement. OK never llama to community license agreement. And you say like, OK, let's spread some restrictions. OK, you're granted a non-exclusive worldwide blah, blah, blah. You can't use reproduce, distribute, copy. You're OK. Fine then go to redistribution use. If you distribute or use this thing or any derivative works, you will provide a copy of this agreement, find all.
And then you're like, oh, before it comes out additional commercial term me OK. If on the limited version release date, the monthly active use of the products and services are greater than $700 million monthly active users in the preceding calendar month, you must request a license provider which may or may not grant to you. Which is.
Yeah oh, interesting. So that's basically like some shade against the other big, big tech companies for the most part, right? And you have sort of this useful provocation from Stephen green. This is. Linux is open source unless you work at Facebook. Right there's some interesting sort of like gatekeeping amongst some of the most well-resourced companies in the world.
How applicable or how much does that apply to us? It could for some of us in this room where it becomes more, I guess, sort of like either racism or AI to me is, you know, you will not use a lot of materials or any of the results of a lot of materials to improve any other language model. That feels to me like quite not in the spirit of open source. It feels to me like it was sort of explodes.
It explodes. The whole thing. How is that open source? It's not. I think how it is not becomes a bit more salient when you look at of the statement that Zuckerberg gave Zuckerberg gave to his shareholders where he said, open source software often becomes an industry standard when companies standardize in building a large stack, that then becomes easier to integrate innovations into our products.
So it's a lot of like I basically like to London to read this into like, this is like having your cake and eating it too, right? It's like, we're going to put this out here, we're going to make some investment and then we're going to basically benefit from the world's contributions within the context of our own products. Doesn't feel super open source team.
I thought it was an interesting article that kind of explicates like the economic potential at stake here and how, you know, the profit that an entity like meta realizes is. I feel quite unfair. And not in the spirit of open source. This article came out in January of this year, so relatively recent and Harvard Business School strategy working paper number 24 038. The value of open source software where the economists on this article they estimated that the supply side cost to create value of widely used open source software is about $4 billion.
They estimate that the demand side replacement cost value is much larger at almost $9 trillion. They estimate that firms will need to spend 3 and 1/2 times more on software than they currently do. If open source software did not exist. So I think that puts things into pretty concrete terms on the money side and really like the have cake and eat it too nature of what entities like that are doing not in the spirit of open source.
What is who determines what the spirit of open source is. Many different parties, one of them being the Open Source Initiative. They do happen to be working on trying to define what open source AI is, so maybe they're going to define it wrong. But I really like what they're doing. They have a consultative community process where they have multiple stages of input into defining what open source API should be.
And I will also call out that the board has expanded and our colleague, professor patty Chaudhry, in the back corner there at Carnegie Mellon libraries destroyed that one. So he's going to solve it for us. They say in the interim, other models have come onto the market that do show a degree of promise, I think, for having a stronger alignment with the spirit of open source.
One from the Allen Institute called omo, or the language model, as it describes it. If we are truly open and framework which highlighted in pink at the bottom, they say that all code creates an intermediate checkpoints will be released for capital release under the Apache 2.0 license. A license of that would be more stable. That does exist.
That is a real thing. The a home moving on a it. So I also believe that OpenAI is open. Part of its openness is contingent upon the degree to which this transparent and that the transparency is important because it promotes the integrity of knowledge and credit and attribution for knowledge creators at the moment, in particular with a lot of the generative AI that we see being implemented, it's a fair amount of like this, like you interact with it kind of like magic ball and get something back that kind of makes sense.
And then but sometimes you start to profit more and then you get like it's kind of like, I don't know. And then you're like, oh, well, why don't you know? And then and then the response is not to pick on GPT, but maybe to pick on ChatGPT because I think they can handle it. You get responses like, I don't have access to how I provided you this answer, which is completely wild.
Like, can we imagine any other knowledge work that we do where this would be an acceptable response where you had a paper in peer review and if an author was pressed and they're like, well, how did you say like, how did you how did you make this argument? Oh, well, you know, I don't have access to my sources. I can't. Can't share them. Like it's completely wild.
You bet it is. You did that in a classroom. Like, I'm not sure that they would test. It's not an acceptable situation. Further commenting on llama, a professor at the University of Washington said, you know, they've made it more open, but still the data is not available. We don't understand the connection, starting from the data all the way to the capabilities and also the details of the Training Code are not available.
A lot of things are still hidden. There are some indicators of promise in terms of trying to prompt greater transparency and better documentation. Huggingface has put forward something called model cards, which draws on some prior work from Margaret Mitchell, called data sheets for data sets, which draws on some prior work, which actually happened to incorporate feedback from the library and information science community.
Not currently present as had existed in the model cards, but what we see in the model card is something not too dissimilar to practices that we've tried to promulgate in research, data management or reproducible science. In collaboration with researchers around the world, where users of huggingface, when they upload a model, they are prompted to complete a model card and the model card prompts them to describe various aspects of the model, such as intended, uses and limitations, parameters and experimental info.
Which data sets were used to train the model and the model's evaluation results. It's It's really nice. It's great. Yeah but it's, it's kind of similar to, I think in practice based on assessment of model card completion. It's kind of similar to like, I don't know, like a research data management plan.
And the degree of comprehensiveness that those are actually completed in practice. As we know, if we work in a University setting or if we've been a part of grant review panels in the past, the research data management plan is often just sort of like a tracking of a nice little thing that hangs upon, again, that submission. So great indicator would love to see more uptake of this, possibly even a requirement to upload to gang.
Perhaps that's around the corner. To me, transparency is needed throughout the chain of AI production and application and without transparency that you have no assurance of knowledge, integrity. And if you have no knowledge, integrity, then like why, why bother being involved in something like this at all? I really think it weakens the value proposition of the entire thing. I also want to talk a bit about knowledge producer attribution.
I think for the most part at the moment it's ghosted. It kind of doesn't exist or it's vaguely referenced to. I think that's bad. I think that sort of an aside that there are a fair amount of like apples to apples comparisons being made at the moment between Wikipedia. And the way that society has received it or received it when it did come out and the way that mostly generative AI products are being perceived now, I just I personally, I don't really see that connection.
I think Wikipedia from the beginning has been a truly open knowledge project, warts and all. It's never espoused to transition itself to a for profit model to start open source. And then shift to for profit. It's always been an open knowledge project. It is highly transparent, as we see in this revision history, good revisions and also really janky revisions like Dr. Jackson is not a reveal something, right?
Like they have always been highly transparent. It's there for us to see how knowledge is produced on Wikipedia. And they also can't put anything up there without a attribution to existing knowledge, production and creators of the knowledge that is synthesized into the Wikipedia articles. So the comparison. I don't really see to me, again, building on this transparency thing, if there's no transparency, there's no knowledge produced for attribution, and if there's no attribution, to me that's essentially anti knowledge work.
Like that's, that's like not knowledge work. It's the opposite of knowledge work. Yeah so I would argue that we should not work. Moving on, I think that our work with OpenAI should be accountable to specific to communities, right? I think it's great that in various national contexts policy and regulation and executive orders are happening that at the societal level are attempting to enshrine certain protections and safeties to guide the development and implementation of AI.
But I would argue that in our community. We have an additional sort of professional burden, colleagues, opportunity, if you will. And we have prior use cases that highlight that opportunity for us to consider. Some years back now, I was this wasn't that long ago, but it really was. It was like, you know, 2014, 2015, 2016, there was a project called documenting now, which arose in the wake of the Ferguson protests in Missouri.
And it was an effort by archivists to build a tool, an open source tool to archive mass evidence of social protest occurring on Twitter. At the time, millions and millions and millions of tweets were captured and archived. And the archivist, that was one of the leads in this project. Burgess Burgess Jules, who is now at the shift collective, was really trying to tangle with like, how do we square sort of this mass data collecting with the principles, the established principles of archival practice.
And so he said that the ethical considerations are many for this work. I really don't think the best way to do this is to ignore the issues and just take it collected all approach. I think that we should consider how practices that we've developed to deal with sensitive data, with donor relations, appraisal ethics, privacy and so forth with physical collections should transfer to the collection of social media archives.
That sentiment to some degree was agreed to by some prominent machine learning practitioners researchers, one of them being Timnit Gebru forced out of Google, later established the distributed artificial intelligence Research Institute at the code for Science and Society. She wrote this article called lessons from archive strategies for collecting sociocultural data and machine learning, where her and her co-author they try to understand and draw from work in our community, particularly archives, to try and assess like how we work with collections, how we work with specific communities and donor agreements and so forth, could transfer into the development of training data and applications of machine learning.
I recommend that you take a look at it if you haven't read it yet. It is from the times of yore, as was mentioned, 2019, which was on the cusp of the pandemic. But I think that it's still relevant for now. Timnit later, as I mentioned, went on to launch the distributed AI Research Institute. The next couple of slides are essentially just going to be kind of like a post card for possibly working with Derek.
So bear with me. But I think they have it right with respect to what I'm talking about, about specific community engagement and AI development and use. So when they talk about community in their projects, one of their aims is to reduce the distance between researchers and community collaborators. Another aim of theirs is to focus on trust and time, to forge meaningful relationships with communities.
And they in particular are talking about minoritized or marginalized communities that are not often at the center of AI research, but are often on the receiving end of negative impacts of AI research. That does not take into account the lived experience or just the physical characteristics. If we're talking about facial recognition or other sort of well known uses of some of these technologies.
So they focus on making trust in time, forging relationships, especially if those relationships don't exist, or if there's a reason why those relationships have been challenged in the past. They also focus on interrogating power, creating spaces for communities to question elites, and understanding why people and communities on the margins are broadly excluded from AI development. So all of which is to say, I think I should be focused on serving the specific needs of specific communities, and part of that requires relationship and conversation.
And I think there provides us a nice roadmap. And some of that roadmap I think does track back to some of the ways that our community has worked with some of these communities, as noted by Curtis Schultz's work on the document that. I think that our work with I should be affirmative in supporting positive evolution of roles and our organizations.
Early in my career, when I would talk about the potential of computation, machine learning, data science, digital scholarship, et cetera, I would often integrate this quote into from a column into a presentation at the back of any subject is the threat of knowing nothing of certainty at all. The threat of knowing nothing is not the same as uncertainty, which is the presence of alternatives.
I think it's a really nice sentiment, but I think in practice often kind of runs aground on the realities, the cause of variable experience of uncertainty. It's like different people experience uncertainty in different ways and there are often reasons and there are sort of material conditions. When that's the case.
A brief story I want to share about uncertainty part one. And I guess it's maybe in the spirit of Chatham House rules. It's going to be anonymous. But I was recently talking to someone who has been experimenting with AI and have their organization and they've realized a significant impact on workflow and access to collections to use data. But they are. They told me that they're quite uncertain, feeling uncertain or nervous or possibly even fearful that if they share the results of their experimentation with their administration, that it's going to result in the loss of some of their colleagues jobs.
And, you know, that's a real fear. That's that's in part, I think, a leadership challenge in terms of trying to reassure staff of an affirmative vision for the use of these technologies. And I think that's a consequential challenge, particularly for leadership in the community to move forward with the use of some of these technologies. The uncertainty or the fear, of course, is not unfounded.
This data is a bit outdated now and only runs out to 2018 from the statistics 2017 to 2018 survey. It runs from 1998 to 2018, but it does characterize significant drop off in some core, right? So it wouldn't be surprising that someone, a line staff member experimenting with I would draw would anticipate similar conclusions from their work and impact to their organization that they might not be comfortable with.
We do have studies that are ongoing in our community now in academic libraries. Leo low recently completed a survey of many, many academic librarians assessing the state of AI literacy in academic libraries. I think that provides some data to help make some choices and guide some conversations about staffing, training, organizational structure. And so forth.
Every story about uncertainty. Part two. Chad isness dongchimi. Is that what it breaks down to? Jason pretty much. OK I was at a thing and there were people there who were leaders. And, you know, we were talking about I and one of the questions that I asked this group of leaders was, you know what?
What are partners asking you for relative to ai? Relative to AI, and what would you like them to ask you for? So we kind of try and characterize the difference between sort of partner partners you're asking for and sort of like the advocacy component of trying to partner in a way that you would like to. And almost uniformly, the leaders were silent on this topic.
And I don't think that's good. I think that we need to have a better sense of what we want in our partnerships as we're working with AI. I often hear sort of the phrase like about having a seat at the table or something like that. Well, I mean, if we don't know why we want to be at the table, then it's kind of like being at the table. It's not going to be a good experience even if we get there.
So some work to do there. I think that the work that RL and CNI have initiated with their joint task force on scenario planning for AI and ML features is helpful for the University library context, and I'm looking forward to the results about how sports being made more broadly public. I believe springs DNI there was mentioned that some of the task force outcomes will be mentioned for the table rationale.
Just a couple of points from the field in terms of emerging scholarship, trying to address the question of what could our roles or our organizations look like. This is from Lori bridges and colleagues at Morgan State university, where they're finding corollaries between educating library patrons on capabilities and antecedents in traditional information literacy instruction.
So it's good to see sort of like a thread of library staff finding antecedents and alignment with core activities that exist where we have the ability to approach a particular kind of activity from a position of strength rather than a completely amorphous kind of unknown hype space about I. More that I would share is from Chris fawkes, who kriegsman and colleagues at MIT libraries where in high level terms they're alluding to the role of a library to promote a durable, trustworthy and sustainable scholarly record relative to AI in their work generative AI for trustworthy, open and equitable Scholarship.
So I think a couple of positive notes from the field. Lastly, as I move toward closing, I want to talk a bit about sustainability. I think that OpenAI should be sustainable and that the adoption should be guided by a stewardship mindset. I like this concept of ecosystem stewardship that Chapman and colleagues put forward where they sort of characterize stewardship as reducing vulnerability to expected changes, fostering resilience to sustain desirable conditions and transforming from undesirable trajectories when opportunities emerge.
To me, the community sustainability strategy depends on awareness of interdependence, threats and opportunities. And I would share three methods that I think are potentially useful that I have made up names for called exploded view systems view. And replacement view drawing on Crawford. And Jonah Wallerstein and Benton. So exploded view.
Many of you have probably seen this before. I think the Met or the MoMA or something actually acquired this was put forward by Crawford and Joel are hard to see on the screen. So I'm going to Zoom in a bit. But essentially it starts from multiple crises. But one of the ways that you can engage with, like what is AI is by starting with the human and then moving down to at the time, the Amazon Echo360 and then going all the way down to the periodic table of elements, up to the strip mining that needs to happen to gain the raw materials to produce a up to smelters and refiners and then favorable labor conditions up to the component manufacturing and so forth.
And so you get this sort of really like universal view of dependencies of people, raw materials, the Earth's resources, quid pro quo exchanges between data that we produce in interaction with AI and so forth. I think it's one way to get a sense of the interdependency that's at play when we adopt not just AI, but any technology by that to be helpful. And then the other one that I want to put forward comes from World systems theory, which was a work performed by a sociologist in the 70s, and it was attempt to characterize the operations of global capitalism systems since the 1500s.
And in this particular system, it basically characterizes world production into core semi-periphery and peripheral countries. And common ways to explain this might be focusing on electronic goods production, where there are parts of the world that are primarily relegated to extracting rare earth materials, things like cobalt. Those goods are then shipped to semi-peripheral countries to, say, produce batteries, and then in poor countries they end up in at end of line.
And highly polished commodity products. So it's like or semi-periphery and periphery as an operation of production. You think about like, how does this apply to I? I think some of the ways that you can see it is in the ways that high performance computing, cloud computing, infrastructure and so forth is primarily concentrated in core countries. And then there's this sort of like this purgatory for the rest of the world relative to AI production and the raw material production, training, data development and well-documented PTSD of US content moderation for some of these systems and as sort of a.
Capital is concretized as inequitable, systematic roles across the world. Highly problematic. It behooves us to think about as we invest capital into AI, where do we have opportunities to allocate our resources differently? And then finally, I think I called this replacement view. It's from Emily bender, computational linguist at the University of Washington, where she just said, I think that we should just stop saying AI and start saying automation, and then we can ask things like what is being automated?
Who's automating it and why? Who benefits from that automation? How well does the automation work in the case that we're considering who's being harmed? Who has accountability for the functioning of the automated system? What existing regulations already applied to the activities where the automation is being used?
That's essentially the end of my talk today. I wanted to share a nice GIF of flowers blooming because I think it's going to be spraying in this part of the world. I live in Las Vegas, so it's like it's kind of always, always sunny. But I want to add on that sort of like colorful image. And the cherry blossoms are going to happen and things around here. And I'll just.
I'll just close and then leave. Leave my side up for what I think. Open I and all the work should be. Thank you so much. So if as a software engineer, I'm well, I'm sort of in this world where I'm always forced to consider like truly open source software, whether it's a library that I'm using in my code to produce full profit software, right?
It's this balance or that sense of. My mind. So what? I do know that a lot of the libraries and projects and some of the software that I use also produce and maintain by large corporations that themselves are in the business of making money, and they are versions of the software. So in those projects would not exist without the continuing funding of these projects.
How do we then like, is this the model that we're looking for when it comes to openai? Are we looking for the larger corporations that have the compute and the funds to train those very compute intensive models? Right we rely on them to do this work, make it available, truly open, right. Or are we going to go and get on the side and say, OK, to have truly open models, truly open AI, then we have to fund that effort.
How do we reconcile these two? Yeah, it's a good question. I think that part of it breaks down to having better licensing terms in the case of llama, right? And fully commitment to the German open source definition. And we also do have alternatives that are being released, like Elmo from the Allen institute, where there is under an Apache license.
Right? so there is an ecosystem where it's like it's not all of the ostensibly open source models or open models are being released with the qualifications of something like lava, because the lava example is consequentially, this is how large it is. But we do have other models that are being used that are more principally licensed. So it's not just about the investment of capital, like it costs money to produce a model, but we are seeing some entities make choices that even when they do invest significant capital, they are releasing it under a more permissive license with comparable performance.
So I think part of it for us is making decisions about trade offs in performance and. I think broader reusability and the potential for others in our community to build upon. Andrew Bates Association of Research Libraries. I'm going to ask you the question that we've been discussing since responsible operations was published.
You've been talking about AI quite a bit. How do we create the ANSI that's required for higher education, for libraries? You know, there's ANSI already being created in the corporate world. There's ANSI being created in government side of things. But I'm worried that the corporations will plunder and the governments will flounder. How how do we create ANSI around this topic for higher education libraries?
That's a great question. Thanks Thank you. I think we need to do a better job of advocacy in the sense of like advocacy with the government. I think that we have some representation in dc, which is fantastic. I think that we need to be doing more joint advocacy with other communities. In addition to the library community.
I think that we of course, need to represent our voice, but in the larger picture, you know, we need to be United with particular disciplinary communities and other academic associations and where we can try and chip out some ability to advocate with private sector partners, know with, with the metas, with Google's and so forth. think that's a steep hill to climb, but.
I think it's. I think it's necessary. Thanks Thanks. Hi, Jennifer. Call from hathi HathiTrust Digital Library. Thank you for your talk. So the anecdote that you told us about the thing that you were where you talked to people and some of them are leaders and they didn't have any thoughtful answers to your questions.
I found that quite disturbing. Not that everyone's going to have the same depth and sophistication of understanding that you do and all the others in the room, but we have to think about it. So I'm just wondering why do you think they didn't really have anything to say? Because we need to know so that we can maybe do a better job of talking to of us. Yeah thank you for that.
I, I think that the community is still trying to figure out what is, what is our particular ANSI. Right so you know, where are we best in sort of like leadership posture or partnership posture or really drafting the evidence behind something else? I also think that the position might be underdeveloped because the community is already under duress. And has been under duress.
I think especially across the pandemic, it really compressed a lot of services and the ways that universities operate. So it feels like in a post pandemic picture that there's still sort of things are settling a bit around what libraries could be and should be. And, you know, I sort of as an additional sort of potential, but also a complication to that, I guess are sort of like my thoughts about why the answer was what it was.
I also think that there just remains to be more sort of like concrete operational experience that needs to be had to inform the strategy and. But still, it's still in process. In terms of implementation and library discovery systems and public facing services in. Partnerships with other entities on campus distribution, labor savings, doing high-performance computing and the library.
I remember prior again pre-pandemic there were discussions. I don't know where it landed, but I think I can attribute this one. I don't think it's like a Chatham thing. I think it was like a public. Public record. But at Columbia at the time, there was a discussion about, say, distributing levels of service for I think it was like a science instruction or something like that, or the library would carry like tier 1 instruction and like high-performance computing would handle all the rest.
But I think that I don't know what happened to that, but I think it's more experience needs to be approved, and there's a perception of a lot of pressure that a lot of decisions and strategy needs to be solved, right now. There are some like iPads or I guess more coffee chat would be good. Thank you. Thank you.
Thank you very much. You know, I'm a fan of yours. Whenever you have a representation, afsharid immediately. So you know that we have some University in Germany. So first, shameless plug 1 in wroclaw. You mentioned earlier scenarios. So keynote, closing keynote in this room.
I will lead the conversation around 2030 for AI futures. And I was inspired by that study. So we discussed this actually. And once they play in the last session, second thing I was disturbed by when you mentioned one of the professional roles in one share, I think, which is troubling, by the way, this is not public yet, but I know a couple of unions in the Northeast started the conversation in higher education, not for profit.
They're talking to senior administration saying, what are you going to do about the AI and automation? There is already kind of pushback, which is kind of help doing that big, big tech of firing people. And they are saying it's not a reason. I mean, I will be here for a while. It's not AI at all because they will kind of hiring right and left polluting. So they get stocks up and in January oops, let go given and mentioned information it's not it's not meeting in the library where I felt it hobbies to my people.
So I mean you're not going to be fired it's maybe you will have less hiring, maybe, let's say six, seven years. But firing people and using AI, no way. I doubt not in our field to discuss subacromial space going back here. Great thank you. I like to raise an issue that I think is a fundamental problem that we have to address in all kinds of ways.
Let me put it this way. When we're talking about large language models, the large language appears to be English. This technology can't be truly globally open until we deal with this really extreme imbalance. That language is very unequally distributed around the world. So I'm interested in your comments on. You know, what can be done about that?
Is anything being looked at about that? It seems to me an almost intractable problem. It's I guess I agree that that would be the first thing that I would say. And I think that I think that it can be improved upon. I think that we have some particular challenges with historic multilingual content that are particular to our community.
Right like if it is the case, it's kind of difficult to tell because most of the training data is opaque and data sources in general for the large language models now. But it seems like from what we can tell, like most of the data is relatively contemporary in nature, right? Most of the language content, for example, is it exists relatively contemporary. But what we have collectively in libraries, archives and museums is thousands of years of language expression and many different kinds of formats of various degrees of machine legibility.
So if there is any future potential for multilinguality and then also to have it be comprehensive in terms of like representing functionality for the historical trajectory of languages as well, there's going to need to be a fair degree of investment in work. Like for us, we are the primary stewards of. Ouch right.
Like a historic Polish. So I can see there need to be investments there, for example. We value cyberspace. So I'm curious what you're hearing about in the open source model world. When people talk about the balance between openness and safety. Right if you can do research on smallpox right now, that you need to follow certain guidelines and regulations if people are studying nuclear energy.
But if you start accumulating a whole bunch of 248. Somebody going to know about it. And so when it comes with these large models and having them maximally open, how do you balance that with the appropriate level of talk people talking about the open source community about them? Yeah what I see in reaction to that is a series of arguments that it actually the technology is safer the more open it is because it's more possible to basically audit how it's functioning well and not functioning and causing harms as opposed to being closed.
Um, but again, I think like to the counterpoint to that, if we think back to the gradient of openness, there are others that argue that a highly controlled closed model, it centralizes the responsibility for automating risk management and things like that. So I think it's like two different arguments for the safety question at least that I've seen kind of like a lot gradient of openness.
Thank you. Good morning. So, Shiva. So my question is on ethical considerations. So I told them I had to get back to. So how do you think I algorithm? Better the patient data is not biased or there should be some fairness of the data. So how do we build that?
And how do we take into consideration about the ethics? That's a great question. In the medical field in particular, I can imagine. Quite important. I've actually got an idea to try and answer that one because I've worked with enough medical libraries in my career to understand that they know way more than I do. So I would just say that I think that's quite complicated, you know.
I think he's like concretely, like poses like immediate risk to an adult. And I'll look forward to your talk on the topic. We have time for, like, one more. One more question. Thank you so much. Call my question is about machine learning. Models are highly variable and they are quite many of them.
And even though they are open and available to everybody, their response will be so different. They are highly trained. So what do you think about like, we can get more reproducible results from those models? Can you say more about your question? More context. More context.
Is that like is again, like the two day Russian language models try to imitate our thinking process. Your thoughts today will be a little different tomorrow. Think about the model's response. Also very variable, right? You can ask the question today and tomorrow or two things later. No responsible baseboard, that one. If anybody else wanted to recreate your response or recreate that model and want to get out from that model to response you get.
It's less likely to get the same outcome. So the challenge will be in here. Reproducibility of the result. So what will be the solution on those? Oh, that's a good question. I think that we see some examples of an engineered system that tries to ensure the same response or a similar response each time.
One of the examples that I've seen, I think it's in beta is a JSTOR product that has some integration, I believe with I believe with everything. I may have heard someone from jeiiystone here and I see a couple have not experienced a couple, but perhaps a conversation with them would be useful because I've seen some examples out there where they say like, oh, it's generative.
The response is always going to do something different, right? But then of course, like the standard of expectation in for different users, it's not always going to work for them. Right? like, do you ask like, what is this article about? How is this article related to this article? Who is prominent for whatever reason in this article, we don't want it to say a different person every time. Right?
so I'd recommend talking with the J store folks. I also saw an interesting experiment that just came out, I think recently from the Library Innovation labs at Harvard called walk, where they're trying to create a system to BigQuery web archive content. And I believe they had kind of a setup that tried to account for the phenomena that you're describing.
So, Thomas, thank you.