Name:
AI, metadata creation, and historical bias
Description:
AI, metadata creation, and historical bias
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/6936476f-6210-4105-8c0a-59ae8c16fa40/videoscrubberimages/Scrubber_1.jpg?sv=2019-02-02&sr=c&sig=%2FmPFNUZ8ir4TNoPgeX8wkhNKrszHaHT6fE%2BI8PPNM3o%3D&st=2024-10-16T02%3A05%3A41Z&se=2024-10-16T06%3A10%3A41Z&sp=r
Duration:
T00H52M36S
Embed URL:
https://stream.cadmore.media/player/6936476f-6210-4105-8c0a-59ae8c16fa40
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/6936476f-6210-4105-8c0a-59ae8c16fa40/18 - AI%2c metadata creation%2c and historical bias-HD 1080p.mov?sv=2019-02-02&sr=c&sig=Ws4n5VteHrPXXTA6ZG6pEULlgjIWz2bzKaoJrjjzzwo%3D&st=2024-10-16T02%3A05%3A41Z&se=2024-10-16T04%3A10%3A41Z&sp=r
Upload Date:
2021-08-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
KARIM BOUGHIDA: Hello, everyone. Welcome to NISO Plus 2021 session. My name is Karim Boughida. I'm the Dean of Libraries at University of Rhode Island, Kingston, Rhode Island in the US. This session is around AI metadata creation and historical bias. It's planned for Tuesday February 23, 2021,
8: 00 PM Eastern time. It's recorded on February 11. Why we're doing this, because it's a very important topic. I'm a firm believer that we should pay attention to AI in our field, archives, museums, libraries, and academic publishing. And that's the reason, by the way, there is a poster on AI. We have the first AI lab in the world at the library here.
8: This session is a high level, so our goal is to stimulate the conversation and address biases and superstructure around metadata systems. So it's not a practical one to use specific technique in AI, machine learning, or natural language processing. So we're lucky to have three distinguished experts today.
8: First, we are going to have a talk about with Dominique Luster. She's Charles "Teenie" Harris archivist, Carnegie Museum of Art, Pittsburgh, Pennsylvania, the US. Then it's Michelle Urberg, "Return on Investment-- Reframing AI and Metadata." Michelle is with Maverick Publishing. And then we have Joris van Rossum in Amsterdam is the Director of Research Integrity, International STM Association, representing academic publishing industry.
8: So we're going to start in this order, and then we're going to open for the conversation during the live session. So welcome everybody. And now it's Dominique.
DOMINIQUE LUSTER: Fantastic. Thank you so much. I am going to share my screen. And you'd think after a year of Zooms, we'd all be brilliant at doing that. So I'm going to share my screen here, and why don't we just get started. If I could just get a thumbs up from my fellow presenters to indicate that we are good, great. You see my screen on.
DOMINIQUE LUSTER: Perfect. Thank you. So I am so excited to be here today. I'm excited to have this conversation. I agree that it's a very important conversation for all of us to have, and I hope that this will be a very fruitful conversation that we can all embark on together. Again, my name is Dominique Luster.
DOMINIQUE LUSTER: I'm coming to you from Pittsburgh, Pennsylvania, where I am the Charles "Teenie" Harris archivist. And I would love to talk to you for just a few minutes on AI, metadata, and historical bias. A little bit about me-- like I said, again, my name is Dominique. I kind of worked between the lanes of information, cultural heritage.
DOMINIQUE LUSTER: I specifically have a strong affinity for Black identity within the art and the cultural arts space. I believe in myself and I push myself to the a champion for Black art, especially and particularly in lanes of silenced historical records and identities. And I wanted to ground us today in a conversation from this quote that I absolutely love, that reads as follows, and that is that time and again, racist ideas have not merely been kind of cooked up from the boiling pot of ignorance and hate.
DOMINIQUE LUSTER: No, but actually, time and again, powerful men and women, brilliant men and women, have actually produced racist ideas in order to justify racist policies of their era. And they do that, actually, in order to redirect blame of their era's racial disparities away from those actual policies and onto Black people. And what he's saying in this section of the book, which I highly recommend that everyone go read, but what he's trying to teach us is to eliminate the idea that racist ideas and policies don't simply come out of nowhere, and more so, they don't actually come-- typically, they don't come from these sporadic and spontaneous modes of hate and violence and just out of nowhere, really.
DOMINIQUE LUSTER: They come from policies, practices, teachings, guidings, recommendations, structures that are passed down that are taught to us. And they're actually often taught in very brilliant and nuanced ways that are often hard to see, unless we take the initiative to interrogate every single lane. And that's what I hope that we will do together when we look at metadata and historical bias using artificial intelligence.
DOMINIQUE LUSTER: So where we see often the two main that I tend to see historical bias in galleries libraries, archives, and museums is in, one, the idea of white normativity. And this is this often unconscious and invisible ideas and practices that make whiteness appear neutral. And this is often something that we see when whiteness is only really called out in the idea of otherness or something to oppose against it.
DOMINIQUE LUSTER: So you start to identify whiteness when you start to identify Blackness, and you really only are looking at whiteness as opposed to something else, as it's the idea of othering. And furthermore, in our records within libraries, archives, and museums, we are often creating this biasness, white normativity, through the lens of the white gaze, and that is the assumption that the reader or the receiver of any piece of record, of any piece of art, of any piece of archival material, of any book comes from the perspective of someone who identifies as white.
DOMINIQUE LUSTER: And it's the idea that both the creator and the consumer, naturally, as a neutral lane, their perspective, their mindset, their identity is white, and therefore they agree, essentially, with the biases of the created record. How this translate, often, we see into historical practice is this idea of-- is the mindset, the historical professional mindset of gatekeeping. We see that, in fact, records are extremely powerful things and that in all systems, public and private, to be clear, records can be used as an instrument of power.
DOMINIQUE LUSTER: And that power can actually be used to empower, to liberate, for salvation, for freedom, but equally, it can also be used to silence, for erasure, for oppression of a people group by the inclusion or exclusion of their records from a system or the treatment of those records, the language that's used to describe those records, for or against a group of individuals.
DOMINIQUE LUSTER: More so, archives can be used as a technology of governance. And then what I would love for our profession to do as a collective community is to interrogate and pay close attention to how access, custodianship, stewardship, and use is regulated by ourselves as in and out of these systems. So it isn't just the system, it's our use of the system. It's using the system for our own institutional gains, and where do those institutional gains come from.
DOMINIQUE LUSTER: More so, as we kind of dive deep into the crux here, into the actual archival metadata, I want to remind us all that archival metadata can typically be made-- the framework of archival metadata and typically be made of three components, and that is content, context, and structure. And we are often led to believe, or we're taught in library school that archives are neutral.
DOMINIQUE LUSTER: And I am here to posit as a stand in the ground for myself that archives are not neutral. We are taught to use passive, neutral, or objective language as we are creating metadata records, as we are creating our findings in our discovery layers. However, what if the language that we are taught is not actually passive or neutral, we have just been taught that it's not that it's passive, neutral, or objective?
DOMINIQUE LUSTER: But in reality, if we think about the language that we use through white normativity and through the idea of white gaze, what is considered objective is actually still uplifting and empowering the white gaze or the white normative of our consumers of materials. Further, context, the context that a collection exists within, within the institutional repositories, thinking about guidelines or requirements for how we contextualize our records in our discovery layers can often institutionally be flawed.
DOMINIQUE LUSTER: More-- and it's not just the context of the institution or the other collections that are in the repository and how the collection that you're processing relates to those other collections, but more so how the collection that you're processing relates to the actual world that it comes from into the community that produced those materials. Is the contextual relationship that you are framing match the actual real live experience of the community of which you are seeking to steward and to represent through this collection?
DOMINIQUE LUSTER: And finally, as a principle of metadata work, of archival metadata work, I ask our community to look at the structure, the actual structure itself, whether that be the cataloging structure or the discovery layer. So on any side are the schema and software that we are using inherently flawed. And I always think of the example here, the story of our indigenous or our first American communities and the collections that may be processed to tell the story of those communities.
DOMINIQUE LUSTER: Now, if the cataloging software and/or the discovery layer that you are using does not accurately enable that community to share the full breadth of what it means for that piece of material, then the structure is inherently flawed because it actually isn't inclusive. And finally, as I'm asking you to think about all of these things before I shift us into the ideas of where AI plays into all of this, I first want to ask you to just kind of think about who is actually-- or whom-- is assumed to be in authority.
DOMINIQUE LUSTER: Who is given that privilege on authority? Who grants privileges and who takes them away? Because I don't want to use those who was given power of authority, but re-understanding that the authority to name, the power to name is a privilege. And how did these assumed or presumed privileges translate into machines? And I believe-- and I am positing for the conversation that I hope that we will have following these presentations-- is that AI and ML is only as strong as the data set from which it is trained.
DOMINIQUE LUSTER: And what I mean by that is if the data set that we are going to use to process the backlog collections at a faster rate is inherently made from humans, who are inherently filling that data set consciously or unconsciously with their own human biases, then everything that's going to come out of that data that will reflect those same cultural incompetencies.
DOMINIQUE LUSTER: But rather than say, oh, well, what can we do, I believe that we have the opportunity or the obligation to actually create or recreate specifically anti-racist and anti-oppressive structures and data sets. And I want to say this very clearly, that it is not simply about creating diverse or inclusive data sets. No, what I'm actually saying is that these data sets, the work of our libraries must be specifically and by design anti-racist and anti-oppressive So finally, for your consideration, as we move through our work using artificial intelligence, and in my experience using machine learning in my own work, I am actually finding that the use of machines actually doesn't speed up my work at all because the thought and the care that must go into this work actually takes a substantial amount of time to build up to the on-ramp to using this work.
DOMINIQUE LUSTER: To get clean, clear, and culturally competent, racially conscious data sets to do this work takes time and is very complicated and must be interrogated at every stage and at every layer, from whether you're using DACS as an archival standard to mods and to your repository's processing scheme because otherwise, machines actually have the ability to exponentially compound the impact of cultural inequities.
DOMINIQUE LUSTER: So just because you have one data set, now you've applied it-- you have one data set that may be flawed, well, now you've just applied it to a backlog of 50 collections, and now your records for all of those collections share and potentially expand those flaws even further. So I just ask us to be thoughtful and to interrogate everything.
DOMINIQUE LUSTER: And I'd like to leave you with this, that "let the globe, if nothing else, say that this is true, that even as we grieved, we grew, even as we hurt, we hoped, even as we tired, we tried, and that forever we will be tied together victorious, not because we will never again know defeat, but because we will never again sow division." and the reason I share Amanda Gorman's beautiful quote, this specific line, with you is because I do feel, as we move into the 21st century, division could potentially occur in the use of artificial intelligence and machine learning, but it does not have to.
DOMINIQUE LUSTER: It does not have to divide communities between who is included and who isn't. As we move forward in the information era, we have the ability to bring together more than we divide. We, as this information profession, have the ability to grow and to build more than we separate, more than our structures and our systems might like us to divide. But that requires courage, even if it requires work.
DOMINIQUE LUSTER: But I do believe that this community has the ability and the capacity and the heart to do that work. And I hope that you will join in that work with me. And if you are interested at all, I see my slide, of course, got to the very last slide before there is a structure issue, but feel free to connect with me. You can reach me by email, which is LusterD@cmoa, my website, you can follow me on Instagram, connect with me on LinkedIn.
DOMINIQUE LUSTER: Or let's chat on Twitter, though I will admit I'm lazy on Twitter. So if you message me there, it'll take me a while to get back to you. But I would love to join in with you on this conversation, and I am going to stop sharing my screen. Thank you.
KARIM BOUGHIDA: Thank you, Dominique. Thank you for presenting in very artistic way and make it clear it's metadata archives are not natural. Next is Michelle Urberg.
MICHELLE URBERG: Hello. Let me share my-- OK, and let me play it from the start. Welcome today. I am also very happy to be here and to be on this panel. I was pleased to receive this invitation to speak on a topic that I think is very important, the intersection between artificial intelligence, metadata, and historical bias.
MICHELLE URBERG: And my contribution to this conversation today is to get us all thinking about the effects of naming in the context of historical bias and how that can hinder the ability of artificial intelligence to leverage metadata to its fullest in a responsible and equitable way. Metadata can produce a great return on investment, which is the title of my talk, "Return on Investment-- Reframing AI and Metadata." But with professionals in the industry, we're talking not only about bottom lines here, of course, but also for social responsibility, particularly in the archival library and academic publishing space.
MICHELLE URBERG: So I'm going to challenge us to think about that a little bit differently today. My framing of historical bias and metadata is influenced, in fact, by my own experiences with metadata over the course of my professional life. I function as a metadata professional in the library and publishing industries. I'm currently a member of NISO's Video and Metadata working group, and I'm always-- if you know me, we talk, you know that I'm always championing the cause to create better data about content, whatever that content may be.
MICHELLE URBERG: And I'm sure many of us can embrace the idea that metadata can make a great return on investment, again returning to the theme of my talk, because it facilitates better user experiences, always thinking about the end user, the customer your product, be it with books, video, journals, or any other content discovered through library and academic research platforms. I'm also a historian by training, a music historian to be specific, and I know firsthand from the end user experience how metadata and description can expedite or hinder primary source study.
MICHELLE URBERG: I spent many, many months photographing manuscripts and then later transcribing their contents to trace changes in melody or text, and then later cataloging, in addition to that manuscript's body of work, images of women in particular-- see Bridgette-- and how they're depicted iconographically. I loved getting hands dirty with these materials, and I loved working on my dissertation topic.
MICHELLE URBERG: The time to produce a dissertation chapter based on this type of research could have been halved or even greater if I could have more easily searched the images or if the music had been previously transcribed and encoded in a way that I could run different types of queries on it. And while this work was illustrative to me and valuable for the musicological community, I could have answered different kinds of questions if I had different pieces of metadata and information about these two data sets, music and image, if I had actually material at my disposal.
MICHELLE URBERG: As an end user, then, my return on investment to create metadata or queryable data sets is enormous. And had I continued as a musicologist, it would have been worth my time to create more robust metadata about these images. But regardless, I bring those two perspectives into play here in our discussion today.
MICHELLE URBERG: And I want to think about, especially with respect to any type of archival and/or other types of content data sets, what are we talking about when we're talking about metadata with tagging, subject analysis, identification of people. Naming is a powerful tool, and the manner of-- the person who is the namer is the person who controls the power for the end user experience and is given a platform, a voice.
MICHELLE URBERG: The end user has to figure out what that voice is when they approach the library discover catalog, the archival finding aid, the material in publishing platform. You don't get to exercise control over that, even though you as end user are the person who has to figure out how to manipulate that to get your job done. So to this end, I want to talk a little bit more now about several concrete examples.
MICHELLE URBERG: I think Dominique did a wonderful job of framing us in a theoretical context, and I purposely wanted to sit alongside of that, showing some really specific things about the power of naming and how we have to be careful. The first example that I have comes from a blog that is recording-- I was looking around the internet as I was preparing this presentation, and the IEEE Spectrum blog, which has a lot of really cool content-- they have a nice section on artificial intelligence, a six-part series run in 2019.
MICHELLE URBERG: Particular article by Oscar Schwartz chronicles the systematic algorithmic bias built into St. George's medical school application process during the 1980s, which disproportionately weighted name and place of birth to reject those names from initial screening processes. So these were the names, or these were candidates who would have never made it to the interview process to even be admitted to the school.
MICHELLE URBERG: So based upon two pieces of information, a label either Caucasian or non-Caucasian, points were deducted for non-Caucasian applicants, or added if you were, in fact, identified as Caucasian. And fewer points translated into less likelihood of being chosen for an interview to study medicine and this particular school. And in aggregate, this one category with these two terms led to intentional racial and gender discrimination, and ultimately a significant reduction in diversity gains made in the later 1970s for entrance into study at the school.
MICHELLE URBERG: Not only did this reduce diversity, it actually enacted an institutional block against those applicants who were rejected by the algorithm, thus changing career trajectories, like, life stories for people, for a number of the candidates. And as I understand, there is only a handful of potential candidates who were able to get their decisions reversed based on the inquiry into this issue that happened in the mid 1980s.
MICHELLE URBERG: So the commission that investigated this issue was widely publicized, and the problems inherent in it were not truly resolved, despite the censure placed on St. George's hospital. The algorithmic learning that privileges one term or a set of terms to restrict or enhance discovery of a scholar or institution for the or work from a particular region is readily reproducible in the systems we have in place today that feed discovery layers, content aggregators, preprint servers, all things that have been a much talked about during the COVID time that we find ourselves in, where we are relying on large new bodies of information and looking for authority in that information.
MICHELLE URBERG: Nevertheless, there are always opportunities for improvement. My second example was inspired by conversations we had as a panel prior to this event. As I prepared for my talk, I went back to Safiya Umoja Noble's important book, Algorithms of Oppression-- How Search Engines Reinforce Racism, which takes a great and important look at a number of issues relating to-- for me personally, I really honed in on the library and discovery issues that she talks about.
MICHELLE URBERG: And she, at one point in her book, she takes a look at Art Store, which is actually a response to another scholar in the field, Matthew Reidsma. If you haven't read his work, his last name is spelled R-E-I-D-S-M-A. Take a look because he has a lot to say about library discovery. So she goes into Artstor, and she runs several queries-- "Black history," "African-American stereotype," and "racism." And I'm going to talk about what she found, which I pulled directly from the book, and I reran those queries to see how Artstor has improved or changed the way they index their information, again, which is fed on the metadata that they're using for their algorithm.
MICHELLE URBERG: At the time when her queries were run in 2016, the first results for "Black history" return about 2,000 hits, and there are a number of white European artists. The first result is for this gentleman, Thomas Waterman Wood, Self Portrait, which is an interesting first search result for that particular query. The second is the "African-American stereotype," and the first item for this is On to Liberty by Theodore Kaufmann, who is a German painter.
MICHELLE URBERG: And the third search-- I'm going to go back one slide here. Note that there are 42 results for this particular image search, "African-American stereotype." For "racism," there's 917 results, and the first item is actually again On to Liberty by Theodore Kaufmann. And in this particular item or this particular search, I'd like to draw your attention to the last item on the right.
MICHELLE URBERG: It's called Rent a Negro, which Noble rightly points out is, she says in the quotation, it's a critique of the racial ideologies that tokenized African-Americans. When I reran these queries, I kept in mind what she was trying to achieve with these particular searches that she engaged with in her book.
MICHELLE URBERG: And this is an example, in fact, of the "white racial gaze on information," which that's quoted from her book. And again, I quote, "a result of investment of the profession in colorblind ideology." So this is librarians librarianing to its fullest. From an end user perspective, which is what I was thinking about in these searches, it is confusing at best and offensive at worst. And so bottom line, we need to do better as professionals in this field.
MICHELLE URBERG: I reran the search with help from Verletta Kern, who's been a research partner of mine at the University of Washington. And while I think I was able to duplicate the search more or less faithfully, the results that I have to share with you here actually reflect a new conversation to be had between what happened in 2016 and now what's available in 2021.
MICHELLE URBERG: "Black history" now returns mostly clothing from the 19th and 20th century in the first page of hits. But notice that the sheer number of results is something over 130,000, where it was approximately 2,000 previously. So this has diluted the usefulness of this particular search string, not only by sheer content, but by the way the metadata tagging has been done in the platform.
MICHELLE URBERG: "African-American stereotype," by comparison, returns fewer results, 35 instead of 42, but it's still arguably inscribes stereotypes associated with slavery in art. And is this a worse type of misidentification than the previous results? And finally, the results for "racism" again are slightly larger, 1,400 results compared to 900 previously, but they're equally problematic.
MICHELLE URBERG: You can take a look at those images there. So tagging and metadata associated with these images needs to have continual tuning and that is the bottom line. But I would like to draw your attention to one key difference between the source the searches 2016 and those in 2021. G Store has-- Artstor, I should say, has demonstrably improved the UX to include an abundance of limiting options on the left-hand side of the page.
MICHELLE URBERG: And while the abundance of options is known to overwhelm users, in this case, the term "racism," we can see acknowledgment of the media encapsulated in the platform comes from all over the world and that the facets can present a good portion of work towards acknowledging previous issues with misidentification in the Artstor platform.
MICHELLE URBERG: So ROI from metadata, this is the last slide before I conclude here. This brings me back to return to the return on investment. If time, talent, and resources are invested, as the Artstor example suggests, a more detailed and specific naming facilitates better faceted searching. Facets in this case, visual and display all the locations were provocative studio art, photographic journalism, et cetera is recording or training perception of racism.
MICHELLE URBERG: It's not perfect, by no stretch. But there is more nuance in the platform than there was even a few years ago. Naming and metadata is everything, and historical bias in the library catalog, the convention of other metadata creation suffers from what David [INAUDIBLE] calls language decay. Language decay is a more insidious problem than refusing to remove outdated terms or naming conventions in controlled vocabularies.
MICHELLE URBERG: It is a statement of cultural values and unconscious bias inherent in cataloging and metadata creation. Metadata pushed through algorithms. To train artificial intelligence requires constant vigilance to train a system to behave equitably and to respond to the changing needs for user discovery. A challenge is here today to begin a discussion about transformative naming with metadata as a strategy to improve AI technology for discovery.
MICHELLE URBERG: Thank you. All right, and we'll stop sharing.
KARIM BOUGHIDA: Thank you, Michelle. This is very interesting. It reminded me of Artstor. I was involved a little bit in the beginning of Artstor, maybe 18 years ago, and we never discussed the biases. We were discussing metadata, quality, metadata enrichment, merging fields, enhancing UX/UI. But now there is a little bit of improvement, but there's a lot to do.
MICHELLE URBERG: Yeah, there's always room for improvement.
KARIM BOUGHIDA: Yeah, it's the way-- when you approach this, we approached this as almost a technical problem. We had poor quality from feeders because it's all over the world. But anyway, next, Joris.
JORIS VAN ROSSUM: Yes. And I'm going just to set up my screen here, and I assume that you can all see my screen. So hello, everyone. Good morning, good evening. Good night for me, when the presentation is live. So my name is Joris Rossum. I am the Director of Research Integrity at STM. STM is the International Association of Scientific, Technical, and Medical Publishers.
JORIS VAN ROSSUM: We have about 150 members worldwide, and our members publish about 2/3 of research articles globally. I'm going to take a slightly different-- show a slightly different perspective. I'm going to talk about AI, metadata creation, and historical bias from the publishing or publisher perspective in this session. But first, I want to take a bit broader view and look at science in general and how it's changed throughout the ages.
JORIS VAN ROSSUM: Science started as being quite observational, looking at the stars, coming up with theories, evolved into more empirical science in the 17th and 18th century, more experimental as well. From theoretical to computational, the invention of the computer changed science fundamentally. We're now in the data science era, vast amount of data being available also changing the scientific practice.
JORIS VAN ROSSUM: And the availability of data and the increase of computational power, but also new technologies, especially AI, we feel will also change science even more into smart science. So one technology, of course, that is very promising is artificial intelligence, and it really has the potential to change science fundamentally and improve it as well. It can basically test a hypothesis against vast amounts of data.
JORIS VAN ROSSUM: Now we are able to really plow through enormous data sets and come up with new hypotheses, design new theories, explore new connections that a human could never be able to make. It can also do research. And yes, it could even run entire labs and write research articles. Actually, there is already some experimentation with writing books using AI.
JORIS VAN ROSSUM: So there is really a lot of potential to change science, improve science, and therefore also improve the impact that science has on society. When I think about AI a bit deeper, there are several components in AI. First of all, the input data, the input data, the training data that feeds the AI systems, we talked about it in the earlier sessions.
JORIS VAN ROSSUM: Then of course, there are the algorithms, the computer code, code that actually does the AI, the outputs of the AI process, and, of course, the use and the applications of these AI outputs. And publishers, academic publishers are actually involved in all four of these. First of all, we are, of course, providers of data. That's our expertise, high quality data. So in that component, we play an important role.
JORIS VAN ROSSUM: And of course, we are also developing AI tools, we communicates the AI outputs through our journals, and we are using AI tools. So AI is important for us as publishers. We are currently preparing a white paper on AI and science and ethics, particularly, which we plan to publish by the end of April. And for that reason we did the questionnaire-- we asked publishers, how are you currently using AI?
JORIS VAN ROSSUM: And actually, to our surprise, in already quite different ways. So how is AI already used by publishers? Well, it's used to recommend articles to readers, similar to what Amazon does, learning from what readers consume and feeding them new content. That's already a practice that's happening for a few years. AI is used to identify journals, the right journal for manuscripts, but also to find the right editors or reviewers in the process.
JORIS VAN ROSSUM: AI is used to identify the quality of English for submitted manuscripts and see what workflow is appropriate for that particular manuscript. We even have examples of AI, as I mentioned, that write books, that take a content and automatically write books. And what's especially interesting on the horizon is that it allows us to detect and prevent fraud, which is an increasing issue for us.
JORIS VAN ROSSUM: Think about duplicate submissions, plagiarism, data and image manipulation, so applying AI to prevent that, which is, of course, a very important goal of us, basically improving the science integrity. So AI can really help there. Are there downsides? Are there risks?
JORIS VAN ROSSUM: Yes, we talked about some of them in the earlier sessions, and we think there are three big risks in AI or things we have to be very cautious about. One is, as was mentioned before, the AI is as good as the data that feeds AI. Using flawed or wrong data will basically ensure that whatever comes out is worthless.
JORIS VAN ROSSUM: You can compare it to a-- you can have a fantastic espresso machine, but if the beans are of bad quality, AI will not make-- the coffee machine will not make the beans, the coffee better. So that's an important consideration, making sure we have the right data, correct data, and the well selected data. I will talk a bit about that later. The second, of course, is that if you have flawed models, then AI is also-- it also doesn't work well, and it can do harm.
JORIS VAN ROSSUM: That risk, of course, is enhanced by the fact that it's an opaque technology, it's a difficult technology. People don't easily understand the algorithms, and sometimes the companies deploying it don't understand it themselves. So that's a big risk for AI. The third one is that, of course, by its nature, with AI, computers learn by identifying patterns in existing data and existing processes.
JORIS VAN ROSSUM: So it amplifies the past and the present, and that has a risk that if you belong-- if you have characteristics, for example, of a group that was previously less successful, if AI is used as a predictive tool, it can lead to discrimination. So historical bias can lead to discrimination, and I think we're all aware of it's something we really have to work on to prevent. But I think for science there is another risk related to that historical bias, is that in science AI has the risk to consolidate the contemporary paradigms and structures.
JORIS VAN ROSSUM: What has been successful in the past will be used to predict the future. But real science, as Thomas Kuhn, the famous historian or the philosopher of science has said, all significant breakthrough comes by breaking patterns, breaking paradigms, and new ways of thinking. And the risk of AI is if you strengthen the existing patterns, the existing paradigms, it can actually suppress innovation and make sure we never get out of existing paradigms.
JORIS VAN ROSSUM: And I think that is a real threat for science and therefore for society. Humans are conservative and biased enough, and we shouldn't allow technology to make us even more so. Especially when AI is used in evaluation of science and peer review, we have to be very careful not to allow this bias to impact science in this way.
JORIS VAN ROSSUM: AI has risks, not only in science, but, of course, everywhere in society, hence various initiatives and organizations are working on ethics principles. Some organizations here, the OECD, the EU, the US government, but also specific organizations are working on ethical guidelines. And in general, if you look at them, they have certain things in common-- general principles, AI should benefit society.
JORIS VAN ROSSUM: It should respect the rule of law and privacy, of course. It should be robust, safe, and secure, accountable. Discrimination, we mentioned that before, is a true risk, and it has to be transparent and has to have human oversight. At STM, our mission is to advance trusted research. Also that's very relevant to AI, as we've seen looking at the risks, so we are working on outlining our AI ethical principles for AI as well.
JORIS VAN ROSSUM: We are, as I mentioned, working on a white paper, which will be launched by the end of April, which will outline the ethical principles, also talking about a lot of the aspects that we've seen in the previous slides-- transparency, accountability trust, fairness, sustainability, et cetera. So at the end of April we hope to publish that white paper with more detail.
JORIS VAN ROSSUM: I want to spend a few minutes talking about data, the data aspects again. The first risk of AI is that we don't use the right data. So I want to just talk about the crucial role of quality data and quality metadata because metadata describing the content and various aspects, like integrity and curation, is really crucial in, first of all, the quality of data and the right selection of data as training and input.
JORIS VAN ROSSUM: And I want to spend a few minutes talking about how that metadata is created in the application process. First of all, there is scientific contents to the journal articles that are being used increasingly also for AI. And metadata here is, first of all, created by the author him or herself, by means of giving keywords, the institutions, all the names, et cetera, and then through the submission process that metadata is being added.
JORIS VAN ROSSUM: But also, of course, journal selection is an important metadata because it already defines in one specific area [INAUDIBLE] area the research is being done. Also important, A&I services are doing additional indexing, and often AI engines using scientific content are looking for content, selecting content using these services. So the metadata added there is also important. The next one is scientific data, which, again, is very promising.
JORIS VAN ROSSUM: If AI feeds on data, primary data, it can, we expect, to weigh more because, of course, machines are better in interpreting and reading data than written narratives. So here, how is metadata created in the process? Well, data is mostly submitted to repositories. The metadata is added there by the repository selection in the first place, of course, but also by the repository and in data creation.
JORIS VAN ROSSUM: Additional metadata is added through the association of a published article, potentially, that links to that data, which also adds data, and that information, the data, is then fed into AI systems. The question here, of course, is this enough? Is this sufficient if you compare it to the metadata created for journal articles? And this is a discussion we need to have as an ecosystem. Do we need to have more processes to create the right metadata, ensuring the appropriate use of that data for AI?
JORIS VAN ROSSUM: There are a lot of talk in the ecosystem about data stewards, for example, at institutions helping with adding the metadata. You can also think about peer review or data creation that could possibly, again, ensure the right metadata, but of the right quality of data being so crucial for AI. So to conclude, AI has the potential to really revolutionize science and support, help society even more than it does already today.
JORIS VAN ROSSUM: It plays an important role for publishers as well, being involved in a lot of the components of AI. However, there are risks, so we need to work on ethical principles. And again, the producing, creation, the taking and disseminating of data and metadata is really crucial, and therefore also focus for us. So that's my talk. Thank you very much.
JORIS VAN ROSSUM: If you want to contact me, here are my details, and I'm very much looking forward to the discussion.
KARIM BOUGHIDA: Thank you, Joris. I'm looking forward to your STM AI ethics report, very interesting. This is it. This is the end of our record session, and we're really, really looking forward to the live session on February 23 for Q&A and also debate. I want to thank our presenters, Dominique, Michelle, and Joris. Thank you very much, and see you. Bye for now.
KARIM BOUGHIDA: [MUSIC PLAYING]