Name:
Discoverability in an AI world
Description:
Discoverability in an AI world
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/ba9426ec-f847-4906-9d3c-a39b4214715e/videoscrubberimages/Scrubber_1.jpg?sv=2019-02-02&sr=c&sig=2Pi5scsKKroLbdxcJE4MDB0J1GvOug02BiAGRpBJIIA%3D&st=2024-09-08T21%3A30%3A05Z&se=2024-09-09T01%3A35%3A05Z&sp=r
Duration:
T00H43M11S
Embed URL:
https://stream.cadmore.media/player/ba9426ec-f847-4906-9d3c-a39b4214715e
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/ba9426ec-f847-4906-9d3c-a39b4214715e/17 - Discoverability in an AI world-HD 1080p.mov?sv=2019-02-02&sr=c&sig=CzhESkXhW4pF1%2BI7bnMU3zIZ%2BmGxb71Epm3GFTdcrz0%3D&st=2024-09-08T21%3A30%3A06Z&se=2024-09-08T23%3A35%3A06Z&sp=r
Upload Date:
2021-08-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
JASON GRIFFEY: Hi, everybody. Welcome to the NISO Plus 2021 session on discoverability in an AI world. My name is Jason Griffey. I'm the director of the NISO Plus conference, and the director of strategic initiatives at NISO. Today, we have a presentation by Andromeda Yelton, who is a software engineer and a librarian who investigates humanistic applications of machine learning.
JASON GRIFFEY: She's also an adjunct faculty member at San Jose State University iSchool, where she teaches AI. Andromeda has a long and storied history in working on machine learning in libraries. I'm really excited about her presentation. After her presentation, we will have a panel/discussion/question and answer session with her as well as Christine Stohn from Ex Librus and Karim Boughida from the University of Rhode Island Libraries, who will be acting as moderator in that.
JASON GRIFFEY: We hope that you all join us for that conversation immediately after our presentation. Thank you for being here, and, Andromeda.
ANDROMEDA YELTON: Thanks, Jason. So I'm going to be talking to you today about discoverability in an AI world. First, I'll give you a whirlwind tour of how libraries and other cultural heritage organizations are building and using artificial intelligence. And then after the glamorous parts, some concerns. So I'm going to structure the AI use cases into two segments. First, I'm going to talk about AI that supports traditional forms of discovery.
ANDROMEDA YELTON: And then I'll talk about some novel forms of discoverability. So I'm going to be looking at four projects in the AI that supports traditional discovery realm, starting with Annif, which does automated subject header assignment. It's a project of the National Library of Finland. And its code is available at the GitHub link provided, and also you can see the project website at annif.org. And I'll try to give you this type of information for each of the projects so that you can explore more as you like.
ANDROMEDA YELTON: So how does Annif work? Well, you can see a demo of it at annif.org, and it looks like this. There's an input text box where you can upload whatever you want. So I chose to test it by uploading part of one of Henriette Avram's papers on the development of the MARC to bibliographic format, and how to-- or whether it could be adequately maintained by COBOL programmers.
ANDROMEDA YELTON: There was some concern that programmers might need to be skilled at languages other than Fortran or COBOL to use MARC, so I'll leave that as sort of a project for you to decide whether that's turned out to be the case. So you upload your text, and then you select the neural net-- the NN in there stands for a Neural Net, I believe. But you select the neural net model you would like to use to analyze it and assess its subject headers.
ANDROMEDA YELTON: So it has a number of choices that are based on different machine learning libraries, and that are available for Finnish, Swedish, and English. So I picked English because unfortunately, I don't speak Finnish or Swedish, and it provided me with the following suggestions in the lower right corner. And even though this is a pretty short chunk of text, for the most part, these are pretty applicable. And it's definitely something about data systems, and programming languages, and databases.
ANDROMEDA YELTON: Not quite sure what Finland is doing in there, but this is a Finnish project. So maybe there's a lot of Finland-related stuff in the corpus they have trained on and you're likely to see that. So as you can see, this is the sort of thing where you wouldn't want to necessarily assess these subject headings with a machine and put them on the web or put them in your catalog without any kind of oversight.
ANDROMEDA YELTON: But it could be the sort of thing that could radically accelerate a cataloger's productivity and allow them to catalog a much larger set of texts by giving them the quick set of suggestions, and then they can sort of thumbs-up/thumbs-down the obvious choices, and then just spend more time on the ones that are particularly challenging to catalog that a computer can't handle. Another set of projects actually is the Teenie Week of Play, so-called because it was a week of investigating various computational approaches to the Charles Teenie Harris Archive.
ANDROMEDA YELTON: You can hear lots more about this in about seven hours, at 8:00 PM Eastern, when Dominique Luster, who is the archivist for the Charles Teenie Harris Collection talks about it. So definitely check that out. I've seen her talk about this work before, and it was awesome, which is why I talk about it every chance I get. So look at that.
ANDROMEDA YELTON: But the archive, along with several other Pittsburgh area technologists and creatives, investigated various things they could do with their archival data and metadata computationally. And artificial intelligence things they tried, including automatically shortening titles, extracting locations and personal names, looking for the same person in different photos across the collection.
ANDROMEDA YELTON: And so let me show you how a bit of that worked out. So here's an example of shortening titles. Now your descriptions, of course, you want to be fairly rich and full of whatever nouns might be relevant to you. But a lot of their titles were also really long. And this proved to be a problem for putting things on the web because you just have these extremely unpredictable lengths of titles attached to your images and it didn't display well.
ANDROMEDA YELTON: And so as Luster was having to manually edit all of these titles down to a form short enough to be web-friendly. So they checked out whether machine learning could do it. And so this used the Python modules spaCy and textacy to identify parts of speech and main clauses and break down long sentences, like children, including Cub Scouts and Brownies, posing on the grandstand with television personalities Lolo and Marty Wolfson with sign for TV Safety Rangers at WTAE studio, to just, children at WTAE studio.
ANDROMEDA YELTON: Similarly, they extracted personal names and place names. So this is an example of extracting the personal names, going from Pittsburgh Pirates' baseball team manager Fred Haney posing with Birmingham Black Barons player Dwight Smallwood at Forbes Field, and it successfully found Fred Haney and Dwight Smallwood. They're missing the location names Forbes Field in here, but that's fine because this is an example that was just looking for personal names, not also for locations.
ANDROMEDA YELTON: That was a different data pass. They tried both the natural language tool kit and the Stanford Named Entity Recognition libraries, and they had better results with the latter ones. And they were doing this work in hopes of supporting geotagging and making it easier to identify people across the collection. So that's cool. It's not, however, flawless. This example successfully pulled out Eddie Cooper and Wilhelmina, who didn't have a last name, but also pulled out Home-A-Rama, who is not a person.
ANDROMEDA YELTON: So they had a 73% success rate for pulling out named entities, and concluded that it could be improved by teaching their machine learning system some context specifically relevant to the collection. So things like Forbes Field that is a local place, that a sort of general usage named entity recognition toolkit might not know about, but that anyone in Pittsburgh would know about. But even with the imperfect success rate, it's still speeds up archivists work.
ANDROMEDA YELTON: And then they looked for people. So this is an example of identifying the same person in different settings. And the pipeline here is there's one artificial intelligence system that does face detection. So it draws those sort of green boxes around wherever it thinks a face is. And then there's a second one that does facial recognition. So it says, this person is the same as this person type questions.
ANDROMEDA YELTON: This is a great match that shows the same person in just completely different events, and the sort of thing that might be very hard to manually discover as a human. Of course, there are also lots of obviously terrible matches, and I'll touch a bit more on that later. They used Mechanical Turk to verify some of the AI suggestions-- not all of them-- and estimated that it would have cost about $500 to check all of their potential matches that way.
ANDROMEDA YELTON: So much lower than having a human look through all 70,000 of your photos and try to identify everyone who might be the same. Of course behind Mechanical Turk is also humans, but they are being handed a much simpler task than assessing an entire collection. Another project that supports traditional forms of discovery in archives is Transkribus.
ANDROMEDA YELTON: This transcribes handwritten documents. It's a project of READ-COOP, which is a European cooperative society-- so it's sort of a corporation, but organized for social benefit rather than for profit-- that's open to non-EU members as well as EU members, and that was established to further this particular project, although it also supports other projects as well.
ANDROMEDA YELTON: And this is particularly a cool project because it's not quite the same as standard optical character recognition systems. So unlike OCR systems, it's not so tightly bound to a particular alphabet or a particular language. So it's been used in the Amsterdam City archives, the Finnish archives, out of the box, it supports Arabic, English, old German, Polish, Hebrew, Bangla, and Dutch, and it can be trained to more languages and scripts.
ANDROMEDA YELTON: They have some pretrained models which recognize particular author's handwriting or the handwriting of particular places and times, but you can train your own models as well. And they have a demo on the web that you can check out or you can download the software and install it to have a full set of features. So let me give you an example. This is one of their sort of standard documents you get if you just try out the free demo on the web.
ANDROMEDA YELTON: And as you see, it kind of goes line by line. It matches up lines in the handwriting with lines on the English side. And you can see, if you can read the English from here, that it's pretty good. I zoomed in so you can see that Edinburgh, 25 November, 1807. It's not perfect. So for instance, the 11 on the next line is the closing quote of that Edinburgh thing.
ANDROMEDA YELTON: And there's a word "cler" in the line above that that should be clerk. But it's a really good first pass. And you can see how doing this and then having a human double-check it could get you to digitize a whole collection very quickly. And then that collection would be full-text searchable. It could be available on the web. It could be used as inputs to further machine learning processes which need full-text corpora to work with.
ANDROMEDA YELTON: So as with the Teenie Collection, this can be a real acceleration to the human labor that is involved in the archive. And then another example is Laesekompas. This is Reading Compass, in essence, in English. This is a system to recommend books to library patrons. It is a project of the Danish Bibliographic Center, which is a public-private partnership that provides bibliographic data and IT services to Danish libraries.
ANDROMEDA YELTON: As a side note, between this and the previous one, I'm really interested in all of these novel business models that are sort of neither libraries nor corporations. and I think there's some interesting questions as to the future of business models as well as the future of the technology itself, and how tightly interwoven those may or may not be. Anyway, I found that very interesting. But the project here, what it does is it's based on loan data.
ANDROMEDA YELTON: And it's also based on some novel metadata. So they worked with librarians in Denmark to have the librarians design a new vocabulary instead of sort of traditional subject metadata that better served their users' expectations. And all new fiction in Denmark is now cataloged with this new vocabulary. And what it is it uses sort of the feelings that you might want to feel or the atmospheres you might want the book to convey.
ANDROMEDA YELTON: And you'll see some examples of that on the next slide. So that metadata and the loan data feeds into a couple of different AI recommender algorithms. And librarians who are using the system in their library can customize the relative priorities of those different algorithms. And there's some guardrails on those algorithms. So for instance, they're not just recommending all of the other books by the same author.
ANDROMEDA YELTON: So you might also like those, but you wouldn't be discovering new and interesting things because you probably already know you would like those. There's some guardrails to keep it from just recommending the most popular books because loan data is always going to funnel you toward the most popular books if you're not a little thoughtful. I'm not sure how they handle privacy in the system. That's always a concern of mine with loan data.
ANDROMEDA YELTON: But let me show you how it works. So if you go to the laesekompas.dk website, you'll see these various reader archetypes that have these adjectives next to them, like fantastic, or mystic. So OK, here's our gothy teenager who likes mystical dark stuff. And you can click on any of those adjectives to get other books that are cataloged with those, or the yellow button at the bottom to get books that fit all of them.
ANDROMEDA YELTON: So let's look at books that are warm. And so far, this is not AI. This is just following the traditional cataloging-- or the traditional cataloging with the new metadata. But if you hover over the books, you'll see this blue button that in English means reminiscent of. If you click on that, you've got the AI recommendations for other books you might also like. And you can just keep doing this.
ANDROMEDA YELTON: And like if I spoke Danish, I would still be doing this because this is really addictive. So there's an example of how AI can advance traditional discovery functions having to do with like readers' advisory. Now AI can also power novel forms of discovery. We can use artificial intelligence to create discovery systems that are unlike anything that currently exists in libraries or cultural heritage organizations.
ANDROMEDA YELTON: And that's super fun, so let's do it. In fact, I have already done some of that. The first project I'm going to be talking about in this category is a project that I wrote, Hamlet. And this is for exploring a corpus of graduate theses. And you can see the code, github.com/tahat andromeda/hamlet, or you can play with Hamlet itself-- not the visualizations on the next slide, but the other forms of discovery it has are available at hamlet.andromedayelton.com. So this was trained on a corpus of about 43,000 Master's and PhD theses, mostly from science, technology, and engineering-type subjects.
ANDROMEDA YELTON: And this corpus had a little bit of metadata, like authors and departments. But it didn't have very much. And in particular, it didn't have full-text search capabilities and it didn't have subject access. And that was a bummer to me because without subject access, it's very hard to find the documents that might be most relevant to you. You can look at all the documents that come from the same department and hope that maybe they have something in common.
ANDROMEDA YELTON: But the fact is there's thousands of theses from the same department, sometimes. So it's very hard to find a needle in a haystack there. And in addition, sometimes things that are from the same department don't have much in common. So this was a corpus of MIT graduate theses, and they have an Electrical Engineering and Computer Science Department, which has like 5,000 or 6,000 theses in it.
ANDROMEDA YELTON: But fundamentally, some of the ones at the electrical engineering end of that department don't really have anything in common with computer science theses. They're much closer to mechanical engineering or physics. And some of the things the computer science end of that department have nothing in common with electrical engineering, they're basically math.
ANDROMEDA YELTON: So departments aren't great for colocation or discovery. So I'm like, well, what else can we do? And I used an algorithm called doc2vec, which is-- it trains a neural net, and it puts theses sort of closer together or farther apart depending on how conceptually similar they are to one another. And that's something it figures out by itself, according to its own mysterious methods.
ANDROMEDA YELTON: I don't tell it what to do. And once we've got things that are closer together or farther apart, you know what you can do? You can put them on a map. You can say, like, this is the whole world of theses. What are all of them? And so this is all 44,000-some of them. I'm going to take you through a couple of these labeled places. In the interest of time, I'm not going to take you through all of them.
ANDROMEDA YELTON: But for instance, you can see up at the northern end of the big island, near the number one, there's the sort of red and orange. It's color-coding by department. So it turns out if you sort of hover over those-- you can't here, but in the page I have, if you were to hover over those, you would see that the orange ones are chemistry and the red ones are biology.
ANDROMEDA YELTON: So in fact, the algorithm has discovered that biology and chemistry theses are in the same neighborhood. That they generally relate to one another, which makes sense because biochemistry is a whole discipline. And there's a lot of overlap sometimes between biology and chemistry. A couple other things of note here.
ANDROMEDA YELTON: This two area, this green is physics. But if you look very closely, you may be able to see there's a little bit of orange up here. And that is the same orange as over here. Those are chemistry department theses that have been placed over in your physics instead of over near the bulk of the chemistry department. Why is that? Well, if you're to zoom in on those dots specifically and look at the titles of those theses, you would find that they're all about nuclear magnetic resonance.
ANDROMEDA YELTON: So they're all from the end of chemistry that is honestly, really physics. It happens to be pursued by people who are getting chemistry degrees with chemistry methods, but could just as easily be in the physics department. And then I also want to have a look at this thing that looks some kind of giant manta ray swimming as fast as possible away from most of the rest of the theses.
ANDROMEDA YELTON: This one I struggled with a lot because as you can probably tell, it's a lot of different colors. So it's not just sort of one department that's really different and is hanging out together. And I couldn't find a pattern when I looked at their titles. Like it didn't seem to be some giant interdisciplinary glob focused on the same thing. I just couldn't figure it out for the longest time. And then I actually started looking at the original texts online of these documents.
ANDROMEDA YELTON: And I realized they're all things like low-quality photographs or old theses that were obviously typewritten, and the back side of the page is bleeding through and visible on the front side of the page. Which is to say they're all things that are horrendously difficult to OCR correctly into text. This is Bad OCR Island. This is the theses whose digitization and scans were low enough quality that the computer couldn't figure out how to make them into full text in a really clean way.
ANDROMEDA YELTON: And so the machine learning algorithm struggled. And this made me sad because there's all this data in here that isn't really amenable to computational extraction right now. Some of which was originally written on a computer and submitted as a computer file, but has been stored as a PDF or a print form of record. And so we've lost that data we originally had. And we're going to have to do work if we want to recover it and make it amenable to discovery in this way.
ANDROMEDA YELTON: Yeah, save your originals. Otherwise, other people in the future have to do work. And it's very sad. What are some other things you can do, though, besides visualization, if you have a sense of what theses are closer together and farther apart? So you can literally just search in sort of a recommendation way.
ANDROMEDA YELTON: Search for authors and search by titles, and you can do this on the Hamlet website right now. And you can find out, what are the closest, what are the most similar other theses to a given title or to the works of a given author? And so if you know that one of them is relevant to your research, you can ask the neural net, hey, what else might I want to read? And it can tell you.
ANDROMEDA YELTON: Or if you are an author, and you have an ego, you can see who else is doing work similar to your own. Or if you are an author and you want collaborators, you can see who else is doing work similar to your own, and possibly who else is doing work similar to your own who might be in a different department, and therefore you might not have come across them sort of socially or through your lab. Similarly, if you're like a faculty member, an advisor, you can see who else's advisees are doing work that's similar to your advisees' work and therefore find potential collaborators.
ANDROMEDA YELTON: What else can you do? Some of this is stuff I've tried working out, some of this is stuff I haven't yet. But all of this is stuff that could absolutely be supported technologically by a system like this. You can automate the first step of someone's literature review. There's a thing on Hamlet that lets you upload your own text and it will tell you what's most similar to that.
ANDROMEDA YELTON: So for instance, you could upload the first chapter of a work in progress, and it would tell you what other dissertations are most like it. Once you've done that, you can try to parse out the bibliographies from those works. And you can say, people like you also cited. Now I've actually tried this. There's a first pass of it on the website. But it turns out extracting bibliographic data from unstructured text is hard because everyone formats their bibliographies differently.
ANDROMEDA YELTON: So I wouldn't put that into production right now, but it's the kind of thing you could do. You could look for perspective coauthors, like I said earlier, particularly valuable if you're in a very large university system where it would be quite difficult to know who all is doing what if they're not in your department. You could take the picture that I put up the first time-- you could put like a date slider underneath it, and you can move forward through time and see the different research areas growing and shrinking according to what the priorities of the institution were at the time.
ANDROMEDA YELTON: So you could see computer science just explode in the late '60s and early '70s, for instance. And you can ego surf. And I always believe in empowering the faculty to egosurf themselves. That seems to me like good for library relationships with its community. All right. Let me show you some other novel forms of discoverability.
ANDROMEDA YELTON: PixPlot is a project out of the Yale Digital Humanities Library. And it is for exploring and visualizing a large corpus of manuscript images. So they're looking at a corpus of 27,000 images from the Beinecke Library. Again, sort of prohibitively large for a human to familiarize themselves with all of those. And similar to Hamlet, they are putting them closer together or farther apart depending on their level of similarity.
ANDROMEDA YELTON: They're using a different algorithm to do so. I used doc2vec, they're using an inception convolutional neural net, but same idea of similarity based. But a completely different type of corpus, right? It's images versus text. And this is also a project I really like because it's hard to take 27,000 data points and make them render cleanly on the browser, and make them sort of zoomable, and panable, and stuff, and PixPlot has done that, and I think that's really cool.
ANDROMEDA YELTON: So let me show you an example. This is all 27,000 images. You can see again they've organized themselves into islands. And the DH Lab has pulled out a couple of hotspots, places where the collection has a lot of stuff. So if I click on the buildings hotspot, it takes me over here. And I can just zoom right in, and hopefully you can see as the pictures get closer to the screen, these are all buildings.
ANDROMEDA YELTON: In fact, as I scroll sort of off to the edge, the buildings get kind of squarier, whereas there's more sort of fancy buildings off to the left side. So it hasn't just organized buildings together, it's done a bit of attempt to organize them by architectural styles. Then if I kind of zoom out and scroll over to the main island, my hope is that I will see things that are not buildings at all.
ANDROMEDA YELTON: So let's pick a place, double-click on it, and see what's in there. And these, in fact, are not buildings at all. These are people. Many of them in kind of fancy clothes, many of them doing some kind of maybe performing. But these look like formal portraits. And so PixPlot has made this extremely large collection explorable and accessible by making it visual in a way that it might not have been traditionally.
ANDROMEDA YELTON: One other thing I want to show you, and continuing with the theme of different media types, is Citizen DJ. This is a project by Brian Foo working at the Library of Congress. Brian Foo has done lots of super cool work. You should definitely have a look at his web page, brianfoo.com. He's done stuff with the New York Public Library, the American Museum of Natural History, as well as Library of Congress.
ANDROMEDA YELTON: And he has a set of media tools that use scikit-learn, a machine learning package to process samples out of large audiovisual data sets. And then he can apply that into various contexts. And so Citizen DJ lets the public remix public domain audio and video from the Library of Congress into hip-hop. So if you go to https://citizen dj.labs.loc.gov/, you'll be presented with a bunch of different collections that he's extracted samples out of using these machine learning tools.
ANDROMEDA YELTON: So I looked at the Inventing Entertainment collection. I clicked the remix, and literally, just the first thing it gave me was this sample it assembled out of a Hungarian rag from 1914 and a '60s funk drum pattern. You can remix the stuff to your heart's content. This dropdown is huge. You can do '80s new wave, you can do '70s soul, like whatever you want. This is huge, too.
ANDROMEDA YELTON: And then you can customize which beats-- these are each four-beat measures. And you customize like where each of your little sample shows up. I decided to remove sample three entirely just because I didn't find it auditorily pleasing. But I left the rest of them exactly as they were, and this is what it was. [MUSIC PLAYING] Which honestly, I think that slaps.
ANDROMEDA YELTON: It's great. I played with that a lot. And that's all public domain. So you can download it, you can turn it into whatever you want. You can have fun being a musician with public collections. Yeah, I spent a lot of time playing with that, actually. And so I love the way that machine learning is letting him take a collection that might be hard to navigate otherwise.
ANDROMEDA YELTON: And not just making it more accessible to people, but turning into something where people can make their own creative works on it. So they're not just exploring it, but they're using it for their own ends. That said, I did say earlier that I would talk a bit about some challenges with AI. So I'm going to do that. This is an incomplete list.
ANDROMEDA YELTON: There are a number of important things I have left out-- actually, extremely important things, such as explainability, and digitization, and rights clearance, which are all critical and difficult problems. But I do want to leave some time for the discussion later on, so I left some of them out. But some other challenges with AI, especially in the library setting. One, data cleanliness-- any of you who have ever done any sort of library technology project are already cringing in sympathy right now.
ANDROMEDA YELTON: But in case you have not, some problems with data cleanliness, one, OCR is hard. Like we saw with Hamlet and the bad OCR island, a lot of the texts in a collection are just not going to digitize cleanly. And so it's not going to be possible to feed them forward into machine learning discoverability systems in a way that makes any sense. Metadata is inconsistent.
ANDROMEDA YELTON: Different institutions have different best practices for how they form out their metadata records, what fields they use, and how even they use some of those fields. The same institutions can have different in-house style guides, best practices, at different times. There may be some records that are extremely thorough and extremely accurate, and others that are extremely skeletal, and so you can't really count on all the fields being filled out.
ANDROMEDA YELTON: I once encountered a set of MARC records that had the US publisher Simon and Schuster as Simon space A-N-D space Schuster, and Simon space ampersand space Schuster, and Simon ampersand Schuster, and if you're a human, those are all the same. But if you're a computer, because computers are very stupid, those are all different. And that's the sort of barrier that can make it really hard to computationally build on things, because computers really need consistency.
ANDROMEDA YELTON: Typos, of course-- we're humans, we make typos. There are lots of them in metadata records and full texts. And computers aren't smart enough to know that they're the same word. Entities with changed names-- people change names all the time, serials publications change names, institutions change names. And so if you have a data set that spans a long time period, figuring out how to match those up can be challenging.
ANDROMEDA YELTON: And the list goes on and on. Related to data cleanliness, but pointier is data bias. I could go on forever on this topic. I spent several weeks on it on my AI class, so I will radically oversimplify. But the data sets we have tend to be full of biases about gender, race, nationality, any number of things. They may reflect stereotypes.
ANDROMEDA YELTON: For instance, a lot of English language-- natural language machine learning is trained on a news corpus that just happens to talk about men and women very differently because US culture is like that. And that may accurately reflect facts about the culture, but it isn't necessarily what you want your machine learning discoverability system to reflect. There's a classic paper, "Man is to Computer Programmer as Woman is to Homemaker," which found that in some standard English language machine learning systems, literally like the word man and related words like male were very close conceptually to the word computer programmer.
ANDROMEDA YELTON: And then the word woman was not. It was close to the word homemaker. That's a bummer for those of us who are women and computer programmers. And this affects things like translation. If you translate from a language that doesn't have gendered pronouns into English, the translation system has to pick he or she. And it will pick he is a doctor and she is a nurse-- that's great.
ANDROMEDA YELTON: So there's lots of stereotypes, there's lots of omissions, and overrepresentations. So for instance, facial recognition systems tend to be trained on white men, and tend correspondingly to do worse on women and worse on dark-skinned people, and especially badly on dark-skinned women. There's a project, Gender Shades, that explores the extent of this problem. I think it's gendershades.org, but if you search for Gender Shades, you'll find it.
ANDROMEDA YELTON: And I know this is one of the problems that Dominique Luster, et al., ran across in exploring the Charles Teenie Harris Archive, which is of Black people's lives in Pittsburgh in the middle of the last century. So I'm sure she'll be talking about that later tonight. Similarly, English itself is wildly overrepresented in any sort of natural language processing.
ANDROMEDA YELTON: There's just a lot of English text out there and a train machine learning systems on. And so Google Translate works pretty well for English, gender stereotype problems not withstanding. It works much less well for minority languages, as you have probably noticed if you speak a language other than English or another very widely spoken language. Certainly, I used to teach Latin, and Google Translate for Latin is not good.
ANDROMEDA YELTON: There's also problems with homogeneous teams, where these problems simply aren't noticed because they don't affect anyone on the team. So for instance Google deploying photo tagging without ever noticing that it did incredibly badly on Black people because there's just not a lot of Black engineers at Google. So it's easy not to notice this kind of stuff when you don't have a diverse team.
ANDROMEDA YELTON: And a lot of this, at best, is embarrassing for institutions who may be using these tools, but at worst, can cause actual real-world harm in terms of misclassifying people or making products that don't work on them. Another problem with a lot of AI systems is surveillance. The data that many AI systems are trained on is data about human behavior, be that the things we click on online for advertising, face recognition cameras in public life, user behavior in libraries.
ANDROMEDA YELTON: So this is a thing that can have a disproportionate effect on marginalized populations who are more likely to be surveilled already. AI systems add to that surveillance. And also, quite frankly, it's, I think, setting up a collision course between library values and potential artificial intelligence business models, because privacy is important to libraries, and there are ways you can implement AI systems that are not about tracking, and commodifying, and monetizing user data.
ANDROMEDA YELTON: But those aren't the systems that are necessarily being built. And I would really like to see in the space a partnership between libraries and other cultural heritage organizations that have the data sets and the patrons, and vendors who are much more likely to have the software expertise. But I'm really concerned this is going to end up being an adversarial relationship. And one where libraries feel like they have to choose between cool, futuristic technology and protecting their patrons.
ANDROMEDA YELTON: And I don't really relish that sort of victimization situation. So hopefully, it will be awesome. But I'm concerned that it will not be. And then another problem is resources. If you want to develop artificial intelligence stuff in house as a cultural heritage institution, I mean, you need software engineers. You need some software engineers who are comfortable with math, potentially.
ANDROMEDA YELTON: You need money, you need labor to assemble and label your data sets. You need time and cloud computing to train your models. And that means that in-house artificial intelligence is really out of reach for a lot of cultural heritage organizations. It's getting cheaper all the time, and certainly, sort of partnerships between libraries and other organizations, or archives, or museums and other organizations, like we saw with the Teenie Harris Week of Play, helps.
ANDROMEDA YELTON: But it's not a realistic option for most cultural heritage organizations to do this in-house. It's I think much more likely that most of the sort of high-profile things in the space will come from the vendor community. But again, I have concerns with that in terms of pricing. Like, will these be things that are only available to rich libraries? Even if the price points are accessible, will they require so much in-house knowledge to run and to take care of and maintain that only a small number of libraries can have them?
ANDROMEDA YELTON: I feel like there's a lot of opportunities here, but there's also a lot of concerns. And so hopefully, having raised lots of questions and giving you lots to talk, think, and argue about, let's have that discussion. Please click the button to join the conversation with me, Christine, and Karim. [MUSIC PLAYING]