Name:
A conversation about semantic censorship-NISO Plus
Description:
A conversation about semantic censorship-NISO Plus
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/ce3eb12d-0d20-4609-9918-2ab768ce129f/thumbnails/ce3eb12d-0d20-4609-9918-2ab768ce129f.png?sv=2019-02-02&sr=c&sig=ixeOZUatMRvsOUlIMQpc%2Fbsk7%2F69E0ugIspWA2DKxT8%3D&st=2024-12-08T22%3A36%3A07Z&se=2024-12-09T02%3A41%3A07Z&sp=r
Duration:
T00H45M19S
Embed URL:
https://stream.cadmore.media/player/ce3eb12d-0d20-4609-9918-2ab768ce129f
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/ce3eb12d-0d20-4609-9918-2ab768ce129f/A conversation about semantic censorship-NISO Plus.mp4?sv=2019-02-02&sr=c&sig=PCik7NQ9fjvxbwQ3BA%2FNFLoXbq1fWfjR9s0XHsRAhHY%3D&st=2024-12-08T22%3A36%3A09Z&se=2024-12-09T00%3A41%3A09Z&sp=r
Upload Date:
2022-08-26T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
YASUSHI OGASAKA: Welcome to NISO Plus 2022 session, a conversation about semantic censorship. My name is Yasushi Ogasaka from Japan Science and Technology Agency based in Tokyo, Japan. I am also a member of NISO Plus Planning Committee. And I am very honored to serve as a moderator of this session. This session is going to cover the implications of choosing or not choosing taxonomy terms viewed through the lens of censorship.
YASUSHI OGASAKA: The session speakers are-- Marjorie Hlava, the President and Chairman of Access Innovations, Angela Cochran, the Vice President of Publishing at the American Society of Clinical Oncology, and Shelly Ray, a Principal Transportation Planner, Research and Records Information Management at Los Angeles County Metropolitan Transportation Authority.
YASUSHI OGASAKA: The presentation will be led by Marjorie. And Angela and Shelly will make comments and share their thoughts at each segment of presentation. Now before moving on to the presentation, I would like to make some logistic announcements. This is a pre-recorded session, followed by a live conversation where speakers will take questions from attendees and have a discussion about that.
YASUSHI OGASAKA: So if you are attending the session live, you are encouraged to post your questions to chat window together with the name of a speaker you would like to ask. If you are viewing this before the session date, please come back and attend the live session for further conversations. Thank you for your participation. Now, Marjorie, please start the presentation. MARJORIE M.K.
HLAVA: Hi, everyone.
HLAVA: Angela Cochran and Shelly Ray and myself, Marjorie Hlava, are going to have a conversation about word control and how it can lead to semantic censorship. The question that we're trying to deal with is, how do we as stewards of information and the concepts that those represent, those of us that are publishers, librarians, and taxonomists, ensure access to the full range of topical ideas while being sensitive to the political climate of our times, or the time in which the item was created?
HLAVA: So what we're going to do is talk about several segments and introduction, some things about controlling thinking, censoring examples, some information loss, and detecting bias. I'm going to present the segment. And then Angela and Shelly are going to come back with some comments. And then I imagine we'll have more comments during the conversation that follows with the open audience.
HLAVA: So this is a delicate topic. Outlines of knowledge to guide people's thinking or just organize your own thinking have happened for ages. I mean, they go back to Socrates and St Augustine, and John Locke, who really made an outline of what we know is Western knowledge now. And then, of course, Descartes who's been increasingly popular lately.
HLAVA: Melville Dewey put together the Dewey decimal system to try to figure out how to organize knowledge as shown in books in the library. And the Encyclopedia Britannica built an outline so that they knew what kind of knowledge they should cover and include in the encyclopedias that they built. And Karl Marx, of course, built a whole outline of how he saw the world. Each of those had a point of view and organize their life view according to that outline.
HLAVA: And they're fascinating things to study. I actually wrote a book on that. But it's a slippery slope. It helps to organize our information on the one hand. On the other hand, it becomes hard to remove from politics and our thinking of the day. And we're going to try to stay above politics, but it's really hard. So when we're talking about controlling thinking as used by those outlines, there are things like many who have reviewed and decided what other people should read or watch.
HLAVA: And that happens in motion pictures. It happens in library collections and the selection policies, deciding what's acceptable content and what's not to include in a journal, for example, and other things. So here's an example. In Venice, in 1564, the Catholic Church, represented by the congregation for the doctrine of the faith, made a list of what was OK to read and what wasn't.
HLAVA: It was the original list of prohibited books, at least in the Western hemisphere. And people followed it pretty tightly for quite a long time. Another broadly known example is the motion picture codes. The don'ts and be carefuls started out in 1927 for the Motion Picture Industry Association. And they were pretty careful about don't include things that have to do with vulgar expressions, or venereal diseases, or ridicule of the clergy, for example.
HLAVA: And it went on for quite some time. You shouldn't use any firearms in movies. You shouldn't have any sympathy for criminals. You shouldn't talk about smuggling. All kinds of things were covered by that listing. And they went on to talk about sedition, or first nights, or just an incredible array of things.
HLAVA: Even kissing was-- you needed to be very, very careful about that. And that held for many years until Clark Gable said, "Frankly, Scarlett, I don't give a damn." And the evaluation criteria and the critics went crazy. And the evaluation criteria got rewritten over some time to the Motion Picture Association guidelines that we know now. So you all know about G, and PG, and PG 13, and restricted, and so on.
HLAVA: And that's what we live on now. But it's gone through quite a number of iterations. Another thing that happens is that people that ban books, just like in 1564. Now are banning them a little differently. The Nazis burned books. They didn't think that kind of learning was good, that only certain things were prescribed. But according to the American Library Association, in 2020, which is not very long ago, although it seems forever with COVID, these are the most banned books from public libraries and schools in the US in 2020.
HLAVA: Most of those books I have to tell you were required reading when I was in high school and probably dating myself. But this year, they are banned. What's with that? We also have screening for things that are unacceptable, filters for inappropriate content, or bad science, or what's inadmissible and so on.
HLAVA: You can make a taxonomy out of them if you like to, including preferred, non-preferred, and lots and lots of synonyms. This is an example of one suspect science taxonomy. Or you can just set an editorial policy for the organization. Here's an example of what BioMed Central thinks should be there. And they want to be very careful about what's included and what's not included in the journals that they represent.
HLAVA: So that's just a quick overview. Shelly and Angela, what do you think?
ANGELA COCHRAN: So I've been thinking about controlling thinking this week after newly empowered politicians in the Commonwealth of Virginia have introduced bills and even executive orders that prohibit teachers from assigning work that has the potential to make students feel uncomfortable or guilty on the basis of gender, race, or sexual orientation. Another bill prohibits teachers from bringing up divisive topics without representing both sides, whatever the quote unquote "both sides" of the topic might be.
ANGELA COCHRAN: And lastly, one prohibits all public school systems in Virginia from providing training to teachers and bias, equity, and inclusion. And prohibits any position from having equity in the job title. So these are very much about words for the most part. I mean, we can say that they're also about concepts. But they're specific words that are chosen in these kinds of legislative proposals that really puts us in the full swing of controlling thinking.
ANGELA COCHRAN: Now these silly bills aren't likely to get approved not only because we have halves of our legislature that are controlled by different parties. On a bit of a lighter note, I chuckled about the movie ratings. If you go back to watch a movie rated PG from the early 1980s, you will quickly see that it's vastly different from PG today. The comedy movie, Airplane, released in the '80s is rated PG despite having drinking, smoking, drug use, swearing, and nudity.
ANGELA COCHRAN: I don't even think it would get a PG 13 rating today. So the Motion Picture Association of America chose not to change the ratings of older movies, which is sort of interesting when you think about it, because viewers today are left to apply today's description of what's acceptable in a PG movie to older movies that no longer meet that standard. And I think that we can do better than that. Regarding screening for unacceptable content, this is becoming more and more critical.
ANGELA COCHRAN: When I was at ASCE, there was a conference paper submitted twice to two different conferences that was a case study out of Malaysia. And argued that all the street signs needed to be changed because more and more women were driving and they were quote, unquote, "intellectually incapable" of learning the rules of the road. And we need to really be vigilant about these things, the controversial topics, the terms that require an additional human review to make sure that what is being put forth for publication is acceptable, or at least in context.
SHELLY RAY: One of the things that really stood out to me in prohibited content for films was that any sort of sexual perversion, just knowing that over many decades what is considered a perversion has changed. And the effect of some of the content that are excluded stigmatizing those conditions, stigmatizing relationships with people with disabilities and things like that, and how all of this really is subjective to the culture at the time.
SHELLY RAY: And to what Angela was saying about, say, bills to ban any sort of teaching that would relate to reducing bias or increasing inclusion. I think that a lot of revisions that we've seen in the language-- well, I'm thinking, for example there is the documentary change the subject about the students from Dartmouth who petitioned to the Library of Congress to remove the subject headings illegal-- or aliens or illegal aliens and replaced that with undocumented immigrants.
SHELLY RAY: And the documentary opens up with-- you hear the voices of where it sounds like network news commentators essentially saying, well, what do these liberal snowflake students think that they're going to do? Write this letter and get them to change the language, because it hurts their feelings. The other side of that is if this language really is dehumanizing and our interpretation of this changes.
SHELLY RAY: And we have better alternatives. Are we really censoring if we remove and replace those, or is that part of just the ethical effort to be more inclusive?
MARJORIE M.K. HLAVA: Thank you both. Any more thoughts before we move on?
ANGELA COCHRAN: I think what Shelly just mentioned, we do have a responsibility and a desire to evolve. And so moving ourselves forward is certainly where we want to be going, and to reduce harm that may have been intended in the past, may have not been intended in the past. But when we know better, we should correct. But we should correct it not by erasing it, but by amending it.
MARJORIE M.K. HLAVA: Well, let's move on to some other censoring examples. You guys have just covered a couple of them. But I have one to share as well. And that is we decided to make a taxonomy on COVID. Seemed like a good idea. We are professional taxonomists. And we could share it with our customers. We could give everybody a copy for free, we still do, as a service in the time of the pandemic.
MARJORIE M.K. HLAVA: So I started collecting a whole bunch of words that are used to name COVID, put them into a taxonomy. And you can see, we also covered biochemistry and drugs in epidemiology and related viral diseases and so on. But the problem came when we talked about what exactly should we cover in those terms. So these are a whole lot of names that it was called early in the pandemic.
MARJORIE M.K. HLAVA: And I've noticed a couple of new ones recently that also need to be covered. And of course, when we started covering all the variants, those needed to be added as well. In this case, we have COVID-19 as the current preferred term, as PT stands for-- and UF stands for a synonym or used for. And something that's rich in synonyms give you the opportunity to cover a term while not making it the term that you think should be the one in common usage.
MARJORIE M.K. HLAVA: My trouble came in was when I started adding the terms Wuhan disease, or flu, or the CCP virus, or the Chinese Communist Party virus. Because according to one of my staff, those were used by some groups of people and not by others. And that I said is exactly the point. We want to get everybody's comments and everybody's information into the thesaurus. So that no matter what they called it, their research is usable.
MARJORIE M.K. HLAVA: And it was quite a disagreement. I have to tell you this person was firmly against calling them politically laden terms. Although, politically laden terms are often the ones that carry a lot of additional conceptual information, which some people would call conceptual baggage. So it's hard to include that.
MARJORIE M.K. HLAVA: So what's included? What information is lost if we don't use this word? Should we not use the word because it's politically incorrect, or should we just put it in the synonym list? And will our search engine accommodate that synonym list? So what about the Hong Kong flu, or the Middle Eastern respiratory syndrome, or Spanish flu, which is now called the 1918 flu?
MARJORIE M.K. HLAVA: But it lasted about five years. So it's hard to just call it by one name. And it was very hard to settle. In the end, we decided that it was OK to put them in as synonyms, but certainly not use them as main preferred terms. Comments, Shelly and Angela?
SHELLY RAY: I think that's the beauty of having a taxonomy. That you're including those as used for, noting that this is not the primary term. And also allowing you to include in the body of content that you're curating works by people who are using that term. Because I think it's important for us to know who is using that term and what they're talking about, and not exclude that from our search results within the body. So you're building out a tool that allows you to retrieve all the different variations of that.
SHELLY RAY: And early on in the pandemic, there were-- for example, I found a copy of Fortune magazine for March 2020, OK? So this is a well-known publication. There is an instance of the Wuhan virus mentioned in there. So very early on it was just another term, and it became politically loaded later on. Would we want to exclude an article like that by a reputable author because today we are uncomfortable with that term?
SHELLY RAY: And I don't think we would.
ANGELA COCHRAN: The current political climate-- and not to say that it's unique, because certainly we've gone through them before. And the terms are misused for completely different purposes. Seriously presents a challenge when we're talking about indexing and providing taxonomy terms. Like even something like critical race theory, which is an actual theory and an actual university setting being taught has now been maligned to mean any number of things in our public school systems.
ANGELA COCHRAN: And so what happens to that term now? So will the actual critical race theory need a rebranding exercise because now it's being seen as a negative thing in our public education system, even though it really doesn't have a place there? So taking the terminology, taking innocuous terminology and molding it to fit in argument, particularly for a political argument is certainly nothing new, but also an example of why we need to be really clear about preferred terms and terms that are synonyms.
MARJORIE M.K. HLAVA: I can remember when CRT meant cathode ray tube. And I sat in front of them for a great many hours a few years ago. I think probably most people did. And now it has a completely different meaning if you say it. I also have a recent example of someone that thinks that we shouldn't use preferred term anymore. We should use main term or something like that because preferred indicates a bias.
MARJORIE M.K. HLAVA: Let's move on to another indication of how words play a big role for us as stewards of information. And that is lost information. What happens when the words are not part of the taxonomy? Shelly just brought this up a little bit. By not preferring or not using one term or another for a Fortune magazine article, you don't have access to the article at all in a big search system.
MARJORIE M.K. HLAVA: So how do we make that information discoverable? And when do we decide if it's censorship or if it's selection? One of the ways to think about that is this fellow , this pigeon on the left, says, what are the other words for censorship? And the scholarly pigeon says, suppression, blackout, censoring, restriction, expurgation, control, bowdlerization, iron curtain, and I'm sure he could go on.
MARJORIE M.K. HLAVA: But it's a delicate path that we walk. We know that a lot of people over the years have deleted information on purpose. Like the damnatio, I can't say that now. Anyway, the decree in Rome that meant the name of the damned was removed. Their name was completely removed from statues. And statues were blasted to pieces or hammered to pieces.
MARJORIE M.K. HLAVA: People were removed from the name of history. And in Egypt, some of the pharaohs were chiseled out of history by subsequent pharaohs or some families, say-- and it's bad right now if you're of the opposite political party, you're dead to me. I don't want to hear your name ever again, that kind of thing. And it's just for having an opinion or for representing an idea that the other people don't hold.
MARJORIE M.K. HLAVA: Some people over the years have been excommunicated by the church for bringing up their particular opinions. And many libraries have been burned as have other cultural resources by conquering peoples because they didn't hold with that culture or those ideas. And some of these things are not on purpose I mean, there might be a fire or a natural disaster. But there's also war. And sometimes the author themselves burn their works.
MARJORIE M.K. HLAVA: They don't want anybody to read them after they're dead, for example. And often they are lost when they're digitized, which I think is entirely preventable. Really bothers me. It used to happen on microfilm when the only access to then information on the film was the header, the microfilm header. You had one header in 96 images and no other index.
MARJORIE M.K. HLAVA: And I've gone through a lot of fish and film in my day. And I can tell you that that information is effectively lost. Much happens when we digitize information and we don't take it before we dump it into a computer. When we just move it to the computer, it's independent images, there's still only a file name or a header. And I think it's information lost.
MARJORIE M.K. HLAVA: It's something we need to be really careful about. So of course, some of the reasons that information is lost is the lack of subject metadata, or taxonomy terms, or whatever you want to call it. Sometimes it's because nobody remembers what the code is anymore. Like these two codecs here, the one on the top is probably well known to everybody. It's the Rosetta Stone.
MARJORIE M.K. HLAVA: It's the same exact topic in three different languages all of which were not well known, but somebody broke the code and we were able to see them all again. Sometimes it's the indexer itself. When they're adding the taxonomy terms, they think, yeah, this isn't really important. That's a minor topic.
MARJORIE M.K. HLAVA: I'm not going to index it. Or they believe the topic is not accessible based on some code or another. Sometimes it's just because it's not understood and well not deemed relevant, not important. But a large part of it is term changes over time that have not been tracked. So there were a couple of articles this week that came out that had to do with changing of names.
MARJORIE M.K. HLAVA: One of them was about Google's inclusive language. It's quite an interesting article, and it talks about how different terms. And some of them I can see. I would agree that we shouldn't have master slave servers anymore. I can see where that's not appropriate. But I don't know how dummy variables and black boxes are so bad.
MARJORIE M.K. HLAVA: And since I think I'm part of the group, senior citizen has been changed to older persons. And I don't know. Somehow I personally would prefer to be an older person than-- I mean a senior citizen than an older person. And then there's this article from the Michigan University record talking about how the library is taking steps to remediate harmful metadata language primarily being the terms that are applied I think Shelly can talk more about that in just a second.
MARJORIE M.K. HLAVA: But this is a very common process at the moment. I know a number of people that are changing their taxonomies so that they can be more current. And what they're doing in that case is changing the preferred and non-preferred terms. OK, guys. Back over to you. What do you think?
SHELLY RAY: Well, another example of moving to more inclusive language in IT-- you gave a few examples like black box or dummy value, and things like that. And UCI-- University of California Irvine Office of Information Technology last year released a inclusive language guide. And there's a lot of slang in IT. And some of the improvements to make the standard of language more inclusive-- or I would say they are improvements because they're more clear.
SHELLY RAY: They don't rely so much on any individual culturally being part of the IT culture to under- understand what you're talking about. And then there are other examples-- or they give examples. Like, instead of wait list, use allow list, or safe list. That's very literal of what wait list means instead of black list, deny list, block list. Some of this new language is more clear.
SHELLY RAY: The primary master slave, converting that to primary and secondary, that's more clear. And you see now even a lot of realtors, they're listing a primary bedroom rather than a master bedroom. Yeah, the inclusive language suggestions are to remove ablest language instead of sanity check spoke test, competence test, coherence test.
SHELLY RAY: Again, becoming more literal instead of relying on a metaphor, or removing violent language. So instead of a kill, you include a halt or stop. One thing that is challenging and that archivists have overcome by when they have archival metadata, and there is now a different language to use for that tag-- just turned my video off. Sorry about that.
SHELLY RAY: They'll have the previous term in quotes to identify that it is an archival term. And for coders or the IT word, you can simply use comments to say that to do is to remove this and convert all-- temporary, the terminology in this file will be replaced with primary and secondary by the end of the year. The information here reflects the existing ISO standard.
SHELLY RAY: So there are ways to make that migration and make it clear in the meantime as well, and have like coherency to not lose the information that you're looking to carry. So I guess my overall point is that this isn't an impossible task. And we can make information more accessible. And oftentimes they are improvements that make the language we use more clear rather than relying on whatever bias the first catalog or the first code writer brought to it.
SHELLY RAY:
ANGELA COCHRAN: I love the idea that more inclusive language is clearer anyway. That makes a lot of sense. And we should think about, is this colloquialism or kind of an insider word that becomes an outside word really explain to someone new what it is and what it means? And that being inclusive means actually taking some of that ambiguity out of the terminology.
ANGELA COCHRAN: So I hadn't thought about that before. So that's a really great angle.
SHELLY RAY: There was another interesting one too from the UCI guide off the reservation. And I've been programming for years. I was like, what does that even mean? Oh, OK. Instead you use against the grain or counterproductive. I know what that means. Yeah.
ANGELA COCHRAN: And when I think about just lost work and information lost in general, the preservation piece of it is so important, but it's also, as you pointed out Margie and a couple of instances, really subjective. I'm reminded of the social media platforms deciding that some content was worth keeping on the platform, even though it violated the terms and use of the platform provider because the person saying it is important. So such as the President of the United States. So the President of the United States, or some other head of state, or politician says something that contradicts the terms of use, but is decided that if I had tweeted it or posted it on a platform, it would be flagged and taken down because it violates the terms of use.
ANGELA COCHRAN: But because someone who's important said it-- and I actually get that argument. However, it is completely subjective to decide which people fall into that category and which content gets to stay up. We have bots that were created to screen capture and archive presidential tweets during the Trump administration so that history could not easily be rewritten in real time by just deleting tweets, or reposting a new tweet that says something slightly differently.
ANGELA COCHRAN: So when we decide who gets and what gets preserved and what doesn't is pretty fraught. But when it comes to content tagging, we have the ability to use those synonyms and non-preferred terms to make the connection between archived content and modern terminology. I do wonder and I'd be interested in your thoughts on this, Margie, how difficult it can be to take a fairly modern taxonomy based on, let's say, 20 years worth of content and apply it to an archive, when you decide to go ahead and archive older content even from the same corpus, how difficult it is to apply modern taxonomy terms to content that's maybe 60 to 75 years old or older, how you accommodate those changes and terminologies
MARJORIE M.K. HLAVA: Well, there are two ways to do it. One is to have your search engine adjust to the terms and change the preferred term or the main term. So that when people search for the older term, they find the newer content. And the inverse is true, when the content uses the older term, then it is presented under the new term.
MARJORIE M.K. HLAVA: And that can happen either by setting the user interface just below the user interface to do that term equivalency. And the other way is to reindex the corpus. Used to be that reindexing the corpus was a horrendous deal. And now it's not so bad because machines are a lot bigger and faster, and the technology to do it is much easier. But it wasn't just presidents. It was when Clark Gable said that nasty word, and they rewrote the entire guidelines because he was such a popular star.
MARJORIE M.K. HLAVA: So it's happened more than once, I think, in these days. Anything else on information loss before we move on? OK, so the next thorny area for us is detecting bias, which is also a word game. We can introduce the bias as writers or publishers, depending on what kinds of things were applying to the-- what kind of filters we're applying to the data as it comes in through collection and analysis, or big data equations, or defining meanings through our interpretations and tagging.
MARJORIE M.K. HLAVA: Training sets, particularly, are usually not objective. And so we need to be really careful when we're drawing inferences from all those things because they are indeed the creations of human design. And who decides what and where, and how do we find that bias? It's really easy to create a model that's going to give the kind of results that you want. And so those models might be flawed, but they aren't necessarily flagged.
MARJORIE M.K. HLAVA: And there's very few databases that list flawed models or models that have later been determined to be flawed. There are ones for cell lines. We know cell lines that are no longer appropriate. They've been proved flawed in some way or another. But models, not quite as much. But bias is something that really helps us process information quickly.
MARJORIE M.K. HLAVA: So we listen to the facts, and then we filter it through our personal beliefs. And in the middle, we get what we heard or what we're going to write. So two people can have exactly the same conversation, they're having it with each other. And they come away with entirely different assessments. It's a really common thing in marriage counseling, for example.
MARJORIE M.K. HLAVA: One hears one thing and the other here's the other thing. And they heard exactly the same piece, but they filtered it differently. The same certainly happens in all of our other interactions as well. And of course, term usage changes over time. So when you're talking to people who are brought up in one time period, they'll hear something differently than when they were brought up in later times.
MARJORIE M.K. HLAVA: And there's a lot of times, it seems to happen much more quickly now. But things move from the full name to an acronym. And of course, one of the famous ones is laser-- the Light Amplification by Stimulated Emission of Radiation. Is just called laser now. Very few people know what it actually stands for. But there are lots of other ones, like washing up machines, which everyone knows is a dishwasher.
MARJORIE M.K. HLAVA: Or Kleenex, which lost its battle to be a trade name because people said, no, sorry, everybody calls it a Kleenex, whatever kind of tissue it is. And then we have problems with homonyms, words spelled the same that means something different. Like, lead, that thing that you walk your dog with, or that feeds into the mouth of a river.
MARJORIE M.K. HLAVA: Or lead, or lead, which is a management practice. It's a widely used word with many, many meanings. And mercury, which is a god, or an element, or a planet, or a car, any number of things. And so the term changes over time. And if you do a Google search and you go, wait, that's not what I meant. And it's hard to sometimes get to exactly the term you meant. And of course, the clarity and popularity of one usage or another will change significantly.
MARJORIE M.K. HLAVA: I mean gay used to mean happy and having a party. The guy was a popular term for everybody, meaning everybody, not meaning a male or a female. So it's a really slippery slope for us to censor information. And it could be simple bias, or it could lead to loss. And it could be intentional, or it could be a match to criteria. So as publishers text taxonomists and information stewards publishers, we need to walk that balance.
MARJORIE M.K. HLAVA: And it's not always easy. So over you guys.
SHELLY RAY: Well, I mean I would argue that with machine learning, the point is to create a bias. Yes, there is the societal bias that we're reckoning now with inherent bias that each of us possess. But the problem really for machine learning is that scale. So in my organization, when I have a training set, I am creating a bias in the machine.
SHELLY RAY: So that it knows that within the corpus that this machine will be looking at, we're going to influence it so it knows which lead or what context to look for to whether or not to believe it's a particular usage of the word lead. So yes, we want to address that in big data. The humans are going to bring bias to it, but that's the point in a lot of cases.
SHELLY RAY: The machine needs to learn from us. We need to create that. So it's a tough balancing act when doing this at scale. But it's kind of the point of what the human part of the work is when you're working with a specific set.
ANGELA COCHRAN: I'm curious, Shelly, because you do build out some of these-- or work with them a lot closer than I do with machine learning and some of the databases, is it actually is kind of like the example that Marjorie gave on using cell lines, is that so much processing or analysis is done on databases or a corpus of data that you did not train yourself.
ANGELA COCHRAN: So at a certain point, we have collections on top of collections with taxonomies and what was included and not included, that we get to a line where you might actually be doing some analysis on content that's been taken from multiple databases that were created by multiple people or organizations with their own biases built into it. What do we do about identifying what the biases are that were intentionally for good reasons built into a database?
SHELLY RAY: I'm going to throw that one to Margie, because I think you have more experience with the different integrations than I do.
MARJORIE M.K. HLAVA: Well, yeah. What I do first is look at the editorial guidelines for any particular source because you know that they're collecting against those guidelines. So they may or may not be the guidelines that you would have followed for your collection. I mean, for example, there's the American Association for Cancer Research. And there's the American Association for Clinical Oncology.
MARJORIE M.K. HLAVA: Both of them have to do with cancer. One of them calls cancer neoplasms, the other cause some cancers. One of them is focused on research, the other is focused on the clinical setting and what the practitioner should do. And having built taxonomies for both of them, I can tell you that they take a considerably different view of the entire field right from what's the proper name of this little nasty thing that climbs around your body.
MARJORIE M.K. HLAVA: And so you just have to be aware of the bias of the corpus. And it's not necessarily a negative thing. I mean, the reason those two organizations exist is because they approach very different bodies of people, the information that they supply is very different. And they should have that bias. So I'm not against it. But if you tried to use one of those collections to index the other and we're not aware of the bias, you would get really awful results.
MARJORIE M.K. HLAVA: OK, I'm going to move to our next segment because we're-- this is so much fun. We're running out of time. Oh, I guess that's it. So I want to thank Angela, and thank Shelly. Angela is pretty well known to this community. I think Shelly is a new one for a lot of you guys. But you can see that their brains are great, and having conversations with them about these topics is really fun.
MARJORIE M.K. HLAVA: So thank you, everybody.
ANGELA COCHRAN: Thank you, Margie.
SHELLY RAY: Thank you.
YASUSHI OGASAKA: Thank you very much, Marjorie, Angela, and Shelly. We now move on to the question and answer session. [MUSIC PLAYING]