Name:
AI and machine learning - less theory, more practice-NISO Plus
Description:
AI and machine learning - less theory, more practice-NISO Plus
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/8e97341e-e0b2-4bc9-a353-d0b7f46474db/thumbnails/8e97341e-e0b2-4bc9-a353-d0b7f46474db.png?sv=2019-02-02&sr=c&sig=RyC3oUAfQ4wYQUZXj7ZTPM%2FHe0Zfo2NpdkzkKnKvlVE%3D&st=2024-09-08T23%3A42%3A07Z&se=2024-09-09T03%3A47%3A07Z&sp=r
Duration:
T00H48M28S
Embed URL:
https://stream.cadmore.media/player/8e97341e-e0b2-4bc9-a353-d0b7f46474db
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/8e97341e-e0b2-4bc9-a353-d0b7f46474db/AI and machine learning - less theory%2c more practice-NISO Pl.mp4?sv=2019-02-02&sr=c&sig=MpMHzPypRZrVd0X8sfhzSzNEoSvcbE0P6tv6YZR%2BMLs%3D&st=2024-09-08T23%3A42%3A08Z&se=2024-09-09T01%3A47%3A08Z&sp=r
Upload Date:
2022-08-26T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
CLIFFORD ANDERSON: Welcome to the NISO Plus 2022 session titled "AI and Machine Learning-- Less Theory, More Practice." My name is Clifford Anderson and I'm the associate University librarian for research and digital strategy at Vanderbilt University. It's my distinct pleasure today to introduce our panel and panelists. We are living through an era of major progress in artificial intelligence and machine learning.
CLIFFORD ANDERSON: Nearly every day, computer scientists announce the development of a new technique that promises to deliver astounding results, from beating the world's best players at Go, to discovering new combinations of drugs that might bring an end to our current pandemic. But historians of artificial intelligence tell us that many promised applications have failed to pan out.
CLIFFORD ANDERSON: In the past, the field of artificial intelligence has seen cycles of boom and bust. What makes today's promises different? Today's session aspires to cut through the hype surrounding the potential application of artificial intelligence and machine learning and library and information systems by discussing how such technologies are already successfully being deployed to solve problems today. We have a stellar set of leading experts for our panel today.
CLIFFORD ANDERSON: Our speakers will present on a variety of applications for AI and ML in the information ecosystem. And then we'll open the floor to discussion about future needs in the AI, ML space. I would like now to introduce our panelists in the order that they'll present. Andromeda Yelton is a software engineer and librarian investigating humanistic applications of machine learning and an adjunct faculty member at the San Jose State University School of Information.
CLIFFORD ANDERSON: She recently completed a contract with the Library of Congress, applying machine learning and data visualization to discovery. In the past, she's written code for the Berkman Klein Center, the MIT libraries, the Wikimedia Foundation, and Bespoke knitting patterns, among other things. Previously she was a jack-of-all-trades at the open licensed e-book startup unglue.it. She taught Latin to middle-school boys, and was a member of the ADA Initiative advisory board.
CLIFFORD ANDERSON: She has a BS in mathematics from Harvey Mudd College, an MA in classics from Tufts, and an MLS from Simmons. She's been a 2011 ALA Emerging Leader, president of the Library and Information Technology Association, and a listener-contestant on "Wait Wait-- Don't Tell Me." Our next speaker is Professor Ruixue Zhou, and she is a chief scientist of the Big Data and Knowledge Services Innovation team at the Chinese Academy of Agricultural Sciences.
CLIFFORD ANDERSON: She has more than 20 years of experience in digital libraries, knowledge organization, knowledge service, and scientific data management of agriculture. She spearheaded major projects at the Ministry of Science and Technology in China, the National Key Technology R&D program, the Ministry of Science and Technology, and the National Science and Technology Library, among others. Professor Zhou also serves as director of library and information branch of the Chinese Society of Agronomy.
CLIFFORD ANDERSON: She's the deputy director of the Chinese National Data Center for Agricultural Sciences, serves as a committee member of the Science and Technology Libraries in [INAUDIBLE] and is a member of the Chinese National Library Standardization Technical Committee. Our third speaker is Barry Bealer. He's currently the chief revenue officer of Access Innovations, makers of the Data Harmony suite and the Data Harmony Hub.
CLIFFORD ANDERSON: As an entrepreneurial leader in the software and information industry, he enjoys building and managing teams to execute sound business strategies. Barry has held executive positions at both global enterprises and software startups, including co-founding the first XML enterprise content management system software company. For over 30 years, he has brought structure and process to business practices to execute more efficiently and profitably.
CLIFFORD ANDERSON: Barry is a frequent speaker and moderator at industry events, and has been involved as a board member or committee member of the Software and Information Industry Association, the Society for Scholarly Publishing, and the W3C Accessibility Education and Outreach working group. He holds an MBA with a concentration in Information Systems from St. Joseph's University, and a Bachelor of Science in communications from Millersville University.
CLIFFORD ANDERSON: With that, I'd like to turn over our session to Andromeda Yelton.
ANDROMEDA YELTON: Thanks for the introduction. Let me get this screen shared while [? Griffey ?] narrowly edits this part out. Thanks for the introduction. So I'm Andromeda Yelton, as you said. I'm ThatAndromeda on Twitter. And I'll be talking about a few examples of libraries, archives, and related organizations, creating AI software that is in a prototype or production stage, but in either case actually usable, and to work on a wide variety of problems, including metadata, transcription, discovery, and facilitating creativity.
ANDROMEDA YELTON: So, I'm a freelance librarian or software developer-- talk to me if you want one of those-- who was most recently on contract with the Library of Congress as part of their Computing Cultural Heritage in The Cloud initiative. This slide is a teaser because, as of this recording, the results of that project have not been formally announced. But I used machine learning to create an interactive data visualization of about half a million Library of Congress documents pertaining to the Reconstruction period.
ANDROMEDA YELTON: There are two other researchers, Dr. Lauren Tilton who used computer vision to enhance discoverability of photographs within the Library of Congress collection, and Dr. Lincoln Mullen who used machine learning to detect Bible quotations across a large volume of Library of Congress collections as an aid to historians and religious scholars. So the first example I'd like to discuss is the Teenie Week of Play.
ANDROMEDA YELTON: This is a project by the Carnegie Museum of Art and several digital consultancies in the Pittsburgh area. And it's a project on the Charles Teenie Harris collection, which is a set of about 70,000 photographs by photographer Charles Teenie Harris of Black life in Pittsburgh in the early to mid 1900s. And so they prototyped a variety of options for machine learning, which included shortening titles, because they're very long descriptive titles that had lots of good information but didn't display well on the web; and also extracting locations and personal names from their descriptions in order to provide more structured metadata.
ANDROMEDA YELTON: But also, and what I'm going to show you in a bit more detail, is they worked on using computer vision to find instances of the same person in different photos. As you might imagine, if you have 70,000 photos it would be quite difficult using only human labor to look through them all and find, for instance, this. Two instances of the same person in different sets of photos. Different events, different clothes, different other people in the frame.
ANDROMEDA YELTON: This would be really hard to find by just poring over 70,000 photos and hoping for the best. And the metadata for your photo collection may not name everyone who's in the photos, and creating that metadata would be extremely time consuming. So using one machine learning system that does facial detection, and then another one that compares the results of those detections, which you can see outlined here with the green box, you can hopefully find a bunch of instances of the same person.
ANDROMEDA YELTON: They also found a bunch of candidate matches that were not, in fact, of high quality. Machine learning systems for photos tend to be trained on modern, full-color, high-resolution photos. They tend to be trained, at least in the West, disproportionately on white people. And so finding Black people in historical photos is not their highest accuracy space.
ANDROMEDA YELTON: And as a way of countering for that, they also used Mechanical Turk to verify the suggestions of the AI. And they estimated it would take about $500 to check all 3,500 or so potential matches. And this points to an important fact about a lot of AI projects. Which is, in fact, there tend to be humans in the loop somewhere.
ANDROMEDA YELTON: There are still things that humans are better at. And there are reasons to consider using AI as sort of an aid and accelerant to human labor, but to continue to involve people at points where they make sense. The second project I'd like to talk about is Transkribus, which is a project of READ-COOP, which is a European cooperative society that was established to further this project, though it has several other projects as well.
ANDROMEDA YELTON: And this is a project to transcribe handwritten documents. If you've ever tried to do any sort of computer processes on handwritten documents, you know that at some point you probably need the full text of them in a machine-readable form, not just a photograph of the original. And it can be very time consuming to transcribe that by hand even with a crowdsourcing project.
ANDROMEDA YELTON: So Transkribus uses ML to do that automatically. It's a particularly interesting project to me because many OCR systems are limited to particular languages or particular scripts. English is far better supported than most other languages, for instance. But of course documents occur in all sorts of languages. And your primary interest may not be in English-language documents.
ANDROMEDA YELTON: And it may not be in documents that use the Roman alphabet either. So this project has been used by the Amsterdam city archives, the Finnish archives. It's been used on documents in Arabic, old German, Polish, Hebrew, Bangla. It has existing models for a number of authors and handwriting styles. But you can also train your own models on your own documents if you like.
ANDROMEDA YELTON: If you go to their website, you see this demo. And you can do some of the basic features on the website. You need to download and install it to get all the features. But what you'll see if you look at the demo is that it's kind of going line by line, and it's identifying lines in the document, and then matching them up with lines that it has inferred. And you may be able to see that, for the most part, the document it's come up with is English.
ANDROMEDA YELTON: There's definitely a few typos. And there's things like this where it has the Edinburgh 25 November 1807, and then an 11 on the next line. The 11 is actually a closing quotation mark that it doesn't know what to do with. But it's pretty good. Again, you'd want a human in the loop to clean that up. But it would be much faster to have a human review and fix this document than to have a human transcribe it from scratch.
ANDROMEDA YELTON: And then you can download the results and go from there. For my next project I'd like to talk about Hamlet, which is a project that I did sort of as a precursor to my Computing Cultural Heritage in the Cloud project. This is a project where I trained a neural net on a corpus of approximately 43,000 master's and PhD theses. I was at MIT at the time.
ANDROMEDA YELTON: So these are primarily theses in science, technology, engineering, and mathematics subjects. Although not exclusively. MIT actually does award some graduate humanities degrees. The algorithm that I used infers conceptual similarity between documents automatically, using the full text not using the metadata. And places theses closer together or further apart according to how similar they are.
ANDROMEDA YELTON: And the fact that it uses full text and not metadata was really important to me, because, as I expect it's the case with many institutional repositories, this collection has some great data on things like authors and titles and departments, but there's no subject cataloging. And there's no reasonable prospect of there being enough human labor to do original cataloging on all of these documents and the thousands more that come in every year.
ANDROMEDA YELTON: So once you have a back end that's a neural net that knows how close together theses are or aren't, there's a lot of things you can put on top of it. So, one example is that you can search for an author or title and find out what is most similar. So, for instance, you can search for noted MIT graduate Buzz Aldrin, second man to walk on the moon, who wrote a PhD theses about guidance for orbital rendezvous.
ANDROMEDA YELTON: And Hamlet will find a bunch of theses that have to do with guidance and navigation systems for spacecraft and again it figured out automatically that these were conceptually related to one another. I didn't tell it anything about what these words meant. It wasn't using the metadata at all. These are all theses from the Department of Aeronautics and Astronautics but the computer didn't know that. In addition, and one of the things I really hoped for, is that you'll find things don't have to be from the same department to be put together by the neural net.
ANDROMEDA YELTON: So, for instance, I uploaded one of Henriette Avram's early papers on the development of the MARC bibliographic format for bibliographic data. And it found four theses that it thought were pretty similar to it. And those all had to do with designing sort of storage- and database-type systems for computers. So they were really kind of similar to a paper on describing a file format.
ANDROMEDA YELTON: And a couple of things I noticed here that were really interesting to me is firstly, these documents span 1969 through 1995. And second, they're from a variety of departments. Two of them are from the Department of Electrical Engineering and Computer Science, as we might hope. But ones from the Department of Meteorology, which no longer even exists, and one is from Urban Studies and Planning.
ANDROMEDA YELTON: And this was really interesting to me because department name was the closest thing to subject metadata that existed in the traditional metadata records. But department name doesn't do a great job of co-locating similar documents. Two documents can be in the same department and have very unrelated topics. So for instance, the Department of Electrical Engineering and Computer Science has some electrical engineering theses that could be basically mechanical engineering.
ANDROMEDA YELTON: and some computer science theses that could be basically math that have much less in common with one another than they do if documents and other departments. And this is also interesting to me because if you're a grad student, you probably know about stuff that your lab group has done recently. But you don't necessarily know about stuff that people did a couple of decades ago or that people did in other departments.
ANDROMEDA YELTON: And so the ability to find things that are related to your work, but are not co-located by the metadata that we have available, is really useful. Because, under the hood, Hamlet is putting things closer together or further apart based on similarity, it lends itself to a visual display of information. And this is a precursor to the interactive data visualization stuff that I did with Computing Cultural Heritage in the Cloud.
ANDROMEDA YELTON: But Hamlet did, in fact, find interesting ways to group documents that are more or less similar to one another. I have a whole blog post on this. But the one I want to call your attention to here is actually cluster number seven, which looks kind of like a manta ray swimming as fast as possible away from the rest of the documents. This turns out to be a bunch of documents that are grouped together because they all have terrible OCR quality.
ANDROMEDA YELTON: So that is not actually anything to do with the underlying concepts of the documents. That's the fact that the neural net couldn't figure out what to do with these because it couldn't read them. And that is going to be a running theme in any applied machine learning project, including all of the ones I've talked about here, is that you really are constrained by the quality of your data, the extent and the quality of your digitization, and the data pipelines you can build on top of that.
ANDROMEDA YELTON: And in practice that is actually the hard part of many of these projects. It's not the ML. It's the digitization. It's the data. Maybe that will be a thing worth discussing in the discussion section. But first, let's move on to my fourth example of a machine-learning project, which I love because this one is about creativity.
ANDROMEDA YELTON: The other three I've talked about are about some fairly traditional problems in library and information science, even though they're using modern techniques. But they're about metadata and discoverability, transcription. Citizen DJ is a project that is about creativity and play. It's a website that lets the public remix public domain audio samples into their own hip-hop tracks. Which I really encourage you to go to CitizenDJ.labs.loc.gov and play with it because it's super fun.
ANDROMEDA YELTON: This is a project by Brian Foo, who is an Innovator in Residence at the Library of Congress. When you go there and you hit the button to have it remix for you, you'll see something like this, where the machine learning has identified a bunch of potentially interesting, very brief, audio samples from eight or nine public domain audio collections. And it's spaced them out in what it thinks will be an interesting sounding way in this four measure, four beats per measure, visual system here.
ANDROMEDA YELTON: But you can put the sounds on other beats if you like. You can see in the first line, it's put this sample one on the first beat of the first measure, the third beat of the third measure. But you could click those blue squares off. You could put them on other squares. You can do whatever you want with the samples it gives you. It's just giving you a starting point. And then it also has a whole bunch of different drum patterns-- funk and soul and R&B and all sorts of different fields and sounds you can put under the samples that it's come up with.
ANDROMEDA YELTON: You can change the tempo. So you can really use this public domain audio to express yourself and to engage with these collections from 100 years ago in a way that feels very fresh and modern and personal. And I love how new technology opens up new possibilities for people to explore with collections and use them in ways that maybe they couldn't use before. Research support is great and book recommendations are great, but making hip hop samples is also great.
ANDROMEDA YELTON: And it's fun to have the ability to empower people to do new things with library data. So there you have it. Four examples of things you can do today with library and archival data. There are many more that one could talk about that hopefully we'll hear about in other parts of the presentation. And I'm looking forward to the discussion with all of you.
ANDROMEDA YELTON:
BARRY BEALER: Well, hello, everybody. My name is Barry Bealer. The chief revenue officer of Access Innovations. And I wanted to talk to you today about a topic that's near and dear to my heart, which is why explainable AI is sometimes not explainable. Now, the other presenters talked about practical applications. I'm taking a more macro view of AI, so I hope this benefits those of you who are looking at AI technologies and possibly thinking about implementing them, or have implemented them and are interested in the lessons learned coming out of that.
BARRY BEALER: So what is AI? Well, as many of, there's a lot of words that describe AI. And there's a listing here of just some of them. It can be very confusing when speaking to folks about what AI truly is. Is it artificial intelligence? Is it assisted intelligence? Is it advanced algorithms?
BARRY BEALER: Things of that nature. So it can be extremely confusing depending on your lens and view of what AI truly is. Not unusual I'm here to tell you it's not unusual and we're going to talk about that in the upcoming slides. Gartner-- and, as many people know, Gartner has hype cycles for pretty much everything-- but in artificial intelligence, they actually track 34, 34 different artificial intelligence themes or technologies.
BARRY BEALER: And I am almost guaranteed that every organization really only touches upon two, maybe three, of these areas. So again, just understanding that AI is an umbrella. And talking about and trying to explain it can be confusing unless you have the view of the organization that you're speaking with. The nice thing is, for many of those of us who are in the content industry, semantic search is coming into that slope and enlightenment.
BARRY BEALER: So that's good news for us. But again, when you see a slide like this and you see 34 different topics being tracked, you know that there's going to be, and bound to be, confusion around it. Doing research to prepare for this presentation, I'm not trying to point out anybody's doing anything wrong, but I found this article describing "What Is Artificial Intelligence?" Which is great.
BARRY BEALER: However, in the second line, they then change and say artificial intelligence is general intelligence and it's strong AI. OK. So, we're using words now to describe artificial intelligence with more words of artificial intelligence. And I think we need some clarity around that. So we need some things to be explainable.
BARRY BEALER: Pictorially though, I'm not so sure this chart necessarily gets to what is AI. The takeaway here, though, is it's evolving. OK? AI has been evolving for a number of years as many of you know. I don't know if we're yet jumping out of a human body and becoming just a brain and a stem walking around. But be that as it may, AI really is evolving and it continues to evolve.
BARRY BEALER: And I think as more and more people invest in it, it's just going to get better and better. And there's going to be more clarity around how it can help you. I love this. For people who are in a hurry, though, just draw diagrams of what AI is. And again, I'm not trying to make fun of the fact that some people draw pictures of what AI is.
BARRY BEALER: We're all trying to explain this. We're all trying to, in our context, explain what AI means to me. But if you're in a hurry, it looks like hand-drawn slides are the way to go. I found this and it really struck me and stuck out. And I blanked out the name of the company who was promoting this but it's 8 important definitions of artificial intelligence.
BARRY BEALER: Which made me pause and think, wait a minute, there's eight definitions of artificial intelligence? And I thought, wait a minute, no. This is only the important one. So there has to be more than eight definitions of artificial intelligence. Again, you can see out in the industry and as you start to do research around AI technologies, there are a lot of definitions.
BARRY BEALER: There's a lot of confusion. Hence, the reason why we have to start explaining things a little differently. Interestingly, an MIT class came up with this diagram, which it asks 14 different questions. And as you go down through the list, it's trying to figure out, from the start, whether or not you're actually using AI technology. I thought this was brilliant.
BARRY BEALER: After you go through the various tree structure there, you can figure out, yep, it's using AI technology if you get to the bottom. The point here, though, is it's important to ask questions when you start implementing or evaluating AI technology. Ask a lot of questions to figure out what's going to be explainable to you in your environment. What is all this leading to?
BARRY BEALER: Well, unfortunately it's leading to frustration. I don't think it's leading to two guys in a suit and tie putting boxing gloves on and boxing it out. But it is leading to frustration in the industry. We are talking, in many cases, organizations that are talking past each other. We don't know what our perspective is of AI and how that's going to be applied. So we don't even have a common definition of what artificial intelligence and machine learning really is.
BARRY BEALER: So we have to be careful about how we speak about AI and AI technologies. Essentially, I want people to understand we're all trying to say the same thing. So if we take a step back and really define, for the person we're speaking to, what your definition of AI is or machine learning is, I think we're all going to get there. But just be aware that the different lens or view of the world and what artificial intelligence is causes confusion.
BARRY BEALER: So we all need to be mindful of them. If we look at Oxford University Press, the OED, the Oxford English Dictionary, their definition of AI is, "the theory and development of computer systems able to perform tasks normally requiring human intelligence." Pretty straightforward. It gets a little messier than that as all of us know. But that's a pretty straightforward, standard definition.
BARRY BEALER: Pictorially, most of the time you're going to run across this type of Venn diagram, which I think is pretty clear. Which is there's kind of three components of this AI umbrella. One is the intelligence or the rules, which is AI. And then a component of that is machine learning, which is taking that data and training the model. And then deep learning, which is more the application of multiple models, which it starts going into the neural network area.
BARRY BEALER: But from a simplistic standpoint, these are kind of the three main areas that you see when you start describing, you're talking about AI with someone. So most of the folks who are in this event are from the content industry, whether you're a publisher, journal book, online publisher, it doesn't matter.
BARRY BEALER: You're either going to be a publisher of content or you're a consumer of that content. So I think this will resonate with everyone. But about three years ago, there was a study-- and I don't want to be duplicative of other presentations-- but there was a study done on the future impact of artificial intelligence on the publishing industry back in 2019.
BARRY BEALER: And this is interesting. The first thing that jumped out at me was reading this is, "As discussions on AI increase," excuse me as my thing is in the way here a little bit, "so does the hype and therefore the confusion surrounding it." which is absolutely true. So I think the hype, and hence you saw the hype curve, continues to go up, there's just more and more confusion.
BARRY BEALER: I swear, we just make up new phrases every day around what AI truly is. So AI in publishing. If you look at the publishing life cycle, AI has really started to penetrate the entire life cycle. So from author, content creation, whether it's in word or some other form, to editorial, to production, to aggregation and distribution, to the end-user experience.
BARRY BEALER: It really is in that full gamut now of where AI can apply. The question is, where is it most beneficial in your organization to apply AI-based technologies? From this study, The Future Impact of AI in Publishing Industry, there's the expected impact of AI. And I thought this was interesting to point out. So there's two areas I wanted to focus on, which was the future company importance.
BARRY BEALER: And as you'll see, which is the second line here, which is the majority of it say yeah, you know what? This is important for us. But then you see the current company investments, there's very little. Now, this was pre-COVID. And I will tell you, I did an informal survey of half a dozen executives in the publishing industry, three from a very large global publishers and then three from smaller publishers, just to kind of get a sense of where they think things are going with respect to the impact of AI.
BARRY BEALER: Interestingly, 2020, for both large and small, was all about fiscal responsibility and making sure they're surviving, because they weren't sure how long the pandemic was going to last. And at that point, they weren't sure is the revenue going to increase or is revenue going to decrease? By the end of 2021 though, many of them were seeing an uptick in their content sales because everything was moving online or more people were subscribing to databases.
BARRY BEALER: The larger publishers also told me they are starting to invest in AI-based technologies. So they're already testing things out or implementing them in their production cycle. The smaller publishers, though, I hate to say it, are still sitting on the sidelines not investing for the most part in AI-based technologies. Which, unfortunately, is going to cause this world of the bigger publishers being able to advance much, much more quickly than the smaller players because of the AI technologies.
BARRY BEALER: So, at some point, I hope those two worlds come together a little bit more with their adoption of AI. So the challenges, as you can imagine, around implementing AI solutions-- financial investment. Any new technology, it's going to be a financial investment, right? And then understanding the ROI, the return on investment.
BARRY BEALER: Those are two for pretty much any technology, not just AI solutions. But I thought the other kind of areas to point out is recruiting skilled employees in AI-based technologies and then training staff. So this isn't about replacing staff. This is about creating an environment that the staff are more productive, but they understand and are able to explain how that AI is helping them automate specific points in that publication process.
BARRY BEALER: Not surprisingly, by the way, the departments that are expected to benefit the most are end-user, right now, are end-user applications. So chat bots, which I'm sure we've all run across, recommender functions, more like this, things of that nature. AI is out there and is being embraced. And again, from the larger publisher perspective, that's what I'm hearing is it's all focused on how to service the client or consumer.
BARRY BEALER: Quite frankly, to be able to sell more content, right? The other areas are starting to adopt. More distribution makes a ton of sense from the standpoint of the aggregation and distribution. That can be highly automated using AI. There doesn't really necessarily have to be manual effort involved. But then the other areas I think it's going to take some more time for AI, depending on the technologies, to really be applied in the appropriate fashion.
BARRY BEALER: Again, when it comes to what departments are specifically trying to implement AI, leveraging analytic-type technology, obviously sales and marketing is doing that. Content creation, in some cases what you're reading is there's content creators out there that are creating content just based on the data. And I know that's actually outside of publishing, but in news they actually create content just on data.
BARRY BEALER: And you're going to see more and more of those things creeping into the traditional publisher journal or a book publisher. And then, again, we talked about chat bots and the other things. But the end-user, client-facing technologies are ahead of everything else right now. And are adopted, I should say, more readily at publishers. So the question really is, is AI explainable? All these things that try to describe AI and what it truly is.
BARRY BEALER: Well, I personally think that I can be explainable, but it's complicated. And it is complicated and that's where you have to understand the lens that you're talking to somebody about AI, and also understanding what their lens is of AI. So what do I mean by that? Well, there's factors for explaining AI. Understanding your audience.
BARRY BEALER: You just have to understand who you're speaking to. If you're speaking to a vendor about AI, make sure you understand their perspective of what AI and machine learning truly is. If you're a leader in the organization, you have to communicate that vision around AI. Because the implementation and the automation of those processes, they have to understand you're not losing your job, you're going to be reallocated, retrained.
BARRY BEALER: So you have to communicate that vision up front. But the context of automation is important as well. Not everything, 100%, can be automated. I guess, in theory, it could. In practicality, for publishers, it's just not going to be, right? There is going to have to be some human intervention along the way. But providing that context around what you're automating in that process I think is important.
BARRY BEALER: Early adoption, while not necessarily for everybody, but early adoption really can help you prove out specific use cases. And understanding that vendors are willing to work with you with respect to pilots and trials to see how it does, how AI does affect your publishing cycle. One of the things I heard from the publishers that I spoke to about the follow-up to that survey from 2019 is they are incrementally implementing and investing in AI.
BARRY BEALER: This is not the replacing everything. They're trying to find specific points in their workflow or in their process so that they can incrementally implement new technology based on AI with a minimal investment. This is not something that they have to write a check for a million dollars. They're incrementally trying to figure out how can I spend x amount of money and leverage AI to decrease the production timeline.
BARRY BEALER: The last thing I want to emphasize is verifying results. You have to be able to verify your results. You have to be able to explain what your AI is doing. If you can't explain what your AI is doing, you're not going to be able to verify those results. So with that, I'll turn it back over to the panel. Thank you very much.
JIAO LI: OK. Hello, everyone. This is Jiao Li from Agricultural Information Institute of the Chinese Academy of Agricultural Sciences. And our topic is A Knowledge Discovery System Integrating Knowledge Organization and Machine Learning. This is the outline of this presentation. Nowadays, we're seeing a constant growth of scientific literature making the access to scholarly contents more and more challenging through traditional search methods.
JIAO LI: Elsewhere also report trust in research has shown that, on average, researchers spend just over four hours searching for research articles a week, and more than five hours reading them. The rate is 625 articles per week, and half are considered useful. In addition, compared to year 2011, researchers rated 10% less literature, and spend 11% more time finding that.
JIAO LI: So how to help search and discovery? With the development of artificial intelligence, a variety of tools and services today are developing [INAUDIBLE] scholarly knowledge of data. Agricultural Information Institute as a scientific research with Institute-based National Agricultural Library has also made active efforts to provide more precise and efficient knowledge service for researchers.
JIAO LI: And today, I like to share you with some progress. Semantic search based on large organization mainly refers to the extended influence of cross language, synonyms, and semantic relations like hyperlinks in the hyperlink. Here is the framework. The image organization has because it is semantically expressed in [INAUDIBLE] format to help the computer understand the literal language query so as to realize the modeling reasoning in the expansion of the original input.
JIAO LI: And they also tend to be more accurate and comprehensive search results to meet users' needs. The STKOS is the abbreviation of the scientific technological knowledge organization systems, which is bilingual course supported by Ministry of Science and Technology of China. Edge integrates glossary to servers. Users search words, also keywords, in other knowledge organizations.
JIAO LI: Covers are fundamental next term, normative concepts, category systems, and ontology networks, with more than 600,000 concepts. And two million terms in fields of science and delivering medical and agriculture. Here is the example of the extended influence you insert. The system can understand the Chinese prairie, [NON-ENGLISH SPEECH] and soybean.
JIAO LI: And suppose synonyms in Chinese, [NON-ENGLISH SPEECH] and [NON-ENGLISH SPEECH]. In order to make the access to full text of the scientific articles, we have also used the unique identifier Steel I from [INAUDIBLE] as a basis for data complements and enrichment of articles. Before that, we collect data of abstract paper website in pull the test link from publishers like [INAUDIBLE] Wiley, and [? Spring ?] [? Later. ?] And the Chinese Academy search engine, [NON-ENGLISH SPEECH],, and other open data like a [? minor, ?] [? MAG. ?] Our search result page.
JIAO LI: If the article has full-text links, it will be displayed here. Users can directly view the full text by clicking the red sign. In the green one will show four full text links for users to choose. Also, context-aware model is designed to show whether the full text of the search results is available in user's current network environment. Here is a framework of context-aware model.
JIAO LI: A resource gathering marketplace, which requires institutions and databases accessible to the student institutions is necessary. The system locates users of available service through the local context-aware invocation of the resource scheduling knowledge base. And then, [INAUDIBLE] resource parameters to the name resolver following the open URL protocol, which returns the available links.
JIAO LI: Here is the example. If the sign of the result is green, it means that users can obtain the full text. Green means cannot get a full text in the current network environment. To provide researchers with much higher quality resources, we designed evaluation models of resource impact for articles and patents.
JIAO LI: We connect ESI top papers from [? railroad ?] signs and patterns from [NON-ENGLISH SPEECH] innovation in a graphic. [INAUDIBLE] model of articles is shown in this table. A series of indicators are used in the weeks are defined, Also, calculation and parameter correction, this algorithm model is formed. From the perspective of large surveys, high-quality resources can be provided not only as a kind of resource with simple litigation, but also as a filter on the search result page.
JIAO LI: Moreover, we have also designed the linking of relevant research institutions and authors. Users can see the detail by clicking them. Multi-factor ranking is developed for the sorting of searching results. To realize the priority display of high-quality and time-sensitive retrieval results by configuring the ring strategy of position, word frequency, and weight.
JIAO LI: The model extracts multi-dimensional factors such as timelines, authority, and the relevance, and designs or customize the rank and strategy for different types of resource like articles, patents, research data, and the experts. The high-quantity resources will also be preferentially recommended on the algorithm model.
JIAO LI: Here is an example. When multi-factor rank is choosing, high-quality resources with no test links will show on the first page. Linking data in the articles can promote their discoverability and retrieval, and improve the transparency and reusability of scientific research. [INAUDIBLE] DB, and so one have launched the services to link data sets to publications.
JIAO LI: The relations are usually realized through [INAUDIBLE] files, citations, or metadata. And we introduce the semantic entity into our method, used data-linking model and relation times to find in the desired schema to form the relation database. Here's user research data navigation on the article search result page. And all the research data page, we can see that entity navigation and users can download and site the research data.
JIAO LI: On the page of research data or article, users can discover the link and see the detail. For example, for this article we can click the link research data and the page will jump to the detail of the research data. Similarly, if we click the related article link, it will link to the detail page of this article. Knowledge Graph.
JIAO LI: Large network of entities and relationships have been successfully used the intelligence knowledge surveys like semantic search, intelligent Q&A, among the existing representations. Mount graph is proved to be effective solution in knowledge accommodation. With resource telescope disease and insect pests in Chinese agricultural [INAUDIBLE] we have generated a knowledge graph with deep-learning model, including cases of disease, pests, microorganisms, and crops and animals.
JIAO LI: With more than 16 [INAUDIBLE] and 24 southern RF trips, and then develop the intelligence Q&A model based on that graph. The above is some of our progress so far. In the future, we believe that the complementary integration of large organization and I will become the mainstream of the knowledge service research.
JIAO LI: And this picture shows the six key development directions. [INAUDIBLE] Italy specialized large service system, and problem-oriented interaction establish connections between users and the right resources. Large base and covers resources provide high-quality data and semantic knowledge models for machine learning. Use text mining technology to extract information knowledge, identify concepts, entities, and relationships, especially focus on [INAUDIBLE] semantic relationships.
JIAO LI: And explore computational casual reasoning. Intelligent Q&A based on collaborative reasoning and large computing will be the mainstream managed service. Human-computer interaction is conductive to providing users with accurate literature knowledge retrieval. That's all for this presentation. many Thanks for your attention.
JIAO LI: [MUSIC PLAYING]