Name:
Supercharge Your AI
Description:
Supercharge Your AI
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/7125328c-4722-423b-825d-5d6d4fa29eb0/videoscrubberimages/Scrubber_1.jpg
Duration:
T00H28M47S
Embed URL:
https://stream.cadmore.media/player/7125328c-4722-423b-825d-5d6d4fa29eb0
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/7125328c-4722-423b-825d-5d6d4fa29eb0/industry_breakout__access_innovations_2024-05-29 (1080p).mp4?sv=2019-02-02&sr=c&sig=d2pSnE8wbe6JKhdydqryerJU19UYUerwNHBzR93XFFU%3D&st=2025-04-29T19%3A11%3A37Z&se=2025-04-29T21%3A16%3A37Z&sp=r
Upload Date:
2024-12-03T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
So don't want to talk about today is supercharging your AI. And there are several ways to do it. So I'm going to end 20 minutes or so, go through a great many of them. I am Margie lava. I'm known for working with taxonomies. I'm currently chairing the new standard for taxonomies or thesauri and other structured vocabularies for the international standard on controlled vocabularies.
I am chief scientist now at access innovations and they have a booth in the exhibit Hall at 215. So one of the things that's coming forth is that all these large language models that we have not only mirror, but they also magnify the problems with data sets. So if you've got bad data, it's going to look even worse or it's going to give really weird answers. And a lot of people just don't even realize they've got the problems until they start working with large language models.
Some of the ones when you converted to XML, if you've done that, there are new hidden biases and gaps that are a real danger in data sets when you implement a generative AI of some kind. But we already had a whole lot of worries. People's search still isn't working quite the way they wanted. They wanted good web navigation. They had a ton of articles, submissions that are coming in and they don't know how to them.
And organizations, a lot of them have different vocabularies and they don't talk to each other because their information is siloed and might want to personalize conference sessions, all kinds of things. And now we have large language models and ChatGPT and it seems like every year there's something new to worry about. But as opposed to knowledge, graphs, and knowledge maps and ontologies and all kinds of other things that have recently come on the scene, what's different now is power and size.
The power and size of these systems are using is incredible. There are these huge server farms and I'm sure you've all read some predictions about what those server farms will do and bring to you in terms of brownouts in the summer and the solar arrays to support them are not big enough to do so. And so on. That's really not our problem, but it is a looming problem. But the real question for these large language models, again comes down to your data.
It's the core asset for an LLM, and without it you can't do much. But if you enrich them with ontologies or some other thing to tailor your search excellence, knowledge maps and knowledge graphs, then you can make some good forward progress. And as that core asset, everybody should try to remember that it's not the shiny new object, it's the data.
The data is the essential component of the strategy, and without it, the rest of the initiative is nothing. So enriching that data with metadata, particularly subject metadata, because it's the topics or the concepts or the subjects that people are really looking for does add some good material to what you're working with because those large language models, not only mirror, but they magnify the problems and data sets and you don't realize got those problems until you implement them.
The technology is a good tool. It's a new one. It's exciting. It's not new technology, but it's powerful. And the size is incredible. Now So it's not the main character. It's the chorus. A lot of people lead with the technology. Most companies of any size have at least five different search software systems, for example, that didn't work.
So they try another one and it didn't work and they tried another one. It didn't work. And so on. So now they're ready to try something new, which is the large language model. And my prediction is that won't work either unless the data is enriched.
So it's the governance and the modeling. But the model is not just getting together a data model. It's working on the content model that's important. The data has to be well sourced and managed. You need to think about the ethical and performance issues in that data. And then the quality of the data. And it's somewhat hard work because it's new work. It's not intellectually.
Different once you've done a couple of them. But the first one requires some attention and getting the executives in the organization to agree on strategy. And the structure model for the content is not always straightforward. But you need that because otherwise there's no strategic value to an artificial intelligence initiative.
And Gartner echoes that. They say that by 2024, companies that use knowledge, that use graphs and semantic approaches for natural language technology projects will have 75% That's a lot less artificial intelligence, technical debt than those who do not. And the way that they're doing that is through existing standards. There are standards for introducing structured vocabularies, schemas, getting the data and organizational structures and ontologies as a starting point.
And when you extract a list of key terms that need to be modeled using the data mining, integrity extraction, data profiling tools, you're really far ahead. You can add additional handcrafted rules when you need them because NoSQL is perfect. You can add attributes to the entities. Say you have an author. You want to know it's an author. You might want to know their institution.
You might want to know what they publish on and other relationships that you can get from business dictionaries and glossaries. And those things already exist. So does the data need to be structured to implement an AI. Well, no, actually the AI systems are taking text, text and images and sounds and all the formats you can think of museum collections included.
And do need a new platform. No, it's an add on. And you worry about hallucinations in the data. What was it this week. I telling kids to eat a rock a day. Really but if you guide it and you tag it, then you do have some protection on your content. So protecting your content is important. You want to be sure that it's not leaked into the great wide world of generative AI, but you do want to organize it so that it's protected and that it has guide guardrails.
And I'll talk about that in a second. And when does that tagging happen. Very early in the workflow, actually, before you dump your data into the large language model. So what's the process. You need a controlled vocabulary that is structured vocabulary, keywords and taxonomy terms and entity identification entities being people, places, and things and the taxonomies being the concepts and things that you can't actually touch.
But we all share as a thing that we describe. So walking and respiration and breast cancer are all concepts, whereas entities are the exact little critter or the person that talked about it. You apply it to your data automatically if possible, because most people have a lot of data and it's a painstaking process. If you do it manually, like the old abstracting and indexing journals used to do.
But it's still possible. And then you can use the power of the LLM. But one of the things you want to try to do is keep your data separate. I remember a few years ago, Google wanted to crawl everybody's sites and get their content and they said, Nope, Nope, not my content. You're not keep your mitts off my content. And now everybody wants to be sure that Google crawls their contract content.
At this point, we're at the same thing with large language models. People do not want their content just dumped wholesale into the large language models because you don't know who's using it and you worry about your subscription models and all that kind of stuff. So there are ways to join into that system without dumping your content in there.
You can keep it separate. The models learned from themselves. And this is not a primer on how to do LLM implementations, although I'd be happy to talk with you about that if you come around. But it does learn from itself. It learns from the written resources, from the text that's put into it. And the longer it's used on a particular data set, the more accurate it becomes.
I don't know how many of you remember in the early 90s there's some faces in here. I'm sure we're not using Google in the early 90s, but I was. And it was really not very good. It was interesting, but as the years went by, it increased and increased and increased because it got millions and billions and trillions now of articles into the system and the more it had, the better it became.
And the same will be true of the large language models. The more information that comes into it, the better it will be. Except that in learning from itself, it can cause more than other systems logical loops. And you don't want the system to get into logical loops that give you the wrong answers. It learns from the interactions with people. So every time you ask ChatGPT a question, it refines the answer. The next time you come back to it with the same question based partially on how you reacted to the answer that you gave it when you were interacting with it.
The last time I did a fun experiment with that, fairly recently, I asked the same question of ChatGPT over a 48 hour period, and every time the answer is a little bit different. And so I started slipping in things that I thought would be interesting in the response. And sure enough, it spat it back at me. In the training. The system studies grammar.
So we used to in grade school diagram sentences and in advanced English classes. And they do that in spades in the models. They look at the order of words and the possible meanings. And if you help with a taxonomy, you can guide the meanings that they use and then they look at how things fit together to co-occurrence algorithms and then make a prediction. And that prediction is the answer to your query.
And it looks like a human response. There's a human response generator. So let's talk for a minute about what the possibilities are for moving your data forward. And one of them is to just dump your data into the vortex and everybody will. Fine no worries. Don't worry. Everything's fine.
But maybe not. Another way to work with it is to bludgeon your data. You gather a large amount of text and you license an AI engine and you just dump everything in there. And there are a lot of systems you can use, like OpenAI or Bard, but also Claude or dall-e or Midjourney or any one of the other ones, and that'll convert your data, load it to the large servers, train it, ask some refining questions which you can answer, and then clarify the user's intent interactively, which is somewhat challenging to do, but it's where you're testing the system and then you can convert the text queries to hybrid search queries and that'll give you the ability to summarize and classify and all those things.
So that you get good answers. But it's definitely the large approach to the large language model. Did I miss one. Another approach is to do taxonomy term priority. Where when you collect that data, you're going to gather the documents, clean them up a little for noise and irrelevant junk extract entities and then add the taxonomy map the extracted concepts as well as entities into the data.
And then you can index build a secondary index to the database for retrieval and display. Because those LLMs really need a little help. They need it to be accurate and they need it to avoid hallucinations. So if you enhance it and add a lot of synonyms, you suggest the structure to the data. And for specialty publications, topical areas that are very narrow in scope, like a lot of journals are, that's what you want.
So then what you're doing with the AI and the general AI generative AI is starting with the enriched content tagged content, and then you feed that to the system and it puts new rules and inference engine work in place. And then the research, the research results get better and better and you keep iterating what you're adding to it and giving little tweaks to the system. That sounds pretty easy, but how about a little more detail.
So taxonomies. Are somewhat time consuming to create, particularly if you have a good sized. Synonymy in them. They also they have a number of pieces to them. They build a hierarchical structure, broader and narrower parent child kind of relationships. They build equivalence with the synonymy, and the synonymy will vary by field.
So if you're in physics, plasma means one thing. If you're in medicine. Plasma means something quite different. And so you need to make sure that those definitions are clear to your system. And they may have a bunch of codes and other things in them that could be useful to your data depending on what you're dealing with. You can buy those already made, which is a big plus because instead of having to build them from scratch, you can adopt them.
We're very fortunate in the US that the government has built a great many of taxonomies and fortunately for me, they build them against the standard. There are two standards. One is the international one and the other one is available from the National information standards organization, which is part of ANSI, the American National Standards Institute. And that one is free for download from niso.org. It's a 39.19, but if you just put in vocabulary standard, you'll get it.
But you can also use those to build out knowledge, graphs, and ontologies. You need to have a taxonomy before you can build those anyway, so you might as well get started with one that already exists in Europe. A copyright usually remains with the agency and there's a fee for them. But in the US they make them freely available and there's quite a few.
Plus there's a number that have been built otherwise. So is it expensive and time consuming maybe. But if you can adopt one that already exists, maybe not. So let's talk about that. This is kind of an expanded list of what you need to do when you're implementing a knowledge domain or a taxonomy into a generative AI system. But it's basically the same thing. You take the data and then you iterate it until it comes out the way you want it to.
And in this case, we've added a subject matter expert review, which is important to do. I think the taxonomy will match your own content. So in the example of plasma, it's a different definition for a different domain and you want to be sure that continues to be the case with your domain or else your data gets cattywampus if you deal with water, for example, you might talk about a lead water leading into a Bay that leads into a river, leads into the ocean.
But if you're working on the. Management practices. You might talk about leadership. Well, that's a kind of entirely different kind of lead. And if you're talking about chemistry, lead is lead. So it's obvious that all those different usages could easily lead the system astray until they get a great deal of data in there.
And when we're talking great deal of data, we're talking about millions of records, not thousands or hundreds of thousands so, or Tens of thousands, which is what most publishing organization has. So unless you're one of the few probably don't have the corpus that's big enough to leverage all the co occurrence. So that's why you need to put guardrails on it.
And that's why you need the extensive synonymy. You need to be able to disambiguate, lead, lead, mercury, the God, mercury, a car, mercury, an element, mercury, the planet. And so there's a lot of feedback loops built in. And that's what prevents the hallucinations because those multiple word meanings give nonsensical replies.
And then your query is going to go against the rules of the system, the ones that you've added in. And that's the prompt contradiction or the prompt engineering system that makes searching possible against the systems. So by understanding that input and understanding the organization of the content, you get a different kind of perspective. One of our customers is the American Society for Clinical Oncology.
They deal with cancer. Another is the American Association for cancer research. They deal with cancer. They can't use the same taxonomy. One of them's for researchers and they call the little critters neoplasms the others for clinical setting, and they call it cancer. So you have to organize for your own domain, for your own subject area.
That's why there are so many journals in different topical areas because they have an emphasis, a specialty. And of course it'll help with query expansion and content control. And so on. So a knowledge domain. I've used that word a few times. It's a specific area or field of knowledge.
It's a conceptual area. And it should be well organized and we should establish boundaries for the disciplines and the subdisciplines. Another way to say it is it's a thesaurus with a rule base. They're pre-built. They're already there. They have full term records. All those fields that I described to you they could come and hierarchy alone.
They match the standards and there's lots and lots of formats available. The output format is really material to the content structure, so a lot of people get confused about the format versus the content. And it's the content that's the emphasis. It's the data. So here are some examples of different topical areas that we have for taxonomies that already exist.
There's others that can be scuffs, scuffs downloads and you see that there's a lot from different government agencies. And some of them because they're big coding systems that are available from the American Medical Association or the Royal Botanical gardens at Kew or other places need to be or they're very fast changing. They need to be available with a constant maintenance system behind them.
So in Taxol gene, for example, the one on the bottom there, you may not be aware that there are 19 synonyms on the average for gene names. So that means that some genes have a whole lot more of them. So if you're publishing with one gene name and you don't revert to the approved central name through synonymy, then the chances that people will discover and be able to read the material that you have is limited.
Same thing with the medicinal plant name service. They have an average of 16 synonyms per plant name for medicinal plants. So for the literature you either need to put in all of those 16 names or all of those 19 gene variations in order to get a comprehensive search. And it's much nicer for people if they're already gathered up into a synonymy for them. So as I said, these are already built and they're licensable.
There are a number of ways to get them. You can either go directly to government agencies and get them and sign up for updates or you can license them from us. We also list about 2,500 different licensable taxonomies in tank. So taxable income and you might go there and browse. It's got some that are pretty silly, like shipbuilding and belly dancing, but it's got some really big, big and useful ones that are likely adaptable to people's content.
And of course, access innovations has some as do others. So the indexing, we're going to do the tagging for search and retrieval, but also so that it'll support recommendation engines, for example, using tag sets and not vectors. So that the information is absolute. Same thing with getting a semantic fingerprint on a potential peer reviewer.
If you've tagged them with a taxonomy, you can get a good look at them and match them to incoming manuscripts. So autotagging is. Fast it's. It's nanoseconds. Actually, it's always to the depth of the vocabulary that you're using. No Misspellings so it's consistent.
The machine is consistent. No editorial drift, as opposed to people who use different tags over time. And if you supply a list to authors, they'll often say yeah, all of that. I'm all of those. And they pick far more tags than actually have to do with the article that they've suggested. And it's replicable.
It's not a black box. It's easy to see. And when you add a knowledge base to your content, you're using your own data. You don't need a deposit to an MLM, nor do you need to structure for XML if you haven't done that yet, although most people have. And if you want to add more stuff you can.
The church systems have a lot of flooding of them. That gives irrelevant responses if you're using a chat system, but the tagging of the data in advance will apply as a filter to incoming systems. So you can look at those and then monitor the chat logs so you know about new data coming in. Here is an example of an LLMs system. And you notice up here at the top, keywords are pretty central to them.
I know I started late, but I got to finish up here. So this is kind of how the traditional system works and that you have the taxonomy and the auto indexing system. Your content is here, it gets the terms, it goes to the documents and then up to search and the system. It's the same on this side in terms of indexing the client data, but you enrich your own content set, you add the power of the system and all of its queries behind the scenes that feeds into the query parser and that'll feed the chat box or the search interaction for your users.
But pretty powerful. So can taxonomy supercharge your AI. Yeah, they can add to the decision making. They can enhance the understanding. They improve your consistency. They facilitate interoperability, support and compliance if you're in a compliance industry. And Yeah, they do it. I wasn't sure whether that would work or not.
So the technology stack at access innovations is fairly extensive and this is an example of what the model looks like. If you go to our website, you'll see these bars and you can click on them and they'll give a bit more information. The stack looks like this one side is automatic indexing with a thesaurus part of it. You don't get to see the mouse here.
And the other is the automatic indexing, which is a number of parts. We also have a contact repository system that will automatically launch from the. From a schema. Once it's the side decided if you have an XML schema, it auto launches and there is a search system on top of that as well. So that's it.
We'd be happy to have you come and see us in booth 215 during the show or talk to any one of us. As the meeting goes forward. Do you have any questions. Was kind of a fire hose. Anybody want to use the microphone except me. Well then, Thank you for attending and see you around the show.