Name:
Good Data: The Most Important Ingredient for AI
Description:
Good Data: The Most Important Ingredient for AI
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/8000450c-da20-4649-8aef-1ecffde3fbd9/videoscrubberimages/Scrubber_1.jpg
Duration:
T00H31M26S
Embed URL:
https://stream.cadmore.media/player/8000450c-da20-4649-8aef-1ecffde3fbd9
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/8000450c-da20-4649-8aef-1ecffde3fbd9/industry_breakout__hum (1080p).mp4?sv=2019-02-02&sr=c&sig=UfdGvHq7lGrAQXOcbq0mahcblLVJJpws%2FBadeWLnKno%3D&st=2025-04-29T19%3A14%3A40Z&se=2025-04-29T21%3A19%3A40Z&sp=r
Upload Date:
2024-12-03T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
If you remember one thing today, it's this that hum will help you drive revenues now while you position yourself for the AI driven future. And so I want to start with AI. I want to start with AI because it's cool and sexy, but also it's probably front of mind for you. If you are thinking about any of these sorts of use cases within your organization, then we'll be talking today about things that are kind of upstream from that.
What do you need to do, particularly when you get into these sorts of things down here at the bottom. Conversational search do you want people to be able to come to your site, ask a question and get an answer based on the content, you've got If you do, there's a lot of underlying AI technology in there that needs to work. And when you tease it all apart, what I'll be talking about today is data.
Data about people and data about content. All right. So if you want to do any of what I just showed you, you need both of those things. So you really need to understand your reader and researcher. You really need to understand your content corpus and you need to understand how those two things interact. And that's what we'll be talking about today. All right.
So maybe not to just hammer this home too much, but so what do a data and I have to do with each other. Data obviously food for the robots. The robots won't work. Well it's the new version of the adage of garbage in, garbage out. So data drives already products like ours. Ours is a form of marketing technology that has an AI enabled tool suite in it.
But you also require clean data to do generative and interpretive. I can't do it without that. So all right. So this is me setting up the problem that you should something you probably want to be thinking about. So what is hum? Is really for the purposes of today, two things.
It is a real time data engine, a.k.a. a customer data platform that collects data about people and in hum's case about content. And it is alchemist and alchemist is this suite of AI tools and I will show you some of them today in passing as we do a quick demo. There are others like they're most of our roadmap is around AI functionality. So if you're interested in that, come see us later.
We have a table next door. We'd be happy to talk about it. There are a lot of use cases around how I put up here, some that should be of interest to scholarly publishers. We are not going to have time to go over all of them. I will mention, read and publish and subscribe to OpenAI will mention reviewer recruitment. I will mention promoting events. I will mention taxonomy creation and we'll talk a little bit about that in a bit more detail actually.
And we will talk a fair bit about content intelligence as well. All right. So what is great data, by the way, I hate the fact that I have to come up with the title of this talk nine months in advance. That's awful. And it was like good data. It's not good data.
It's great data. You want to have great data. So what is great data. It's not perfect. Number one, it's not perfect. It will never be perfect. And your systems, you can't rely on perfection. Most publishers are already collecting the top three on this list.
So 0 party demographic data, that is to say, what people tell you about themselves, their name, their email, where they're affiliated, transactional information. Did they buy something. Did they subscribe to something. Did they attend something. Geographic where are they. Maybe firmographic.
That's spotty in a lot of cases, but it's the bottom three that most scholarly publishers aren't currently collecting, and that's where home helps. So behavioral data, what do people do on your site. Not just what pages they land on, but what are they most engaged with. Google Analytics can tell you what pages people have been on, but it doesn't know what those pages are about. Humm does so Humm is able to see what people read, how engaged they are in that, and put together a full contextual understanding of all the stuff that individual is interested in.
You can also have derived data. So what career stage is somebody at overall engagement score. Engagement score by topic. Customer lifetime value. Et cetera. Et cetera. So you can see where you're going with this derived information. And then when you have that can make predictions.
What somebody's likely to do next is able to help with all of that. That's the people side. And then you have the content side. So content side, the usual metadata, who, what, where when. Now, protocols related data. As more and more things are being published as part of the publication process. Where are those extraction of funder information.
Third party impact. That's data that might come from dimensions and so on. Topical taxonomies and tagging embeddings embeddings is the AI native format. AI is AI'S just it's math, right. And embedding is the mathematical expression of a piece of work. So a paper can have an embedding. A person can also have an embedding hum creates embeddings of content and people.
And if you put both of those in the same embedding space, you can see what content a person should be interested in because they're close together in the embedding space. The embedding space, by the way, is hard to conceive of. It's 768 dimensions, so don't try to think too hard about it until after lunch. So when I talk about behavioral data, what am I talking about. I'm talking about all of these kinds of people.
And all of those sorts of things. So it's not just did they read something. If you're a society and you have events, if you have learning, everything you do, hum can capture. And this is a huge amount of the data that you have available to you as a publisher. Next OK. So the other thing to say about this data is you probably have it and it's probably all over the place, right.
The other thing that hum does is it brings it all together. And the way it does that is by connecting into your other systems and acting as this is a very hubristic diagram. It shows hum at the center of the universe. But this is the data universe data about people, data about content. All right. So I'm going to work with Tim and we're going to dive in and we're going to look at some live client sites.
So this is you're looking at live software. We've teed up a few things, and I just want to illustrate some points, give you a sense of what this is like. All right. So I told you, we have data about people. The golden standard for marketing is that 360 degree profile, everything you know about somebody all in one place.
And so here we are. This is cyber risk alliance. This is a B2B publisher actually, that is in the cybersecurity space. Jake minturn is a colleague of mine. He doesn't mind if I show you his data. So on the left hand side, you have all the information that's getting pulled in from other CRA platforms, so their CRM system, their marketing system, et cetera.
So anything else that gets pulled in will show up here. In addition to that, they have a third party data arrangement with a company called bombora. Think of bombora as B2B dimensions. It just has a lot of information about companies and so on. All of this information gets pulled in. So Jake's been identified. Bombora pulls this information in, and because it's in hum, kra can segment on it, which is to say they can use it as criteria for including Jake or not including him in a group of people.
So the other thing you've got here are all the segments that Jake is in. So you can see Jake working closely with Sierra is in rather a lot of segments, but these are all the segments they've created that he finds himself in. Here's another profile of Jake. This is his profile that at Rockefeller University Press, who is a charter client of ours.
And on the right hand side here, you'll see at the top the top 50 topics published by RUP that Jake's engaged with a normalized engagement score. Top 50. There are lots more. So what is he interested in. And that's quantified there. Down below are a series of content recommendations. What are the five articles that Jake has not yet seen that he ought to be interested in based on his topical affinities?
And you'll notice, because we do a lot of testing with rep, there's lots of various recommendations. These aren't necessarily all being surfaced, are you can see them. The client doesn't necessarily see them, but you can see they're just from one journal, the Journal of Experimental medicine. What's the content he should see from that one journal. So all of this is available.
You can, as you'll see in a second, you build collections of content. You can build collections of people segments, and then we have an activity log. This is the raw data, the atoms of what Jake has done, every single thing he's done. And if you go in here at the top on activity, you'll see all the kinds of things we're capturing. Every time he was shown an ad, every time he clicked on an ad, every time he was shown a lead Gen every article.
How far down the article did he read. Did he go to an event. Did he register an event but not show up. All of that information is available on him and can be used to segment. All right. So that's people. What kinds of people do you have. Well, most of the people are anonymous.
You don't know who they are, but an awful lot about them. For every publisher, this is CRA again, you can see they've got 480,000 connected profiles. That's to say they have behavioral data and identifying data and they've been able to merge them together. But 14 million anonymous profiles. So most they don't know who someone is, but they know what they're interested in.
They know what they've done. They have this behavioral data, and leveraging that data is a huge monetary monetization opportunity for publishers. OK Next OK, here's the International Water Association. And you can see over there, they've got an audience of about 2.3 million. And I want to show you an AI feature here. So they've got this group of people and these people have been reading their content.
And unsurprisingly, it's all about water. So I didn't want to do anything, show you anything proprietary. So I went to a different organization, the water quality association. There's a seminar they have coming up. OK, Kim's going to cut and paste that seminar description. In here. So under Profile search. So we're looking for people.
Who will be interested in this seminar topic. So now that audience of 2.3 million. 5933 people are going to be interested. If you want to tweak the level of interest, you'll see here I'm basically capturing anybody in the top quartile. Back, not even top quartile people, but with an engagement rate of 75 or higher. If I drop that to 73, the number of people grows to 30,000.
So if I had in my head, I wanted to put this out to 30,000 people to let them know about it, I could do that. That took me 12 seconds. I bet you can't do that currently. All right. So that is how is this doing it. It's building and embedding of this and it's comparing that embedding to all the people in the embedding space.
And saying, who are the people who are interested. And it's basically just drawing bigger and bigger circles based on where you pull this. Line we use an exempt webinar example. You could do this for calls for papers. In fact, many of our clients are using that for calls for papers. So top in the description of the special issue you're going to do. Who should I be talking to.
OK activation. So you built this segment of people. Now what do you do. Well, you want to reach out to them. So here we have up again at the bottom. So they've built a segment of people who are highly engaged with a collection of previously published works and.
That segment has about 190,000 people in it. And so you can see how they built it. So they were looking for people with high level engagement in any of those topics. If you go back to the previous page, you can see if you scroll down. Here's the modal that they want to put in front of people. So they're basically just promoting this pre-built content collection.
And you can see it's coming here as a pop up modal. There's some anti fatigue settings they set up here. So once every 14 days between these two dates and that date. So it's a campaign they're running. And if you go to the top, we take a look at reporting. We'll see that how this one did or is doing. I think it's still live.
All right. So of the 190,000 people in this segment, they've served at 2773 times 83 conversions, conversion rate of 3% You can see the conversion rate over time. You can see where people were when they actually saw the modal. I wanted to show you one new thing.
We've got. This is Karger. Karger making use of variants. So they have a campaign, but they want to try different versions of the campaign. What if it shows up in a different spot. What if I use a different headline. What if I use a different image. Basically A/B testing, but with a twist.
We're using multi-armed bandits. You can see here that they're running different variants and you can see the performance of the different variants over time. So in here. It takes a second to load. No Now it's a live demo because it's not working.
OK, here we go. And so down here at the bottom, you'll see varying performance. There we go. So variant performance over time and again. So you can see most of the people touching this are anonymous. I don't know who they are.
Doesn't matter. I know they're engaged in this topic. So he was a very performance over time. OK Topics so we've talked about people, we've talked about activating campaigns now topics. Every paper is about something. In fact, it's about multiple things. Here you can see all the topics in RUP.
So hum underscore protein trafficking mechanism that tells me that it's an alchemist applied keyword. So I applied this keyword to 101,395 pieces of content. And that's a fairly highly engaged with topic. So you can see all the topics if RUP and they do already had a taxonomy, they used maybe they already had author supplied keywords. This is as well as not instead of so you can take any existing taxonomy and include that you can even train alchemist on an existing taxonomy.
We can also use alchemist to give you a taxonomy. More on that in a second. But here are our topics. Each topic has a profile and if you click here, you will see this is a topic profile for protein trafficking. There are 1395 pieces of content. There are 102,000 people who've engaged with it. When was it created. When was it last updated.
And here are all the articles tagged with that in descending order of engagement. Every article has a profile. So here is that top articles profile on seeing all the keywords that it's tagged with all the metadata about that article, an excerpt and the articles performance over time.
So this one kind of interesting evergreen. In fact, it's had a little bit of peak lately. It's probably me preparing for this. All right. So we've got content, we have topics, we have content, we have profiles for each of those. What about querying this. So here we have Content Explorer. I sort of showed you audience Explorer before.
Here's Content Explorer. So back at our friends at Rupp, they have 305,538 pieces of content that have been ingested into hum. And you can come in here and you can put together a collection. So they've done this. They are looking for from their journal of general. Physiology, I think, and only journal articles. And there are 733 of those. So within their corpus that's and here they are can see top keywords for that collection just like you can save a segment of people can save a collection of content that's important later when you want to filter.
OK Content collection. This was the. The other thing I wanted to say about this. This is CRA again, CRA did this is a content collection about events, so they just wanted to know what are all the events I've put in and which ones were the most popular. So here they are.
Instantaneously they can see that 864 events in their content corpus. It's not just articles. It's not just books. It's book chapters. Anything that's a piece of content, a learning module, a blog post. An email. Everything come, can read everything, come, can classify everything come, can measure.
OK all right. OK content discovery. So we have out of the box reports on common things that publishers are interested in. One thing. What's my top content. So here at Rupp for the Journal of cell biology. So I've selected a particular collection here for the last 365 days.
Here are their top articles in descending order of engagement may not match Google Analytics because Google Analytics measures people who've landed on the page. This is measuring engagement, including scroll depth. Whether somebody cited it, saved it, downloaded it, et cetera. All of those things contribute to the engagement score. So I can see that here. This article on image manipulation is the number one article in JCB that meet these criteria down here.
So this is articles down here, topics, what are the top topics. And I can filter this by content collection or segment. So if I have a segment of early career researchers who are into lipids, I could put that segment and say, well, what content are those people interested in and pull that out. And again, it takes about 10 or 15 seconds to do that. All right. Next we have the content opportunities report.
This is where you have high engagement in a topic, but low content counts. So let's see. Nuclear transcription in JCB. I think this is. Yeah JCB. So nuclear transcription. They have 25 articles on that topic, but an extremely high engagement score.
They have a whole they have a content whole. It's a suggestion to an editor where you might want to go out and start to commission. If you're doing strategic commissioning, maybe it's a special issue topic or something. We have other reports as well. But those are the two exciting ones. And then the final one I wanted to show that the salespeople will use is helping understand where people are.
So everybody's usually at a University. So here's a report. This is IWP again, the International Water Association. They use the Subscribe to open model. So they're interested in what the free riders. Who uses my content a lot but isn't actually a subscriber? Well, here it is in descending order and you can see not only what they're looking at, but if you select one institution which you can do, you can go in and you can see what articles are they looking at, what journals are they coming from, what type of content.
So if they wouldn't subscribe to the whole thing, maybe they'll subscribe to one journal, or maybe they'll subscribe to a collection of content about a certain topic. But you've got all the information here essentially instantaneously. OK, so that's it for hum. Well, I guess I should say I talked about S2, a read and publish.
Same thing. You can use the exact same thing for read and publish. OK, but wait, there's more. Please tell me somebody else in the room remembers this. Ginsu knives, please. OK Thank you. Thank you. Thank you, old people. All right.
So I just want to say a couple of words about taxonomies and tagging. It breaks my heart to see how much money publishers spend getting taxonomies created when alchemists can do it in a week. So I just want to show you a couple we've done recently. We did one for the American Psychiatric Association and we did one for IEEE, I'm not kidding. About a week. So here's American psychiatrics part of it.
It's a proper taxonomy so you can see some taxonomy at the front. So that's how I know what it is. Psychopharmacology and medication, medication assisted therapy for addiction. So you can see it's a multi-level three level, in this case, proper taxonomy. They have 30,000 terms all in, including the various levels. And every one of their articles now in their entire.
History of everything that's digitally readable has been tagged with this taxonomy. This is psychiatric. Here's IEEE. I like this one because it showed you a couple of right next to each other. A couple of. Obviously change.
So every time somebody engages with an article, these can shift. So what I did this morning, they've shifted a bit. But anyway, but you can see get a sense of how this works. So all right. So what can you use home for. Summary slide. You can collect behavioral data that you're probably not collecting currently.
You can use that data to put messages in front of audiences, leveraging your anonymous audience. And there are all those use cases you can use non-dues revenue dues revenue, membership calls for papers, promotion read and publish. So the guy at Max Planck doesn't know that Rockefeller University Press has a read and publish deal with Max Planck. But you can remind them because where they're coming from.
You can then deploy a deeper, data driven content strategy. Where are my holes. What I didn't show you was the report. That's the opposite of your holes. Where do you have tons of content and nobody cares. So stop. You can get your content ready for AI. You want to do vector search, you want to do conversational querying of your content.
You're going to need your content as embeddings and hum can help with that. And so now within your this is and to be clear, this is frontier AI. This isn't software. Sometimes when people talk about AI, they really mean software. This is proper AI. So taxonomies and tagging help your content embeddings help you query your content, predictive interests help you match people to content.
And not only people today, but where is the puck going to be I'm a Canadian. I'm required to use one hockey analogy in every presentation. OK, so tomorrow, if you really want to talk about I just want to do a commercial for this panel tomorrow at 2:30, session 2 F So some high powered names kind of really working at the frontier of AI, including my colleague Dustin Smith. So just a plug for that.
And this QR code will get you more information specific for publishers. If you're interested. Again, drop by and say Hi and I'm now going to shut up and happy to take questions. Thank you. No questions.
OK OK. Here we go. Here we go. You mentioned conversational search. Do you see hum's place in that as being providing the embeddings for and working with an external commercial model or having a model that also powers. So ultimately it will be the latter.
I just think because what publishers want is someone who can help them do all of this and no publisher is going to be able to maintain themselves at the forefront of generative AI. I mean, generative AI is one that gets all this information, all the press. Much of what we talked about today is interpretive AI. So it's encodings models. It's understanding what's been written.
It's not trying to it's not writing stuff for you, but that's what you want for conversational search. So Yes, I think hum will be providing all of that and we are working to provide things that are specific for scholarly publishers. So what I didn't say, but I should make clear is that psychiatry has its own model. IEEE has its own model. Each client has their own model.
So which is important to clients because they don't know who in here put up your hand if you want your content out in the wild. I training models. Yeah I didn't see any hands, so I'm not surprised. This man. Thank you.
What does the start up process typically look like when you're working with a new publisher. I'm imagining you have to index content, get information about subscribers, JavaScript, library, what else is going on. Actually, not so much. So we build integrations with your existing platforms. So for the base platforms it would be your publisher platform.
So atypon or Silverchair or highwire or whatever you use. That's a lot of your content sits. If you have your books on a different platform that your CMS. So whatever you use for that. LLMs is your marketing system, if you're an association or society, you may have an association management software system. We'll pull that data in.
But again, that tends to be demographic data. It's pretty straightforward. So that gets ingested into hum, cleaned a little bit unified. The content reading that takes about the well, not counting the ingestion. Takes about a week. Congestion can be the long one. So IEEE, I think it was six weeks start to finish. And then do you have an email delivery component to your campaigns or is it mostly web traffic.
We do not do emails. We will feed back to your email service provider. So if you use Marketo or Pardot, I'll feed a segment to you and that segment will be dynamic. So segment of everybody who's not yet signed up for my event but is interested in the topic. And then as people sign up, they automatically fall off. So it's a dynamic segment. So I'm thinking.
And just to add one thing to what John said, John described and the picture we showed was an enterprise deployment of hum, where hum is plugged into not just a publishing platform but a whole range of other systems. Many of our clients just start with the publishing platform and just start with understanding and relating to their anonymous audience, because there's a huge amount of value you can get out of just that anonymous audience.
OK Joanne fogelson, ASC. I was wondering if we can do the publishing platform. Can we do the society platforms using their CMS and the search that they're doing on that also. Absolutely Yeah. Yeah and it's worth saying in relation to those original publishing platform implementations, they're incredibly quick.
We're up and running within a week or two. Yeah when you think of implementation and you may think of publishing platforms and you think 18 months or 24 months. That's not what we're talking about here. When you have disparate data in all different platforms, what do you do for data disambiguation? Yeah, excellent question. So that's some of the stuff that we're working on at the moment.
You're disambiguating people and you're disambiguating institutions. There are some third parties that already do the latter. To varying degrees of success. But this is an area of intense interest for us at the moment. So sometimes it's hard and you can't tell. We tend to be a little conservative. So we tend to use deterministic methods and not probabilistic ones at this time.
But this is something we're going to give clients a bit more control over. B2B publishers, for example. They're not nearly as worried as you all are about making mistakes. And you know that Bill Smith and will Smith are the same person. But that might be a big deal for you. Where you worry a lot about reviewer integrity. Like who is really reviewing this paper.
Now what did that person read. What does that person, what do they look at. And you start to be able to do things like tell when someone's they've offered to review a paper that's way outside what their history is with you or maybe they've got a different email address than the one you're used to. Give you some clues.
OK I think we're technically out of time. Thank you so much for your attention. And come say Hi. I should also say Humm has socks. We're famous. We're famous for our socks. And so come, if you by the booth, you can have a pair. They're very warm and fuzzy. Thank you.