Name:
The Future of Interfaces, part 2
Description:
The Future of Interfaces, part 2
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/3a49358c-7fd6-451f-ab57-1f7ca610d831/videoscrubberimages/Scrubber_3510.jpg
Duration:
T00H58M53S
Embed URL:
https://stream.cadmore.media/player/3a49358c-7fd6-451f-ab57-1f7ca610d831
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/3a49358c-7fd6-451f-ab57-1f7ca610d831/Platform Strategies 2018 - The Future of Interfaces Pt 2.mp4?sv=2019-02-02&sr=c&sig=DwIINFv7ntEl70KLpAS8UsDiWOCe%2BbGQpzRueJ1XIjw%3D&st=2024-11-23T15%3A30%3A30Z&se=2024-11-23T17%3A35%3A30Z&sp=r
Upload Date:
2020-11-18T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
SPEAKER: Welcome to part two of the Future of Interfaces. And this particular session will be about smart content and enriching content and the journey that we've been on. The promise of deeply tagged data has certainly become a reality for a lot of programs and almost to a point where it's in the mature part of the S curve, where it's being used across many different applications now. Whereas 5, 10 years ago, it was certainly more in theory.
SPEAKER: We're going to hear today about folks who are using this smart content to enrich their products. With us today speaking is Travis Hicks from the American Society of Clinical Oncology, ASCO, John Magee from Gale Cengage Learning, and Jabin White from JSTOR in ITHAKA. And I am going to quickly step out of the way and let them present to you on this topic. So John, you're starting us off.
JOHN MAGEE: There we go. Hi, everybody. I'm John Magee. I'm the director of indexing and vocabulary services for Cengage. My group handles a wide variety of controlled metadata for Cengage. And we're responsible for most of the manual application of that metadata to all of our content. On the Gale side, we have pretty close to about a billion documents in our repository now between periodical databases and historical archives and reference content.
JOHN MAGEE: We like to say we're not quite big data, but we are getting to be mid-sized data there. We've got a lot of information there. We've done a lot of work in my group with the Gale side of the business for a long time. We've done an increasing amount of work with the learning side of the business, working on course metadata like learning objectives, discipline taxonomies, and other things.
JOHN MAGEE: There's a really interesting discussion to be had about the differences between metadata for search and retrieval and metadata for learning products. This isn't that talk. Sorry. But what I am going to talk about-- I thought this would be really helpful for people-- is one of the other services that my group provides is we serve as internal consultants in the business for different needs for metadata for different products.
JOHN MAGEE: And one of the real problems that we have that we encounter is sort of the blank page problem, what to do when you know you need something, but you don't know what to do. And I thought I'd talk a little bit about some of the things that we try to do to look for what you should do when you're really trying to organize stuff. So it would be nice if people would come to my group and say, hey, you know, John, we've got this interface here.
JOHN MAGEE: We really need a three-step thing. It's got to be about 200 terms. We've got to have everything. Actually, when I prepared this slide, it never happens. And then I thought, oh, it did happen twice. And both times it wasn't actually what that group needed. But it's really difficult to know what sort of metadata, what sort of descriptive metadata, you should put on content.
JOHN MAGEE: And it's really easy to fall into the trap of letting either time or budget or what you already have at hand determine what you're doing. My group went through a process of learning when we started working on learning metadata. One of the things we have control over is the Gale subject thesaurus. Gale research has a fabulous subject thesaurus of about 60,000 preferred terms and another 60,000 nonpreferred terms.
JOHN MAGEE: And it really covers sort of the breadth of all human knowledge, more or less. And yeah, I am proud of it. But when we started looking at learning metadata, it didn't really include some of the pedagogical stuff that people really needed to work through courses. So we've worked a lot on figuring out what to do about it. And our approach is always to take the user and put them at the center.
JOHN MAGEE: There is a lot of different types of controlled metadata out there. We could sit here and have days of conference on just the different types of controlled metadata. Actually, in a couple of months they're going to have that conference here in Washington DC. It's called Taxonomy Bootcamp. It's a really good conference if you get a chance to attend it. There's everything from controlled lists, even simple things like Boolean true/false.
JOHN MAGEE: That, in a way, is controlled metadata. You're saying, hey, here are two options here. Here's where it is. There are taxonomies. There are authority control files. In my group, we have a lot of named entity authority control files. They contain a lot of distinguishing characteristics-- birth and death dates for authors, place of birth, companies' places of incorporation, all sorts of things.
JOHN MAGEE: But the real key is figuring out what somebody needs. So where do we start? The most important thing we start with is how are people going to use this vocabulary? What do they need it for? Who's the user group? What are they going to do? And we go out and we ask questions. So somebody comes to us.
JOHN MAGEE: They say, we need some organization here. We're not sure what we need. We go out, and we ask questions. So I'll talk a little bit about the sorts of questions we ask and what we're trying to get out of that. The number one question is, how are people going to use it? So the first thing is, what do I want to do with this? There are all sorts of things that you can do with controlled metadata, and it really depends.
JOHN MAGEE: You can use it to organize your catalogs. You can use it to do deep data mining. You can use it to link pages directly. You can use it to link pages by implied relationships. You can use it to steer content. We use a lot of controlled metadata and mapping to add additional controlled metadata stuff. We have a philosophy of, index once, use many times in many places.
JOHN MAGEE: And that really means we need to figure out the structures that are needed by our products. And the reason they're needed by our products is they're needed by our users. So here are a couple of things in particular I want to talk about that we work with. Hierarchy. What sort of hierarchy do you need? What sort of granularity?
JOHN MAGEE: What's the user community? A really high-end academic group is going to have one sort of language and one sort of jargon. A K-through-6 user, elementary school kids, are going to have a completely different way of talking. And you need to be able to account for that. And then there's a couple more practical things as you get into it.
JOHN MAGEE: What's the ongoing maintenance? Is this something that you're going to do as a single vocabulary, a one-off? Or is this something that you're really going to build over time, you're going to use, you're going to apply in other places? What are the mappings that you need? Who's going to apply the content? Where is it going to go?
JOHN MAGEE: How is the metadata actually going to get onto the content? Everybody likes to think it's all automated indexing now and you can just press a button and apply it. That works well for some stuff. It doesn't work very well for others. And last-- and intentionally last-- budget considerations, because after my nice general talk, we all live in the real world. And we've got to figure out what we can do with the time and money we have.
JOHN MAGEE: So hierarchy-- a couple of things to consider when you're looking at building a controlled vocabulary is, what sort of depth do you need? Do you need something that is used as a drill down in an interface? Do you need something that's navigated by internal people to organize content? Do you have something that's going to roll up content? This is something that people don't think about often.
JOHN MAGEE: But oftentimes people are going to use that controlled vocabulary. They're going to want to apply specific terms. One of the things that we have as a general principle-- we violate it if there's a good reason to-- but it's always to apply the most specific concept to a piece of content because we can always use that hierarchy to navigate down to it, to bring it together. Those are the sorts of questions you want to ask.
JOHN MAGEE: So how are people navigating this? How are they getting down? Are we rolling it up? Granularity. Granularity is another really important concept in the world of thesaurus, in the world of taxonomy. How discrete really are the concepts that we need? How many items are you going to return? As I said, we're closing in on about a billion items.
JOHN MAGEE: That's a lot to retrieve all at once. And if you have a small vocabulary with really broad concepts, you're probably not going to get a useful set of results just by doing that. You might need something more specific. So when we look to think about creating a new term, one of the things we think about is, all right, do we have other content?
JOHN MAGEE: Can that bring light content together? And can it bring it all together in the same place? The example I have up here, which I kind of like, is beagles. There are two breeds of beagles. There are 13-inch beagles. There are 15-inch beagles. For the most part, nobody really cares about the difference between them, other than you've got a slightly smaller beagle.
JOHN MAGEE: Westminster Dog Show, this is really important. There are two categories of beagles. So if your audience is the Westminster Dog Show, if your audience is the American Kennel Club, you need to have 13-inch and 15-inch beagles. If your audience is maybe a general college class looking into animals in general, beagles is probably good. You might even just need hounds.
JOHN MAGEE: But you've really got to think about who your audience is and what they need. User syntax and language. So there are a lot of things to consider in terms of how you create those terms. What do you use? Reading level-- I mentioned elementary school students-- we have a product that goes to elementary school students. And we have some special terms that we use for them.
JOHN MAGEE: And they're fairly straightforward. We have a lot of products that go to academic markets. And those require more complicated and more specific terms. So you know, you've got to think about what you've got there. Language and multilingual vocabularies. There's another really fascinating talk to be had on multilingual thesauri and how you get out across other languages, how to structure it efficiently, and how to do it-- again, sorry, this isn't that talk either.
JOHN MAGEE: Form of entry. Form of entry is interesting. And we've actually been engaging with this on the person-name front for a while because we have, in our authority control files, a default form of entry, which is the traditional inverted last name, first form of entry, and we use that a lot.
JOHN MAGEE: But that is really not how people search for names anymore. People search first name, last. They type it in first name, last. And you can use a proximity indicator to bring those together. But it's not really an exact search. And sometimes when you're going across a lot of data, you need to have that term correctly.
JOHN MAGEE: And another thing I just wanted to mention is nested context or hierarchy. You see this a lot. I kind of dislike this. But you know, you make compromises when you've got to do things that are helpful and helpful to people who are using it. Sometimes you'll see terms that actually nest the hierarchy in the term.
JOHN MAGEE: So it won't be a term called "beagles." It'll be "mammals/dogs/beagles." And that's OK if you need to do that. But be aware that that has implications in how somebody is seeing those terms and what they're looking at. Ongoing maintenance. There's a tendency that when people build vocabularies, when people work on products, they build a new product.
JOHN MAGEE: It's something new. It's something exciting. And you say, all right, we've built it. We're done. Onto the next project. Hopefully whatever you build is going to be sustained over time. And so this is really something to consider. Is this something that we're doing in the short run, or not?
JOHN MAGEE: As a technical consideration, one of the things that happens in ongoing maintenance is sometimes vocabulary, especially when you're trying to get something out the door, it'll get encoded in the code of a system. People will use a lookup table. People will use some way of putting this in there. And you really want to avoid that if you can. You want to keep your vocabulary out some place where you can really maintain it, update it, do it without having to do a full code release of a product, for instance.
JOHN MAGEE: Application of terms to content. I love creating vocabularies. I think about that a lot. But I also have another hat, which is our indexing manager. So I'm vocabulary director, and I'm indexing director. And I often have to think about, how are these terms getting on the content? That's something that's sometimes underrated, too.
JOHN MAGEE: You want to think of your end users first. But you also need to consider all the users who are going to use this stuff. And the people who are applying these terms to content need to be able to access it. There's a really good talk to be had about indexing, and a lot of subtleties there. Again, not this time.
JOHN MAGEE: Required mappings to other vocabularies. So it's great to build a vocabulary. It's great to build, be it an ontology or a thesaurus or what have you. But ultimately you want to be able to get it out in the world. And this is a place that I think is a really interesting place of growth in the world of vocabulary, the world of controlled metadata.
JOHN MAGEE: And it's outside standards. It's open web standards. A couple of them are really interesting these days to us. Wikidata, which is basically the vocabulary that underlies Wikipedia, is open. It's out there. You can link into that. You can link out to that. It provides a lot of disambiguation for people who are searching across concepts on the open web.
JOHN MAGEE: And if you can map that into whatever vocabulary you're using, you can provide some pretty direct connections. VIAF stands for the Virtual International Authority File. It's a project that was started by OCLC, which is a large library non-profit. And they started it up with the Library of Congress and Bibliothéque Francaise and, I believe, the German public library a few years ago. It's a large intermapped connection of vocabularies across the world.
JOHN MAGEE: And it's also a really interesting place to be able to link out to libraries worldwide and get them to your concepts. So think about that. I am running out of time. So I've talked about everything else. I haven't talked about budget and time considerations, other than running out of time. But the reason I put that last is, it's the real world.
JOHN MAGEE: We all live in it. You're going to have time crunches. You need to build something. You're only going to have so much budget to do stuff. But think about all that other stuff first. Think about what your users need. And then you can start to think about, OK, now where do I cut a corner? What do I have to do to make this practical?
JOHN MAGEE: What do I have to do to get this out on time? And there are questions you can ask around that. You know, how much stuff do we have? How much time do we need? But that's all I'm trying to say, is you start with a blank page sometimes. What's your deciding factor? Make sure that it's, who's going to use this vocabulary? Who's going to use this controlled metadata?
JOHN MAGEE: And I think that's my last slide. So thanks, everybody. Onto the next. [APPLAUSE]
SPEAKER: Take it away.
JABIN WHITE: Good morning, everyone. I am Jabin White from JSTOR. And I'll start my clock. All right. All right. Let's set the agenda here. I'm going to give a little background of who the heck I am and what my problem is. And I've got a problem.
JABIN WHITE: I've got several. But this is the one I'm talking about today-- a little bit about search versus browse, how we attempted to solve some of those problems with what we call the JSTOR Thesaurus, and then some conclusions. So my company is ITHAKA. It's a not-for-profit. Well, you can read as well as I can.
JABIN WHITE: But it's basically made up of four entities. The ones I work for are JSTOR and Portico. I run the content management groups that feed content into JSTOR and into Portico, as well as the systems that the production folks use to shovel all that content. We also have Artstor and ITHAKA S+R. So my background-- I was an STM publishing guy for years. And I bring that up just to sort of give some context to what I'm going to talk about later.
JABIN WHITE: Most that time spent in digital publishing. I worked at Elsevier, Wolters Kluwer, and even Silverchair for a little bit. Since 2010, I have been vice president of content management at ITHAKA JSTOR, my current role. So I'm going to tell you a story about JSTOR. Everybody's probably heard of JSTOR, yeah? Show of hands. A little exercise this morning.
JABIN WHITE: All right, good. JSTOR launched in 1995. It was the brainchild of the former president of Princeton University. And the idea, 23 years ago, was pretty revolutionary. You know, libraries had print journals on their shelves taking up a lot of space. And it was expensive to maintain and limited access. And JSTOR-- again, this is 2018, so this doesn't seem very revolutionary-- but JSTOR scanned these journals, made an actual digital copy of that journal.
JABIN WHITE: And it did a couple of things. Number one, it gave access to a lot of folks that didn't have access to the print title in the first place. And it allowed the libraries to do, what we call, deaccession, which is basically throw out a lot of print and make room for a Starbucks or something else in their library. And I can brag about this because I wasn't there at the time, but it was a pretty neat idea.
JABIN WHITE: And it took off. The usage of the content, a lot of it was driven by the exclusivity. You couldn't have access to these journals anywhere else but JSTOR. In the physical world, you had to be at a library that had that particular title. And it was really revolutionary. And the usage reflected that.
JABIN WHITE: We're a not-for-profit, so we don't call it sales. But our director of outreach has this joke. He says, a sale back in the day was walking over to the fax machine and ripping off the order. And that was his sales cycle. So I'm not saying there wasn't any competition, but it was a nice business. So how people searched was appropriate for the times.
JABIN WHITE: They started at jstor.org, and they went on about their day. But times were a changing. So back in my day, back in a lot of people's day, search was pretty vanilla. We would do a full-text search. And then you had this great option of an advanced search. And that was kind of it. The level of effort and the level of what you expected from a search was pretty universal around the world.
JABIN WHITE: And let's be honest, we were spoiled. So this is what it looks like for us. I mean, this could be 10 years ago. You know, the basic search box. You could do some advanced stuff. I only want to search for these authors. Or you can even search by disciplines. And don't even get me started on this. We use disciplines at the journal title level.
JABIN WHITE: So you could have a discipline about this journal with 15 articles that are on different topics. And that's a mess. But this was search back in the day, and nobody complained. Well, today, search expectations are completely different. So with minimal input, people expect absolutely precise results. And our dear moderator, Mr. Jake Zarniger has a great joke, if I can steal it.
JABIN WHITE: May I? Oh, you know what joke, pal. [LAUGHTER] An oncologist would never think of walking up to a librarian, a physical person, and say, cancer! But you give him a search box-- right? And so that is the expectation.
JABIN WHITE: People expect everything from very little input. Jake, your joke kills, man. That's great. [LAUGHTER] All right.
JOHN MAGEE: Oh, I intend to steal it.
JABIN WHITE: Oh, yeah, exactly. I just ruined my own punch line, but, you know, thanks Obama. Well, that joke doesn't really work anymore. But it's not Obama's fault. It's Google. And I love Google. Not to fanboy out, but Google is great because they have raised the expectation of search. And there's really no way around it. I mean, this is expectation management. People expect the results from one sheet.
JABIN WHITE: Now, we at JSTOR could choose to say, well, they have these advantages. It doesn't matter. The expectation is out there that I'm going to enter one word, and I'm going to expect to see everything you have related to that word, things related to those relations and everything. So Google has some advantages, just in terms of volume. And I want to give you a couple of statistics that absolutely blow me away.
JABIN WHITE: JSTOR does 100 million searches per year. I'm pretty proud of that number. That's a lot of searches and a lot of academic, journal, book, and research report content-- 100 million per year. Google does 40,000 per second. That translates to 3.5 billion per day and 1.2 trillion per year. So I'm no math major, but Google does about 35 times searches per day than we do for the year.
JABIN WHITE: That's pretty humbling. And we do a lot of searches. So we're not going to compete-- we're not really competing-- but we're not going to compete with those expectations, because Google sort of churns all of those results and where the users went to after the results back into their search algorithms, and they make it better, which is fantastic.
JABIN WHITE: And we all take advantage of that. But we're an archive of academic, scholarly, and very focused on humanities and social sciences content. So we don't have some of those same advantages. But the expectation is still there. So what do we do about it? Well, the other thing we have to consider is that searching humanities and social sciences is a little bit different than STM.
JABIN WHITE: I'm speaking very generally now. But a lot of times in STM, there is a right answer. And you're searching for it, and its ways of getting to that. A lot of times there's not. In humanities and social sciences, it's more mungy. It's, well, it could be this, or it could be that, and I want to study the effects. It's a little bit of a different user experience. I like to refer to it as, it's search, then browse.
JABIN WHITE: I want to search, and then when I see my results, I want to look and see what's around that. And we see this in user behavior. But it's a sort of search and browse thing. And our browse functionality-- and I think this is probably the same for a lot of people-- is just god awful. Most people's browse is really bad. And the thing is, we don't invest a lot in our browse function because nobody uses it.
JABIN WHITE: But nobody uses it because we don't invest in our browse. So there's the snake eating its own head thing. So we needed a solution that married these two things together. We called it the Goldilocks solution. This is where semantic metadata came in. We embraced the concept of semantic indexing. And we wanted to use that in our search and discovery. But there was no such thing as a humanities and social sciences based thesaurus.
JABIN WHITE: It wasn't that there was no existence of one. There were too many. So for instance, there were four history taxonomies we could have used. And they all were very well-intentioned and all did very similar things. But they did it differently. And guess who was right? All four of them.
JABIN WHITE: Just ask them. So we had to sort of referee that and then munge these all together to customize it around the JSTOR content set. So we did that. We took about 20 source vocabularies, munged them all together, refereed the disagreements, made sure that there were no things in philosophy that were pointed to history that were wrong, did all that.
JABIN WHITE: And then we used basically TDM, text and data mining, spidering of our content to match words with the appropriate JSTOR articles and book chapters and sort of linked those together. We loaded that into our solar index. And then we also came up with this concept of topics associated with each content set. I'm going to show you those in a second. But I first want to tell you there's also cultural changes to this stuff.
JABIN WHITE: We never really had to worry about this stuff before because we just did full-text search. The platform engineers, those guys and gals, took care of that stuff. We didn't have to worry about this. Our metadata was very different because in the early days of JSTOR, like I said, we had to basically prove to the librarians that we had every issue of every journal of every title that we said, because if they're going to throw away print, they want proof.
JABIN WHITE: And we wanted to provide that proof. Well, we're now talking about changing that focus to a more semantic metadata approach. And we're talking about communicating those changes to the product and platform teams that are doing stuff with it-- very different from a cultural perspective and probably not to be underestimated. So I talked about this topic.
JABIN WHITE: So you do a search on JSTOR now. And we like to think, because of the solar index, the search is improved. But on top of that is this thing at the bottom where these topics are. You searched on this, we think you might be interested in these topics, which are right around that semantic tree, and they're all related. And if you click on those topics, you get to-- well, I'll show you that in a second.
JABIN WHITE: To maintain that, we had to change the jobs of a couple of our metadata librarians. I referenced the different metadata before. We actually have two metadata librarians who, five, six years ago, didn't know a thing about semantic indexing that now are taxonomists in name and in function. They now recruit subject-matter experts, external folks who are expert in anthropology or history or smaller, minute topics.
JABIN WHITE: And we have really built that muscle of being able to recruit and speak to, everyone calls them SMEs-- I kind of hate that term-- but subject-matter experts in all these different fields. One of the things that is really nice is we've been able to do custom branches of the thesaurus for, we call them topic theme collections. So we just launched, about a year and a half ago, a thing on sustainability.
JABIN WHITE: And it's a very topic-focused site with books and journals and research reports. And every topic you see at the bottom is a new thesaurus entry. And it's a specific branch that we built out, knowing that we were going to be publishing in this space. The problem this solves, or the thing that this addresses, is that the JSTOR Thesaurus, and I'll be the first to admit it, is very broad, but it's not very deep.
JABIN WHITE: How can it be? We've got to be able to have it manageable. There are 20 top-level disciplines within JSTOR, everything from business to religion to history to economics. And it's very difficult to have a very deep and manageable thesaurus around that. So when building out branches for these topic collections, we gave ourselves the ability to go deeper.
JABIN WHITE: So sustainability was the first one. Security studies was the second. There's a knock-on benefit to this, which is we are recruiting SMEs in these areas, who then go out and talk to the people at their institution about how wonderful this new collection from JSTOR is going to be. And it helps drive goodwill. So this is our results in security studies. You see the topics.
JABIN WHITE: And again, if you're navigating after a search-- I talked about search then browse-- this supports that. So I'm going to search for maybe a broad topic, maybe a specific one. But I'm going to get to the relevant articles that I hit on. And then right below, I'm going to be shown all these topics that we hope, at least, are related to the thing that you're searching for.
JABIN WHITE: You can then go out to a topic description page of that thing that you just clicked on. The very top part of the page is a bit controversial. We pulled in-- and someone mentioned earlier-- Wikidata. This is the description of democracy pulled in from Wikidata. We actually took some grief for this. Somebody called it-- and I love this phrase-- they said, "JSTOR, you're a Porsche, and you're linking out to Yugo," which, yeah, that kind of hurt.
JABIN WHITE: But our response was, well, shouldn't we make the Yugo better, and the world would be a better place? So we actually gave access to JSTOR to some of the Wiki librarians, and they are improving their topics. And in theory, hopefully they will improve their reputation with some of the people who were giving us grief from linking to those evil Wikipedia people. After you do a search, and if you've gotten to a topic, when you come back out of downloading a PDF-- and by the way, I should mention, we are a PDF factory.
JABIN WHITE: And we're proud of that. And getting out of the way for the end user to be able to get to that PDF is not necessarily a bad thing. I'm going to speed up because I'm out of time. I mentioned engagement with users. We also allow them to give us feedback on the thesaurus. If they get to an article and they're linking to a piece that is foreign to them or is just plain wrong, they can let us know, and we'll take a look at it.
JABIN WHITE: It's great engagement. I'm out of time to run text analyzer. But one of the cool things we've done is you're able to upload content. Any site you want to link to, a paper you've done, an article, you can upload it to JSTOR. Even if it's an image, we will OCR the text. We will run a search for you against the thesaurus and bring back relevant search results.
JABIN WHITE: The best use case I can think of is if you've got something on your subject matter and you're not sure about what other things you can write about. This is a student, obviously. So they could use a syllabus and say, these are the terms. It uploads it. It finds matches in the thesaurus. And it does a search across JSTOR.
JABIN WHITE: So I am out of time, and thank you so much. [APPLAUSE] [AUDIO OUT]
TRAVIS HICKS: Good morning, everybody. I just wanted to say that all of these presentations were made in isolation. But I hope, as you'll see with mine, there is a real interconnectivity between them all. So John started talking about, really, the broad overview. Jabin got more into some of the weeds. And I really want to talk about a very discrete functionality that can be powered, essentially, at least we believe so, by metadata moving forward.
TRAVIS HICKS: My name's Travis Hicks. I am the associate director of digital content strategy at ASCO. For those of you who do not know, ASCO is a member organization for oncologists. We are about 46,000 members currently. We have a real mission that is driving education of the field. Within that scope, we produce reams of content. But primarily, they fall into the categories of five journals, about 7,500 abstracts produced a year from between 7 and 9 meetings.
TRAVIS HICKS: In addition, we have roughly 4,000 presentations that are given at various meetings that we also video capture, transcribe, and provide to our member base as well as attendees. So underlying all of this is the idea that we needed a metadatabase at which to standardize and categorize all of these disparate pieces of content. With that, we started creating several thesauri in order to provide that categorization.
TRAVIS HICKS: Within that, we took a look at what our needs were and developed, essentially, a three-pronged approach, three distinct vocabularies-- a subject/topical vocabulary, a drugs vocabulary, and a genes vocabulary. The subject/topical hierarchy is a poly-hierarchical structure, while the others are flat lists. This gives us the ability to understand, as was discussed earlier, that you can roll more granular topics up to a higher level in the topical categorization.
TRAVIS HICKS: But within drugs and genes, we're really talking about distinct identifiers that we want to be able to apply. For a visual perspective, I just wanted to give a snapshot here of what our topic thesaurus looks like. The largest areas are, unsurprisingly, cancers, which really becomes cancer types, and is one of the largest, followed by medicine and treatment.
TRAVIS HICKS: So as we dug into this-- and when any of you get into a metadata project, I really hope that you clearly define what the goals are of that project. That enables you to focus as you move forward. And it also enables you to discuss somewhat intelligently with the folks who might not be familiar with the concept of taxonomy, thesaurus, or metadata in general.
TRAVIS HICKS: So we essentially came up with three goals that we wanted to be able to use our three thesauri for. Number one was to allow for personalization and automated curation of all of our various sites and products and services and potentially advertising. And this is based on those standardized content characteristics. Secondly, it's the ability to actually inventory our content. How much content do we have in particular areas?
TRAVIS HICKS: Prior to even starting this, we had no idea. We knew discretely, looking at individual content points, we could identify to an extent. But we still weren't using that standardized vocabulary which would allow us to fully understand the depth and breadth of our content. And finally, as Jabin was discussing, it's the ability to really enhance search. And that's to produce more targeted results and accurate results.
TRAVIS HICKS: And also, one of the things that we do-- and Jabin discussed it; and many people do-- is we then use faceted filters based on our taxonomy to enable that browsing component after an initial search is executed. So what I really want to talk about here is the actionable potential of using taxonomy in combination with personalization. Jabin didn't get to really talk in depth about his text analyzer that JSTOR Labs has developed.
TRAVIS HICKS: But it's a really, really interesting model. And what it does is it looks at relevancy amongst different pieces of content. I think that's something that most of us are probably familiar with-- the concept of like-minded content. What I'm going to propose here is a concept of actually taking relevancy and applying it to not only content, but using those same standardized characteristics to apply to an individual-- so essentially breaking an individual down based on their data that we have of them, whether that's hard data or whether that's activity data, and creating what I will term a taxonomical profile.
TRAVIS HICKS: So personalization-- I think it's generally accepted that it's delivering the right content, the what, to the right users, the who, at the right time, which is the when. Now, the when and the who and the what, they're driven differently. Right now we're going to really talk about how to get the what to the who.
TRAVIS HICKS: So in essence, when I talk about a taxonomical profile, what I mean is really taking relevant data-- now, this could be anything you have in your database-- and matched with additional information, including browser data, and then basically whittling that down to a subset of terms that you can use to describe a person. So ultimately, the same way that you define a piece of content, what that content is about, you can define a person based on what it is that they want to look for or what you believe they're going to look for based on those data characteristics that you can analyze.
TRAVIS HICKS: In theory, the idea really depends upon being able to take all the disparate pieces of data and process them into a single database, at which, through multiple processes and models, you can then end up with a very distinct taxonomical profile of an individual that can then be aligned with your content, which should be categorized in very much the exact same way, which really leads into that third point, which is matching that relevant content services opportunities based on those profiles.
TRAVIS HICKS: I have person A here. This aligns with content B over here. So just to give some use cases. Now, these are things that have been kicked around at ASCO. By no means are they absolute sureties, but just to kind of get your mind flowing on, when we talk about personalization, how this can potentially be delivered.
TRAVIS HICKS: And we don't want to limit the platforms at all. So in this case, I'm going to give an example both from an email perspective on delivery and then also on the delivery of digital content on a traditional web-based platform. So in this case, perhaps ASCO wants to personalize emails by adding dynamic content blocks at the top of emails. You all familiar with dynamic content blocks?
TRAVIS HICKS: OK, good. So that will all be based on the taxonomical profile. So the email creators would index the content blocks themselves against our taxonomy. And then they would identify, pulling from a query within the development of the email, look for these characteristics of an individual. So a traditional query in an email platform says, I'm looking for a data point ABC, exclude DEF.
TRAVIS HICKS: We could use that because we've now defined an individual based on these data taxonomical points. Secondarily, we could want to reduce the amount of information, essentially, that our website visitors have to process when landing on a home page. I'm sure, like most organizations, we have a home page that tries to appeal to the broadest audience.
TRAVIS HICKS: So we have everything we think is important to potentially anyone who arrives on the site. The reality is that is not all information that is relevant or important to everybody. So by understanding what an interest level is for somebody-- and this is not just limited to taxonomy from a subject/topical area. But also from, if we know, defining who they are as a person, where they where, what their profile pieces are, we can start to limit the view that they see and only present content that is relevant for them, therefore, in theory, boosting the relevance of our site in their eyes.
TRAVIS HICKS: So when I talk about profile data-- so really, when we start talking about how we're going to take this large breadth of data, squeeze it down into a taxonomical profile, we're looking at two distinct sources of data. One is profile data. And two is activity data. I'm going to define both of those here, starting with profile data.
TRAVIS HICKS: So essentially what profile data are are going to be self-reported demographic data. In the case of ASCO, we have a plethora of data that would fall under this umbrella. But for the purposes of really developing a personalized content approach, these are the data points that I think are probably the most relevant. It could be a specialty or interest, degree, the individual's primary professional role, the location at which they work, whether or not they are involved in research and to what level is their involvement in research, and then any of their board certifications.
TRAVIS HICKS: Now, the best part about profile data is their reliability. They're self-reported. They're concrete. It doesn't involve interpretation. Activities data, on the other hand, is not concrete and does involve interpretation. So when we think about this, activities data are really built from how an individual engages with us as an organization.
TRAVIS HICKS: So that can be products purchased. It can be their learning profile and the courses they take on an LMS. It could be the credits claimed through continuing medical education via those courses or via attendance of our meetings. It can also be the attendance of our meetings. So again, examples of those data points, they could be a listing of sessions completed for credit.
TRAVIS HICKS: They could be courses completed for credit. It could be the meetings attended. It could be those products purchased. Also, another piece of this is, it could be anything authored by an individual, whether that be an abstract, manuscript, or a presentation given. All of those things have an underlying taxonomical structure that can be indexed that we can then use to build a profile from.
TRAVIS HICKS: So again, I had mentioned this, that at the top of the activities piece is that the implications are somewhat difficult to fully measure out. They're not as concrete as profile data. And usually there is more of an inference into what the composite of that is, because they are not self-reported and we are just gathering that data. So again, we start talking about this database.
TRAVIS HICKS: What we want to do is to be able to store attributable demographic and activity data over a certain time period. Time period is very important because, in particular on activities, they can change. We have general oncologists, for instance, who may be very interested in a particular subset of a specialty because that's the people that they're seeing come into their office at the given point.
TRAVIS HICKS: That could change a year later. So we want to be able to make sure that we do have a time limitation and there is the ability to decay some of this data that we're relying on over time, replacing it with new data. Ultimately we want to take all of these individual data points that we talk about and really process those down into individual term subsets and then, finally, taking all of those term subsets that are stored into this database and processed into a single profile that would be accessible for this individual.
TRAVIS HICKS: So this is kind of what that might look like. As you can see, we've got the data sources at the top. Go into the personalized database. All of that information is processed to create a taxonomical profile. Then, once you have the taxonomical profile, there essentially is not a limitation on the applications it can be applied to.
TRAVIS HICKS: You can use it any way. Here I'm just showing similarly that you can push it through solar using solar queries, then processing it out to a variety of websites. Similarly, I had mentioned the database in terms of emails, using those same characteristics to distribute emails in those development. I'm going to quickly get into the nitty gritty on how this might operate in terms of a processing so you can see.
TRAVIS HICKS: So I'm going to move through the concept of a discrete activity into a subset activity and then, finally, how you process the entire taxonomical profile. In this instance, all of these surrounding pieces here are essentially sessions that an individual has put in his or her schedule in attending the ASCO annual meeting. Each of these sessions has an underlying taxonomical structure.
TRAVIS HICKS: So if we take all of these, we put them into a, basically, large bucket, you're going to have a long tail on that. In essence, cut like that, you're going to have these pieces up here that are very strong and ones out here that are cut off. Essentially, we work on an algorithm to present that in the way that we get that subset of, let's say it's maybe 12 terms that comprise that individual activity.
TRAVIS HICKS: We then can take that activity and move it into a larger subset, so a meeting activity profile. So as you can see where that arrow is, that's the scheduled sessions we just discussed. There are other discrete activities that would build into that, ultimately being processed down into a final terms and weights for the meeting activity alone. We would then take that bucket and move it into a full taxonomical profile.
TRAVIS HICKS: So here are all of the three places-- or one, two, three, four; actually, it should be five; I'm missing the browser data-- but the five distinct areas that we would create subset taxonomical profiles that would then be processed down to that final taxonomical profile. In application, this is what this would look like. This is an example of a potential master taxonomical profile.
TRAVIS HICKS: In here, you can then see the subsets of content that we may be able to offer, their underlying taxonomies, and then you see where the alignment is. Each of those green areas that are highlighted are terms that were indexed based on the taxonomy. Then they align with member X's taxonomical profile. One of the things I had missed to talk about is, also within that taxonomical profile, there are rankings of terms.
TRAVIS HICKS: So in the groupings, there would be three levels. Essentially you have high, which are the ones that we are most likely and want to have the highest matches for from relevancy. You have a second, which would be kind of in the middle, and a third, which are going to be the lowest. So that goes into the relevancy development of the algorithm. Real quickly because I'm out of time, I just want to hit on this.
TRAVIS HICKS: We ran a pilot on this, actually, at each of the last two years at our annual meeting with a goal of essentially creating a taxonomical profile from the information that was self-submitting. So instead of going through and interpreting all of the data, we actually had a verification of individuals, that they inputted the data, went through their specialty, then went more granular in terms of terms. We also pivoted that, then, on their professional role and their location of activity because that had some variation in terms of the types of sessions that they were more likely to attend.
TRAVIS HICKS: And so quickly I'll get to that. And then essentially we're able to take that algorithm and spit out a list of recommended sessions from our annual meeting. The reason we did that is because in the slide that I just had, the scope of the annual meeting is extremely large. We have 210 sessions over four days. And folks get really overwhelmed when they start trying to build a schedule.
TRAVIS HICKS: So what we're trying to do is give them a path that, within five minutes, we could offer suggestions that would guide their development of their schedule at that annual meeting. Quickly, the results of that-- we had roughly 3,200 of our 32,000 professional attendees actually completed the recommendations. Of those, we've received 1,400 bits of feedback.
TRAVIS HICKS: We wanted to build a feedback mechanism in there because it was a pilot and we wanted to test and get that feedback. Overall we received a 73% favorable rating on the accuracy of those recommendations, meaning a 4 or 5. We had 20% of those came in at a 3, so relatively neutral, and just a 7% negative rating of 1 or 2 in terms of the relevancy of those results.
TRAVIS HICKS: Thank you. [APPLAUSE]
SPEAKER: Thank you to our panelists. We are now at the question portion of the session. I do have a few that have been submitted here via the texts. But as always, I open the floor to the live questions for priority. All right. We'll start with the iPad questions. Well, Travis, this is one that just came in regarding your presentation. And that is, "Do you, or how do you, create profiles for anonymous and the unauthenticated individuals coming across your platforms?"
TRAVIS HICKS: Whoa. I should keep that back a little bit. So to answer that, to create an anonymous profile, we could just base that on browser data. And that would be the limitations that we have. So again, probably the least reliable form of data that we could have, so the accuracy would be reduced at that point.
SPEAKER: Great. Questions? I see one. Can we--
AUDIENCE: Question for Jabin. So now that you've got the taxonomy-based search interface, are you getting any qualitative or quantitative data of people going through a Google search of JSTOR versus one that's in the site search? Any feedback that you've gotten?
JABIN WHITE: Wow. We are, but I'm not sure we're supposed to talk about that publicly. So we also send our metadata to discovery services. And I didn't talk about that too much. But that has been enriched, as well. So we track how people are coming to that article. And I can tell you that most people come in through Google or a related search engine.
JABIN WHITE: The discovery service traffic has sort of increased a bit in the past couple of years, but not that much. What they do once they're on the site is really what we're really concerned with around the thesaurus work. And we've noticed some positive results there. We've got some work to do in terms of getting this stuff out there beyond our walls.
JABIN WHITE: But that's definitely in the works. So that's a total cop out to your question, Bert. But yeah, stay tuned. In the future we're going to be doing more with that.
JOHN MAGEE: I think I can talk about that a little bit. Jabin and I were just talking about the fact that it's always interesting because we often all have the same problems in sort of slightly different domains. But for our products, which involve a lot of periodical content as well, 15 or 20 years ago, the main access point for people was a subject guide. People would come in, they'd get to a subject guide. They'd search for topics. They'd maybe use that hierarchy a little bit.
JOHN MAGEE: Usage of that is really, really small now. People like the search window. They go into the search window. So there was a time when people thought, well, maybe we just don't need subject indexing. We could just do full-text searches and a couple keyword searches and move on with our day. But it turns out that the results are a lot better if you've got good descriptive metadata underlying the search.
JOHN MAGEE: And what you really want to do is you want to make the results that people get really relevant. But you want to sort of hide that complexity for them, which there is sort of an art to it. But it really revolves around looking at what searches people are doing, what they're looking for, and then finding ways to improve what they get. So we use our vocabularies internally, not just for the subject guide, which doesn't get used a lot anymore.
JOHN MAGEE: But in the searches, we've created portal pages in a lot of our products. So you can go in, and if you search on World War II, you'll come to a portal page about World War II, and it's got some facts about it. Or you search on George Washington, and it's got some facts. And that portal page actually has a quick fact box that pulls information from our vocabulary on George Washington-- birth date, death date, birthplace, death place, nationality, profession, things like that.
JOHN MAGEE: And then underneath that portal page are a lot of custom queries that we've developed that use a combination of subject indexing and keyword searching and other things to bring back related content and to bucket it a little bit for people. So there are some overviews and some biographies and maybe some historical documents. And it's a way that we've developed to give people the benefits of all of that really deep, rich subject indexing without making them go through all of the steps, because a power user or a high-end librarian-- really happy to do that and wants to go in and use our advanced search or use the subject guide or do some other things.
JOHN MAGEE: But the average student coming in to research our products really just wants to type in the box and see the best papers and get them for their paper, their assignment, or whatever, and move on with their day. So it's a similar thing. And we've really seen that shift in our usage from advanced search or subject guide to, hey, here's a box, and I just want to see what I'm getting.
SPEAKER: Great. Thank you. A question for Travis. "What are your suggestions for scaling down an ASCO type project to a smaller society?" Are there pieces that you would want as the building blocks that might not be as comprehensive as what you showed?
TRAVIS HICKS: So I should probably make more clear that this is also a very theoretical concept right now. We've done, in the very small scale, a pilot. It's one of a few approaches we're thoroughly exploring. So I'm sure you can take subsets of any of it, actually. I'm not sure that you have to go whole hog into creating a profile that pulls in that much activity data. You certainly can work from self-reported data and then use a little bit of supplementary data.
TRAVIS HICKS: I don't think that you have to look at the broad scope of the entirety of the activities. I think, even in our case, that looking to kind of phase in the construction, potentially, of a database would involve doing the activities one at a time and rolling it out, not necessarily looking to shoot for the moon immediately, but working from demographic data, then taking some of our more reliable data, which is going to be our meetings data that we have coming in, and then kind of moving from there.
TRAVIS HICKS: I also think, actually, our abstract and general manuscript data is also extremely reliable, in terms of, if an author has authored a piece, we've indexed it, they have those terms. I think that that's a really good starting point, as well.
SPEAKER: Great. So more of an accretive model, where you can start and you build up as you go.
TRAVIS HICKS: Absolutely.
SPEAKER: Right. Question here. Michael?
AUDIENCE: So I'd like to ask what I probably think is a grouchy question. [LAUGHTER] So for anyone, I kind of think of the controlled metadata topics as being the known knowns of content. But can anyone talk about how you might use these techniques to uncover the unknown unknowns that searchers might be looking for to find that content which is more orthogonally related, as we heard about in that geology example earlier?
AUDIENCE:
JOHN MAGEE: [LAUGHS] Yeah, nobody wants to answer that one. [LAUGHS] Yeah, so one of the problems that we have, I mentioned we're running up to about a billion documents, and it's really hard to put those on. And we've got a lot of manual ways of putting terms on content. We've got a lot of automated ways of putting terms on content.
JOHN MAGEE: We have some content that sort of defies terms in various ways. So we really rely on a couple of things to figure out what people are looking for and what they're interested in. One of the things that we use to really improve how we're organizing our content, what terms we have in our vocabularies, is search logs. People come into our products, and they search for things.
JOHN MAGEE: And we take a look at what they're searching for. And we match that up against our vocabularies and our indexing to see, hey, is this something that we have? Is this something that we're doing a good job with? Or is this something that we need work on? We have a fairly good size staff of people who work on vocabularies. And they spend a lot of time actually keeping an eye on the news and events that happen, names that come up, adding them into our vocabularies, and then going back and populating them in our back data so that when users come, they can find in search.
JOHN MAGEE: One of the other things that we do-- we're really excited about it; and it's really just coming out this fall-- is we've put together a digital humanities platform. We call it the Digital Scholar Lab. And it's associated with our primary source historical archives. And it gives researchers and searchers a whole variety of data mining tools that they can use against our archives.
JOHN MAGEE: And those are some pretty large archives. It's things like the Eighteenth Century Collections Online, which is pretty much everything the British Library has for the 18th century, and a lot of 19th century stuff, or the American Civil Liberties Union papers, which are all the papers of the American Civil Liberties Union from the 20th century. It gives them a whole variety of data mining tools that they can use directly with those products to find things outside of just the controlled metadata, just what we've done, but gives them some options.
JABIN WHITE: I'm going to answer your-- I didn't think it was a grouchy question at all. And I'm going to answer it with a blatant plug, if you don't mind. Go to jstor.org, and click on Text Analyzer. It's right below the Search box. And I ran out of time to get to this, but it essentially does what you're saying. So it does a search, but then it doesn't just-- you have to have a controlled vocabulary.
JABIN WHITE: And I think all three of us would agree with that. And I don't think you're suggesting otherwise. But in addition to the controlled vocabulary, you want to be able to say, what else is around this concept? Because if you didn't call it the Battle of Bull Run, we want you also to know that it was the Battle of Manassas. And machines will never do that unless you give them help. So what the Text Analyzer does is it uses-- this gets really geeky really fast-- but LDA topic modeling.
JABIN WHITE: So it counts the words in this section. And it says, we think this is about this concept. And then it looks at the JSTOR Thesaurus and says, do I have a match here or not? If it does, it's like, good on you. Let's go. And if not, it keeps looking, or it dismisses that paragraph as a, nothing to see here. Let's move on.
JABIN WHITE: So I'm suggesting that it does exactly what you're asking for, which is take a bunch of words that a user doesn't necessarily have to enter into a search box and say, what do we think this article or this dissertation or this book chapter is about? And what do we have-- and it's completely selfish-- what do we have in JSTOR that is about that thing? It's a pretty powerful thing.
JABIN WHITE: And the Twitter love we've gotten for it is just ridiculous. So people love it when they actually use it.
SPEAKER: Well, on that note, I think we are going to have to wrap the session with time. I think I want to end with Twitter love.
JOHN MAGEE: [LAUGHS]
JABIN WHITE: Twitter love, it's a thing. Twitter hate is also a thing.
SPEAKER: But I'd like to thank our panelists for sharing their taxonomic journeys with us. Thank you. [APPLAUSE]