Name:
What Is Non-Consumptive Data
Description:
What Is Non-Consumptive Data
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/d615d032-65e7-45c2-af6b-0b311bef6832/videoscrubberimages/Scrubber_1.jpg
Duration:
T00H26M11S
Embed URL:
https://stream.cadmore.media/player/d615d032-65e7-45c2-af6b-0b311bef6832
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/d615d032-65e7-45c2-af6b-0b311bef6832/What Is Non-Consumptive Data.mp4?sv=2019-02-02&sr=c&sig=6lcu4ii3xngigllLXjauebUZ2EMD2SNmqGwllR8HjmI%3D&st=2024-09-17T22%3A57%3A50Z&se=2024-09-18T01%3A02%3A50Z&sp=r
Upload Date:
2024-03-06T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
OK fans of granularity, here we are. I want to first I want to compliment and thank our by the way, I'm Peter Simon from newsbank. I was asked to help do a little mild moderating. But I want to just applaud jay, Stephen and Amy and Matthew and Glenn for the presentations they did. They worked together to make a sort of SeamlessAccess transition presentation to work us all through this information.
And for those who thought non consumptive data did have something to do with the lunch break or something, it really didn't. And it was great. But I think for all of us to recognize how text and data mining has evolved into such art form of its own, and a genre and a brand, a process that requires tools. And thought and structure. And now we're talking about the data structures that might facilitate making it that much more effective than trying to hash it out with whatever form the content happens to be in.
So I think it gives us all a lot to think about in terms of standard standards and interoperability and all the different issues that were posed. When I was thinking about this session. I had found an obituary for a fellow who actually founded the business that became my company, newsbank, who was Nesbit, who wrote the book megatrends, and he formulated his methodology by getting newspapers from all across the country and looking for articles that evoked social issues and trends.
And it was sort of a form of sentiment analysis circa the 1970s. But, but really by digging into the content, not to read the articles, but to pull out the themes, the ethos of what they were about. So in that very crude manual, I think he put the articles on to, I don't think I know to microfiche. And if you could find the right coordinates in the very hard to use manual printed index to the content you might discover a trend.
So we've come a long way, but it just strikes me that there's a fundamental research exercise to get at what is this content trying to say overall, which isn't about the individual articles, but the collection. So let's start. Jay Stephens in addition to saying what comments or questions anyone has, but also the questions that he asked of the group in terms of thoughts on standards, on interoperability, on what should we all be working on and thinking about?
Floor is open. Don't be shy. Let's look in a chat. Maybe somebody. Yeah yeah, I do chat. OK well, thanks, Marybeth. About the doc outside of the very complimentary comment that I kept the group on track.
Anybody? well, anybody. Anybody have a question for the panel or maybe j. Steve, I'm going to throw one of those toss up questions back out again. Yeah please don't be shy. It's a rare opportunity that we get to be together, and the questions posed at the end of the talk were not rhetorical, actually. We are striving to develop some kind of standard, maybe with nice.
All the fine groups, fine groups that nice. So to find a way to do this so that we can start propagating. Get other collections involved so that we can develop our data better to. And of course with the short term goal of interacting TFC interacting at least with the consulate world because we can build, build our tools together. So I know a lot of people in this room have a lot of knowledge about standards and metadata.
So we're very. Very interested in hearing that. And we have a raised hand. So, Jennifer, don't be shy. Just grab a mic and jump in and thank you for jumping in because it actually put my put my hand out because I decided I didn't want to be first. But it was already first.
Your second, how about that? Your second. Well, I hope that this question is not to too general or tangential, but I agree with you that it's so important that anything we're building today is designed to be extensible, to deal with whatever unknown we're going to have to deal with in. I don't know at this point.
Six months, a year. How long does it take for the whole world to change? Do you have. Do you have any tips, practical or philosophical, for how one does designs things that will be able to deal with the. Known unknowns that we know are coming. I can just say some of the known unknowns that trouble me now, and that is that we created extracted features on this non consumptive premise.
But now that we have GPT. Everything can. Everything can be expressed. Right I haven't tried I don't know whether anyone has to take extracted features like a non consumptive data set. Give this to chat GPT and say please make sense of this. Will it recreate the original text? I would like to think that it would won't, but it's pretty easy to do a lot of pretty sophisticated to generate sophisticated sounding text that might concern a copyright holder.
So, yeah, that's an unknown, unknown. Unknown, right. How is the new world of generative ai? Does it make all of this all of the points moot? Does it make does it make our data more consumptive or more expressive than we had thought? Or does it just like clear the stage where we don't worry about it anymore because all bets are off?
That's kind of a SCI fi, non unknown. The true I mean, the unknowns that I care more about are what are the research interests, what are the research possibilities of the files that we're creating now? And I know I have some very good friends in the digital humanities world for whom our very carefully selected set of features are useless for their research topic.
So we're already I mean, we certainly don't answer all research needs even now. And, and so we're guaranteed not to answer all of them next year or in the next five years. So I think that it's always a case of doing the best that we can with what we have. And what we have is copyright restrictions and understanding of transformative fair use, a set of texts with imperfect with a variably imperfect OCR and a set of features that we think are generally useful for a significant number of people.
We should always we should never pretend that we have everything for everybody for all time. I think that's the bottom line. But we would like to do we do know that we can do better. And one of our collective motivations in coming to the conference to talk with all of you is one of the things we can do better is to work better together. All of us who are data providers or creators or manipulators, we do we think that we can.
We think that this is a winning approach and we think that it's something that we could standardize a little more, if not with the standard, at least with a set of best practices or common understandings. I will say. Yeah I was going to say, I mean, my reaction is very practical. I come at it sort of with my librarian hat on and like, the best thing we can do now is have our data as clean as possible.
Now and acknowledge that we're probably going to have to reprocess it again in the future. So the best thing we can do when, when we're, we're, we might have to build new features in the future. And the best way to be able to do that is if we have content as clean as possible coming in as we can. And to acknowledge that it's not going to be a one off, we're going to have to do OCR and re OCR and re OCR every decade because the tools are going to get better and the content is going to get to the extent that we can have the metadata coming in as clean as possible, the images coming in as clean as possible, the OCR is robust as possible.
That's going to give us the most flexibility over the long term. That's a very my very practical, very practical hat there. You heard it here first, folks, the content is going to get better. So yeah, yes, I believe that's actually true. Right and and so we start with a better OCR and then we're going to have better feature extractors, we're going to have more accurate language detectors, we're going to have more accurate entity recognizers, more accurate.
It'll just get context better and better. And so, you know, we're going to constantly evolve. And where this becomes a metadata problem, which I alluded to in the chat, is three years from now, someone's going to develop, let's say, a new embedding, one of those magic embedding things. We need to be able to have extracted features, the whole extracted features verse, I don't know. It sounds like a movie, a whole movie chain.
But what we need is to have the ability for the people that create the new feature to properly describe it. And so that when people grab it, they know what it is that they're grabbing. And this is why it's a metadata problem. Right? it's unlabeled data is the kiss of death because you don't know what it is, how it came into being. So provenance or great provenance issues that a lot of us in the digital curation world deal with, which though we inherit those kinds of problems.
Yeah and maybe I could jump in with kind of an example of a very practical example of this. What are these new embeddings but tying to this people coming up with new things in the future? So and when Steven is saying embeddings, this is methods for taking free text or other types of data and producing a set of numbers in an array of numbers that you can then do a lot of interesting computations with.
And there are very simple ways of doing that, such as how many times does the word and show up in this document? How many times does the word camera show up in this document. And you get a long list of that incident numbers and then there's much more complex ways of doing this. How how the steps that we take when we're engineering those features to get the oc, how did we count the number of ends?
Did we cut off the trailing commas or periods on them? Did we lowercase them right? Those decisions. And this is where the data cleanliness is actually. This is a very relative term. Certain researchers, they. Don't care about and they don't want to see it. They're just interested in the content of the steps. And so they as soon as they get that, they want to remove it.
Other researchers, those types of conjunctive utility terms are E Their research and things like punctuation or even OCR performance. They actually want our messy data as a way to kind of OK, here's a baseline for this and we have a new type of OCR and we want to try and compare our baseline and say what if it's done? And so from the kind of engineering standpoint, we've got some users saying, OK, we want I want all of these tokens extracted out and normalized in this particular way for my research.
We have other users saying, why did you normalize this? Why did you lowercase all these things like this data is too clean. This is not close enough to the original raw data. And so this gets to that sort of matrix of kind of usability and complexity of the data that I was talking about, where we have users coming with different skill sets, a different range of experience and different research needs, and.
There's probably not one single feature set that is going to make everyone happy. There certainly is not that. But figuring out what's a common baseline, you know, if we provide people with here's the number of times and shows up with a capital A versus showing up all in lowercase. People who want them all lowercase, they can take that data, and then they can transform it into the data that they need.
If we provide it all cleaned up, the people who need that raw data, they can't transform it back out. And so trying to think about what data can we provide and then what skill sets can we provide our users, right? So it's more than just the content. It's also helping train them in how to use the content. I know certainly for us at consulate, this is a major, major part of our work, but it gets right down to this level of metadata standards to can we document within the data that's coming out.
Here's exactly all the choices that were made. Here's where you can find definitions about those choices, because these are specialist terms. And as Glenn accurately notes in I mean, not even 10 years next year, two months from now, we're going to have new methods that are going to be applicable to this type of work. So, yeah, trying to keep in mind that what's a common denominator is that we can producer reduce the amount of customized work we have to do every single time, but still meeting the needs of all of our users.
That's the tension. I'm keeping my head all the time. Reed any other questions or thoughts from. The folks in the meeting here in cyber. Seminar I have a. I think, as it was pointed out, the conference organizers created a Google Doc that I see already has in it the questions that Jay Stephen posed and some thoughts and so forth.
And that is a place where one might go afterwards. We have thoughts or questions or ideas or so forth. I think one of the aims, as Todd carpenter said this morning, is to try to see where the various conference programs by lend themselves to identifying topics or issues that could lend themselves to working groups and future discussions. And hopefully, in my view, this would be one of them because it's an area that.
You know, as the different factors were outlined, there's a lot that could be done to fine tune what folks are wrestling with. So so any other questions for the speakers or. Peter, I have a provocative question for people in the room, and that is so Stephen and I work for, for an academic Research Center whose goal is to provide as much stuff for free As we can.
We absolutely acknowledge and make great use of things that are not for free, things that are under license, that have business models, etc., etc., et cetera. My question is, I mean, in the past there's been for at least for commercial data providers, there's been a need to rein in the use of the use of their data. So that people will have a motivation to buy it, which is perfectly acceptable in our economy.
My sense is that profit motivation is not going away, but that it may be changing. So my provocative question is, for those of you who have a business model that requires you to either or both make use of licensed data, that is, you have to pay for it and then you have to sell it. Does greater interoperability? Is that anathema to you or is it the opposite case? Can you make your money in a different way?
And can we make it. So that researchers can really can mix and match the data that comes from a variety of sources, both free and paid. And we know that researchers want to do that. Researchers don't care where their data comes from. They don't care most of the time even that their library had to pay for it. But they just want what they want.
So that seems like a utopian future, but I have doubts about how possible that is. That's my provocative question. And if no one wants to go on the record, there's a great question in the chat so we could bypass this. But I'd love to hear I'd love to hear your thoughts about that. Any thoughts out there from the squares?
That was a great question. Yeah, I can you know, in talking to I talk to content providers really frequently and and in some contexts, Ithaca is one not in all contexts, but in some contexts. And here we are of a variety of things, especially for the smaller content providers, the smaller publishers, like the stuff is expensive.
Like they're not trying to, you know, they do not have a path forward for monetizing data set delivery, you know, like they're struggling to put content up for, for their users and to publish on their schedule and to add new. So especially for smaller publishers, we tend to have conversations with them. They're they're like once we explain to them what we're doing, they're usually the smaller ones are usually very interest like, Oh yeah, I get a couple of questions a year now I can just send them to you and, and that's good.
It can really vary by. Field so the STM guys. They're making substantial money selling their data to, say, pharma and industry. And and it is important that we not interfere with that. You know, we don't want to. So there's just so many different I guess this is just to say that, like, there's not one category of sort of commercial content provider out there.
They are so varied with different needs from like, Oh my god, I'd love to save on my staff time to, you know, I, I can do a lot more because I can, I can make some real money off of this. Like, not when you're selling stuff to pharma, it's a whole different field, a whole different world than the kind of numbers we typically talk about. So a lot of that's what we see is just a lot of variation depending upon who you are and where your focus is.
And so to the extent that third parties like hobby trust or Ithaca can help in this space, I think that's what we both aim to do. We want to help all the parties, if we can, if we can. I think there's excuse me segment. And I think Amy has really hit upon something here. I'm associate dean here at the School of Information sciences, and I help my faculty negotiate data contracts of different kinds with different kinds of providers, not necessarily just text, but, you know, Twitter feeds and all that.
Oh, pick some kind of data and someone needs a data set. So it's up and down the stack. Everyone in the chain from the person that gets contacted at the company all the way through sales, through the lawyers, up to the vp, whatever the VP title is, perhaps all the way up to the president seems to need to stick their all in because they're all scared about something. There they're scared about violating the terms of the people that they answer to or losing the losing the monetize monetize ability.
Is that the right word, the monetization of their data and future liability, all these things. And then we have it on the we have it here at the University side, right? Because we might be dealing in some sensitive data. We have to make assurances, spend money, get our lawyers involved, get our people involved, get audits and all these kinds of things. Report if data lakes or whatever all through the University lawyers.
So the one thing that I like, I mean, one aspect of the our extractive features model as it stands now is that the thing that comes out lives in the wild. It is. We're putting, I think, C 0 on it or something like that. And it's like you're done, right. Everything that we've been putting out is open. So you do up with the I'm not going to ask to negotiate a contract with you, whatever.
And on the other side, we convince the commercial stakeholders or not even commercial stakeholders of things that they have fiduciary responsibility over, that the product that comes out is all good. It's not going to cause harm to them then I think we can get more people to play. Yeah and I think to the, the notion of, of standardization of the, the non consumptive data concept gets people's heads out of worrying about protecting the text, the articles or whatever because it's really taking the data and putting it in a completely different context and environment and use.
And, and that's what a lot of content owners don't really understand. But you're taking my articles and you're putting them over here or I don't want them, but we're really talking about it's not your articles, it's stuff that's embedded in there that we're talking about trying to lift out. That's the challenge, I guess. So it's a strange place because because, you know, I feel a lot of obligations to our content providers.
They've invested in that content. And I don't want it to slip out of our control. At the same time, we live in a world where it has slipped out of their control in the reality. And so it's an odd place to be, but you're like, no, we have contracts with those folks that I've got to we've got to honor them. It doesn't it doesn't matter what it doesn't matter that other things are happening.
In the world. These are the contracts I have right now with these folks who have entrusted us with their government. It's a complicated space. Very complicated. It's complicated. Vincent had a less complicated question yet. Yeah, yeah, Yeah.
Context Yeah. Contract trumps copyright. So I see I see that we're at the one at least my computer at 146. Mark, I think we're supposed to end quite soon. So folks can be in their next session. But I want to thank the panelists. So hopefully everybody got such a clear picture of all these issues and layers and so forth.
And I hope that the dialogue will continue and folks will seek out that Google Doc and seek out invitations to talk further, and so forth. That's what I hope helps out. So thank you all and happy Valentine's Day. Thanks Thanks for attending. OK