Name: What is non-consumptive data and what can you do with it? Recording
Uploaded: 2024-03-06T00:00:00.0000000
Duration: T00H50M13S
Description: What is non-consumptive data and what can you do with it? Recording

Name: What is non-consumptive data and what can you do with it? Recording

Description: What is non-consumptive data and what can you do with it? Recording

Thumbnail URL: https://cadmoremediastorage.blob.core.windows.net/df6edf91-59f1-4dda-8bbb-eaddf783ea0e/videoscrubberimages/Scrubber_3.jpg

Duration: T00H50M13S

Embed URL: https://stream.cadmore.media/player/df6edf91-59f1-4dda-8bbb-eaddf783ea0e

Content URL: https://cadmoreoriginalmedia.blob.core.windows.net/df6edf91-59f1-4dda-8bbb-eaddf783ea0e/What is non-consumptive data and what can you do with it-NIS.mp4?sv=2019-02-02&sr=c&sig=4ETVUrM%2BLWggXxfog99V4DeXIK3ldi%2B%2FbeIibApUD9w%3D&st=2024-12-26T16%3A02%3A11Z&se=2024-12-26T18%3A07%3A11Z&sp=r

Upload Date: 2024-03-06T00:00:00.0000000

Transcript: Language: EN.
Segment:0 .
Welcome We are so happy to speak with you today about non consumptive data, which I know is quite a mouthful and doesn't make a lot of sense on its face.
We're probably all looking for our cookies right now. I'm Amy Kirchhoff and I'm going to kick us off today. I'm joined by my HathiTrust colleagues, Stephen Downy and Glenn langworthy and by my constant colleague Matt Lincoln. And as you know or will know, our session is being moderated by Peter Simon of newsbank. Consulate where I work is a service of Ithaca, which is a not for profit with a mission to improve access to knowledge and education for people around the world.
We believe that education is key to the well-being of individuals and society, and we work to make it more effective and affordable. We do this through a number of sister organizations that include j store, constellation, art store, portico and Ithaca. Issandr constellation is a tech text and data analysis platform that integrates access to scholarly content and open educational resources into a cloud based application and lab to help faculty, librarians and other instructors easily teach text and data analysis.
Consulate's teaching and learning focus allows learners across all disciplines to apply text analysis methods to content, hone their skills with on demand tutorials, attend live classes taught by the experts at consulate, and engage with our inspiring user community. And now Steven is going to tell you a little bit about HRC. Thank you.
Hey hey. That was a great start there. Hi, I'm Steven Downey. I'm the Illinois co-director of the Heidi trust Research Center. The Research Center's actually a collaboration between Illinois and Indiana, where John Wolfe Professor John Walsh at the Lodi school of Information Sciences is the director.
And it's a great collaboration between Illinois, Indiana and the University of Michigan, where the HathiTrust. Mothership, as we like to call it, the main organization resides. And our goal, our purpose at HRC is to provide non consumptive access for text miners and scholars and researchers to the entire corpus of the Hardy trust, one of the largest digital library collections in the world.
Yes so is there anything else I should be introducing about the Adi trust other than it's totally awesome and it's to Google us, you'll be able to find our analytics page quite readily. Totally awesome. I concur. All right. I'm going to talk a little bit about text analysis. We find that sometimes it's not really clear.
And so before we even get to non consumptive data, let's introduce some of the purposes that can be put to as I expect you all understand, text was written to be well read by people using their eyes in their brains. But in our modern world, we sometimes want to have computers perform text analysis on the written word rather than having a human being read it. Text analysis is the practice of extracting information from collections, both large and small, of text to discover new ideas to answer research questions.
In text analysis, software is often used to classify, sort and compile data to identify patterns. Relationships, sentiments. Create new knowledge. Text and data analysis are really core competencies within data literacy as a whole, and they're connected to digital humanities, data mining, data analytics and even big data.
Text analysis can help us answer a question such as what are these texts about? How are these texts connected? What emotions are found within these texts? What names are used in these texts? And which of these texts are most similar? Many different methods.
This is just like a sampling of them can be used to answer these questions, and I'll note that often these methods build upon one another. For example, co-location analysis, which looks at how frequently words occur together, is necessary for named entity recognition, sentiment analysis and topic modeling. So they rarely stand alone but are part of a spectrum of research.
If we turn back to the page from the papers of Thomas Jefferson, example that we looked at four slides ago was on the screen at any rate. What I put on the screen and what is up there now is an image or a picture of a page. Before we can do any text analysis on it, we'll need text, not an image. Otherwise, we'd be talking about image analysis, which is definitely also a thing, but not our topic for today.
Constantly how do you trust? And many, many others? Lots of text. Not just images of text. Unfortunately, providing actual full text to faculty, students, librarians or other researchers can be really complicated by both copyright and contract law. Those of us who have custodianship of content do not always have all the rights necessary to provide the text for a live example of this complexity.
Most of the archival content held by jstor is in constellation and available for analysis. Because the licenses our publishers have signed with jstor are relatively broad. The one exception to this are our non open access book publishers, whose licenses with jstor are more narrow. It's not quite clear to us if we have the right to put that content and concentrate. So to be safe, we've taken a different route.
Portico, a sister organization to js store that provides preservation services, also has agreements with publishers. Porticos agreements are extremely narrow. They are just about preservation, nothing else. Portico provides very little access. So to include content from portico in consulate and we have over 100 publishers from portico participating. Our portico publishers must opt in.
Most of those non-opioid access. Jstor book publishers are in portico and we've targeted them in our conversations so that we can try and get that content into constellation because in truth the end user. Really doesn't care where the content is coming from. They don't care about Jay's story. They don't care about portico. Maybe they care about constellation.
But this is why the papers of Thomas Jefferson, which we were looking at, is enhanced by the content in those volumes and constantly came from portico because we have an agreement with the Princeton University Press at portico. If you go to read that content, you're going to read it at jstor because that's where it's hosted. And in truth, the OCR of the page, I showed you the actual text of the page, not the page image.
I went and got that from a federal government website. It was just easier for me than tracking down the right colleague in jstor to fish out the OCR. This is this describes the need for non consumptive data, one of the needs for non consumptive data. Those of us with custodianship of content want to enable as much text analysis as we can. It's important and it's valuable.
And there are new fantastic research results happening every day. And we want to support that. So in those cases when we can't provide the full text, we want to provide non consumptive data or features, which is a terminology we adopted from HRC that can be used for much of text analysis. I'll note that there are a few scenarios where a consulate can deliver full text and where we do it.
There's a lot of situations in which we can't and so we can always provide those features for non consumptive data at constantly. We include things such as the bibliographic metadata. So you can see we've got some creators in there and it's publication dates a serious title. Unit grams, which is every word in our documents and the number of times it occurs by grams, which is every two word phrase in the text and the number of times that occurs.
For example, Jack and Jill went up the hill, Jack and is a gram that occurs once and Jill is the gram that occurs once and so on. And then try grams and their frequency. The unit at which we provide these features and constellation at the moment is the document. This tends to be a journal article, a book chapter, a newspaper issue. Sometimes it's a full book.
It varies based upon our content source. This differs a little bit from HRC, where one of the unit features is the unit of the features is the page. And that's something for us to think about as we think about possible standardization moving forward. We're going to continue. My colleagues are going to continue walk through. And you're going to see that the features we provide at consulate partially overlap with what HRC provides.
We both have some core that's the same, some on the edges. It's different. And that's another thing for us to consider as we think about moving between content sources and providers. Oh, and I clipped to the end. Excuse me. Let me flip back to where I'm supposed to be. Wow I can hardly trust planned here.
Here we go. And now I'm going to pass over to Glen at HTRC. Thank you, Glen. Perfect thank you very much, Amy, for that great introduction to text mining and to Constellate. So as Constellate and its partners are constrained by primarily by contract law, HathiTrust and the HathiTrust Research Center is constrained primarily by copyright law.
These are overlapping areas of law, but different areas of law, although they're distinct. The ways that we've both dealt with these constraints converge on the idea of non consumptive research, including the topic of our panel, non consumptive data. So here in the hathi trust portion of the panel, I'd like to start with some pertinent history and context context that set us on the way and determine the path to where we are now and why we do what we do and why we use these funny words for it.
Next slide, please, Amy. So, many of you will remember the December 2004 announcement of Google and five research libraries, four in the US and one in the UK. They announce their intention to engage in a massive book scanning project. This was called at the time, Google's moonshot. The French called it Google's challenge to all of Europe. It was very... it aroused a lot of passion, both positive and negative, including legal passion.
Less tha 9 months later, the Authors Guild of the US filed a lawsuit against Google and against all of its libraries for massive copyright infringement. So this is a really crucial step of our whole story. I suspect many people remember it, but you may not know all the connections. Next slide, please, Amy. So a few years after that, a group of libraries decided to form a consortium.
By that time, there were 23 libraries who were Google partners. We founded a consortium and called it HathiTrust. Hathi is a Hindi word for elephant, with the obvious implications of something that's very big and has a very long memory. So 23 libraries, research libraries, mainly in the US Midwest and the University of California system created HathiTrust.
And it didn't take too long before the Authors Guild included HthiTrust in its massive lawsuit. I'd like to show this little celebration of the founding of HathiTrust in the New York Times. An elephant backs up Google's library. I always like to point out, especially in a University context, but it's important for us here as well. This was not is not, was not and never will be. Google's library.
This is our collective library. This is the library that your kids tuition dollars have gone to fund and probably your tuition dollars as well. And it represents the work of centuries of library collecting and cataloging and preservation, etc., et cetera. So although I bristle a little bit at the term "Google's library." The thing to celebrate is that it's a very, very big collection and it's an academic version of what was the Google library project.
And then became the Google Book Project. Next slide, please. So that lawsuit, as big lawsuits tend to do, lasted many years. There were lots of phases of discovery, some of which I remember being involved in. And it was finally settled only in November 2013. So this is 9 years after the announcement of the Google Books project, and it was settled resoundingly in Google's favor and by that time in HathiTrust's favor.
This is important to us because the, the decision was so strong and so strongly worded and cited so much research that is important to the world of text analysis. The digital humanities community in particular was quoted. Some of us had filed an amicus brief in the lawsuit, and it was relied upon pretty heavily. And the judge's decision, it's a great piece of legal writing, also something that you can Google, something that I assign to students when I teach them highly recommended.
The basic finding of the court was that the copying of in copyright works, even without asking permission, is legal for purposes of text mining. This is tremendously, tremendously important for everything that we do and it really has made the work of text mining for in copyright materials possible. Next slide, please.
So since I know not everyone will go to read this finding, I'm going to read a little bit from it and a little bit from a piece of the list of the litigation that didn't come to fruition. And that is an author's Guild versus Google Books settlement agreement, which the two parties had worked on, and it was rejected by the court. But a lot of good work went into it, including the crucial aspect, the crucial idea of non consumptive data.
So here's the official definition from the court documents. Non consumptive research means research in which computational analysis is performed on one or more books, but not research in which a researcher reads or displays that is, consumes substantial portions of a book in order to understand the intellectual content presented within the book.
So this is the official definition that we use for non consumptive research. By reading a book, we consume it by text, mining it. The decision of the court is we don't consume it. Other people have used other terms, such as non expressive research and non expressive data that I prefer that, but it hasn't taken off. So we go with non consumptive. And then just to remember this is from 2008 from the final finding of the court from 2013, there's another similarly strong argument and that is Google Books.
And by extension, antitrust does not supersede or supplant books because it is not a tool to be used to read books. Instead, it adds value to the original, and it allows for the creation of new information, new aesthetics, new insights and understandings. Hence, the use is transformative. Those things that are in the subsequent quotes are from one of the legal definitions of fair use and of transformative fair use.
That all goes to all adds up holistically, legally, to the idea that non consumptive research is a fair use, which means we don't have to ask permission from copyright holders even if the original is in copyright. Next please. As Amy has mentioned, text analysis can be thought of as a statistical analysis of a text's features.
A feature is just what it sounds like. It's not reading the text, it's aspects about the text, facts about a particular text, not the text itself. And as facts. Not only. Not only was the scanning of the work, the copying of the work, not an infringement of copyright, but the presentation of facts about the work is not infringement of copyrights, because facts are not subject to copyright restrictions.
So this may seem like a loophole, but in fact, it's a very fundamental and deep truth about what we do in spite of the fact that a human being can't read or can't easily read a set of extracted features, they are nonetheless very, very useful for many analytical purposes, including but not limited to topic modeling, genre detection by computer tracking of linguistic trends over time, front matter detection and other detection of aspects of a text that you may be surprised that a computer can.
Well, these days, nobody is surprised that what a computer can do. But it's still pretty impressive that from a set of statistical features, word counts, essentially algorithms have been trained to do pretty remarkable things. Next slide, please. My consulate, we at the HathiTrust Research Center present our extracted features in JSON format.
Our colleague Matt, we'll talk in a minute about various formats. This is the only one we use. It has features at several different levels, including the volume level, which is library metadata created by a librarian or librarians. These include things like title, author and language and other information that would whatever's included in a library catalog likely enhanced.
Also, of course, very important, unique identifiers for the volume and things that make it addressable and accessible and reproducible. For each page in the volume. There's also page level metadata since libraries don't catalog at the level of page these this metadata is metadata that we generate ourselves. And by ourselves I mean we and our powerful computers and the algorithms that we write for them.
The volume level metadata includes the sequence of the page and the volume as well as computationally inferred data, such as the number of words on the page, the number of lines, the number of sentences, the number of empty lines, as well as an automatically determined language of the page in question. So a book may be primarily in one language, but it may have substantial text within it in other languages.
And we have used automatic language detection to determine what those other languages are so that we can sort of allow the researchers to dig in to the page. Probably too small to see. But you can the last line on this little code snippet is calculated language. "en" so that means the language of this page, not the volume. Probably the volume also.
But in this case, it's the page is in English. Next slide, please, Amy. The page level is further broken down to what I think of as the real meat of the extracted features. A page is broken down again by algorithm into sections of the page, so header, body and Footer essentially. Usually the header and the Footer are very brief, maybe one line and the body is the main portion of the text that may or may not include footnotes.
That's probably depends on the algorithm. This is an imperfect process, just like many of our processes are. But crucially, for text mining purposes, it also includes, as I said before, line counts and empty line counts and individual words and their counts on the page. And what we do and I'm not sure whether constantly does this or not, but it's been an important part of our work. We have automatic part of speech taggers so that we can distinguish between homonyms that may have served different purposes.
It's pretty good, not perfect. So one could also track, for example, the density of nouns or verbs. Over the course of a long text, you can see whether certain sections of the text are more, more verb heavy or more adjective heavy. So this is, again, probably too small to see. But you can see the last two lines on this. On this code snippet are snow, which is identified as a noun, and it occurs twice on this page.
So every single page of every single volume in hottie trust has a section of a file like this. Next, please. What can you do with extracted features? Well, in spite of the fact that these are very abstract, basically statistical versions of what's on the page, we can do a lot of things. Even as human beings we can help.
We can identify tables of contents, for example, or indexes or title pages. The little illustration here shows that a page that happens to have a whole lot of capital letters as the first character in the line and a whole lot of numbers as the last character in the line. These are both features that we provide in the file. That's likely to be a table of contents page if it only has capital letters as in the first column of letters and not let not numbers at the end, it's likely to be poetry of a certain formal nature, right?
So there are things that we can do that help us to guess, help the algorithm, to guess what kind of page this is. Next, we also feed these extracted features into a variety of research tools that people can use, including some that have very low barrier of entry. One of our favorites, one of the first is called the bookworm tool. Those of you who've used the Google Ingram viewer will recognize that it's the very same code.
At least the core of it was the same code as the Google room viewer. We've adapted it for our purposes to use extracted features, and in spite of the fact that it uses very abstract statistical versions of a text, it can show really interesting historical trends. This particular one, I believe, shows the use of two different words related to Republican liberty for books published in Canada and in the US.
And you sort of compare how the two national contexts affected the way those words were used. There are innumerable examples of what this can do. Next, please. One fascinating recent bunch of research is the detection of duplicates. So the HathiTrust Digital Library includes everything that its member libraries choose to submit to it, which means that there are hundreds and hundreds of copies of Jane Austen novels and Shakespeare plays, many of which are from exactly the same printing, the same edition.
It's interesting to know that there are lots of those copies, but it's not that interesting to have them overwhelm your text mining results. And so we have scholars who have worked with us in the past and continue to be collaborators from elsewhere working on things like text duplication. This in particular is just one little graph from a really great emails funded research project by our colleague Peter Organisciak at the University of Denver to discover text duplication.
He does that through cosine similarity across a work, and a certain amount of cosine similarity is indicative of an identical text. A slightly less amount is indicative of a similar text, possibly the same text, but in a different edition and a very low amount of a low cosine similarity indicates that these are different works. So a very slick way, again, using just extracted features, not the full, not the full text.
Next please Amy. Now I'll show you in quick succession some very, very recent work by our colleague Ben Schmidt, who currently works for nomi a company. Before that, he was at New York University. He has devised a way to reduce the dimensionality of these very, very large files in such a way that he claims that all 17 million volumes of the body. HathiTrust extracted features data set can run in the browser.
We have only seen him do that with 7 million, a mere 7 million volumes. These are 300 or 400 page volumes running in his browser, and he can visualize it in an interactive visualization. So here we have 7 million text objects expressed as extracted features, file arranged and therefore clustered by algorithm according to text similarity so that the algorithm knows nothing about this part of the algorithm knows nothing about what the book is.
It only knows how similar its statistics are to the next book over and it clusters them. Additionally, this particular setting shows the languages which are taken from metadata by color. So not surprisingly, all of the books in the same language have clustered together. The big orange cluster is the English the English blob, which makes up about two thirds, 60, 60%, I think, is the current amount in the hottie trust collection that is in English and it all clusters together.
No surprise there. Next slide, please, Amy. So I tweaked one of the parameters here to show the same. The arrangement is the same. So the texts that are similar are close together on the chart. But now the colors express subject matter again as expressed in the metadata.
And here we see that every language has a whole bunch of subjects in it. Also, no surprise, but it's a confirmation that this algorithm is actually working. It's doing what it says it does. So English language novels have clustered in the Northeast corner, and there's a little random example highlighted there. And then if you'll go to the next slide, the same visualization colored by subject shows that a government document is still in the English half of the cluster and the English sort of 60% But it's somewhere down near the champaign-urbana area of the graph, which not that it's boring here, it's great here, but we do very important and very serious stuff.
None of this novel stuff. No, that's not really true. This is a marvelous new it's a very experimental. We haven't implemented in implemented it in any of our interfaces. But I think we hope to and Dr. Schmidt is a great collaborator and a great researcher. So I would like to end with one more. Next slide, please.
Amy, one more quick set of numbers, this is what my boss, Dean Downie calls wow numbers. So you're all to be wowed. How big is the extracted features data set? Well, the HathiTrust collection is 17 plus million volumes in over 400 languages. This includes about 60% of works that are in copyright, meaning that the extracted features approach, the non consumptive approach, is absolutely essential to present it.
17 million is a big number, but not nearly as big as the almost 3 billion individual tokens. That is individual words represented and documented in this data. Set the full download. Some people want the full download people like Ben Schmidt. They are looking at more than almost 4 and 1/2 terabytes of data for most research questions.
As you may imagine, the full download is not required, but it's there for your use. I think that's a more than adequate introduction to extracted features as implemented at the hathi trust. So now I'm very happy to turn the mic over to our colleague Matt Lincoln to talk about similar things in. Constantly Yeah. Thank you so much, Glenn. And this is the numbers are a great segue into some of the technical aspects of what I'm talking about.
They introduce me. I'm a senior software engineer working on constellation. And so a lot of my time is spent thinking about the engineering challenges of how do we produce these, how do we share these, and what's usable by researchers that we share. So a next slide, please. So Glenn just showed us extracted features can become massively large as you end up keeping multiple representations of that same underlying text sitting around.
And so when we're trying to problem solve around this, we find ourselves pulled in a few directions at once by sometimes competing priorities. I have a little animation image next once, please. So we needed to be understandable for beginners, right? Delivered in a well documented, easy to use format. They're easy for users to work with. We also want to figure out how to optimize for large amounts of data.
Can we stream this data? Can it be randomly accessed? I'll talk about that a lot in a minute. Can it be read in parallel? But then I also do think about myself and my colleagues working on the production side as well. In order for these formats to keep being produced, they need to be maintainable by us. The producers have to continue to develop and expand with new metadata and new feature types without needing to spend weeks or months completely rewriting our pipeline every time.
And we need to, of course, make sure that our compute or storage or transmission costs at least remain sustainable. So I'm going to spend a minute talking a bit about our current CSV and JSON based solutions and then talk about what roles the parquet format that you see on this little triangular diagram I've made. What role that may play for our future. Next slide, please.
So these are some screencaps of what extracted features look like. We've seen some detail about what the trust looks like, how they trust us and how to trust. We both provide JSON based features. Conflate is also designed to provide a CSV comma separated values features. These end up looking like a sort of tabular format like you would see in an Excel spreadsheet.
We do this in large part because a lot of our focus is not just advanced researchers who have a lot of understanding of this already, but new learners who are trying to pick up data and text analysis for the first time. And so in the CSV example, we have columns for document ID, unit gram and count. So each document inside this large CSV file, it will have thousands of rows, one for each unique unit of that document and then account of how many times it appears in that document that you've seen with the JSON, JavaScript Object Notation format, this is more complicated.
You can't open it in excel, but it allows for nesting of complex fields, so we can begin to include bibliographic metadata alongside those extracted future counts. All the data just being in one file is a lot of convenience factor there, but the biggest benefit of these formats is how widespread they are, both CSV and JSON. There is a lot of good graphical user interfaces for them.
There's libraries in every programming language you have from Fortran to r to Python for reading and writing them. This is really important. Next slide, please. But here's the downside. They are very inefficient for storage and transmission. You've already seen again, Glenn's raw numbers. It's important to try and contextualize just how big, even a relatively small snapshot of documents can get in this format.
In this screenshot from constellation, we've put together a moderately large data set about 22,000 newspapers from the chronicling America archive. This is not an unreasonable size for a researcher might want to use for a research question. It's not every single document conflate, but until they gain the right kinds of programming skills, you'll see the diagrams file the CSV file 37 gigabytes.
That is larger than a normal person. Working with a kind of normal off the shelf laptop is going to be able to read into python, even reading to Excel and do anything useful with. And so this problem is really more than just about storage and bandwidth. And bear with me as we get a little bit technical here because for both csis and json, all the data for a record is contained in that same line or neighboring lines in the file.
And so we made neighboring bytes in the file. And as a researcher, often I'm just interested in maybe a few of the metadata fields and maybe one or two of the types of extracted features in that record with CSV and JSON. Not only do I have to download that entire file in order to do anything useful with it, my computer will still need to scan every bite of that record on disk just to plug it with the fields of interest that I want.
And this gets very slow as soon as you enter the dozens of gigabytes territory, much less the terabytes that Glenn was talking about with the really large data sets, this forces users to do gymnastics with their code, to actually become pretty advanced programmers in order to do the work they're doing. And oftentimes they try and get the escape hatch. I just let me just get a bigger and more costlier computer.
But that's obviously a big accessibility barrier as well for so many researchers and learners. Next slide, please. And so we're starting to pay more attention to that priority of data efficiency, not just usability, but efficiency by looking at an open Apache license file format called parquet. Parquet is an emerging standard for storing data in a way that favors analytical queries.
It puts columns together physically on disk rather than rows by prioritizing, having like data with like on the disk and including index like metadata to help the file operate like a database. The program I write can or you write or any user can write can skip right to those bytes of the file that contain the records and fields of interest. Potentially, you just need a couple of megabytes of memory to get useful data out of that 20 gigabyte file.
And this can even work over HTTP. You don't even need to download the entire file in order to start using it. Next slide, please. And this is just a little bit of data if you actually like what are the real numbers? Does this look like it will simulate a test of file sizes for looking at grams, five grams, try grams onto disk in different formats.
You can see how drastic the size differences are, especially in this middle chart where we're looking at portico data as we discuss a lot of this features, some book length documents, very long documents. It doesn't take a lot to get an enormous number of words in them. Much larger data sets become feasible for users to access and manage from a bandwidth storage and memory standpoint. The try grams for portico, as you saw, took almost 22 gigabytes as a CSV, but they take less than six gigabytes in per k, and that's even before you get to the benefits of reading in from RAM usage.
So this is likewise inspired by the work of Ben Schmidt that Glenn was, whom Glenn was talking about earlier. And so we're starting to explore what a park based data schema would be for storing and sharing non consumptive features with users. This would let us keep our data set level metadata, bibliographic metadata and feature data together in one file or one set of files and allow users to access just the pieces of data relevant to their research that's useful to them.
It's useful for us as well because it means we have to spend less computing power and frankly, less engineering time making these customized files per user to exactly what they need just to help cut down on their computing requirements. We need to think about how to build this with the future in mind. We've heard about word counts.
We've heard about a couple of other statistical features. Researchers are going to continue coming up with new types of features that are useful to have. And so we want to make sure our schema can standardize around accommodating future new types of features numeric vector representations used in deep learning and AI models. The colorful clouds that you saw earlier are powered by such just such a representation and other features we haven't come up with yet.
But as I mentioned at the start, efficiency alone can only be one of our priorities. The downside will sound great, but the downside is they simply require more programming knowledge to work with right now than CSV or than JSON does. So we need to think strategically about not just which format but perhaps which formats, plural we may want to standardize around and thinking about different user needs.
And so at that point, I will hand things over to Steven, who will talk about these bigger strategic challenges. It helps if I turn on my microphone. Yes thank you, colleagues, for fantastic presentations outlining the background and the work. So far and the challenges. And I'm going to just more or less summarize and build upon and then, you know, set the stage for the kind audience input when we get together in February.
Hello, February. Yes, actually, it's I believe today is Valentine's Day. So happy Valentine's day, everybody. Thank you for spending it with us. So what we need to do as we move forward with extracted features to make them even more useful and more impactful to the world, whether it be Scholarship or data mining or pedagogy, is to come to some realizations and then work from those realizations towards solutions, right.
Solutions and implementations that will achieve our goal of improved research, pedagogy and Scholarship. So one of the first realizations is that we're not going to preordained a perfect set of features. There are some common threads throughout their different kinds of features, but a, we can not predict new features. And the research question that will need to be answered by the creation of a certain set of features.
So there is no perfect set. Sometimes we use a lot of features, sometimes we see this one little tiny feature like page counts. That's it. That's all we do for each book is a page count. All that other parts of speech, all that other stuff is not used. So yes. So no perfect set of features, which is fine.
There is no perfect file format that meets the needs for computing efficiency, ease of use. That is absolutely true. That was just brilliantly just laid out the virtually even the differences between the different formats going I know my intuition is give me a CSV because I can read it. I have that human readability aspect, but when it comes to size and crunching and stuff, then that's where we start.
OK, I'll use parquet because I need to get this computation done. I'm really not sure how it works Jason somewhere in the middle. Sometimes I can get my browser to pretty present it, sometimes it doesn't. And so especially if it's super complex. So yeah, so no perfect file format. So we're going to need to focus on the abstract nature of the extracted features, the schema of them and be format agnostic as much as possible and help our users help our communities move from one to the other as per needed.
Yes, but we can in the text mining and digital humanities and digital library and all these communities, the teaching community, we can start to come together on common features. The different formats communicate with each other about those, the different approaches, and motivate all those with some standard use cases, the classroom being one right or other kinds of analysis of, you know, for technical papers and language modeling and so on.
So if you go to the next slide, please, and thank you. So non consumptive data futures. I love this title. It's a great play on words that I'm one that enjoys a good pun. So we have some key features that we about our future that we're going to be looking at. It's interoperability, extensibility, interpretability and reproducibility. So interoperability, I think you've got the sense just between the consulate world and the HRC world, we have similar but not 100% interlinking data.
We've actually been in communication. We have slightly different fundamental units. We have the volume with page, they have the article from different journals. That's sort of a fundamental unit or the whole volume as a fundamental unit. So we need to find ways to interoperate. And there's more to the world than just HRC and J's store and consulate.
So we need to find a way, perhaps through some standardization. Hence this workshop panel presentation at ISO Plus about how we might start thinking about creating standards for interoperability above and beyond our two great organizations. Intertwined with that is this notion that we do not know the future of features. So we need to build in extensibility.
People need other researchers need to create new features that we've never, ever heard of and have it compatible with whatever it is that we come up with as our standard, as our best practice. So it has to be hospitable to unknown future things, future features sounds sound like a movie, right? Feature from the future then and this is where pedagogy but just life span of these documents of these features we need to ensure the interoperability of them.
And this can a lot of research can actually hinge on what do we mean? What do we mean by page count? What is a page? For example, in the DRC, we don't call our objects. The things that we call pages aren't pages. They're actually called sequences because they started 0 on the very first surface of the volume.
So probably the cover and that's sequence zero. And because of that, there, you know, there's a interpretability. What do we mean by page? What do we mean by sequence? We have to document all this so that people can make judgments about what it is that they're seeing in the data, make a proper, proper understanding of it. And this is where your simpler formats really can help.
So your these and your Jasons with the nice corresponding labels can really help towards that. However, they're not terribly efficient as we've come to know. Scale is not our friend. And of course, we have reproducibility and reproducibility along two facets. One is that we've got to make sure that we have the software stack such that we can feel confident that the data, the features that we've generated are correct and we can rebuild them as needed.
And so on. So there's that reproducibility and that gives confidence for researchers that the results that they're getting make sense and are publishable and interpretable and all that good stuff. We also need this notion of reproducibility so that we can go and extract features predictably. Outside of the stuff we've already done. So future books, future items from different organizations, they need to have the tool, the suite of methods and the understanding.
So that they can create for their volumes, their journal articles, the a compatible set of extracted features data so that we can hit that other high point that we were talking about. Our goal is that users don't care where the data comes from, they just want the data. So we're not trying to create a monopoly on data provision. Actually, we'd like to share that burden because as Amy and Matt were saying, there is a sort of a commitment of resources, commitment of staff, commitment of programming and computational time to keep these things vibrant and live and avoid bit rot, which can happen quite quickly in this fast moving world of future features.
And of course, now so we need to have shared data, shared tools, and of course, then better scholarship, better support for Scholarship. Next slide, please. So all of this fantastic set of presentations, if I do say so myself, all channel towards engaging with you, the audience on this fine Valentine's Day. So what we would like to talk about either.
In cyber world with you or online or in email or future meetings. We are going to start convening some sessions where we like to talk to people that have different, different knowledge bases along this complex problem. What how do you think? How do you think we should be thinking about the tension between easy to use file formats and best at scale formats?
So this is our continuum from sort of, let's call it Ross CSV all the way to parquet or even some other formats that are more compressible and more computationally efficient. We are open to all suggestions, all comments. How do we harmonize different data granularity as one granularity we hadn't really talked about? This is the actual collection granularity where we take the extracted features that represent all the novels from England published in 1800.
So that's sort of a broad granularity all the way down to the features on a given page or within a given paragraph, right? So long documents, long documents are a challenge, as you know, article level. So this also influences at what level we count the anagrams, the diagrams, diagrams and so on. We also need to think about options for when we get integrins strike games and higher level engrams because we have issues of non consumptive.
If if the number of grams gets too big, then we're not being as non consumptive as we want to be because you can reconstruct the text from overlapping programs and so on and we want to avoid that to stay within the parameters of that link outlined earlier by Amy and Glenn. And so the last question, we want you to speak with us about how can we achieve data compatibility in the context of multiple providers, business models and stakeholders?
So yeah, that's so we have a great partnership going on between the trust Research Center and consulate with our commercial players in this space. There are other institutional players, other libraries, other museums, other countries. So we need to negotiate the great Venn diagram of stakeholders in this space. And with that, I am finished.
And so I want again thank you, thank the panelists for their great presentations and their work in our great host and our great tech staff who's helping us record this as I speak. Again, there is our contact information and we look forward to an ongoing conversation. Thank you.

Cadmore media player playing video What is non-consumptive data and what can you do with it? Recording

Video Player

Transcript

Segments

End of Video Player Control