Name:
Building a Unified Product Delivery Platform, part 1
Description:
Building a Unified Product Delivery Platform, part 1
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/2ba79777-6e9b-4373-a45b-85267b7d9571/videoscrubberimages/Scrubber_3240.jpg
Duration:
T00H56M02S
Embed URL:
https://stream.cadmore.media/player/2ba79777-6e9b-4373-a45b-85267b7d9571
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/2ba79777-6e9b-4373-a45b-85267b7d9571/Platform Strategies 2018 - Building a Unified Product Delivery Platform%2c Pt 1.mp4?sv=2019-02-02&sr=c&sig=cJVYEHS3FLpyCX8u5%2BKDnj%2F2bMszb866ezn%2Fk9ZNs%2BM%3D&st=2024-11-23T09%3A09%3A25Z&se=2024-11-23T11%3A14%3A25Z&sp=r
Upload Date:
2020-11-18T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
SPEAKER: Today, we have Dr. Paul Cleverly, who is at Robert Gordon University and Geoscience World. And he's going to be talking about using new techniques to analyze data and go beyond traditional search. Pierre Montagano-- did I get that right? Awesome-- Code Ocean, recent ALPS award winner-- and he's going to talk about how code and software is becoming an integral part of the scientific record, and in reproducibility as well.
SPEAKER: Dr. John Unsworth is here with us as dean of-- he's dean of libraries at University of Virginia. He's going to share with us the Hathi Trust and its approach to big data. And from what I've seen from the statistics, it's more like massive, massive data that the Hathi trust is managing. And Tanya Vidrevich is from XSB and SWISS. And she's going to show us a really interesting new interface and analysis tool over technical documents and technical standards as well.
SPEAKER: So to get started, Paul cleverly from Robert Gordon. Oh, here we go. Thank you so much. Take it away.
PAUL CLEVERLY: Thank you-- thank you. I'm delighted to be here to share some thoughts on text analytics and how that could be used to augment information discovery for geoscientists. We're going to downplay the importance of the article, or the paper, or the document to focus more on the concepts within those documents and articles, so thinking of all of the journal collections like a giant text leak, if you like.
PAUL CLEVERLY: So setting the scene, we're all aware of cognitive biases. There was some discussion about that yesterday. Individual biases, such as confirmation bias, where we seek out, we cherrypick information to justify what we already believe; social biases, such as a groupthink, where we can reach a decision through consensus rather than stacked independent critical thinking-- both of these have been seen to be detrimental to decision making in business and also in science.
PAUL CLEVERLY: So then we have this big data construct. So on one hand, capable of overwhelming information overload; the other, fascinating levels of information discovery-- some people see big data about compute power. Other people see big data as more about brainpower. What are the questions that we choose to ask in this vast resource? But either way, big data is likely to be about small patterns and how we can exploit those for commercial or humanitarian benefit.
PAUL CLEVERLY: So in amongst this large text lake, shall we say, can we apply algorithms and surface patterns that challenge what we know and show us what we don't know? That doesn't mean that if we see something that contradicts our mental models that we're wrong, but just simply being aware of stacked opinion of thousands of authors, somewhat independent. The averages, the outliers, and volumes of information too vast for us to ever read may just challenge what we think we know.
PAUL CLEVERLY: So text and data mining are very broadly divided into three areas. So I'm going to focus predominately on unsupervised machine learning. I'll just give a couple of examples around the others. So rule-based is where we input some of our own knowledge into the corpus. I'm sure most societies are doing some element of text and data mining to surface trends and patterns.
PAUL CLEVERLY: So these may be taxonomies. They may be rules to extract integer and float numerical data. And there's some great examples in geoscience, for example of fossil stromatolites, which are mound-like structures created by bacteria. They're being created today. They were created in the Precambrian Era, 600 million years ago.
PAUL CLEVERLY: So people had thought that the population of stromatolites through geological time was related to mass extinction events. But by using TDM techniques and vast amounts of geoscience literature, it points more to potentially seawater chemistry. So there's an example of some papers that you can read about that, that show that it's capable of supporting new scientific discoveries.
PAUL CLEVERLY: The theory, obviously, is never in the data. But the data can help us draw different conclusions. So there's an example where the whole, all of the text-- the body text within amongst all the journals-- is more than the sum of the parts. It can surface patterns that you wouldn't find in, say, an individual paper article or another journal. Some people call this dark data. You can mine information for this.
PAUL CLEVERLY: But predominately, it's rules that we program. It's a deductive way of reasoning. The second element is supervised machine learning. So this is where we give labeled examples of something so that we can build a statistical model for the rules to be created through math. So for example, images-- we want to recognize the image of a cat or a dog, we just supply lots of labeled images.
PAUL CLEVERLY: We did an example, some research, in geoscience on sentiment in the oil and gas industries. That was supplying thousands of sentences extracted from literature that were related to petroleum system elements. And then we had retired geologists label each of those sentences-- of course, it could be paragraphs of documents. We chose sentences-- whether they were positive, negative, or neutral as a sentiment.
PAUL CLEVERLY: So then we built a Bayesian classifier that we could then run through vast amounts of text that can never be read. And it was actually very close to human-like accuracy. The geologists agreed with themselves about 90% of the time, which is quite high for geologists, because they're not the most agreeable of people. And I say that as a geologist. And the classifier was quite close to that.
PAUL CLEVERLY: And we surface contradictions. So what you're seeing with the little pies there, with the red and green, is for particular aspects of petroleum system per geological area, per basin. The red is negative and the green is positive. So it's showing you contradictions where in the literature you've got two opinions which differ around the same element in the same area.
PAUL CLEVERLY: The contradictions are great, because, when we're doing things like literature review, we're always interested in contradiction, because it provides a good fertile ground for new theory development. And that might stimulate that, interface people to go and click and read those articles, read those paragraphs. They would never have necessarily made that query had they not been stimulated with these sorts of techniques.
PAUL CLEVERLY: And the third example is unsupervised. So here, we're letting the latent structure within text surface patterns. So we're not putting any of our own deductive knowledge into the area. People often say, which is best? It's a bit of a false argument. You use all sort of combinations of these, depending on-- and they can be used one after the other in different sequences.
PAUL CLEVERLY: So I'm going to focus-- we just have a little small example of the unsupervised. It's targeting analogs. So I've done quite a bit of interviewing of geoscientists. And one of the things that often comes back is analogs. When you're working [INAUDIBLE] sparse data, you need to draw on information for something that's similar. It's a very difficult search on traditional search systems, because you do what they are.
PAUL CLEVERLY: And they're similar to other things like case-based reasoning and so forth. So if we look at that area, I'm just going to work through a tiny example. I think it might just help understand how simple some of these techniques are. Obviously, they can get more sophisticated. But we look at three formations. So these are entities that appear in the body text-- green formation, the [? Beano ?] formation, the [? Nemo ?] formation.
PAUL CLEVERLY: They could be products. They could be people, places, names of technology. They just happen to be geological names here. And we can define a word vector for each of those entities by the words that surround it. And so we can go through all of our text and build up and aggregate how many-- the words that occur around those entities and how often they occur.
PAUL CLEVERLY: So we're turning text into a mathematical representation. Why would we want to do that? Because then we can perform operations like similarity. So we can say one vector, we can take another vector, we can measure the cosine of the angle. The closer it is to 1, then the more similar those are. So if you had a question, which is the most similar to the green formation? You see here I've got three-- imagine in text, it'd be thousands of these that you'd recognize.
PAUL CLEVERLY: And then the columns, there's three. But again, there'd be thousands of these. So the word, "shoreface," occurs five times around the green formation, not at all around the [? Beano, ?] or one around [? the Nemo. ?] We simply just do the calculation-- 5 times 0 plus 3 times 1 plus 2 times 2. We divide that by the square root of the green formation, 5 squared plus 3 squared plus 2 squared times the square root of the [? Beano. ?] We end up with 0.2075. Now, if we had exactly the same words and exactly the same quantities, then we would end up with 1.
PAUL CLEVERLY: So we now run through the same with the green and the [? Nemo ?] formation. We get 0.733. So what we can infer is that, if you're looking for an analog-- and it's a very trivial example-- you would say that the [? Nemo ?] is a better bet than the [? Beano ?] formation. So we can take those concepts and we can build search-based applications where, instead of delivering documents or articles of search results, we're delivering entities, so more like answers.
PAUL CLEVERLY: So it's still search. You're still searching exactly the same text that you currently have in a traditional digital library, except you're giving answers back of which are the similar entities. And there's a lot more sophistication things you can do. But more or less, I think that example shows just the concepts of how you can do that.
PAUL CLEVERLY: I've tested it with some geoscientists. And it does appear that you can, with some statistical significance, surface unexpected, insightful, and valuable knowledge that previously was unknown using that same text collection. So that's all I have. I hope that's been useful. Thank you for listening. [APPLAUSE]
SPEAKER: How about math at 9 o'clock in the morning? All right.
PIERRE MONTAGANO: A hard act to follow. So hi, [AUDIO OUT] from Code Ocean. And today, I'm going to talk a little bit about how we can interconnect different research objects to really create a more interactive, a more complete scholarly record. And then that has a lot of implications around reproducibility and reuse. And it was very interesting yesterday talking about-- the speakers were fantastic.
PIERRE MONTAGANO: And some of the themes that I picked up about is like, what is our competitive advantage? When to build, when to borrow? And what's really interesting is, last I was thinking about it, I think most of our competitive advantage, very obviously, is the fact that we have very specific and targeted content that we deliver to a particular group of people who are interested in that content.
PIERRE MONTAGANO: And as we've moved away from becoming just repositories of PDFs and really our platform is being able to do a lot of very interesting and very dynamic things, Code Ocean, I think, plays an important role in that, in creating a more complete scholarly record. So there's one quote I wanted to-- two, actually, that I wanted to read. This is from-- it's from David Donahoe.
PIERRE MONTAGANO: And it's all the way back in 1998. And he said, "An article about computational science and a scientific publication is not the scholarship itself. It's merely advertising of the scholarship. The actual scholarship is the complete set of instructions and data which generated the figures." Right? So I don't know if I agree that the article is merely advertising, but it's an interesting perspective.
PIERRE MONTAGANO: And a recent blog post by Ben Marwick also pointed out-- he made the analogy, the article is like a label on a bottle. So very, very important-- you want to know what's in the bottle. But most of the time when you find the bottle you want, you want to actually get a bottle opener and get inside to the actual contents. And recently at a reproducibility conference, Victoria Snowden-- this was at Columbia along with Code Ocean-- made two claims.
PIERRE MONTAGANO: And she said, virtually all published discoveries today have a computational component Right? Both in social sciences and in the hard sciences, and that there was a mismatch between the traditional scientific process and computation, leading to reproducibility concerns, right? So now, I'm going to talk a little bit about why code is important. So if we're curating the data-- and a lot of us now are curating the data.
PIERRE MONTAGANO: We are linking out to the actual data behind the actual article-- the other part of that is, as data becomes more and more complex, as it becomes larger, researchers are deploying code in order to analyze the data. So if we're not also curating what was used to analyze the data and only the data, then we're only actually curating half of what's important, I feel, to researchers, right?
PIERRE MONTAGANO: If researchers want to actually use the data and generate the same results and actually build upon that work, they're also going to need the code. And if you actually do a search for code-- and I invite everyone to do this on your platforms. Go onto your platforms and do a search for code. I did a quick search on Science Direct for Matlab. I got 200,000 results. I did that same search on Taylor Francis.
PIERRE MONTAGANO: I got 20,000 results; Oxford, 10,000 results. And that's only one programming language, right? You're going to run into trouble if you try to search for R. And that's unfortunate. But that's the most common. So it's one of the most common programming languages, right? You're going to get a massive amount of results. Anyway, so what Code Ocean allows people to do-- oops, I jumped ahead-- is just-- what we allow you to do is curate the code and in a way that's executable.
PIERRE MONTAGANO: How about that? So what we do is we allow you to curate the code so that it's executable. And so what makes us different from, let's say, something like GitHub, or just putting your code into a repository, in a flat repository?
PIERRE MONTAGANO: And this is the steps that I asked my founder, Simon Adar, who is a spectral imaging engineer, to explain what he had to do in order to get a piece of code up and running. So he was doing his postdoc. He was studying-- he was trying-- they were actually building algorithms to dig soil pollution. And they were looking at past research.
PIERRE MONTAGANO: They would come to past research, they would get to the paper, they would find some mention of code, or they would see that the actual figures-- the actual diagrams-- they knew that these were created by using code. And then they tried to get the other researcher's code up and running, right? So it was a nightmare, because researchers, in a lot of ways, aren't coders.
PIERRE MONTAGANO: So a lot of times, they're not following proper coding practices, in the sense that they're not including things like what exact version of language they used, what packages were included. And then of course, there's the dependency hell, right? One file depends on another file, depends on another file. And it took these researchers months to get one algorithm up and running. And so what he wanted to create was a system where you were able to create and containerize all the code and run the code by just clicking a button.
PIERRE MONTAGANO: And it really speaks to this reproducibility issue, right? And this is a slide about all of science, not just computational science. But what really took me for a loop here was not that researchers had trouble reproducing other people's science. It was that researchers had trouble reproducing their own, right? That really took me for a little bit of a loop, right?
PIERRE MONTAGANO: And the reason is-- over time, the reason is because researchers aren't curating all the proper things needed for their code to run, right? And so if we look at Code Ocean, it is a very friendly user interface where researchers can deposit their code files, deposit their data files, select the proper environment. And then others can simply hit a Run button and execute the code in real time.
PIERRE MONTAGANO: It also allows others to change the input values, play with the code, and rerun it. So not only for reproducing, but also for reuse. So here is the code files, here are the data files, and here's a sandbox window that other researchers can then go in, change the code, play around with the code. And they can actually run the actual code. The reason I picked this example-- one of our founding partners was IEEE.
PIERRE MONTAGANO: And a lot of the code that's in Code Ocean comes from IEEE. But I wanted to show this example, because this is a social science example. It's from political science and research methods. So this is as true for the social sciences as it is for the hard sciences as well. And then here is this Run button. And researchers can just run the code, and they can get the results.
PIERRE MONTAGANO: I'm actually showing you the new interface from Code Ocean. So what's our technology stack built on? We're basically built off Docker. So you can also think of us as GitHub is a really friendly interface over Git, right? We are a really friendly interface over Docker, right? So we're containerizing everything. So we work on Linux and Docker. And then on top of that, we can support basically any open source programming languages plus Stata, Matlab, and soon others to come as well.
PIERRE MONTAGANO: We support deep learning frameworks as well. And we have GPUs as well. And right now, what we're working on is for researchers-- so we're a reproducibility platform. Researchers deposit their code and data. We containerize it. It's executable at the point of the article, within the article, giving you that competitive advantage, right?
PIERRE MONTAGANO: So researchers are reading the article. They're very interested in this particular article. Now, they no longer have to go to another platform to run the code, or find the code, or see the data. They can see everything within the article page, or on a separate tab within that page, right? And that, all of a sudden, makes the content, in my opinion, a lot more valuable, right? And that's really, I feel, our true competitive advantage.
PIERRE MONTAGANO: We are working on-- our next step is to work on how we can start integrating live machines, meaning people can actually pull the code from Code Ocean into a live Jupiter environment, into RStudio, and other places and start augmenting and playing with the code in real time, and then saving it back into their own private dashboard. So moving further down into the researcher workflow, right? But right now, we're optimized as being a reproducibility platform and point of publication.
PIERRE MONTAGANO: So this is a workflow. And I'm running out of time. But basically, authors are asked a question. Do you have associate code-- yes or no-- with your research? They answer yes, they deposit a compute capsule in Code Ocean. If you want, as a publisher, we can actually provide a private instance of that compute capsule to be shared during the peer review process.
PIERRE MONTAGANO: Now, that's very, very valuable as the recent peer reviewers, a lot of times, don't engage a lot with the code. It's this whole entire trouble I was telling you about, getting code up and running, right? And then there's a notification back to the publisher, and then a widget appears onto your platform. And here's our current partner list. Thank you very much. [APPLAUSE]
SPEAKER: That was great. Tatiana, are you next? Great.
SPEAKER: And just a reminder to everyone, you can text in questions as we go, and as you think of them, to the text number that's on your programs. And we will be doing a Q&A after these. I'm sure there's a lot of questions around these thought provoking presentations. Tatiana.
TANYA VIDREVICH: OK. So I'm going to talk today about technical specifications and standards. And it is a narrow slice of publications that are of interest here. But it is a very special slice. I think we are safe. And we have cheap, good, fast products, because we all use specifications and standards in development.
TANYA VIDREVICH: Now, if you talk to any engineer, when it comes to geometric models, they all use pretty sophisticated geometric models usually, nowadays. I was at a news conference six months ago. And one engineer said, do you remember when people were using drawings? Well, not anymore. The 3D models allow people to analyze, simulate, make sure the product is safe, it doesn't have any problems before it gets produced.
TANYA VIDREVICH: However, next to a model, quite often you see this blob of basically text. And that text specifies nongeometric properties of a part. So it specifies material, surface treatment, how you test the part. And that text, basically, is as important-- and it usually references standards, which is a good thing, because, if people use the same industry-accepted standards, you are assured of a good supply chain.
TANYA VIDREVICH: And you are assured of the quality of a product, usually. However, as you can see, this text is rather cryptic. This particular line, here, means you have to go and download a PDF file and find what this line refers to, which refers to something like that within a large PDF file. Aha. Actually, I have to download another PDF file, which is [INAUDIBLE] and interpret what the concept of class 1 means there, and so on.
TANYA VIDREVICH: And basically, a design engineer class to do that. A manufacturing engineer has to do that. A vendor who coordinates a job has to do that. Test people, QA people have to do that. So every time, this information is being interpreted. And you can imagine how much problems it creates. What's more? Usually, when we think about technical standards, we think about common standard development organizations.
TANYA VIDREVICH: However, if you think about Boeing, you think they produce airplanes. They have at least 70,000 active standard documents internally that they are using. So in order to produce a part for them or for their vendors, they have to interpret their own standards, and they have to interpret standards from external organizations. And what's more?
TANYA VIDREVICH: Design usually takes a long time. In the industry takes a long time. If you think about pipelines for oil and gas, a large project might take a long time, too. Well, most standards development organizations update their standards once in two years, which can have a material change to what people are doing. So not only they have to interpret this information, they often have to reinterpret it again within the two years to understand if they still comply with the code they are supposed to comply with.
TANYA VIDREVICH: And that example that I just showed requires more than 500 pages of interpretation. So actually, we talked a lot to engineers and to ask them, what do your hires expect to do? And what we realized is that a lot of organizations are trying to tell them, you'll save a lot of time if you just improve your search. Then if you go back and look in the search logs of these organizations, you'll find out that, 80% to 90% of the time, engineers know this specific document they're looking for.
TANYA VIDREVICH: They type in that specific standard number. And when they don't, they actually use Google. So at least for standards, it wasn't a lot of benefit improving a search. What you really improve is interpretation of the standards, integration into their internal systems, and navigation between those standards. On average, if we look at the government's standards, they are about 10 pages long.
TANYA VIDREVICH: Although, it's just an average. There are very long and useful standards. The most downloaded standard is more than 1,000 pages long. And on average, they reference seven other standards. So you can imagine finding related concepts between those standards are important for navigation, for interpretation. And an ability to identify, to address a specific concept in that standard and keep track of it is also very important.
TANYA VIDREVICH: So an engineer usually identifies those concepts, and then saves the information that they are interested in, in their work instructions, in their spreadsheets, in their internal PLM systems. When this information changes, they are interested to know that this particular concept that I referred to has changed. Maybe the whole metadocument changed, but that particular concept has not changed.
TANYA VIDREVICH: So it's very important to know what specifically changed and how it affects me. So then where do we go? Of course, for years, since 1984, PDF was a prevalent format for distributing technical standards. Currently, the key players in this space converted all their information to XML and basically publish out of XML into PDF, HTML, and so on.
TANYA VIDREVICH: And that's a very good step forward. It allows you to at least connect the documents internally and understand which document references which document. However, it's not good enough if you want to see if my own document in an engineer-- my own engineering system references a whole network of other internal and external documents, and if the change of a document might or might not affect me.
TANYA VIDREVICH: So we put together a consortium, a technical working group that consists of large Ainslie manufacturers, as well as key standard development organizations. And we basically asked that question from users of standards, what do you hire a standard to do? As a result, we came to a format called SWISS, which is semantic web for interoperable specs and standards. It relies on the linked data model that knows who I am referencing to, why I'm referencing, and the state of that reference.
TANYA VIDREVICH: And because it has an API, it integrates into major product lifecycle management systems, it integrates with major tools that engineers use, engineers find it very useful. So for example, I might have my proprietary spec for a part. I might mention ASTM standard for a material, SE standard for a specific process, and ASME standard for some dimensions.
TANYA VIDREVICH: If information in one of these standards changes, or a standard gets withdrawn, it's very important for me to know that. Through using linked data model, we are basically aware what specifically changed, what concept changed, and how does it affect all information in my proprietary system. So we basically call it smart-connected documents. And-- [APPLAUSE]
JOHN UNSWORTH: [INAUDIBLE]
SPEAKER: Yes. The clicker.
JOHN UNSWORTH: Is it working or not? In typical academic fashion, I have twice as many slides as I should. And they're full of text. So you can download them. I'm going to I'm going to blow through the first half of this presentation, which is mostly for the record, to get to the part that's interesting. So this is about the Hathi Trust and its research center.
JOHN UNSWORTH: Hathi means elephant in Swahili. And their motto is there's an elephant in the library. And there is an elephant in a library. That's the mission of the Hathi Trust, which is the parent organization for the Research Center. Here is the data in the Hathi Trust. It's a lot. Five billion pages is probably the salient figure here.
JOHN UNSWORTH: And that number is updated every day. Well, the other salient number here is that about 65% of that content is in copyright. It covers all of the domains of knowledge in book form in research libraries. It has complex copyright conditions. And the Hathi Trust has been managing those very carefully for its members.
JOHN UNSWORTH: The Research Center is the research arm of the Hathi Trust. And the purpose of the Research Center is to provide computational access to all of that content that's on that previous slide. This effort started in 2008. I actually launched the Research Center when I was at the University of Illinois, as dean of their Library School, and reached out to Indiana, which backs up the Hathi Trust from Michigan.
JOHN UNSWORTH: And so this is in Illinois and Indiana effort. And I'm still on the steering committee, because they can't figure out how to get rid of me. And the Research Center was formally established in 2011. That's the mission of the Research Center. You may have heard the term non-consumptive research. This comes out of the attempted Google Books settlement with publishers. But we have actually carried it forward as the guidelines for what we're doing here.
JOHN UNSWORTH: Basically, it means you can do computational work with this, but you can't take the stuff away. You can't republish it. And you can't really use it outside of its environment. So in order to do that, we need to bring the computation to the data. We can't move the data. Since 2011, the Research Center has been developing services and tools that allow researchers to employ text and data mining methodologies on that collection.
JOHN UNSWORTH: And we've been working, so far, with the part of that collection that's out of copyright. But last week, that changed. So last week, we began to provide computational access to all of the content, including the copyrighted content, to people at member institutions of the Hathi Trust. Basically, what that looks like is this. There are three approaches. So HTRC Analytics, if you go and find it on the web, gives you web-based tools that you can use to do things like Ngram analysis.
JOHN UNSWORTH: You don't get access to the data at all, really. And what's underneath those analytics is not the data, but something called extracted features, which I'll come back to. But basically, those are statistically derived information at the page level that cannot be reassembled into the underlying text. Feature extraction services are what provide those extracted feature data sets.
JOHN UNSWORTH: And we can actually allow people to download those data sets, because they can't be reassembled into the underlying text. And what happened last week was we started to provide people with access to something we call the data capsule. This is a virtual machine on which you can load software to run against the data. And it operates in one of two modes.
JOHN UNSWORTH: It's either in Maintenance mode, where you can SSH into it and load it up with software and do things; or it's connected to the data. When it's connected to the data, you can't get anything off of that virtual machine. You have a terminal on the machine, a virtual terminal. But you're not-- you can't get from there to anywhere except the data. And when you're finished with your computation, your results go basically into a holding tank.
JOHN UNSWORTH: So when you turn the machine back into Maintenance mode, you don't have your results yet. Those results are being reviewed by a human being before they're released to you. And we're reviewing them, basically, to make sure that you're not expropriating large amounts of text. The web-based portal gives you algorithms that are click-and-run tools to provide [? canned ?] analysis.
JOHN UNSWORTH: The extracted feature data set can allow you to do more things with more actual data. And then we have some sample tools like Bookworm to do word trends. The analytics from members-- that started this past week-- is the secure computational environment that I've been talking about. It took three years to get that architecture approved by the information security people, the general counsels, and so on at Illinois, Indiana, and Michigan.
JOHN UNSWORTH: That was three years of pure bureaucracy to make this possible. Right now, that computational access through the data capsule is only available to Hathi Trust member-affiliated researchers, partly because we expect significant demand. And our computational capacities-- well, based at Big Red in Indiana-- are still limited in some ways. And there is a lot of data, potentially, to work with.
JOHN UNSWORTH: So we're trying to keep a throttle on that. There's about 130 member institutions under this link, mostly in the US, but some in other countries-- Canada and australia, and also, I think, in Europe now. So our purpose in the Hathi Trust writ large is to allow lawful research and educational uses of this collection. And there is a non-consumptive research policy, which you can read to see what are the details of what we think we're allowing people to do and what we don't want people to do.
JOHN UNSWORTH: And you can see those services embodied in the HTRC Analytics, which is available on the web. The next project-- and now, I'm talking about something that is basically my research project self-funded with my startup at the University of Virginia. And I'm working with JSTOR and Portico staff to experiment with enabling distributed text mining across the book material that's in HTRC and the journal material that's in Portico.
JOHN UNSWORTH: We're working only with 10 publishers who've agreed to be part of this pilot. And we're working with just biological data, just a subset of the data that people have approved. And we're doing this because the problem that HTRC has been working on with how to provide access to-- computational access to copyrighted data is larger than HTRC's collection. So we have, out there beyond HTRC, lots of other copyrighted data.
JOHN UNSWORTH: And if I'm a researcher, I'm interested not only in book material, not only in journal material. I'm not only interested in one publisher's material. I'm interested in the subject area that cuts across those things. And so we need to figure out how to provide people computational access to distributed copyrighted data. Portico has a set of agreements with its publishers, which it can't-- doesn't want to renegotiate and doesn't want to violate.
JOHN UNSWORTH: And that involves not moving those publishers' data out of their environment. HTRC has the same requirement. The Hathi Trust is not going to let us ship data around the world. So what we're trying to figure out is, can we allow people effectively to do text mining when you can't bring the data together? Now, we're not the first people to worry about this problem.
JOHN UNSWORTH: Google's been here before. There's MapReduce. There are tools and techniques out there that I think can enable us to do this, but it's still experimental from our point of view. So I'm happy to report, actually, that, a couple of weeks ago, we managed to do extracted features that are interoperable across the journal and the book content, which means we had to harmonize our metadata across these two very different kinds of collections.
JOHN UNSWORTH: And that's actually the major accomplishment. To have done that means that we can now begin to think about this next step of distributed text mining. Thank you. [APPLAUSE]
SPEAKER: Great. Thank you all. We're now open for questions from the audience. I have a few, but I promise, if you raise your hand and go live, you get the jump to the front of the line. So it looks like Alison here. We have a microphone in the crowd? No? I'm moderating and passing microphones.
PIERRE MONTAGANO: Phil Donahue.
AUDIENCE: Thank you, Oprah. John, this is question's for you, just about the Portico-JSTOR aggregated data mining. So our journal content is in Portico and JSTOR. I don't know exactly what our licensing agreement says with them. But how are you going to navigate the different licensing agreements for the different publishers' material that are in those repositories?
AUDIENCE:
JOHN UNSWORTH: The point is actually doing this in a way that doesn't require renegotiating those agreements. The publishers who are participating in the pilot said, it's OK. This is not going to be open to the public at this point. This really is a science experiment. And so at this point, what we're doing is working with publishers who are interested in providing text and data mining services to their users, because they're getting requests for that from their users.
JOHN UNSWORTH: And what we're trying to figure out is, precisely because it would take an infinite amount of time to renegotiate every one of those agreements, and even to work with individual publishers-- so working with Portico and within their existing agreements is the point. If people don't want to participate, obviously they won't. But if we can figure out how to do this, then those who do want to participate, we're not taking their data out of Portico.
JOHN UNSWORTH: We're not you know allowing people to expropriate data from Portico. So the point is actually not putting ourselves in a position where those agreements would need to be renegotiated.
AUDIENCE: I have a question for Code Ocean. You mentioned rightly that dependency capture was a significant problem. When you were presenting, I was looking where the dependency capture was happening. If you'll let me upload my code, I'll upload whatever I've got. The dependencies that go with that, how do you capture that in a way that will survive a long enough time period?
PIERRE MONTAGANO: So the when you upload your code-- so you're talking about dependencies in terms of other files? Yeah, right. Yeah, so there is an interface-- we also have an Environment tab that I didn't show, that you can go into the Environment tab and select all the different packages, startup scripts, anything that you're particularly using. So maybe I should have shown the Environment tab. So we have a package-- basically package manager as part of our Environment tab.
PIERRE MONTAGANO: And then they can select the version as well. So they can select a version of that particular package.
AUDIENCE: [INAUDIBLE]
PIERRE MONTAGANO: All the different things, yeah. So there are certain things that-- anything that's open source, we support right now along with other vendors that we've made some agreements with as well.
SPEAKER: Any other questions in the audience? I have a few here that have been texted in. We'll go to one of those. This is a question for Pierre and Paul. How do you see TDM machine learning and these new search techniques and data repositories converging to benefit researchers in the future? Any specific use cases that publishers here should be considering for these techniques on their platforms?
PAUL CLEVERLY: I think there's going to be a multitude of interfaces. I think it's probably unlikely that you'll be able to have one interface that will handle all the different questions that researchers or business users will want. So I see that, as well as the traditional mechanism to find articles, the digital library, there could be an explosion of other interfaces which address specific tasks, which are related to each discipline.
PAUL CLEVERLY: So I don't see it as an either/or. I think there's going to be an and for these.
PIERRE MONTAGANO: Yeah, I think that-- yeah, there are a lot of data repositories out there. And a lot of them are subject-specific. And I do see that the-- but at least, the numbers are indicating that there is not a lot of people going out and pulling out those data sets or checking those data sets. So I do think tools like Code Ocean, where people can extract the data outside of a repository, and then bring it into an environment, a cloud-based environment, where you can then start to do some analysis around the data and make the data more operable, or make the data-- do some analysis around the data.
PIERRE MONTAGANO: I think that that's key right now. I think that that's, in my opinion-- and it's just an opinion. Right now, I think a lot of data repositories are not getting a lot of usage only because the tools around analyzing the data aren't joined up with the actual data repository. And I think that the logical place for all that to live is actually at the point where the research is curated, which is in the digital libraries.
SPEAKER: Thank you. Still looking for questions from the audience. We have plenty here on the iPad. This one is for Tanya. How specific to technical standards or engineering standards is the SWISS technology? Or does it apply to all connected document systems?
TANYA VIDREVICH: There is something special about technical standards in that they change a lot. Usually, when you publish something-- an article in a journal-- it usually doesn't change. However, having said that, an ability to find connected concepts across multiple documents and navigate through those connected concepts, I think, is appropriate to a lot of publications.
SPEAKER: OK. Mark, is there a mic? You're getting this mic. You can keep it.
AUDIENCE: So I am a publisher that has a large corpus of copyrighted material that I don't want to just put out there. We actually are pretty generous with researchers who come and approach us. But let's say I wanted to go one step down and make available the content that would be good for text and data mining, but in a way that doesn't reconstruct the text. What set of files or whatever would I want to make available? Is it just Ngrams?
AUDIENCE: Is there a standard set of extracted things that would come from the text that you would want to see publishers just make available that could be pulled down? Or is it really, if you don't have the full text, then you're doing it wrong?
JOHN UNSWORTH: Well, one of the things that motivates the experiment with the JSTOR and HTRC content is, at that scale, if we can figure out methods that work and don't imperil copyrighted content, then your content, wherever it lives, could present itself in that way for those services. But basically, my first answer would be have a look at the extracted features data set.
JOHN UNSWORTH: And what we're extracting is part of speech information, individual words in alphabetical order, page level data, with which you can actually do a lot. You can't do sentence level things, obviously, because they're not assembled into sentences. So you can't do anything syntactic. It's limited semantic kinds of research, but you can still do some useful things with that extracted features data set.
JOHN UNSWORTH: And if this works, then other people can create this. And, in fact, you might upload data and have it returned to you as extracted features.
SPEAKER: Great. One more question coming from the audience here on the text. This is probably relevant for a lot of our audience. And that is, how do small publishers and/or society publishers use these concepts to their and their users' advantage versus the big data or the more aggregated view? So I don't know who wants to take that one. Paul?
PAUL CLEVERLY: Yeah. I mean, with text and data mining, I always think of it as two main areas. One is you could do broad and thin, where you extract all of your text. And you can put that in the cloud, or you could put it somewhere else on a secure server, and then just allow people programmatic access to it. So you're not predefining the certain concepts that people want to see in concurrence.
PAUL CLEVERLY: A lot of text and data mining is counting how many times this occurs with this, or other things like that. So that would be a flexible broad and thin. But then you could also use open source tools to build your own applications to target repetitive tasks that are relevant for your specific domain. So I think those two-- one is a bit more involved, where there's a bit more technology play.
PAUL CLEVERLY: The other is more of you're just facilitating a set of hypotheses or questions that people may wish to query. But if you have already the XML-- even if you have PDF, you can convert it to text-- it isn't that difficult just to put all of that data there and call it a lake of sorts, and then just allow open source access to that, obviously either as part of existing subscriptions, or it could be a new option to offer to subscribers potentially.
PAUL CLEVERLY:
TANYA VIDREVICH: From SWISS perspective, since we assign unique IDs for every concept in a document, we know how open a concept is addressed, basically, and checked. And we know basically the usage of that concept, which, I think, is very helpful for a publisher. And we also know who reaches out to understand the information for that concept, which means, if information is being copied out and then legally transmitted to somebody else, we know when it happens.
TANYA VIDREVICH: And as a result, you could basically increase the user base for those concepts.
SPEAKER: We have time for one more question. Thank you.
AUDIENCE: Thank you. I was curious. Code Ocean again, not surprising given that's what my background is-- I was curious. How often do you see people play with code snippets or segments that have been deposited along with papers? I'm just curious.
PIERRE MONTAGANO: So that's where I think it starts to get interesting, when you start to look at metrics that are indicators of real engagement, not just time and environment and vanity metrics, right? And so things that we are seeing is that, of the compute capsules that we do have on Code Ocean, 75% of the compute capsules someone has come along and changed, augmented and code in some way-- it's a little less for the data-- and rerun the code, right?
PIERRE MONTAGANO: And what's a really positive indicator that we're seeing now is we have a duplicate function that allows you to duplicate the code and keep, obviously, the attribution back to the original researcher and save it to your own dashboard. Now, that's an indicator-- for me, it's an indicator of reuse eventually down the line. And it's only something we're going to see materialize a year or two years down the line.
PIERRE MONTAGANO: But that number is 40% of our compute capsules on Code Ocean have been duplicated by one or more researchers. So we are seeing-- we are seeing metrics that are very promising above and beyond, let's say, just static data repositories.
SPEAKER: There's a follow-up here from the iPad, which is, how does copyright impact Code Ocean? And is the code-- if the code is linked to an article that's public in a standard subscription model, what is the rights and access around that code?
PIERRE MONTAGANO: We don't own the code. Researchers assign their own license to the code. We're an open access platform. Anyone can go in and view the code, they can download the code. But they can assign-- we default to CC0 and MIT for code and software. But researchers are more than welcome to associate any copyright agreement. And then we also have the ability-- anyone can port out.
PIERRE MONTAGANO: And it's made very explicit to the researcher. Anyone can download the code and data from our site.
SPEAKER: Great. Well, I'd just like to thank our panel again for such a thought provoking-- [APPLAUSE]