Name:
                                Building a Unified Product Delivery Platform, part 1
                            
                            
                                Description:
                                Building a Unified Product Delivery Platform, part 1
                            
                            
                                Thumbnail URL:
                                https://cadmoremediastorage.blob.core.windows.net/2ba79777-6e9b-4373-a45b-85267b7d9571/videoscrubberimages/Scrubber_3240.jpg
                            
                            
                                Duration:
                                T00H56M02S
                            
                            
                                Embed URL:
                                https://stream.cadmore.media/player/2ba79777-6e9b-4373-a45b-85267b7d9571
                            
                            
                                Content URL:
                                https://cadmoreoriginalmedia.blob.core.windows.net/2ba79777-6e9b-4373-a45b-85267b7d9571/Platform Strategies 2018 - Building a Unified Product Delivery Platform%2c Pt 1.mp4?sv=2019-02-02&sr=c&sig=NW2JzlfJDkVE7rzgiobdFstm7%2BZLshRaxDuDp1vvsYo%3D&st=2025-10-31T21%3A17%3A40Z&se=2025-10-31T23%3A22%3A40Z&sp=r
                            
                            
                                Upload Date:
                                2020-11-18T00:00:00.0000000
                            
                            
                                Transcript:
                                Language: EN. 
Segment:0 . 
  
SPEAKER: Today, we have Dr. Paul Cleverly,  who is at Robert Gordon University and Geoscience  World.  And he's going to be talking about using  new techniques to analyze data and go  beyond traditional search.  Pierre Montagano-- did I get that right?  Awesome-- Code Ocean, recent ALPS award winner--  and he's going to talk about how code and software is becoming  an integral part of the scientific record,  and in reproducibility as well.   
SPEAKER: Dr. John Unsworth is here with us as dean of--  he's dean of libraries at University of Virginia.  He's going to share with us the Hathi Trust and its approach  to big data.  And from what I've seen from the statistics,  it's more like massive, massive data  that the Hathi trust is managing.  And Tanya Vidrevich is from XSB and SWISS.  And she's going to show us a really interesting  new interface and analysis tool over technical documents  and technical standards as well.   
SPEAKER: So to get started, Paul cleverly from Robert Gordon.   Oh, here we go.   Thank you so much.  Take it away.   
PAUL CLEVERLY: Thank you--  thank you.  I'm delighted to be here to share some thoughts on text  analytics and how that could be used to augment information  discovery for geoscientists.  We're going to downplay the importance of the article,  or the paper, or the document to focus  more on the concepts within those documents and articles,  so thinking of all of the journal collections  like a giant text leak, if you like.   
PAUL CLEVERLY: So setting the scene, we're all aware of cognitive biases.  There was some discussion about that yesterday.  Individual biases, such as confirmation bias,  where we seek out, we cherrypick information  to justify what we already believe;  social biases, such as a groupthink, where we can reach  a decision through consensus rather  than stacked independent critical thinking--  both of these have been seen to be  detrimental to decision making in business  and also in science.   
PAUL CLEVERLY: So then we have this big data construct.  So on one hand, capable of overwhelming information  overload; the other, fascinating levels  of information discovery--  some people see big data about compute power.  Other people see big data as more about brainpower.  What are the questions that we choose  to ask in this vast resource?  But either way, big data is likely to be  about small patterns and how we can  exploit those for commercial or humanitarian benefit.   
PAUL CLEVERLY: So in amongst this large text lake, shall we say,  can we apply algorithms and surface patterns  that challenge what we know and show us what we don't know?  That doesn't mean that if we see something that contradicts  our mental models that we're wrong, but just simply  being aware of stacked opinion of thousands of authors,  somewhat independent.  The averages, the outliers, and volumes of information  too vast for us to ever read may just  challenge what we think we know.   
PAUL CLEVERLY:  So text and data mining are very broadly  divided into three areas.  So I'm going to focus predominately  on unsupervised machine learning.  I'll just give a couple of examples around the others.  So rule-based is where we input some of our own knowledge  into the corpus.  I'm sure most societies are doing  some element of text and data mining  to surface trends and patterns.   
PAUL CLEVERLY: So these may be taxonomies.  They may be rules to extract integer and float  numerical data.  And there's some great examples in geoscience,  for example of fossil stromatolites, which  are mound-like structures created by bacteria.  They're being created today.  They were created in the Precambrian Era,  600 million years ago.   
PAUL CLEVERLY: So people had thought that the population of stromatolites  through geological time was related to mass extinction  events.  But by using TDM techniques and vast amounts  of geoscience literature, it points more  to potentially seawater chemistry.  So there's an example of some papers  that you can read about that, that  show that it's capable of supporting  new scientific discoveries.   
PAUL CLEVERLY: The theory, obviously, is never in the data.  But the data can help us draw different conclusions.  So there's an example where the whole, all of the text--  the body text within amongst all the journals--  is more than the sum of the parts.  It can surface patterns that you wouldn't find in, say,  an individual paper article or another journal.  Some people call this dark data.  You can mine information for this.   
PAUL CLEVERLY: But predominately, it's rules that we program.  It's a deductive way of reasoning.  The second element is supervised machine learning.  So this is where we give labeled examples of something  so that we can build a statistical model for the rules  to be created through math.  So for example, images--  we want to recognize the image of a cat or a dog,  we just supply lots of labeled images.   
PAUL CLEVERLY: We did an example, some research, in geoscience  on sentiment in the oil and gas industries.  That was supplying thousands of sentences extracted  from literature that were related to petroleum system  elements.  And then we had retired geologists  label each of those sentences-- of course,  it could be paragraphs of documents.  We chose sentences-- whether they were positive, negative,  or neutral as a sentiment.   
PAUL CLEVERLY: So then we built a Bayesian classifier  that we could then run through vast amounts of text that  can never be read.  And it was actually very close to human-like accuracy.  The geologists agreed with themselves about 90%  of the time, which is quite high for geologists,  because they're not the most agreeable of people.  And I say that as a geologist.  And the classifier was quite close to that.   
PAUL CLEVERLY: And we surface contradictions.  So what you're seeing with the little pies  there, with the red and green, is  for particular aspects of petroleum system  per geological area, per basin.  The red is negative and the green is positive.  So it's showing you contradictions  where in the literature you've got  two opinions which differ around the same element  in the same area.   
PAUL CLEVERLY: The contradictions are great, because, when  we're doing things like literature review,  we're always interested in contradiction,  because it provides a good fertile ground for new theory  development.  And that might stimulate that, interface  people to go and click and read those articles,  read those paragraphs.  They would never have necessarily  made that query had they not been stimulated  with these sorts of techniques.   
PAUL CLEVERLY: And the third example is unsupervised.  So here, we're letting the latent structure  within text surface patterns.  So we're not putting any of our own deductive knowledge  into the area.  People often say, which is best?  It's a bit of a false argument.  You use all sort of combinations of these, depending on--  and they can be used one after the other  in different sequences.   
PAUL CLEVERLY: So I'm going to focus-- we just have a little small example  of the unsupervised.  It's targeting analogs.  So I've done quite a bit of interviewing of geoscientists.  And one of the things that often comes back is analogs.  When you're working [INAUDIBLE] sparse data,  you need to draw on information for something that's similar.  It's a very difficult search on traditional search systems,  because you do what they are.   
PAUL CLEVERLY: And they're similar to other things  like case-based reasoning and so forth.  So if we look at that area, I'm just  going to work through a tiny example.  I think it might just help understand how simple some  of these techniques are.  Obviously, they can get more sophisticated.  But we look at three formations.  So these are entities that appear in the body text--  green formation, the [? Beano ?] formation,  the [? Nemo ?] formation.   
PAUL CLEVERLY: They could be products.  They could be people, places, names of technology.  They just happen to be geological names here.  And we can define a word vector for each of those entities  by the words that surround it.  And so we can go through all of our text  and build up and aggregate how many--  the words that occur around those entities  and how often they occur.   
PAUL CLEVERLY: So we're turning text into a mathematical representation.  Why would we want to do that?  Because then we can perform operations like similarity.  So we can say one vector, we can take another vector,  we can measure the cosine of the angle.  The closer it is to 1, then the more similar those are.  So if you had a question, which is the most similar  to the green formation?  You see here I've got three-- imagine in text,  it'd be thousands of these that you'd recognize.   
PAUL CLEVERLY: And then the columns, there's three.  But again, there'd be thousands of these.  So the word, "shoreface," occurs five times around the green  formation, not at all around the [? Beano, ?] or one around  [? the Nemo. ?] We simply just do the calculation--  5 times 0 plus 3 times 1 plus 2 times 2.  We divide that by the square root of the green formation,  5 squared plus 3 squared plus 2 squared  times the square root of the [? Beano. ?]  We end up with 0.2075.  Now, if we had exactly the same words and exactly  the same quantities, then we would end up with 1.   
PAUL CLEVERLY: So we now run through the same with the green  and the [? Nemo ?] formation.  We get 0.733.  So what we can infer is that, if you're looking for an analog--  and it's a very trivial example--  you would say that the [? Nemo ?] is a better bet than  the [? Beano ?] formation.  So we can take those concepts and we  can build search-based applications  where, instead of delivering documents or articles of search  results, we're delivering entities, so more like answers.   
PAUL CLEVERLY: So it's still search.  You're still searching exactly the same text  that you currently have in a traditional digital library,  except you're giving answers back of which  are the similar entities.  And there's a lot more sophistication things  you can do.  But more or less, I think that example shows just the concepts  of how you can do that.   
PAUL CLEVERLY: I've tested it with some geoscientists.  And it does appear that you can, with some statistical  significance, surface unexpected, insightful, and  valuable knowledge that previously was unknown using  that same text collection.  So that's all I have.  I hope that's been useful.  Thank you for listening.  [APPLAUSE]    
SPEAKER: How about math at 9 o'clock in the morning?  All right.   
PIERRE MONTAGANO: A hard act to follow.  So hi, [AUDIO OUT] from Code Ocean.  And today, I'm going to talk a little bit about how  we can interconnect different research objects to really  create a more interactive, a more complete scholarly record.  And then that has a lot of implications  around reproducibility and reuse.  And it was very interesting yesterday talking about--  the speakers were fantastic.   
PIERRE MONTAGANO: And some of the themes that I picked up about  is like, what is our competitive advantage?  When to build, when to borrow?  And what's really interesting is, last  I was thinking about it, I think most  of our competitive advantage, very obviously, is the fact  that we have very specific and targeted content  that we deliver to a particular group of people who  are interested in that content.   
PIERRE MONTAGANO: And as we've moved away from becoming  just repositories of PDFs and really our platform  is being able to do a lot of very interesting and very  dynamic things, Code Ocean, I think, plays an important role  in that, in creating a more complete scholarly record.  So there's one quote I wanted to-- two, actually,  that I wanted to read.  This is from-- it's from David Donahoe.   
PIERRE MONTAGANO: And it's all the way back in 1998.  And he said, "An article about computational science  and a scientific publication is not the scholarship itself.  It's merely advertising of the scholarship.  The actual scholarship is the complete set  of instructions and data which generated the figures."  Right?  So I don't know if I agree that the article is merely  advertising, but it's an interesting perspective.   
PIERRE MONTAGANO: And a recent blog post by Ben Marwick  also pointed out-- he made the analogy,  the article is like a label on a bottle.  So very, very important-- you want  to know what's in the bottle.  But most of the time when you find the bottle you want,  you want to actually get a bottle opener  and get inside to the actual contents.  And recently at a reproducibility conference,  Victoria Snowden--  this was at Columbia along with Code Ocean--  made two claims.   
PIERRE MONTAGANO: And she said, virtually all published discoveries  today have a computational component Right?  Both in social sciences and in the hard sciences,  and that there was a mismatch between  the traditional scientific process and computation,  leading to reproducibility concerns, right?  So now, I'm going to talk a little bit  about why code is important.  So if we're curating the data--  and a lot of us now are curating the data.   
PIERRE MONTAGANO: We are linking out to the actual data  behind the actual article--  the other part of that is, as data becomes more and more  complex, as it becomes larger, researchers are deploying code  in order to analyze the data.  So if we're not also curating what  was used to analyze the data and only the data,  then we're only actually curating  half of what's important, I feel, to researchers, right?   
PIERRE MONTAGANO: If researchers want to actually use the data  and generate the same results and actually  build upon that work, they're also going to need the code.  And if you actually do a search for code--  and I invite everyone to do this on your platforms.  Go onto your platforms and do a search for code.  I did a quick search on Science Direct for Matlab.  I got 200,000 results.  I did that same search on Taylor Francis.   
PIERRE MONTAGANO: I got 20,000 results; Oxford, 10,000 results.  And that's only one programming language, right?  You're going to run into trouble if you try to search for R.  And that's unfortunate.  But that's the most common.  So it's one of the most common programming languages, right?  You're going to get a massive amount of results.  Anyway, so what Code Ocean allows people to do--  oops, I jumped ahead--  is just-- what we allow you to do  is curate the code and in a way that's executable.   
PIERRE MONTAGANO:  How about that?   So what we do is we allow you to curate the code so  that it's executable.  And so what makes us different from,  let's say, something like GitHub,  or just putting your code into a repository,  in a flat repository?   
PIERRE MONTAGANO: And this is the steps that I asked my founder, Simon Adar,  who is a spectral imaging engineer,  to explain what he had to do in order to get  a piece of code up and running.  So he was doing his postdoc.  He was studying-- he was trying--  they were actually building algorithms  to dig soil pollution.  And they were looking at past research.   
PIERRE MONTAGANO: They would come to past research,  they would get to the paper, they  would find some mention of code, or they would  see that the actual figures--  the actual diagrams-- they knew that these  were created by using code.  And then they tried to get the other researcher's code  up and running, right?  So it was a nightmare, because researchers, in a lot of ways,  aren't coders.   
PIERRE MONTAGANO: So a lot of times, they're not following  proper coding practices, in the sense that they're not  including things like what exact version of language they used,  what packages were included.  And then of course, there's the dependency hell, right?  One file depends on another file, depends on another file.  And it took these researchers months  to get one algorithm up and running.  And so what he wanted to create was a system  where you were able to create and containerize all the code  and run the code by just clicking a button.   
PIERRE MONTAGANO: And it really speaks to this reproducibility issue, right?  And this is a slide about all of science,  not just computational science.  But what really took me for a loop here  was not that researchers had trouble reproducing  other people's science.  It was that researchers had trouble reproducing their own,  right?  That really took me for a little bit of a loop, right?   
PIERRE MONTAGANO: And the reason is-- over time, the reason  is because researchers aren't curating  all the proper things needed for their code to run, right?  And so if we look at Code Ocean, it  is a very friendly user interface  where researchers can deposit their code files,  deposit their data files, select the proper environment.  And then others can simply hit a Run button  and execute the code in real time.   
PIERRE MONTAGANO: It also allows others to change the input values,  play with the code, and rerun it.  So not only for reproducing, but also for reuse.  So here is the code files, here are the data files,  and here's a sandbox window that other researchers can then  go in, change the code, play around with the code.  And they can actually run the actual code.  The reason I picked this example-- one of our founding  partners was IEEE.   
PIERRE MONTAGANO: And a lot of the code that's in Code Ocean comes from IEEE.  But I wanted to show this example,  because this is a social science example.  It's from political science and research methods.  So this is as true for the social sciences  as it is for the hard sciences as well.   And then here is this Run button.  And researchers can just run the code,  and they can get the results.   
PIERRE MONTAGANO: I'm actually showing you the new interface from Code Ocean.  So what's our technology stack built on?  We're basically built off Docker.  So you can also think of us as GitHub  is a really friendly interface over Git, right?  We are a really friendly interface over Docker, right?  So we're containerizing everything.  So we work on Linux and Docker.  And then on top of that, we can support basically  any open source programming languages plus Stata, Matlab,  and soon others to come as well.   
PIERRE MONTAGANO: We support deep learning frameworks as well.  And we have GPUs as well.  And right now, what we're working on is for researchers--  so we're a reproducibility platform.  Researchers deposit their code and data.  We containerize it.  It's executable at the point of the article,  within the article, giving you that competitive advantage,  right?   
PIERRE MONTAGANO: So researchers are reading the article.  They're very interested in this particular article.  Now, they no longer have to go to another platform  to run the code, or find the code, or see the data.  They can see everything within the article page,  or on a separate tab within that page, right?  And that, all of a sudden, makes the content, in my opinion,  a lot more valuable, right?  And that's really, I feel, our true competitive advantage.   
PIERRE MONTAGANO: We are working on--  our next step is to work on how we  can start integrating live machines,  meaning people can actually pull the code from Code Ocean  into a live Jupiter environment, into RStudio, and other places  and start augmenting and playing with the code in real time,  and then saving it back into their own private dashboard.  So moving further down into the researcher workflow, right?  But right now, we're optimized as being  a reproducibility platform and point of publication.   
PIERRE MONTAGANO:  So this is a workflow.  And I'm running out of time.  But basically, authors are asked a question.  Do you have associate code-- yes or no-- with your research?  They answer yes, they deposit a compute capsule in Code Ocean.  If you want, as a publisher, we can actually  provide a private instance of that compute capsule  to be shared during the peer review process.   
PIERRE MONTAGANO: Now, that's very, very valuable as the recent peer  reviewers, a lot of times, don't engage a lot with the code.  It's this whole entire trouble I was telling you about, getting  code up and running, right?  And then there's a notification back to the publisher,  and then a widget appears onto your platform.  And here's our current partner list.  Thank you very much.  [APPLAUSE]   
SPEAKER: That was great.   Tatiana, are you next?  Great.    
SPEAKER: And just a reminder to everyone, you  can text in questions as we go, and as you  think of them, to the text number that's on your programs.  And we will be doing a Q&A after these.  I'm sure there's a lot of questions around these thought  provoking presentations.  Tatiana.    
TANYA VIDREVICH: OK.  So I'm going to talk today about technical specifications  and standards.  And it is a narrow slice of publications  that are of interest here.  But it is a very special slice.  I think we are safe.  And we have cheap, good, fast products,  because we all use specifications and standards  in development.   
TANYA VIDREVICH: Now, if you talk to any engineer,  when it comes to geometric models,  they all use pretty sophisticated geometric models  usually, nowadays.  I was at a news conference six months ago.  And one engineer said, do you remember when  people were using drawings?  Well, not anymore.  The 3D models allow people to analyze,  simulate, make sure the product is safe,  it doesn't have any problems before it gets produced.   
TANYA VIDREVICH: However, next to a model, quite often  you see this blob of basically text.  And that text specifies nongeometric properties  of a part.  So it specifies material, surface treatment,  how you test the part.  And that text, basically, is as important--  and it usually references standards,  which is a good thing, because, if people  use the same industry-accepted standards,  you are assured of a good supply chain.   
TANYA VIDREVICH: And you are assured of the quality of a product, usually.  However, as you can see, this text is rather cryptic.  This particular line, here, means  you have to go and download a PDF file  and find what this line refers to, which  refers to something like that within a large PDF file.  Aha.  Actually, I have to download another PDF file, which  is [INAUDIBLE] and interpret what the concept of class 1  means there, and so on.   
TANYA VIDREVICH: And basically, a design engineer class to do that.  A manufacturing engineer has to do that.  A vendor who coordinates a job has to do that.  Test people, QA people have to do that.  So every time, this information is being interpreted.  And you can imagine how much problems it creates.   What's more?  Usually, when we think about technical standards,  we think about common standard development organizations.   
TANYA VIDREVICH: However, if you think about Boeing,  you think they produce airplanes.  They have at least 70,000 active standard documents  internally that they are using.  So in order to produce a part for them or for their vendors,  they have to interpret their own standards,  and they have to interpret standards  from external organizations.  And what's more?   
TANYA VIDREVICH: Design usually takes a long time.  In the industry takes a long time.  If you think about pipelines for oil and gas,  a large project might take a long time, too.  Well, most standards development organizations  update their standards once in two years,  which can have a material change to what people are doing.  So not only they have to interpret this information,  they often have to reinterpret it again within the two years  to understand if they still comply with the code they  are supposed to comply with.   
TANYA VIDREVICH: And that example that I just showed  requires more than 500 pages of interpretation.  So actually, we talked a lot to engineers  and to ask them, what do your hires expect to do?  And what we realized is that a lot of organizations  are trying to tell them, you'll save a lot of time  if you just improve your search.  Then if you go back and look in the search  logs of these organizations, you'll  find out that, 80% to 90% of the time, engineers  know this specific document they're looking for.   
TANYA VIDREVICH: They type in that specific standard number.  And when they don't, they actually use Google.   So at least for standards, it wasn't a lot  of benefit improving a search.  What you really improve is interpretation  of the standards, integration into their internal systems,  and navigation between those standards.  On average, if we look at the government's standards,  they are about 10 pages long.   
TANYA VIDREVICH: Although, it's just an average.  There are very long and useful standards.  The most downloaded standard is more than 1,000 pages long.  And on average, they reference seven other standards.  So you can imagine finding related concepts  between those standards are important for navigation,  for interpretation.  And an ability to identify, to address a specific concept  in that standard and keep track of it is also very important.   
TANYA VIDREVICH: So an engineer usually identifies those concepts,  and then saves the information that they are interested in,  in their work instructions, in their spreadsheets,  in their internal PLM systems.  When this information changes, they  are interested to know that this particular concept that I  referred to has changed.  Maybe the whole metadocument changed,  but that particular concept has not changed.   
TANYA VIDREVICH: So it's very important to know what specifically changed  and how it affects me.   So then where do we go?  Of course, for years, since 1984,  PDF was a prevalent format for distributing  technical standards.  Currently, the key players in this space  converted all their information to XML  and basically publish out of XML into PDF, HTML, and so on.   
TANYA VIDREVICH: And that's a very good step forward.  It allows you to at least connect the documents  internally and understand which document references  which document.   However, it's not good enough if you  want to see if my own document in an engineer--  my own engineering system references  a whole network of other internal and external  documents, and if the change of a document  might or might not affect me.   
TANYA VIDREVICH: So we put together a consortium, a technical working group that  consists of large Ainslie manufacturers,  as well as key standard development organizations.  And we basically asked that question  from users of standards, what do you hire a standard to do?  As a result, we came to a format called  SWISS, which is semantic web for interoperable specs  and standards.  It relies on the linked data model that  knows who I am referencing to, why I'm referencing,  and the state of that reference.   
TANYA VIDREVICH: And because it has an API, it integrates into major product  lifecycle management systems, it integrates with major tools  that engineers use, engineers find it very useful.  So for example, I might have my proprietary spec for a part.  I might mention ASTM standard for a material,  SE standard for a specific process,  and ASME standard for some dimensions.   
TANYA VIDREVICH: If information in one of these standards changes,  or a standard gets withdrawn, it's very important for me  to know that.  Through using linked data model, we  are basically aware what specifically changed,  what concept changed, and how does it  affect all information in my proprietary system.  So we basically call it smart-connected documents.  And--  [APPLAUSE]    
JOHN UNSWORTH: [INAUDIBLE]   
SPEAKER: Yes.   The clicker.   
JOHN UNSWORTH: Is it working or not?   In typical academic fashion, I have twice  as many slides as I should.  And they're full of text.  So you can download them.  I'm going to I'm going to blow through the first half  of this presentation, which is mostly for the record,  to get to the part that's interesting.  So this is about the Hathi Trust and its research center.   
JOHN UNSWORTH: Hathi means elephant in Swahili.  And their motto is there's an elephant in the library.  And there is an elephant in a library.   That's the mission of the Hathi Trust,  which is the parent organization for the Research Center.  Here is the data in the Hathi Trust.  It's a lot.  Five billion pages is probably the salient figure here.   
JOHN UNSWORTH: And that number is updated every day.  Well, the other salient number here is that about 65%  of that content is in copyright.  It covers all of the domains of knowledge  in book form in research libraries.   It has complex copyright conditions.  And the Hathi Trust has been managing those very carefully  for its members.   
JOHN UNSWORTH: The Research Center is the research arm  of the Hathi Trust.  And the purpose of the Research Center  is to provide computational access  to all of that content that's on that previous slide.  This effort started in 2008.  I actually launched the Research Center  when I was at the University of Illinois,  as dean of their Library School, and reached out  to Indiana, which backs up the Hathi Trust from Michigan.   
JOHN UNSWORTH: And so this is in Illinois and Indiana effort.  And I'm still on the steering committee,  because they can't figure out how to get rid of me.  And the Research Center was formally established in 2011.  That's the mission of the Research Center.  You may have heard the term non-consumptive research.  This comes out of the attempted Google Books  settlement with publishers.  But we have actually carried it forward  as the guidelines for what we're doing here.   
JOHN UNSWORTH: Basically, it means you can do computational work with this,  but you can't take the stuff away.  You can't republish it.  And you can't really use it outside of its environment.  So in order to do that, we need to bring the computation  to the data.  We can't move the data.   Since 2011, the Research Center has  been developing services and tools that  allow researchers to employ text and data mining methodologies  on that collection.   
JOHN UNSWORTH: And we've been working, so far, with the part  of that collection that's out of copyright.  But last week, that changed.  So last week, we began to provide computational access  to all of the content, including the copyrighted content,  to people at member institutions of the Hathi Trust.  Basically, what that looks like is this.  There are three approaches.  So HTRC Analytics, if you go and find it on the web,  gives you web-based tools that you  can use to do things like Ngram analysis.   
JOHN UNSWORTH: You don't get access to the data at all, really.  And what's underneath those analytics is not the data,  but something called extracted features,  which I'll come back to.  But basically, those are statistically derived  information at the page level that cannot be reassembled  into the underlying text.  Feature extraction services are what  provide those extracted feature data sets.   
JOHN UNSWORTH: And we can actually allow people to download those data sets,  because they can't be reassembled into the underlying  text.  And what happened last week was we started  to provide people with access to something  we call the data capsule.  This is a virtual machine on which you can load software  to run against the data.  And it operates in one of two modes.   
JOHN UNSWORTH: It's either in Maintenance mode, where you can SSH into it  and load it up with software and do things;  or it's connected to the data.  When it's connected to the data, you can't get anything off  of that virtual machine.  You have a terminal on the machine, a virtual terminal.  But you're not-- you can't get from there  to anywhere except the data.  And when you're finished with your computation,  your results go basically into a holding tank.   
JOHN UNSWORTH: So when you turn the machine back into Maintenance mode,  you don't have your results yet.  Those results are being reviewed by a human being  before they're released to you.  And we're reviewing them, basically,  to make sure that you're not expropriating  large amounts of text.   The web-based portal gives you algorithms  that are click-and-run tools to provide [? canned ?] analysis.   
JOHN UNSWORTH: The extracted feature data set can  allow you to do more things with more actual data.  And then we have some sample tools like Bookworm  to do word trends.  The analytics from members--  that started this past week--  is the secure computational environment  that I've been talking about.  It took three years to get that architecture approved  by the information security people,  the general counsels, and so on at Illinois, Indiana,  and Michigan.   
JOHN UNSWORTH: That was three years of pure bureaucracy  to make this possible.  Right now, that computational access through the data capsule  is only available to Hathi Trust member-affiliated researchers,  partly because we expect significant demand.  And our computational capacities-- well,  based at Big Red in Indiana--  are still limited in some ways.  And there is a lot of data, potentially, to work with.   
JOHN UNSWORTH: So we're trying to keep a throttle on that.  There's about 130 member institutions under this link,  mostly in the US, but some in other countries--  Canada and australia, and also, I think, in Europe now.  So our purpose in the Hathi Trust writ  large is to allow lawful research and educational uses  of this collection.  And there is a non-consumptive research policy,  which you can read to see what are the details of what  we think we're allowing people to do  and what we don't want people to do.   
JOHN UNSWORTH: And you can see those services embodied in the HTRC Analytics,  which is available on the web.  The next project-- and now, I'm talking  about something that is basically  my research project self-funded with my startup  at the University of Virginia.  And I'm working with JSTOR and Portico staff  to experiment with enabling distributed text  mining across the book material that's in HTRC and the journal  material that's in Portico.   
JOHN UNSWORTH: We're working only with 10 publishers who've agreed  to be part of this pilot.  And we're working with just biological data,  just a subset of the data that people have approved.  And we're doing this because the problem  that HTRC has been working on with how to provide access to--  computational access to copyrighted data  is larger than HTRC's collection.  So we have, out there beyond HTRC,  lots of other copyrighted data.   
JOHN UNSWORTH: And if I'm a researcher, I'm interested not only  in book material, not only in journal material.  I'm not only interested in one publisher's material.  I'm interested in the subject area  that cuts across those things.  And so we need to figure out how to provide people  computational access to distributed copyrighted data.  Portico has a set of agreements with its publishers,  which it can't--  doesn't want to renegotiate and doesn't want to violate.   
JOHN UNSWORTH: And that involves not moving those publishers' data  out of their environment.  HTRC has the same requirement.  The Hathi Trust is not going to let  us ship data around the world.  So what we're trying to figure out is,  can we allow people effectively to do text mining when you  can't bring the data together?  Now, we're not the first people to worry about this problem.   
JOHN UNSWORTH: Google's been here before.  There's MapReduce.  There are tools and techniques out there  that I think can enable us to do this,  but it's still experimental from our point of view.  So I'm happy to report, actually,  that, a couple of weeks ago, we managed  to do extracted features that are  interoperable across the journal and the book content, which  means we had to harmonize our metadata across these two very  different kinds of collections.   
JOHN UNSWORTH: And that's actually the major accomplishment.  To have done that means that we can now  begin to think about this next step of distributed text  mining.  Thank you.  [APPLAUSE]    
SPEAKER: Great.  Thank you all.  We're now open for questions from the audience.  I have a few, but I promise, if you raise your hand  and go live, you get the jump to the front of the line.  So it looks like Alison here.  We have a microphone in the crowd?   No?  I'm moderating and passing microphones.   
PIERRE MONTAGANO: Phil Donahue.   
AUDIENCE: Thank you, Oprah.  John, this is question's for you, just  about the Portico-JSTOR aggregated data mining.   So our journal content is in Portico and JSTOR.  I don't know exactly what our licensing agreement  says with them.  But how are you going to navigate  the different licensing agreements  for the different publishers' material  that are in those repositories?   
AUDIENCE:   
JOHN UNSWORTH: The point is actually  doing this in a way that doesn't require  renegotiating those agreements.  The publishers who are participating in the pilot  said, it's OK.  This is not going to be open to the public at this point.  This really is a science experiment.  And so at this point, what we're doing  is working with publishers who are interested in providing  text and data mining services to their users,  because they're getting requests for that from their users.   
JOHN UNSWORTH: And what we're trying to figure out  is, precisely because it would take an infinite amount of time  to renegotiate every one of those agreements,  and even to work with individual publishers--  so working with Portico and within their existing  agreements is the point.  If people don't want to participate,  obviously they won't.  But if we can figure out how to do this, then those who do  want to participate, we're not taking  their data out of Portico.   
JOHN UNSWORTH: We're not you know allowing people to expropriate data  from Portico.  So the point is actually not putting ourselves  in a position where those agreements would  need to be renegotiated.    
AUDIENCE: I have a question for Code Ocean.  You mentioned rightly that dependency capture  was a significant problem.  When you were presenting, I was looking  where the dependency capture was happening.  If you'll let me upload my code, I'll upload whatever I've got.  The dependencies that go with that,  how do you capture that in a way that will survive  a long enough time period?   
PIERRE MONTAGANO: So the when you upload your code--  so you're talking about dependencies  in terms of other files?  Yeah, right.  Yeah, so there is an interface-- we also have an Environment tab  that I didn't show, that you can go into the Environment tab  and select all the different packages, startup  scripts, anything that you're particularly using.  So maybe I should have shown the Environment tab.  So we have a package--  basically package manager as part of our Environment tab.   
PIERRE MONTAGANO:  And then they can select the version as well.  So they can select a version of that particular package.   
AUDIENCE: [INAUDIBLE]    
PIERRE MONTAGANO: All the different things, yeah.  So there are certain things that--  anything that's open source, we support right now along  with other vendors that we've made some agreements  with as well.    
SPEAKER: Any other questions in the audience?  I have a few here that have been texted in.  We'll go to one of those.  This is a question for Pierre and Paul.  How do you see TDM machine learning and these new search  techniques and data repositories converging to benefit  researchers in the future?  Any specific use cases that publishers here  should be considering for these techniques on their platforms?   
PAUL CLEVERLY: I think there's going to be  a multitude of interfaces.  I think it's probably unlikely that you'll  be able to have one interface that will handle all  the different questions that researchers or business  users will want.  So I see that, as well as the traditional mechanism  to find articles, the digital library,  there could be an explosion of other interfaces  which address specific tasks, which  are related to each discipline.   
PAUL CLEVERLY: So I don't see it as an either/or.  I think there's going to be an and for these.    
PIERRE MONTAGANO:  Yeah, I think that--  yeah, there are a lot of data repositories out there.  And a lot of them are subject-specific.  And I do see that the--  but at least, the numbers are indicating that there is not  a lot of people going out and pulling out those data sets  or checking those data sets.  So I do think tools like Code Ocean,  where people can extract the data outside of a repository,  and then bring it into an environment,  a cloud-based environment, where you can then start to do some  analysis around the data and make the data more operable,  or make the data--  do some analysis around the data.   
PIERRE MONTAGANO: I think that that's key right now.  I think that that's, in my opinion--  and it's just an opinion.  Right now, I think a lot of data repositories  are not getting a lot of usage only  because the tools around analyzing the data  aren't joined up with the actual data repository.  And I think that the logical place for all that to live  is actually at the point where the research is curated,  which is in the digital libraries.   
SPEAKER: Thank you.  Still looking for questions from the audience.  We have plenty here on the iPad.  This one is for Tanya.  How specific to technical standards or engineering  standards is the SWISS technology?  Or does it apply to all connected document systems?   
TANYA VIDREVICH: There is something special  about technical standards in that they change a lot.  Usually, when you publish something--  an article in a journal--  it usually doesn't change.  However, having said that, an ability  to find connected concepts across multiple documents  and navigate through those connected concepts, I think,  is appropriate to a lot of publications.    
SPEAKER: OK.  Mark, is there a mic?  You're getting this mic.  You can keep it.   
AUDIENCE: So I am a publisher that  has a large corpus of copyrighted material  that I don't want to just put out there.  We actually are pretty generous with researchers  who come and approach us.  But let's say I wanted to go one step down and make available  the content that would be good for text and data mining,  but in a way that doesn't reconstruct the text.  What set of files or whatever would I want to make available?  Is it just Ngrams?   
AUDIENCE: Is there a standard set of extracted things  that would come from the text that you would want  to see publishers just make available that  could be pulled down?  Or is it really, if you don't have the full text,  then you're doing it wrong?   
JOHN UNSWORTH: Well, one of the things  that motivates the experiment with the JSTOR and HTRC content  is, at that scale, if we can figure out methods that  work and don't imperil copyrighted content,  then your content, wherever it lives,  could present itself in that way for those services.  But basically, my first answer would  be have a look at the extracted features data set.   
JOHN UNSWORTH: And what we're extracting is part of speech information,  individual words in alphabetical order,  page level data, with which you can actually do a lot.  You can't do sentence level things, obviously,  because they're not assembled into sentences.  So you can't do anything syntactic.  It's limited semantic kinds of research,  but you can still do some useful things with that  extracted features data set.   
JOHN UNSWORTH: And if this works, then other people can create this.  And, in fact, you might upload data  and have it returned to you as extracted features.    
SPEAKER: Great.  One more question coming from the audience here on the text.  This is probably relevant for a lot of our audience.  And that is, how do small publishers  and/or society publishers use these concepts to their  and their users' advantage versus the big data or the more  aggregated view?  So I don't know who wants to take that one.  Paul?    
PAUL CLEVERLY: Yeah.  I mean, with text and data mining,  I always think of it as two main areas.  One is you could do broad and thin, where  you extract all of your text.  And you can put that in the cloud,  or you could put it somewhere else on a secure server,  and then just allow people programmatic access to it.  So you're not predefining the certain concepts  that people want to see in concurrence.   
PAUL CLEVERLY: A lot of text and data mining is counting  how many times this occurs with this,  or other things like that.  So that would be a flexible broad and thin.  But then you could also use open source tools  to build your own applications to target repetitive tasks that  are relevant for your specific domain.  So I think those two--  one is a bit more involved, where there's  a bit more technology play.   
PAUL CLEVERLY: The other is more of you're just facilitating  a set of hypotheses or questions that people may wish to query.  But if you have already the XML--  even if you have PDF, you can convert it to text--  it isn't that difficult just to put all of that data  there and call it a lake of sorts,  and then just allow open source access to that, obviously  either as part of existing subscriptions,  or it could be a new option to offer  to subscribers potentially.   
PAUL CLEVERLY:   
TANYA VIDREVICH: From SWISS perspective,  since we assign unique IDs for every concept in a document,  we know how open a concept is addressed, basically,  and checked.  And we know basically the usage of that concept,  which, I think, is very helpful for a publisher.  And we also know who reaches out to understand the information  for that concept, which means, if information  is being copied out and then legally transmitted to somebody  else, we know when it happens.   
TANYA VIDREVICH: And as a result, you could basically increase the user  base for those concepts.    
SPEAKER: We have time for one more question.   Thank you.   
AUDIENCE: Thank you.  I was curious.  Code Ocean again, not surprising given  that's what my background is--  I was curious.  How often do you see people play with code snippets  or segments that have been deposited along with papers?  I'm just curious.   
PIERRE MONTAGANO:  So that's where  I think it starts to get interesting,  when you start to look at metrics that are indicators  of real engagement, not just time and environment and vanity  metrics, right?  And so things that we are seeing is  that, of the compute capsules that we do have on Code Ocean,  75% of the compute capsules someone  has come along and changed, augmented and code  in some way--  it's a little less for the data--  and rerun the code, right?   
PIERRE MONTAGANO: And what's a really positive indicator that we're seeing now  is we have a duplicate function that  allows you to duplicate the code and keep, obviously,  the attribution back to the original researcher  and save it to your own dashboard.  Now, that's an indicator-- for me, it's an indicator of reuse  eventually down the line.  And it's only something we're going  to see materialize a year or two years down the line.   
PIERRE MONTAGANO: But that number is 40% of our compute capsules  on Code Ocean have been duplicated  by one or more researchers.  So we are seeing--  we are seeing metrics that are very promising  above and beyond, let's say, just static data repositories.   
SPEAKER: There's a follow-up here  from the iPad, which is, how does copyright  impact Code Ocean?  And is the code-- if the code is linked to an article that's  public in a standard subscription  model, what is the rights and access around that code?   
PIERRE MONTAGANO: We don't own the code.  Researchers assign their own license to the code.  We're an open access platform.  Anyone can go in and view the code,  they can download the code.  But they can assign-- we default to CC0 and MIT  for code and software.  But researchers are more than welcome to associate  any copyright agreement.  And then we also have the ability-- anyone can port out.   
PIERRE MONTAGANO: And it's made very explicit to the researcher.  Anyone can download the code and data from our site.   
SPEAKER: Great.  Well, I'd just like to thank our panel again  for such a thought provoking--  [APPLAUSE]