Name:
Linked data and the future of information sharing
Description:
Linked data and the future of information sharing
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/f01e8d91-f1d9-48af-9332-787aa5e0cb19/videoscrubberimages/Scrubber_1.jpg?sv=2019-02-02&sr=c&sig=4M0aoLyAkwpjekV4ZbPRA2Y65N0S%2BR6CYMqVONDIlKI%3D&st=2024-05-04T14%3A16%3A01Z&se=2024-05-04T18%3A21%3A01Z&sp=r
Duration:
T00H34M25S
Embed URL:
https://stream.cadmore.media/player/f01e8d91-f1d9-48af-9332-787aa5e0cb19
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/f01e8d91-f1d9-48af-9332-787aa5e0cb19/6 - Linked data and the future of information sharing-HD 108.mov?sv=2019-02-02&sr=c&sig=fszKbaGTR%2FC1%2FjMVOWvkg4FCKzovC%2B5TDPzedf%2BMCB4%3D&st=2024-05-04T14%3A16%3A02Z&se=2024-05-04T16%3A21%3A02Z&sp=r
Upload Date:
2021-08-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
ANA HEREDIA: Hello. Welcome, everyone, to this session on linked data. I am Ana Heredia, ORCID's engagement lead for South American publishers across the Americas. Before joining ORCID, I worked for Elsevier. And before that, I was a researcher. Now I'm here as part of the organizing committee for NISO Plus 2021. We are here today with Shelley Stall, senior director for the American Geophysical Union's data leadership program, and Christian Herzog co-founder, and MD of UberResearch, and CEO of Digital Science, Discovery & Analytics.
ANA HEREDIA: About this panel, our two speakers are very complimentary. They will share their perspectives on the differences and similarities between the needs of their disciplines, as well as their vision of data linking and sharing in the research landscape as a whole. So let's first start with Christian, and then with Shelley.
ANA HEREDIA: Enjoy.
CHRISTIAN HERZOG: Hello, my name is Christian Herzog. I'm a medical doctor by training, and one of the co-founders of Dimensions, and currently, the CEO. Thank you very much for inviting me to that session today. For the next minute, I will talk a little bit about linked data, contextualization, and connections-- to which end? As one caveat up front, I'm not using the term linked data in its purest technical form.
CHRISTIAN HERZOG: I'm using it to describe the connections and relations between data elements which are used to the end user, not the technical representation of the data as the triplets. In essence, I'm using linked data seen from the angle of the consumer of the data, not from the engineer who builds and provides the data infrastructure or the researcher interested in linked data, just to clarify that up front. The future of linked data is bright, and it will empower and simplify information sharing because otherwise we would not do it.
CHRISTIAN HERZOG: But getting on top of that mountain, that rock, is hard work. And jumping requires trust in the cape. Since a lot of players are involved in producing the data, so a lot of agreements need to be reached how the data should be linked, provided, which identifiers to be used. But afterwards, the data also needs to be in a certain infrastructure so that the user can have the confidence to jump into the analysis.
CHRISTIAN HERZOG: A little bit about Digital Science and linked data. For example, Altmetric is linking onto it. Altmetric associates more than 16 million research outputs with roughly 160 million mentions from social media, news, and blogs. GRID is helping to link it. We launched years ago, GRID as a database of persistent identifiers for research organizations, and cover currently close to 100,000 institution records.
CHRISTIAN HERZOG: The Dimensions API allows the resolution of affiliation data to GRID IDs, and GRID has been used over the past year as the seed data set for ROR and the more community-driven approach to that topic. You can find the details and download the entire data set. It's openly available at grid.ac. Symplectic Elements, the universities who lean on the research information management system from Symplectic and making use of the data which is served with as many links as possible, as interlinked as possible, to manage the information about their researchers and their activities.
CHRISTIAN HERZOG: And last but not least, Dimensions is aggregating large amounts of information, and is trying to link it as much as possible. And let me spend a little bit more time on that, because when we launched Dimensions, we actually said we want to be able to not only look at publications and citations, but create a research information infrastructure which covers the resource input from grant, the research activities, the data sets produced, the publications, the tweets, blog citations as outputs, and then clinical trials, patents, and policy documents as impacts.
CHRISTIAN HERZOG: So cover the entire trajectory and to establish as many links between these as possible to allow an increased understanding. And today, Dimensions covers 160 million publications with 1.3 billion citations, but we also cover 5.7 million grants with $1.8 trillion in funding. We cover clinical trials from multiple sources. We cover 55 million patents, which will soon increase to 120 million, policy documents, and data sets.
CHRISTIAN HERZOG: But we didn't only put all of these records in one database and then end up with a large number of documents. We invested quite a lot of work to actually make these-- the different data silos as interlinked as possible. For example, for 16 million publications, we were able to establish a link to one particular grant who has supported that publication. And for 22 million, we were at least able to extract which funder did support the research leading to these publications.
CHRISTIAN HERZOG: And all the other arrows are an awful lot of work as well, but they actually make the data speak. And these are the main processes with which we do this. Institution identification, categorization, concept extraction, researcher disambiguation, and the reference extraction with a couple of other enrichment processes end up with one harmonized database with all those different content types, but also the links between them.
CHRISTIAN HERZOG: Links as context matter, or context matters. So here I have one example of one record within Dimensions, where we have been able to establish as much context as possible. So for example, disambiguated authors taking GRID and their ORCID into account, affiliation resolution using GRID, citation count, derived indicators, Altmetric indicators and score. Then the document history, linking preprint publication dates, links to supporting funders through mining of the acknowledgments section, different machine learning-based classification systems based on the content of the article, MeSH terms for this particular article, related data sets, obviously, with a preview of the data, cited publications with the indicators, the supporting grants linked to the publication, references to clinical trials, citing publications with indicators, citing patents, and last but not least, related policy documents.
CHRISTIAN HERZOG: All of this context is openly available on the individual records in Dimensions, but the question is how to serve the linked data up to users with different use cases and different levels of technical skills. The discovery use case, finding relevant information, is served with a web application where we make the links I have just shown you earlier on that particular record available on the individual record level, but also provide a sophisticated search interface so that the user is able to connect the dots.
CHRISTIAN HERZOG: And since we actually think that this use case is of public interest for everybody, we are not taking any commercial considerations into account. We decided to make publications and data sets openly available so you just can go to f.dimensions.ai and try it. But the data matters. Easy access to all the links, that's why we actually released the Dimensions data also on Google BigQuery.
CHRISTIAN HERZOG: Because there we found a solution how we can make the underlying data virtually available to every user in a large relational database, which is immediately coming with the computational infrastructure and power to jump into the analysis right away, just by basically creating an account and starting, even without learning new skills if you happen to be able to use SQL queries already or if you connect it to standard BI tools.
CHRISTIAN HERZOG: So basically, lowering the barrier so that the user can access all the links which we established and which we put into the large database. You can even link your own data into that analysis as well. But here's an example of the most basic relations we have stored in that database. Taking only person relationships into account, citing relationships between documents, affiliation resolution, or who funded the activities represented in this document, and as you can see, we already then have 2.2 billion of the most basic relationships in that database, not taking any indexing or any classification systems yet into account.
CHRISTIAN HERZOG: You can look at the numbers and also in the distributed slides later, but that actually, as I said covers, only the most basic links. A few example use cases what can be done with it, Simon Porter and Daniel Hook, recently published a preprint on arXiv, which was about scaling scientometrics, Dimensions on Google BigQuery as an infrastructure for large-scale analysis. And what they did, they computed the center of a mass shifting on a map through the centuries.
CHRISTIAN HERZOG: As you can see, 1671, it was still in the greater London area. 200 years later, it has shifted considerably west already. And in 1945, we are seeing a spring back, a move towards the east and a little bit to the south. And today we are basically at the point of Cyprus on the map. And my expectation is that it actually will swing further east, and perhaps even-- or hopefully, also more south.
CHRISTIAN HERZOG: That was based only on the affiliation data of the institutions where the researchers were located. If you take citations as weighting mechanism into account, it's basically the same paper. A little bit slower, as you can see, 200 years later, we were still overland in the UK. But then it shifted later, and actually shifted a little bit more east already. They did the same analysis also for COVID-19 related research from January to November last year.
CHRISTIAN HERZOG: And there you can see the opposite direction. We have a movement from east to the west. But I expect that in the coming years, we will also see the shift back. Why do I mention this particular example I find fascinating personally? But I also think it's a good example how the linked data and the way it is prepared-- the cape actually matters.
CHRISTIAN HERZOG: What you see here on this slide is the SQL statement which actually is doing the calculation for the visualization we have seen on the previous three slides. It's just a few lines of codes, but what's mindblowing is that also this basically runs through 1.4 billion relations. It only took 30 seconds with the data set of 160 million records to execute it and produce the data for these visualizations on GBQ.
CHRISTIAN HERZOG: So making the links available in a technical way which can be easily understood opens up a lot of opportunities for these kind of analysis, but also perhaps more serious ones. For example, economic impact of an institution is always a topic which is of high interest. And one way to describe it is to actually look at the patents which cite publications from a particular organization-- in this example, the Salk Institute for Biological Studies.
CHRISTIAN HERZOG: So what we did, we took the last 10 years of publications, we're identifying the publications which have been cited, the patents which were citing it, and then we were identifying who has filed these patents, which is an economic impact completely remote from the University. They might even not know about it. And it normally would take a lot of time to pull that information out. But again, with the ready-to-go linked information infrastructure, it's a few lines of SQL code, and then perhaps even setting up a monitoring dashboard which actually shows in real time where patents are popping up, leaning on the research of one particular organization-- or even the country.
CHRISTIAN HERZOG: For the publishing process, for example, if you think about the process in three simple steps, manuscript processing, review process, post-publication, such an information infrastructure where the data is readily available to be queried or pulled into other applications. You can automatically create the context for an author via the ORCID ID or name matching. You can create a context for the particular research topic of the manuscript at the host institution to decide whether that's a standalone manuscript or whether it actually has a huge background.
CHRISTIAN HERZOG: You can look at the funded grants of the authors and see whether they hold some grants where also-- which make them eligible for APCs covered by the funder, journals with semantically similar publication, and so on and so forth. The sky's the limit. And the review process can help to identify reviewers to check for conflicts of interest and relations. And in the post-publication process, it's about monitoring reception and indicators.
CHRISTIAN HERZOG: You might be able to build up a network of your authors, monitor preprint activities, and monitor funding of your authors, which they receive after having published. But these are only examples. All is possible with a few persistent identifier and the data readily available, which lowers the barrier for actually realizing these more micro-use cases because they can be almost done at no costs.
CHRISTIAN HERZOG: Also we have used Dimensions' data to create real linked data. So we worked with our sister company, Springer Nature, to help them release their content as the Springer Nature SciGraph. And the content ended up in a linked data graph with approximately 2 billion triplets using Dimensions as the basis.
CHRISTIAN HERZOG: And the barrier to do this for all the content is actually not technical. It's, of course, a cost issue, but it's even more so an issue about copyright and who actually wants the data represented in this way. I just listed this here so that it's clear that we have deliberately not gone to the linked data in the technical sense because that would have limited us greatly in terms of the scope which we could realize.
CHRISTIAN HERZOG: But actually, rather produce the relational linked information infrastructure with Dimensions because we actually think it can be more helpful. So while we've provided as a link open data set, we are continuing to focus on providing the versatile linked data cape which suits many, which is safe and easy to wear, in order to make the data speak so that we can understand complex processes better and facilitate information sharing.
CHRISTIAN HERZOG: Thank you very much, and I'm looking forward to the discussion.
SHELLEY STALL: Thank you, Ana. Thank you, Christian. I really appreciate the opportunity to speak today. So I've titled my talk Cite Data. Link Data. There's quite a lot of things we can do to make sure that data is well-documented, and preserved, and reusable, but one of the most critical things that we need to do is make sure it gets into the paper cited, and then links into all of the other research objects coming through from our research and the work of our colleagues.
SHELLEY STALL: So let me share with you, within the American Geophysical Union, we really care about Earth and space science data. To us, it's a world heritage when that particular tsunami or hurricane, the models around climate change, the data that goes into all of those-- that research, that work-- is unique.
SHELLEY STALL: Those events don't reoccur, and keeping that information about our complex systems for the Earth, and planets, and the universe is something that is incredibly valuable to us. And we want it to be preserved and documented, and those that have created it get the credit for that they deserve. And that really depends on us making sure that it's cited and linked.
SHELLEY STALL: So this is a really neat interactive diagram. I'll put the link in to the chat, but this is a celebration that Nature did of 150 years of publishing starting back in the 1800s. And what you can see from this is how the papers were connected based on the references from one paper to the other. And the interactive nature shows you, like based on seminal papers, what it was based on and then what papers came from it over time.
SHELLEY STALL: And what I really find exciting is how interconnected we are. How our different sciences impact and support each other. Yellow, there on the right, are the geosciences. Orange is chemistry and green is biology. And it's just so interesting to see how discovery and phenomena that takes place impacts the work of other disciplines.
SHELLEY STALL: So wouldn't it be interesting if we took that structure and made it even more connected? What if we were able to instead of building it for just 150 years celebration, have that structure immediately? And we could see the connected data sets, and the connected software, the models, was there a clinical trial, et cetera. And we could really take this further.
SHELLEY STALL: Who are the authors of the paper? What are their affiliations? If we looked at that structure as an institution, we'd be able to see every single research product that came from all of the researchers affiliated with a particular college or international effort. And that's really exciting. And we are actually in the middle of doing this and building it.
SHELLEY STALL: So what do you need to actually create these links? Well, you need at least two entities. As a publisher, usually the publication is one of the primary ones, but that's not really required. It could be the data that comes from software. It could be the developer for that particular software. And we could go on, right? And then to uniquely connect them-- accurately connect them-- a persistent identifier is really valuable.
SHELLEY STALL: And then what's the relationship between the two? Does one reference the other? Is one cited by the other? What are those relationships? And then, gosh, if you're going to use a tool, it's got to be machine readable in a way that's consistent across all of us and how we do it. So it's not hard, but it does take us thinking about it. Just this morning, Helena Cousijn from DataCite gave a talk to a society seminar about how important it is to share data and to cite data.
SHELLEY STALL: So some of her slides-- she said it was OK for me to share them with you-- they really drive home how important this is. DataCite is a registration entity for persistent identifiers for data and other things. And they really care about how connected things are. So as a digital object identifier, which is a particular kind of persistent identifier, is registered, they track how it's connected.
SHELLEY STALL: And the entities that hold that data like a repository make it discoverable. And there's ways that you can actually track these things. So the workflow actually is kind of complicated. Everything from a journal policy requiring a citation, the author selecting an appropriate repository for that data, making sure that there's coordination between getting the paper and getting the data preserved, linking it all, getting it published correctly with all of the right persistent identifiers in the right place, and then getting them distributed, and aggregated, and then available for others to find it.
SHELLEY STALL: This is actually a rather complicated workflow. But we already do a lot of this. As journals, we've already connected our authors. Most if not all of us require an ORCID, which with our authors-- and highly recommend that you also do that for your coauthors. We know what those institutional affiliations are. And many of us are starting to-- and I highly encourage the rest of you to do this-- to require data citation, and where it's appropriate, software citation.
SHELLEY STALL: And then we also reference other publications. So we're building this. We as journals are important contributors to the linking of all these products. Highlighting for you the organizational persistent identifier, ROR, many of us know GRID IDs, and some other organizational identifiers, but ROR is really trying to get adopted, and also works very closely with GRID ID.
SHELLEY STALL: So just wanting to make that available to you. Other groups have gone into depth on details about ROR, I won't do that here. And the work of FREYA-- so all of these things link together. And the European Commission funded a series of projects. The first one was called THOR-- I think it was the first one.
SHELLEY STALL: Then Freya came after that. And they actually created this concept of a PID graph. And what it means is a way for you to take those links and relationships and actually show them in a way that's visually useful. And you can actually make inferences not only between two entities, but you can also infer things across links between three entities, which is really exciting.
SHELLEY STALL: This leads to possible collaborations. It leads to opportunities for you to see potentially that a particular sort of method that's used on atmospheric conditions might also be useful in hydrology, for instance. If you were able to see that that particular method was used by two different papers in two different disciplines, that would be an incredible awareness for you to explore that possibly.
SHELLEY STALL: It's exciting to see what would be possible. So the European Commission, through another opportunity, helped DataCite build what's called a DataCite Commons. And I'm proud to say that there was also a little bit of money from one of the projects that I'm working on called PARSEC that also contributed funds to DataCite Commons. So I feel that highlighting this for you is actually really valuable. And you can use-- we'll put the link in the chat.
SHELLEY STALL: You could use different persistent identifiers to explore what that looks like and actually bring to life the PID graph that you see on the right, connecting data sets to ORCIDs, to DOIs, to different funder IDs. So here's what it looks like. If you're looking at a particular cite-- reference. For instance, you were to bring up this paper, what you can see it this has one citation.
SHELLEY STALL: And you can start to explore what other connections that it has. And what's really interesting is prior to this particular tool, it was really hard to see a data set. We knew we had referenced them, but it was really hard across all repositories to actually see what had been part of a paper or what was being registered in a the particular repository in a way that it was connected to other research outputs.
SHELLEY STALL: So this is really exciting because you can see how it connects who's using it. This particular data set is housed in Dryad, and you can see that it has been cited once, downloaded 16 times, and 99 unique views. I believe that's correct in how I'm interpreting that, but 99 unique views on that data set. So it's something that might be interesting to others. Also important that we identify all of the different creators for a data set.
SHELLEY STALL: And again, thanking Helen Cousijn for these slides. It's really important to realize that authors of a paper are not necessarily the creators of the data sets the paper uses. There may be a different set, there may be different reasons for that. But really wanting to give credit to the folks that have generated that particular data. And Helena use my ORCID, this morning on the talk, so I feel it appropriate to go ahead and continue that.
SHELLEY STALL: Looking at my particular ORCID, you can see all of the different papers that I've written, the different types of licensing that I've used, and learn more about what my work is, but then also explore those publications, coauthors, data sets-- although, honestly, I have not yet published a data set. So don't think you're going to find one there. But my coauthors have, so that's valuable to know.
SHELLEY STALL: So everything sounds great, right? All of the linking sounds fantastic, and wouldn't it be fantastic if we get that all to work? But in reality, it doesn't work that well. The folks at GBIF have partnered with AGU for a long, long time. And Daniel Noesgaard created this slide for me to demonstrate to you that things are not that great. They put a lot of work into reviewing every single paper in their disciplines.
SHELLEY STALL: And you can see by the counts here that there's quite a number of them. And they do this both in an automated way and a manual way. And they recommend the data citations that should be used in the papers. And what they're finding is-- now granted, it's trending pretty well, but still, the numbers are not good. The green bar are the correct compliant citations by paper, and then the blue bar is-- are those that mention GBIF, used GBIF data, but are not compliant.
SHELLEY STALL: So even with a lot of work in communicating with authors on what the citations should be, we still have a lot of trouble as journals helping our authors make those citations correctly. And this is one of the areas that I'd really like us all to work on together. There are a number of challenges. AGU's having a number of challenges, and these numbers from GBIF are just demonstrating that we still have a lot of work to do.
SHELLEY STALL: So what is it that we can work on? OK, first of all, hello, journals. I'd like to introduce myself. If you've not yet explored what policies-- framework policies are out there for data citation, there's a talk later on where Ian H. Is going to give a review of data citation policies for journal-- a framework. So I highly recommend that you take some time to watch that or watch the recording, and he'll give you a sense for what's available to you for you to use, and maybe example policies for you to explore for your own journals.
SHELLEY STALL: And knowing it's a journey, knowing that we have to start somewhere and continue to proceed. And then as journals, specifically in our production process, you need to know that data and software citations are formatted a little differently than journal and book citations. That there's a lot of PID data and software PIDs that are being used that are not DOIs that they're being registered in registration entities that are not Crossref.
SHELLEY STALL: And there are URLs that are not persistent identifiers, but do have valid locations for where that data and software are located because how repositories are evolving, not all of them have the PIDs that we need, but they might actually be the right place for that data. So please, take a look at how you're validating these citations, and make sure those machine-readable XML entries that go to Crossref to populate the links for the PIDs are actually accurate.
SHELLEY STALL: And speaking of Crossref, new schema coming up later this year. Again, later today, there's a talk from Patricia Feeney following Ian's talk that we'll walk you through what is in the works for the new Crossref schema to do a better job identifying data and software citations, specifically, and what's going to be coming up. And then get your authors examples of is a well-cited data set and a well-cited software package.
SHELLEY STALL: We haven't talked about availability statements. Those are important as well. But having all of those examples available to your authors is really critical. Ana, thank you so much for the opportunity to speak to everyone today. Belmont Forum is the founder for PARSEC. And some of my other work that is related to this is funded by the National Science Foundation, so I'm really grateful to them.
SHELLEY STALL: And then PARSEC is also funded by a number of other countries. FAPESP, and JST, ANR, and I'm just very grateful for that. And CESAB and FRB. So thank you so much.
ANA HEREDIA: Thank you so much for the great presentations. Thank you, Shelly. Thank you Christian. [MUSIC PLAYING]