Research data: describing, sharing, protecting, saving
Research data: describing, sharing, protecting, saving
https://asa1cadmoremedia.blob.core.windows.net/asset-1a0b4630-313b-4c7e-bb8f-a5b1b5d56558/29 - Research data - describing%2c sharing%2c protecting%2c saving.mov
IAIN HRYNASZKIEWICZ: Hello, everybody. I'm Iain Hrynaszkiewicz giving a short presentation to introduce this session also on behalf of Natasha Simons, who will be sharing some of the proceedings. My topic is on policies, journal and funder data policies. And I like to remind myself and ourselves as often as possible what is the goal of these policies. And I think at the appropriate level of abstraction that is increasing the sharing of research data so that they can be reused.
IAIN HRYNASZKIEWICZ: And the good news is that policies of journals and of funders do incentivize to motivate data sharing by researchers as found by a number of different surveys. For example, 2019 state [INAUDIBLE] data report. More good news is that policies, when they're implemented effectively by journals and publishers, do increase data sharing by researchers.
IAIN HRYNASZKIEWICZ: The figure, I'm showing here is from a study that explored the impact of data sharing policies at several hundred journals. [INAUDIBLE] publisher, A, and the PLOS publisher, B, on the right. And the black line that demonstrates when both of these publishers on all of their journals began requiring statements on the availability of data as part of a new policy in all of their publications, there's a large increase in compliance with those policies.
IAIN HRYNASZKIEWICZ: Compared to at BMC when they only encouraged researchers to share information in the papers about the availability of data, there's about a 5% compliance. It shows that when implemented consistently and rigorously, there is an effect on the amount of data available, the amounts of transparency around data availability. So more good news. Beyond journals and publishers, there has been and continues to be a development in the research data policies of funding agencies.
IAIN HRYNASZKIEWICZ: This figure, these stats-- we can see Spring Nature-- going back a couple of years now, but illustrates that there are or appear to be a growing number of funding agencies that have policies on sharing of research data or on research data management up to around 25% of funders in this analysis compared to a smaller number five years previously. So more good news in terms of raising awareness and engagement on these issues.
IAIN HRYNASZKIEWICZ: But there may well be unintended consequences of this explosion. And I would call it an explosion of policies of publishers, in particular, and of other stakeholders. So in the last five years, most of the major publishers covering tens of thousands of journals have introduced initiatives to standardize or introduce research data sharing policies for all of the titles. There's also been disciplined or societal initiatives, such as the American Geophysical Union, and independent initiatives such as the FAIR data principles and the Transparency and Openness and Promotion guidelines.
IAIN HRYNASZKIEWICZ: So this is all a win for open science, a win for research data sharing, in terms of raising awareness. But it could have some unintended consequences of potentially conflicting requirements or confusion around similar but not identical terminologies. And there's also the need to consider that different research communities, different journals, different institutions, different levels of support, different levels of readiness with regard to their approach to supporting or enforcing data sharing.
IAIN HRYNASZKIEWICZ: And all of this can and does cause some confusion for researchers and people who work with researchers. So getting in front of that issue, about five years ago, a group was formed in an organization called the Research Data Alliance, the RDA, involving representatives from funders and publishers and [INAUDIBLE] organization that provides infrastructure for research and institutions.
IAIN HRYNASZKIEWICZ: And so this group intends to tackle that problem of policies evolving and often lacking standards and cohesion and how we could try to work together to promote alignment in a way that helps us achieve that goal of more data sharing. The objectives of this group were to define common frameworks-- and that word "framework" is important, and I'll come back to that later-- to define frameworks for research data policies that allow for these different levels of commitment or readiness for different stakeholders, different communities.
IAIN HRYNASZKIEWICZ: A conscious decision was made to focus on journals and publisher policy first, so acknowledging that institutions have policies, funders have policies, and other stakeholders as well. But for good reasons, in terms of developments, App publishers, focusing on journals and publishers, was the logical place to start. And then the intention was to engage and iteratively develop guidance and frameworks for the introduction of policies that were more harmonious, if not completely standardized.
IAIN HRYNASZKIEWICZ: So the main outputs of this work over the last four or five years is a set of guidelines, a policy framework, which is published as a peer-reviewed publication. The citation is there on the slide. Through this work, developing a set of guidelines, this policy framework, with community stakeholders for team features of a research data policy for journals were defined. So that's the column on the left-hand side.
IAIN HRYNASZKIEWICZ: And those were arranged into six different tiers or six different types, which allow for different levels of compliance. Therefore, different levels of effort, if you like, for the sake of implementing that policy. Features, I won't read them all out. They cover topics such as sharing data and repositories, the inclusion of data availability statements in all published articles, peer review of data, data management plans.
IAIN HRYNASZKIEWICZ: Just to call out some information on the key here, the light dots mean that a policy feature is enacted simply by providing information. That might be we encourage you to cite data in your publications, but we're not going to enforce it. Whereas, where there is a dark dot on the framework, that means that that's a requirement and that has to be done, which therefore means for the publisher implementing it, they have to ensure that there's a process in place to make sure that happens consistently.
IAIN HRYNASZKIEWICZ: And what we found is that all the existing publisher approaches to policy types and policy tools do map to this framework. So it's usable and adaptable by multiple publishers and organizations. To give you one example of these guidelines in practice, take the feature of data repositories and how that's captured in a policy. We have a definition of that feature, what it means, but also why it's there with evidence for why it should be there if it's available.
IAIN HRYNASZKIEWICZ: And then bottom of the slide here, we also provide, in this paper, template text, so that that policy feature-- that policy in its entirety-- could be copied and adapted for a journal as a headstart, if you like, on implementing that policy. So thinking about implementing this framework, and then I'll conclude with next steps and impact an adoption.
IAIN HRYNASZKIEWICZ: It's worth considering particularly as a journal publisher if what are your goals, what are your objectives with implementing a research data policy. And that's going to help you decide the type of policy, the stringency of policy, that would be more appropriate. So if your objective is to raise awareness and to signal the importance of an issue, then a lower tier policy might be more appropriate than one if you're really trying to increase data citation or increase sharing of data in particular repositories.
IAIN HRYNASZKIEWICZ: But it does mean that if you're having stronger policies, there are great costs associated with that. But then there are also greater benefits as well. For example, there is an association of more citations to one's papers if the research data is available and links to those papers. Also, considering zooming out a little bit that ultimately sharing of research data, making sure it's reusable is an investment really in future research for future knowledge as illustrated by the economic analysis called out on this slide here with respect to FAIR data, creating opportunities of more than 10 billion euros.
IAIN HRYNASZKIEWICZ: The impact and adoption of this work so far, we've seen it used in the first couple of instances by some Springer Nature journals. And also PLOS journals have updated their policy language to include things like data management plans, which are absent before. There's a collaboration ongoing with an organization called the STM association, represents around 150 member publishers, working to use that policy framework as a tool to encourage more journals and publishers to promote data sharing.
IAIN HRYNASZKIEWICZ: The paper itself has been peer-reviewed, has been published. There are 12 citations in Google Scholar. Of course, that's always the highest number. So we like to use that one. But then I think what's more important is the organic impacts and adoption of this work. So we've seen more recently peer-reviewed publications that have reported on adoption of these recommendations.
IAIN HRYNASZKIEWICZ: So a group of regional journals in Slovenia, they've implemented that policy framework in their journals. There's a group of Earth Science and Biodiversity journals, which have also used that framework to inform some road mapping of improvements of their policies. And that all is very good to see, but it also means that as part of this effort, this group, which wants to look at research data policies, ultimately not just for journals but for all stakeholders.
IAIN HRYNASZKIEWICZ: It means we have begun to explore the next problem, which is what opportunities there are for policy alignment with funders and in particular alignment between funders and publishers. Because right at the start of the presentation, I mentioned that funder and publisher policy are both incentives for researchers to share data. So work is in the planning phase now, will be further discussed at the next research data alliance relates to what does that funder, publisher policy alignment piece looked like.
IAIN HRYNASZKIEWICZ: And there's a draft workplan, summarized here as having three phases, in terms of doing some research to look at what's happening in that funder publisher policy landscape, what are the opportunities for alignment, and how could those opportunities be enacted in terms of disseminating them or looking at [INAUDIBLE], et cetera. I do want to give a big acknowledgment here to Jeremy Geelen, the Canadian Institutes of Health Research for drafting that that plan.
IAIN HRYNASZKIEWICZ: But for folks who are interested in that, there's opportunities to get involved. What I want to conclude with this presentation for the discussion section that's going to come after the presentations is questions for the NISO community to consider. And those three questions, I will read them out. So number 1 is, what and for whom are the problems that would be solved by creating formal standards for policies rather than framework?
IAIN HRYNASZKIEWICZ: So what we have at the moment is a framework. It's not a formal standards. It's not machine readable. There's different policy types. It is something that comes up in discussion. So I think it'd be useful to discuss what problem are we solving for whom to move more towards formal standards for types of policies? Second question relating to the publisher and funder policy alignment piece is, where, which features of policies would funder and publisher policy alignment have the greatest impact on data sharing?
IAIN HRYNASZKIEWICZ: Is it data management plans? Is it repositories and citations, et cetera? And then also, acknowledging that this policy framework is just that a framework within those many features, 14 features, then may well be more work that needs to be done to build out those features or build up capacity or scope in order to ultimately affect the goal, things like data and software citation.
IAIN HRYNASZKIEWICZ: For example, which is the topic of the next presentations. And I am going to hand over to Shelley Stall. To begin the discussion on that topic.
SHELLEY STALL: Iain, thank you. This is a really great place to begin for our talk. It's a good set up for challenges as journals and publishers are looking to implement elements of the framework and considering what that means for them. So with me are members of the FORCE11 Software Citation Implementation Working Group, The Journal Task Force.
SHELLEY STALL: Are my slides looking OK to everybody?
IAIN HRYNASZKIEWICZ: Yes.
SHELLEY STALL: OK. Thank you. So we're going to talk about some of the challenges that journalists are experiencing right now and give some recommendations for how to address those having to do with data and software citations. And I hope you find this as something that is worthy to help with getting credit to our authors for these citations, those data producers, those data creators, software developers.
SHELLEY STALL: And we're really excited to share this work. So I'll start off with a brief review of what's happening in the reference section. Patricia Feeney from Crossref is going to provide you with current and upcoming guidance coming for the Crossref schema. And then Rosemary Farmer from Wiley is going to give you the recommendations in brief and offer opportunity to connect to the FORCE11 group for additional support.
SHELLEY STALL: So really exciting. Our group recently published a recommendation two journals for software citation. So a plug for the group leader in this meeting who's going to be providing you information on that. And please don't miss their talk. And we're really excited. And it's an F1000. It's been fully peer-reviewed.
SHELLEY STALL: And here's my work. And note that I'm getting credit for this. It's really exciting. And both F1000 and Crossref have pushed this to my ORCID. And that's how it should work. Like, this is my paper and I get credit. So here's the paper. And we referenced a paper that was published in scientific data in 2018 called The Data Citation Roadmap for Scientific Publishers.
SHELLEY STALL: So this is in the reference section with a citation. And if you go look over at that paper-- so you're following the map, I'm leaving you the breadcrumbs-- if you look at that paper, you can see that there are 33 citations of that paper across the various articles. So if you dig even deeper-- so keep following the breadcrumbs-- if you dig even deeper, you can see that in the XML provided to Crossref, these linkages take place.
SHELLEY STALL: And it's beautifully down in the weeds, the nitty gritty. This is what fuels the citations, the linking, the math. Everyone's looking at the math. Iain just talked about the fact that Google Scholar has the highest numbers. Well, isn't it funny that we've got different numbers? Maybe we can figure out how to get the math to work better. So that's a paper referencing a paper.
SHELLEY STALL: Here's that math. So the new paper in F1000 references the scientific data paper and then plus 1 on crediting the authors of the scientific data paper. Very exciting. OK. Now, I, as a scientist, and elated because you've just cited the paper that I've worked very hard on and you found it valuable in your own work.
SHELLEY STALL: We're building on each other's work. This is brilliant. So now, we're going to take a look at what happens with the data citation. This isn't 100% across the board, but it is a significant issue for a lot of journalists. I'll use one of [INAUDIBLE] journals as an example because I might as well look in my own backyard. So Earth's Future is one of our papers.
SHELLEY STALL: We had a really fantastic article published in 2019 stresses on the river basins. There's a data citation in the references. It looks beautiful. You see the DOI. You note that it's in dryad. This is very exciting. If you go look in dryad, here is the actual data in dryad. Note at the bottom, there are two citations.
SHELLEY STALL: Now, you see how some of the math is working. And let's go interrogate those two citations. So remember, the paper was in Earth's Future. And you note that here we've got nature sustainability. And we also have scientific data. And where's the Earth's Future paper? I don't see it. We cited this data, it's not there. All right.
SHELLEY STALL: Now, we've some math issues. So let's head over and take a look at the Crossref record. So if you're savvy on the XML, what you're going to see is that our persistent identifier, that beautiful dryad DOI, isn't there anymore. Uh-oh. So crazy. This is actually a blackhole a lot of our data citations and some of our software citations are falling into.
SHELLEY STALL: And I just want you to know that this is a representation of an enormous amount of data that created the very first image of a blackhole in April of 2019. And don't forget I come from the American Geophysical Union where these are scientists. So connecting it all back to the science. So here's some slides that Kristian Garza from DataCite, who is just watching over these kinds of problems very closely, presented at RDA, Iain explained Research Data Alliance at the Helsinki meeting in 2019.
SHELLEY STALL: And I want you to be horrified. So here, numbers of linkages of data to papers from inception to 2018 is just over 100,000. And we know that across all of our publications, we have published way more papers than that by 2018. And then now, and at the time these numbers were taken, it was August of 2019 we still only have 400,000 linkages.
SHELLEY STALL: So knowing that most papers, not all, but most, have a data-- are using data for their research, what we can see here is that the reported linkages are primarily coming from data repositories, reporting the linkages, not from the publishers, not coming from a reference section. And these are metrics of the problem I just showed you. And the goal is we need to fix this. For all of the papers that we're publishing, we want 100% of those citations to correctly make it through our production process in the machine readable version.
SHELLEY STALL: And really, we need you all to pay attention to this issue because most of us have this problem. So I'm going to hand off to Patricia. Patricia, I'll be happy to advance your slide. And I will stop my video.
PATRICIA: All right. Thank you, Shelley. Yeah. So I'm going to talk specifically about Crossref and data citation. Citations, they are a very core part of our infrastructure, so for most publishers, Crossref metadata is really the key to have your citations identified and connected to research.
PATRICIA: And we've been on a bit of a long road to handling citations efficiently. Next slide. Thank you. We want to make this easy. And conceptually, it's very easy. Members, you gather citations of all kinds, data and software included.
PATRICIA: You send them to us. And then we pass them along. Yeah, that's really simple. Why isn't it happening? Let's go on to the next slide. There are a lot of ifs involved with this. If you send us a citation, we do pass it along to our REST API outputs and our XML outputs. But identifying whether a citation is a data citation or a software citation and what should be done with those, that can be very tricky.
PATRICIA: Currently, the step one here depends on members collecting data citations in the first place and understanding how and why those should be supplied to Crossref. That involves technical and cultural change, both are very hard. We've been discussing data citation with our membership a lot over the past few years. And currently, I'll be frank, our support and guidance for this hasn't been robust partially because citation practices community-wide weren't clearly defined for a long time.
PATRICIA: I think that's changed a lot recently, but they've been evolving. And more importantly, we have our own limitations as to what changes could be made. We're coming out from the heavy load of technical debt. We haven't been able to be as nimble as we would like. But I'll get to this in a bit, but hopefully that's going to be changing. So but for a while, we were making recommendations that worked within what we could support, not really the way we would ultimately like to support data and software citation.
PATRICIA: But the recommendations weren't clear or very easy for our members to follow. So there wasn't a ton of uptake. But for those who are sending us data citations, like Shelley mentioned, step two, we match citations to the DOIs that works great for journal articles. But it doesn't work so great if an item doesn't have a DOI, as many software and even data citations don't, or if the citation for something doesn't have a Crossref specific DOI.
PATRICIA: You can supply us a citation that has a DataCite DOI. And if you supply the DataCite DOI, we'll-- and explicitly say, this is a DOI, we'll be able to pass that along. But if you don't have the data cite DOI included in your citation, we can't add it in for you. We just aren't technically able right now. And many publishers don't collect those DOIs or do any matching on their end.
PATRICIA: And so those data citations do get lost. For step three, citations are available by Crossref APIs, and we send them along. We do send all citations that you send to us and opt to make public to our JSON and XML outputs. And they're available for downstream users. And those also include any UI matches we can make. But data and software events are in our Event Data API and the Scholix API as well.
PATRICIA: And those are really key to making these connections between data and software downstream. So this essentially means that if we don't have a data slight DOI for a data citation, we can't send these citations to our Event Data API or to the Scholix API that are so essential to collect data and software to research. It's a weak link. Next slide.
PATRICIA: So I do have just a few examples because I know some of you have looked at our reference data and it's always good to have an illustration. So this is a good data citation for us. It has a citation plus a DOI. We can send that where it needs to go. Next slide. This is the same citation without the delay.
PATRICIA: This means, we'll pass this along to our XML and JSON outputs, but we don't send it to Event Data, we don't send it to Scholix. Anyone looking specifically for DOI citations, they're going to miss this. And so it's essentially lost in that blackhole that Shelly mentioned. Next slide. So this here is a well-formatted software citation.
PATRICIA: It has a software specific identifier, but we don't recognize it as being a software citation. We recognize it as being as a citation that doesn't have an DOI attached to it and an identifier that we don't do anything with. So I think this kind of illustrates what we're working with right now. Next slide. And it might sound a bit dire, but I'm not here to bring you down.
PATRICIA: We're making some changes to allow data and software to be identified and passed along. And these changes are actually pretty simple. We are allowing you to flag citation types. We are allowing you to find identifiers in your citations. And for those who supply structured references to us, you can actually add some data specific pieces of metadata. That will be very useful. The changes we're making will align with [INAUDIBLE] fairly well.
PATRICIA: So we're hoping this can easily be added to our member workflows particularly for those of you who are collecting data and software citations already. And for the citation types, we're adding data as a citation type, software as a citation type. But we're also adding journal article book, preprint, that sort of thing. And so hopefully, if we can get our members on board with this, that will really make our reference metadata a lot more useful.
PATRICIA: And we're hoping, since we are following, aligning with the [INAUDIBLE] recommendations that it is something that can be done without too much pain. We'll see what happens, but I'm feeling very good about this. Next slide. So this means you can explicitly say, this is a software citation, and we can work with that as can anyone downstream who uses our data.
PATRICIA: So we can down the road pass this on to our event data service. We are also explicitly marking up identifiers. And we're not going to be able to do anything with as in this example, there's a software heritage identifier. But I think there are a lot of people who use our data who will be able to make use of this metadata. So I think there's a lot of value in providing this because there are a lot of organizations that really build on the metadata we provide.
PATRICIA: Next slide. So quickly, this is just an example, all marked up with the pieces of metadata broken up into different fields. You can tell by looking at this what is being cited. More importantly, machines can tell very clearly, and they can make these connections easily between citations and research. Next slide.
PATRICIA: So just to sum, up these changes aren't quite in place but I have a team actively working on this right now. So we hope to have this up and running in a few months. I don't-- we're not far enough along to be able to predict timelines. But I think you should all be looking for some good news from Crossref about this in the spring. And also, we'll be working to expand support for making connections and Event Data and other services going forward.
PATRICIA: But I think just being able to identify the software citations is a very important step. OK. Next, Rosemarie is going to share some of the steps we can take to work towards this.
ROSEMARIE: Thanks. So how do we fix them the math? How do we make sure that data and software citations are properly linked and that they get counted? Next slide. So two key recommendations that we've come up with have to do with, first and foremost, making sure that our XML is consistent and aligned with industry standards for the most reliable output.
ROSEMARIE: And secondly, agreeing persistent identifiers for software and data and establishing validation mechanisms for them. These two things should help us ensure successful citation linking and give credit to the creators. So thank you for listening to our presentation. The FORCE11 Software Citation Group will continue its work. And we'll come back with lots more recommendations.
ROSEMARIE: Obviously, there's a lot more policy and operational discussions and decisions that need to be made. And we welcome questions at this time.