Name:
Maximizing Data Sharing Policy Impact through Implementation, Compliance, and Support Workflows
Description:
Maximizing Data Sharing Policy Impact through Implementation, Compliance, and Support Workflows
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/95cee71b-c27d-48cf-aaf1-5c5f3d73c490/videoscrubberimages/Scrubber_0.jpg
Duration:
T01H06M01S
Embed URL:
https://stream.cadmore.media/player/95cee71b-c27d-48cf-aaf1-5c5f3d73c490
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/95cee71b-c27d-48cf-aaf1-5c5f3d73c490/session_1c__maximizing_data_sharing_policy_impact_through_im.mp4?sv=2019-02-02&sr=c&sig=xVEfO1amxWSKBXpVyodSa85tqiMRwKpRpa8aXms4QhI%3D&st=2024-11-20T02%3A37%3A35Z&se=2024-11-20T04%3A42%3A35Z&sp=r
Upload Date:
2024-02-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
Thank you so much to everyone who opted to come to a session entitled maximizing data sharing policy impact through implementation, compliance and support workflows instead of a musical. When we saw that we were up against the musical for the session slot, we had a feeling that maybe like two people would come. So think that I'm going to speak for all of us, that we're really excited to see that there is a nice room of people that are excited about data sharing policies.
Before we get started, we're just going to go through the code of conduct. You've probably seen this already a good bit, but this is going to be like a high participation session. So we just want to be sure that everyone's engaging according to SSPS code of conduct. So here. A little bit short, so I'm going to. OK so the code of conduct states the Society for Scholarly Publishing is committed to diversity, equity and providing a safe, inclusive and productive meeting environment that fosters open dialogue and the free expression of ideas, free of harassment, discrimination and hostile conduct.
So please keep that in mind when you're working in your groups and when you're reporting back to the rest of the group here. All right. So first Tim is going to introduce us. Then we're going to have a group activity where you will talk about what you would do if you had to implement a data sharing policy if you were a small publisher and that was suddenly required.
Next year. Then we're each going to present a little bit on aspects of implementing and enforcing data sharing policies. Then we'll do another group activity where you'll talk about how you would respond to new funder data sharing requirements at your own organization instead of being at a hypothetical publisher. And then we'll have a panel discussion and a Q&A.
So with that, I'll hand it off to Tim. Thank you. Yeah, I'm doubly glad to be here because otherwise I was going to get roped into the musical, so. OK, so you probably have this feeling around open science in that things are changing really quite fast. The OSTP memo has.
The Nelson memo has. Really kicked into focus. Kicked into gear. The needs for all organizations to be thinking about open science. The memo states that all research data generated by US government funding needs to be available at publication, which implies that the work of making it available needs to be done before publication.
And because publishers are the ones that work with data before publication, that means us. So we can also be fairly sure that policy changes aren't done. The European Union is going to do something too. It may be that institutions are going to start to get involved. And in particular, one of the changes that we've seen with the NIH is new policy. Is that because NIH awards contracts not just to individual researchers, but to the institution and the researchers?
The new data management sharing policies that the NIH is bringing in. So data management sharing plans. These are then formulated an initial submission. The authors go out and collect the data and then the institution is responsible for ensuring that the data, the authors have collected, the researchers has collected, is made publicly available right at the end of the process.
And this is a really hard problem on pain. The institutions have to make this data available on pain of not getting further NIH funding. So this is a pretty terrifying prospect for an institution. So institutions are probably going to do something and you could probably imagine that it's going to push push back to publishers as well. And there's a question that I think every organization needs to ask itself is, are you going to be reactive or proactive?
One good thing about open science is the destination is fairly well known. That is, if you go very proactive on open science. You know what you need to do. You need to ensure that the authors share all of the data underlying their manuscripts on a fair repository at publication. And if you can get to that point consistently for all the manuscripts that are passing through your journal, then you're probably good.
It's not like. That there's a it may be difficult to get there, but it's not like it's impossible to know where to go. So that's the proactive stance of just getting there ahead of everything and then saying, yeah, you know, we've done what you want us to do. Or you can be sit back and wait to see where it's going to go and, and react as new policies come down the pipeline.
Um, and so we want to sort of highlight these options, um, with, sorry. Let me go back. So proactive is you're monitoring the three aspects of being proactive here are monitoring to establish the current situation. That is. What proportion of your authors actually share any data? Um, right now and what proportion of share them on a repository.
What proportion share data and supplemental material, for example. And then the next step is to implement open science policies, require data sharing at publication, for example, or code sharing, and then promote compliance with those policies and. As everyone probably knows, that these can be quite a lot of extra work that can be hard to scale, especially if you're a large journal.
Finding the resources to do this is potentially challenging. So well, we want to talk about here is a sort of thought process, a thought exercise. Um, you're a manager. You're managing a small publisher. The funding agencies have just released a new policy, and that is, they will no longer cover publication costs. That is, they will no longer cover apcs unless the publisher can prove that the data is shared alongside the published manuscript.
So they are shoving the responsibility for ensuring compliance with the Nelson memo down onto publishers. So the publisher then has to prove. That the authors have shared all their all the data underlying their article. And then the publisher has to prove that to the funder to receive the APC.
An institutions have announced that they're not going to pay for subscriptions for any outlets that don't require data sharing, perhaps with an eye to this NIH mandate. So what are you going to do? This policy is coming into force in six months time. And we want you we can leave this slide up. We want you to get into, let's say, four groups, sort of the back five rows and the front five rows here, the back, four rows there into four separate groups and discuss and lay out.
We're going to send you all a piece of paper. Lay out the steps that you as an organization are going to take, and then we'll reconvene and talk through what's going to all the various ideas that the groups. OK so think we had like a scant five minutes for everyone to report back. So I might invite the Rapporteur from each group to come up to the mic in the middle. If you're all stoking my data policies.
Um, we have to communicate that policy to authors and make sure that they understand. And I Fed in that. I work for F1000 who kind of do most of this anyway. We mandate data on submission of all of our things, but even then, when it's clear in our policies or we think it's clear in our policies, we still have a good percentage of authors who don't know.
And we get that back and forth to make sure that it's provided. So there's a big education piece. Staff needs to be trained to make sure that they know what they're looking for. Is there a way to help that by some kind of AI tool to do the automated checks on the links that are provided, talking about where data is stored? Question about do we need to train editors or is it something that needs to be done prior to editors really getting involved?
Because it can be very cost heavy in terms of using those people to do that. But also there was another question came up is like, what do we do about is data plagiarized? Is it something that is actually what it says on the tin? And, you know, is that I checking I and/or all of those questions but I think a lot about it is also can you do this in a stepwise manner like do you have to do all of these things at the start?
Or can you start by simply requiring a data availability statement? And it's there. And that start with is, you know, that's good enough for now and we work on it in the future. The second part of it is where do you require that data? Do you do it right at submission? Probably the easiest way to do it because then you've got the longest time to have those conversations and make sure that it's there and you haven't got to keep going backwards and forwards.
Or do you do it at first decision or a revision, or is it actually at acceptance when peer review is complete depending on the model? Because in F1000 we publish before peer review, which is why we absolutely have to have it at submission. So I think there's questions there around that. And then the third part is where is this data? So at the minute it can be shared in multiple different locations.
There's lots of different repositories, different communities have different repositories that they use as standard. Is there a way to get some kind of consensus between funders, institutions, publishers, to have a set of those that are preferred in a certain way, and how does that even work? And then but from a publisher perspective, if there is a set one, 2 or 3, then it makes it easier for us as publishers to do something more like a technological integration with those repositories to ease this process for the author.
So there's a lot of change in all of the different stages. But I think education, understanding the specifics and then really working together as a group of different stakeholders in the same flow will hopefully kind of ease that and get that education out there to authors and institutions more widely. So yeah, I think that was everything. Anything I missed, no good.
All right. Just talk loudly. I was going to say, I think it's fine. OK, fine, fine. All right. Um, similar sorts of conversation. Um, and, you know, issues around all of the things that Andy's talked about. We talked about in terms of first steps or next steps that a community needs.
Assessment would be an important thing to do because where we start in trying to help authors and the communities we're working with will depend on where they're starting from. You know, different disciplines are really starting from different places in terms of data, whether they even know or call what they're using data. So we would want to look at that, um, also review our existing policies if we have a policy, you know, are we already complying, how do we need to modify what we're doing and how do we communicate those this external mandate to the authors that we work with, the communities we work with, and then develop best practices out of that.
And in that process sort of track compliance and monitor uptake. And then also, as you know, assuming in this scenario we were working with we're at a small publisher, how do we amplify what the resources that we have? We probably are reasonably resource, a little bit resource limited. How do we connect with industry bodies to find existing resources, to understand things like certification for repositories, potential existing shared materials, and how do we work out whether that's already there?
We can sort of work at scale with other like publishers to make sure that we're not just reinventing the wheel with resources that we don't have or having to do use resources. And out of that, really to understand the policy and to, you know, essentially define a minimal viable approach, such that we're complying, but with, you know, a way to do that with shared compassion, I think was the phrase that was used because, you know, when you don't have a lot of resources as small as a small publisher, you really have to be a little kind to yourself and know that you're not going to be able to do everything.
But you, of course, want to serve your community as best as possible. So that's about that's about as far as we got. Same same. Thank you. Yeah, we also didn't solve the entire list of questions or problems, so we had a sort of a variety of folks that we're discussing this from, you know, publishers, librarians.
And we talked a bit about sort of the checks and balances issue and that sort of pre-publication post publication where we have these policies. Do we actually that the data being shared is actually being shared? Are we getting the right kinds of metadata with it? Do we know how long it has to be available? We talked a bit also about, you know, ownership and de-identification some of the different communities and issues.
Being from Canada, indigenous communities are entitled to own their own data. So what does open mean there? And we noted sort of the expertise around the data that is being shared is generally probably with those authors and not necessarily with us as publishers, as librarians, as folks who are working in industry. And they're seeing that requirement to deposit the data as an administrative burden.
And we really do need that expertise to be able to ensure that we're sharing data that is usable and reusable. And we talked a bit also about, you know, there's costs for this. So that's an issue in terms of smaller publishers, in terms of anywhere where are we actually getting that? Who who has the money to be doing this, who has the time and the people to be putting towards this?
So I have just a lot of question marks. So apologies. I don't know if there's anyone else who can answer all of the questions from a group back there, maybe. An old school favor, giving pen and paper and probably going to use it.
You so. Well, I'll just starting. OK the time is ticking. It's one year. They said that we have to execute this, so everybody's only at ground zero. So the first thing to take action on we said is like really thinking about what is this mandate? It's a lot of it is TBD.
And you are all talking about needing to find out additional resources. So two resources came up in our conversation that I'll share their acronyms, so be prepared to write them down. One is NIH, gri. It's a committee of generalist repositories of several different organizations, including figshare.
I represent Digital Science here, but for this purpose I represent this large group of publishers, librarians and then a technologist or two back there so you can follow it. It's going to be the NIH sort of community outreach initiative in terms of education. So something to follow and see if we can get clarity or at least find a person to answer questions there. The other one is cares, and I'll have you say what that is.
I wrote it down, but better for you to say. Care principles for indigenous data stands for collective. OK, so these are all things that need to go into your policy which you're going to create in the next year by doing a lot of educational outreach for your organization, your industry, to the funders, government, get clarity, and then by the year end, hopefully you'll have compassion and you will come up with the policy that's your minimum viable, as you all suggested.
And then at the beginning, it would be good to understand ultimately who is going to be responsible for compliance because that's going to set the direction for who's willing to take the burden of costs. And we're all in this ecosystem together, but we all have varying stakes in it. And to the point about who's the best person to do this particular role in terms of data curation, data management, metadata collection and all of that, that's a big question that is going to make it reproducible and usable in the future.
And that's the point. So we need to probably get that right up front. Um, so yeah, we had a plan that we thought by the year end, we would have a policy, we would do the education internally, start to work on our workflows and then work on educational outreach to our audiences, our authors. But the question was how many new subject matter experts are we going to need within our organization? And how much are those jobs worth?
And is that a new job opportunity for everybody? But that also comes back to business models. We know from open access our experience in the last 20 years of open access that we're still working out new business models and trying to understand what works for everybody and the implications. And ultimately this in the voice of the inevitable past. President Angela Cochrane. It's going to cost money.
Who's paying? We need to have that in mind all along the way to build models that will work for everybody. OK, that's a bit. Thanks so much for all that. Really good ideas that were coming out of those sessions. And it's great to see that a lot of it is actually what we're going to be covering in the next section.
So the next section is going to involve the four speakers. We have four panelists really running through a number of different topics around data policy implementation, around data policy challenges, everything we've just really been talking about. And it's great to see, like I said, so many of the same challenges. Hopefully we're able to speak to them now with our three publishers and Tim from dataseer.
So to introduce myself, I'm Graham Smith. I'm the Open Data program manager at Springer nature, and I'm going to be talking about the supporting structure that's really needed to embed data policies and also the process of moving data policies up from relatively weaker standpoints onto these more strong data policies that can actually see the impact that we want to see in line with things like the Nelson memo. Now, this is quite timely for us because a few weeks ago during the Research Data Alliance plenary, we announced that we were moving to a single data policy.
In fact, one for our journals and one for our books. And the important thing for this discussion is that that involves both rolling out a very large, rolling out policy to a very large number of journals, but also moving up in terms of the strength of the policy itself. So this was our initial starting point. This was our previous policy structure in terms of data. It had four levels.
They moved up really in terms of the strictness for what was required from authors. So at the first level, really just basic encouragement to share data, the top level, all data having to be shared. And what we really see is that in between there's this focus on transparency, on data availability statements, as well as an encouragement to share data and some mandates around very specific data types that are embedded in community expectations.
Now this was really groundbreaking at the time because it gave us the ability to move any of our journals from across three, 3,500 journals in the portfolio onto a standardized data policy. Previously, it had been much more scattered, almost like the wild west, like certain journals doing a lot, certain journals, not really having anything around data, different definitions and really the ability to actually standardize what we were doing back in 2016 when we introduced this.
This allowed us to move any of our journals onto a data policy. So where we had got to, we had certain of our imprints such as the nature portfolio and BMC on this policy level three, as it was called at the time, requiring transparency, requiring data availability statements, but then other portfolios. So Springer and Palgrave titles really split between policy levels 1 to 3 or not actually having a policy at all.
So that was a starting point. But we were progressively moving towards this type III as a standard. So data availability statements, really the idea to embed transparency as that supporter of integrity, but not going to the level of saying we're going to just mandate immediately like we were just saying, within the exercise. The new single data policy is really just to consolidate that move.
We've decided to take a step that since these four types developed, we actually found the same sort of thing happening prior to having these data policies where different subtypes were being developed, different journals were sort of taking things off in slightly different directions and it was getting confusing as well as just talking to authors about like type two, type three, it's not immediately accessible really. So we've focused on data availability statements to underpin this new baseline data policy, but also enabling journals to do more if they want to, so that mandatory data policy is still in place, but for a much smaller number of journals.
Now, I'm only going to talk in brief about some of the key challenges and our solutions to them, but I think I can really summarize the approach that we took. Key challenges. Many of these you already covered within your groups. Standardizing guidance across journals. This was something we found. But the ability to actually present in straightforward terms, what do we want you to do with your data while submitting to one of our journals?
We would love it if it was such a well recognized and high profile activity of research data management that we didn't need to really frame it within the submission process that way. But that's really what we saw as the challenge for policy, actually really summarizing this in a straightforward way. We also touched on what roles provide compliance checks. This was a big question for us.
You know, do the editorial support teams take on the burden of this? Is it something that's just added onto the plates of editors, reviewers, for example? What level of training is required in terms of data repositories, data, citation, sensitive data to be able to get to a level of policy compliance? And can we actually make these policies work across all disciplines?
That was our aim. We wanted to have a policy that was applicable to any research area, and certain research areas may not have even considered they were working with data. There's been a lot of moves actually in that area recently, especially with humanities and social sciences. Rebecca's been quite a key part of that. And updates the systems. I've given that three words and it is probably one of the biggest things that you can try and do in terms of data policy.
Just getting this embedded in your processes and your systems in an effective way that you can actually be compliant with and show the right information to authors. Broadly speaking, our approach. So we have focused on this expertise question, which came up. Training support teams for this minimal level of compliance. So what is the minimal compliance of the policy? That is what we put onto our editorial support teams.
We train them from really a data expertise perspective. But the idea hasn't been to embed a data Steward or curator or specialist in every single journal. How are we using our data specialists, really, which is the team that I manage is to as trainers and as escalation points for those more complex queries which come up. Those more complex queries. We almost take an approach that's like case law. For this, you build a library of different cases which have come up what the solutions were, what the precedent is for that.
That way you're not looking to turn everything into a checklist. It can't just be that every single thing goes through a binary yes, no. In some cases that works. You can't just put that in front of your editorial support teams, for example. So an effective escalation process. Collaboration with editors as a two way process have to bring editors into this conversation, acknowledge the challenges they have in their specific area, and fit that into how the policy is actually managed.
And that really leads into that last point is understanding what's appropriate for policy and what's appropriate for practice. A lot of the talk, we already had was about repositories, practical solutions around this. What do you put into the policy as opposed to what else can you do, for example, to nudge authors towards best practice?
And that's something that we've really tried to do by making solutions more Seamless and more accessible, but without necessarily going to the level of saying you must share this data at this specific point. So there are other tools as well as the policy that's available. Really taking this community approach. These things really have grown, for example, in the genomics community via different community levels of growth and action.
That's what we're trying to embed. So we embed these best practices, case studies, really telling the story in different communities, which we do, for example, from our help desk, from our research data community, so that it's not just a checklist, essentially. OK I'll hand over to Rebecca. And Thanks so much, Graham.
So are you popping? Thanks so Graham was speaking from the perspective of a publisher that has this suite of data sharing policies. And I used to work with Graham, so I'm quite familiar with the kind of various data sharing requirements across Springer Nature journals. And I think even now, with the strengthening of policy at Springer nature, many journals are on a policy that's quite well aligned with what OSTP announced, which is around making data available, having a data availability statement, but not necessarily requiring that that data is fully open or deposited into a data repository.
So I'm now speaking from the perspective of F1000, where we do have incredibly stringent open data policies across all of our journals. And I just wanted to speak briefly about what that means for us and our team and what the benefits are and also, of course, some challenges. So anyone who is not familiar with, I guess, what is kind of a standard approach to a fully open data policy, this is kind of a graphic representation of what we require at F1000.
So we do require that data availability statement, as many publishers now do. But in addition, that data being described by the author, it has to be in a data repository. So you cannot make your data available on request, you can't upload data supplementary files you need that data to be in a repository and the repository has to be one that's on our list of approved data repositories. We do have a bit of flexibility there because some institutions do have institutional repositories which are of a very high standard.
So we'll make an assessment of that. If an author comes to us having used their own institutional repository, we focus on reuse with our policy. So we do want to ensure that not only is the data open, visible to the reader, but also that it could be reused, whether that's to build something new with it, to integrate it into a future experiment, or maybe to attempt to replicate the authors experiment.
So we only allow our data to be licensed as either cc-by or cc0. So at a kind of maximum, the author can have a requirement that people reusing the data needs to cite them as the creator. But but that's, that's as strong as the license can be. We don't, for example, allow authors to prevent commercial reuse of their data sets. We do ask that the data is made fair to the extent that that's possible.
So we've aligned our policy with the fair principles to try and ensure that the data is findable, accessible, interoperable and reusable. Again, this is aiming for reuse of data. So we I'm not sure to what extent authors really understand what fair means, but we do try to explain to them what they could do to make their data more fair and more reusable. And then we also want to make sure that having gone through this process of making the data accessible and reusable, that authors receive credit for having done so.
So we also require data citation and we will tag data citations in our reference lists so that we can track that something is a data set that's being cited. Um, we have I guess a slightly unusual publishing workflow at F1000. So peer review happens after publication. So we have editorial checks by our internal editorial team. The paper is made public and then peer review happens openly and the peer review reports are available and all the versions are made available as the author goes through the versioning process.
But this also means that the data is accessible to the peer reviewers, and we do have some guidance for our peer reviewers on what they should be looking out for when they assess the data set as part of peer review. So you can see it's really comprehensive. I think there are a number of elements of this policy that authors may not have encountered before. So I think it's important I'll come back to this, but kind of being ready to explain and to support the author in complying with the policy is really important.
And that also means that our internal teams need to understand what all of this means and how it applies to different disciplines. So I thought I would start with I didn't want to call it like benefits and challenges, but I guess on this slide I'm thinking a bit more about like, why is this something we do? Why is this kind of core to our publishing model at f1000? And to me, it's really central to our mission and to our vision.
So being a fully open access, open science publisher, I don't see how you can have transparent research being published. If you can only read what the author says about the data and never see the data set itself. So I don't think we would be F1000 if we didn't have such a transparent and open data policy. And I do think when you talk about open data or even open science more generally.
There's a assumption that this open data will lead to reproducibility or replicability of science, and I think that goes a bit too far. So what we don't ask our peer reviewers to do is to replicate an experiment. So I wouldn't want to say that in having published with us, you have replicable research, but it's certainly a step in the right direction. And again, if you're thinking about baseline of what's needed to make research reproducible, data has to be open and also kind of thinking more broadly about what this means as a publisher.
We've been talking a bit internally about how this connects in with research integrity. And I think if you were at Elizabeth bick's amazing keynote last night, even having the data open and accessible and peer reviewed doesn't necessarily mean that you can avoid, um, like research misconduct or even paper Mills. But again, it's a step in the right direction. And I suppose the amount of additional work that an author has to go through to like participate in a paper mill and submit fraudulent research is increased.
I suppose if you're also asking them to share open data in a repository. And then finally, something that's come up a few times already this morning, but alignment with those emerging stakeholder policies. So when the NIH came out with their policy statement last year, I think the first thing that I did was an analysis of, well, how does this compare to what we're already doing?
And I think across the board we were doing a little bit more than the NIH was asking for, which is great. Like we can say to our authors, come publish with us. Like we know what the NIH expects of you. And if you publish with us, we're going to check that you've done everything right and you should be compliant. I mean, I'm not going to make that promise to every author, but I think, like insofar as a publisher can support a funders policy, our open data policy is really doing a lot towards that and so on to the challenges is this is something that I think Graham already alluded to.
So because we publish across disciplines and across geographic regions, there are massive variances in ability. Um, I think as Graham mentioned, like in humanities, I'd say more humanities and social sciences, but a little bit in both. There's a lack of familiarity with data sharing. You can't assume that a historian will even know what you mean when you ask them to share their data.
And we do ask historians to share their data. Um, I also think there's a bit of an assumption. So for example, with an NIH, there's now an assumption that us researchers really know what they're doing, but they may still never have done it before. So you have to be cognizant that like, depending on who your author is, they may or may not have ever even thought about this. Um, as F1000 with our open data policy, we're still an outlier.
Um, so you, you won't find so many journals where you come to publish and you'll be asked to do so much. And this ties into my next point, which is authors can go elsewhere. So unless you're a strong advocate for open science publishing on F1000, if you don't like what you're being asked to do, you might say, I'm going to go to another journal. So like, I think that will change.
So we publishers are implementing stronger and stronger policies, but at the moment we are a little bit unusual as a publisher in forcing this. And then as already mentioned, there's certainly resourcing impact. So our editorial team do our initial checks before publication. And they are very skilled in identifying appropriate data sharing.
Does something have the right license, has something being anonymized appropriately? Um, but like that's quite a big burden on editorial and they're almost acting like gatekeepers to say like, yeah, you've shared this appropriately or not. And it's not just the editorial team who need to be upskilled because this touches all of our teams at F1000. So if you are in the publishing team like I am, strategic partnerships, content acquisition and any of our teams really who intersect with our authors, you need to understand what this policy means.
You need to have a sense of what's good data sharing and what is open data sharing as well. Um, so I will leave it there, but looking forward to the panel discussion and Q&A. Jenny all right. I'm here to throw a wrench in what my colleagues have already been saying. But first, I'm going to force a little bit more participation again. So here.
Sorry I'm going to try to get the microphone right. OK so in April 2019, psychological science had an issue with 14 papers in it, all of which were assigned open science badges with open data. So that means that for anyone that have a feeling if you're at this instead of the musical, what that means. But if you don't, presumably everyone that published in that issue would have deposited their data and their code in a repository and someone would have checked that to award them this badge.
So recently, a few months ago, another paper came out in psychological science, where Sofia crouwel and other authors attempted to computationally reproduce all of those papers. So how many of those papers do you think were actually computationally reproducible? So let's see. Hands for all 14. We're not very optimistic in this room.
OK, let's see. Hands for eight. Got two hands for eight. Wow aw, you guys all read this paper. Let's see. Hands for four. All right, let's see. Hands for one.
I think a lot of the room, either knew where I was going or read the paper. So you were correct. The answer is, one, this has less shock factor than I think would be enjoyable, but means that we probably all kind of know that things are a little bit bleak in terms of computational reproduction. So I think that this is a particularly interesting illustration of like what some of the issues are in the gap between what we typically want data sharing initiatives to accomplish and what they actually sometimes do accomplish.
Because here we see that, first of all, this journal didn't have like a data sharing requirement. They just encourage it and have an open science badges so people didn't actually have to share their data. These authors went out of their way to try to share their data and to apply for an open data badge. And then there was actually a second level of opting in because cruel and authors actually emailed all of the authors and asked them if they would be willing to be part of this study.
And they were told that like they were going to try to computationally reproduce the results. So it's not like they just did this and then alerted the authors like, just so you know, we couldn't reproduce your results. They actually asked them. So the authors, I would bet, probably thought that like their results would be computationally reproducible and yet only one paper was fully computationally reproducible.
So why is this so cruel? And authors and the data sharing literature in general lists a few reasons for why, even though people will share their data, it won't actually be computationally reproducible. So those are missing or incomplete data and/or code. Lack of documentation of data and/or code, unclear or incorrect reporting of procedures in the article text.
So in that case, the data and code is shared, but it's just not actually usable in the way that the text would describe for someone else to try to reproduce it. Minor discrepancies in individual results. So again, in this case, the data and code was shared, but when you try to reproduce it, you're finding different results than what's reported in the paper data storage issues such as file corruption.
And it's worth noting that this was actually noted in that article and that article was trying to reproduce data from 2019. So this isn't like decades old data. This is data that was like relatively recent software dependencies and differences in hardware or software configuration lines. And it's worth noting too, that even though I'm describing this in terms of like computational reproduction, because that's what this paper I'm talking about, talks about it in terms of these same issues come up when you're thinking about how you can just reuse data and thus make that data more impactful beyond just one paper.
So when we're thinking about, for example, like metadata being part of shared data sets, this is something that like, you know, if I just give you an Excel sheet that just has thousands and thousands of numbers in it and don't tell you what any of that is, that's not really that useful, even though I've shared my data. And you might check like when I'm, you know, publishing with you that I've deposited something in a repository and you see that like my data file is there, but you might not be able to use it if you actually ever wanted to use it.
And similarly, you know, if I share my data and code, but I share data in code that's only usable in a very specific data environment. And, you know, two years from now, that software that I was using with it no longer exists, then you might still not be able to do anything with that. So it's still not really that helpful. So before we go into how you can deal with those issues, we're just going to step back and think about why this matters.
Sometimes I think when we're talking about data sharing and particularly when people start throwing around words like metadata and like data environment, there's this inclination to sort of disengage and to think like, oh, I'm not like a technical person, I'm not a data person. This isn't my area, you know. But I think that that actually really ignores how much this intersects with much bigger conversations we're having in scholarly publishing right now.
So first of all, when we think about like, data reusability and like issues that come up when data that's shared isn't actually reusable, it really gets at like the issue of researcher waste every time that like someone has to pay, you know, support staff and like, you know, the myriad things that you have to pay for when you're collecting data, every time someone has to do that, if it's data that already exists but someone just can't access it, that's waste.
And that's really problematic because, you know, we all know that like research is underfunded. You could just put that towards something more effective. Then secondly, you know, author experience is something that I think everyone is kind of obsessed with at conferences this year. And I think that there's an author experience piece of this because, you know, even if you share imperfect data, it's really time consuming to like get your data in a way that's shareable.
I mean, one, sometimes data files are just massive. So like uploading it to something can be very time consuming. And then two, I think anyone in here that's worked on a large project, you know, that oftentimes, like your personal files are just a mess before you actually, like give them to anyone. So just getting it in a format where you can readily share it with other people is really time consuming.
And if you're asking authors to do that, but it's not actually that usable, it really makes it so that you're just giving them some kind of burden without any kind of value. And then thirdly, I think there's like a really big trust piece here because when you have an open data badge on a paper or when you have like an open data policy at a journal that's well advertised, you know, there's this like feeling that like, oh, so not only did this person's research go through peer review, but like they actually published the data with it and they publish the code and like, you know, I could just go reproduce this if I wanted to.
And I don't think that's always true. You know, this crüwell paper just shows that even when people have really good intentions, that's oftentimes not true. So there can be this added level of trust that isn't always merited. So what prevents reproducibility and reuse despite shared data? And what do we do about it? So we already talked a little bit about what prevents the reproducibility and reuse despite people sharing their data in terms of the first few things that are listed there, some things that people in the ecosystem are doing is just requiring reproduction and not just sharing.
So the American Economic Association came out with a policy in 2019 in which they started requiring all papers to be computationally reproduced before they would be accepted. And if you go to the center of open sciences database, where they track like different open science policies, journals have, you can see some other journals that have also adopted this. But the only issue is that like one, there are some disciplines for which that doesn't really make any sense.
Like to think Rebecca's point earlier about asking historians to share their data, there's some sort of qualitative data that you're not going to be able to like computationally reproduce the way you would with like an economics paper. And then two, like it's just really resource intensive, like adding in, like having the personnel, having like the systems, having the just having the capacity to be able to reproduce every single paper is something that can be really challenging, particularly if you're at like a high volume publisher.
And then the other things that are listed here have a little bit more to do with infrastructure investment. And I think that it's worth noting that sometimes when you talk about infrastructure investment and in scholarly communication, you know, there's this hesitation of like, Oh no, there's yet another thing that we need to find money for, and we already are relatively under-resourced. But I do think it's worth thinking about the fact that in some cases this isn't so much asking for more investment, but rather a shift in investment.
So that instead of constantly going out to like fund more data collection, you're instead funding ways to make data more reusable. So I just want to note that this talk was not intended to say like data sharing policies are useless and that, like, you should just stop having them unless you're fully reproducing things, but rather that it's just like the first step. So cruel.
And authors have a quote that I think describes this really well. They say it is commendable when authors attempt to share their data. Data and code imperfectly shared are typically better than data and COVID perfectly kept to oneself. Indeed, our study would have been impossible without the introduction of the Open Data badge. The badge is a step in the right direction, but the corresponding policy needs to be improved to better support and incentivize transparent and reproducible research.
So I think that that's like the perfect sentiment to end this. Um, definitely keep, you know, implementing data policies, keep trying to improve them, to make them more, to increase compliance. But just remember that like this is the very beginning. So for interested in making science more transparent and more efficient, data sharing policies are just our first step.
All right. Thank you, Ginny. Um, so this talk is essentially what is technology ever done for us? What can it do to help journals have stronger data policies? Um, this is going to be short and sweet. So as I said at the beginning, one of the proactive steps publisher a journal can take around open science is to monitor where are you now and what effect does various policy changes initiatives.
What effect did that have on author behavior? And there's not a lot of good information out there on how policy changes or various incentives to share data actually affect data sharing. Um, and so. This is something that we really need to get a handle on. This is part of the work we've been doing with plos on their open science indicators. And so here's an example result of.
How it how data sharing it lost, which is the darker bars compares to data sharing and the comparator set of articles. And this is the fairly strict criterion of being data being in a fair repository. We also have another graph that's kind of off the page here, but it's data shared anywhere online, which is a higher proportion. But we can see now that the proportion of data sets that are ending up in the repository by 2022 is inched up.
2 over two, about 30% But that's still a long way from the 100% that they want. But that gives them the ability to decide how to plot a chart forward, what kinds of paper are sharing their data, what kinds are not, and so on. So there's a lot of scope there. Once you've got this information to work out what to do. And the other part of the other big challenge, I think that everyone's alluded to is compliance.
How do you go from the broad policy that you've created? Dear author you must share your data at publication. How do you how do you get to the point where the authors are sharing the data sets from their article? And this is a big implementation gap in the open science workflow because all too often the pi has to sit down and work out for their particular paper how the various policies that are out there apply to this article and then work out which data sets need to be shared, where they need to go.
And so on. And that is hard work. And it is resented and it is a burden on the researcher community. So what we actually need to do is help authors by jumping past that and giving them advice on what they need to do for their particular article right now. And I can help with that by, for example, going through the article and finding sentences where the authors describe data collection, working out what kind of data it is, and then recommending to the authors based on their situation, their funding agency, their institution, the kind of data that's being collected, where it should go.
And this can be done automatically. And so that gets the authors to the point where they can give this task to a more junior researcher to say, OK, here's a list of the things you have to do to comply with the journal or the funder policy. Just go and do them. And the other advantage of this list is that the stakeholder also can see what needs to be done. So that they can monitor compliance.
At the same time, if it's clear to the authors what needs to be done, it is also clear to the stakeholder what needs to be done. And as the corollary of that, it is clear to the authors that the stakeholder knows what needs to be done. So if you want. One analogy here is that if you're running a railway, you could decide that you want everyone to have a ticket and you could put a poster up on the platform that says, to ride on this train, you must have a ticket.
And as you can imagine, if that's all you do, virtually nobody on the train will have a ticket because they can see that all the enforcement is some sort of advice. And that is what a general policy is like. What all trains do is have a conductor on the train talking to individual passengers and making them individually responsible for sharing their data sets.
And that's really important because that's how data sharing or compliance with policies is achieved in regular life. You make people individually responsible. You say, this is what you have to do and then check that they've done it. And this approach is really effective. We've been doing this with aligning science across Parkinson's.
So the sort of z shaped graph here shows data says assessment of the first version of the article. That is when it's first put up as a preprint, the proportion of data sets generated for that particular article that have been shared, which was a lot of the time 0. And we gave a report that was on the previous screen to the authors saying, here's the list of data sets, here's where they need to go.
And by the second version, a very high proportion of those data sets have now been shared. So this is a very effective approach. You have to say to the authors, this is what you need to do. We know we can see whether or not you've done it. And from there they can actually move on and do a good job of data sharing. And that's it. That's all.
Thanks so much, everyone. So in classic conference workshop style, we have run out of time for our second activity. So I think we're going to move on to our Q&A and panel discussion now. But if you want to use this to frame maybe some questions you had for the panel. And so what would your own organization do if stronger data sharing policies were suddenly required?
So this session is being recorded. If you had a question, if you can make your way to the mic in the middle, that would be great. Just so that your question can be recorded as well as our response. Um, so did myself have some thoughts? Having listened to the conversation with the group, heard from our panelists? Uh, so I think what came up a couple of times was around resources.
So if you're a publisher who's thinking about either implementing policies or strengthening policies, I was wondering if the panel had any suggestions of like, what are the resources that publishers could look to. Um, I have a couple of thoughts myself, but maybe I'll pass over to Graham. This is my.
This now sounds like it's working now. OK OK. So sorry. Was that barriers or resources? Resources resources. OK, so what are the advantages? Guess there is quite a lot out there already, but that's also part of the challenge, I would say. And the general resource, I'm going to get that wrong.
Now, the NIH initiative, gray think is the acronym for it, and very good example of where a resource that's relatively straightforward to use has come out in terms of data sharing. And there's also a complementary domain specific repository initiative. And likewise, there are quite a few other resources out there like free data, fair sharing that can be used to provide. And this is something we're looking at more Seamless access to the information.
So think where we've got to as publishers is almost a stage of trying to provide all of the information in one go to authors and it's not manageable or sustainable. So we've had to take this real user driven approach, which Jenny mentioned, of how does this fit into the user experience to use the very sort of development term for it? And what, what problems is it trying to solve? At what point before building in is much more specific tools which are now available?
OK and Tim or Jenny, did you want to add anything there? Want to add very briefly that the communities, of course not homogeneous. There are some researchers who are absolutely excellent at reproducibility data sharing and they can really act as a very powerful resource. So what should we be doing with this data? Oh, this is what we do with it already. And that 10%, 20% of the community is great at it.
They can really guide you on what to do and they're also very powerful for establishing precedent and standards for the rest of the community to follow. I think on a similar note that Tim, I was just going to add that think that your other well, if you're at a society or a publisher that's relatively specific to like a discipline, like your colleagues I think can be really good resources if they already have some kind of data sharing policy because that way as well, if like your authors are typically publishing with other publishers and what they do, you can kind of try to roll out something similar if you like what they're doing so that your authors kind of know what to expect just in general, when they go to a publisher.
And so they're not just constantly relearning different author guidelines. And I think typically, too, like a lot of us are pretty friendly with the folks that we know that are at like other publishers in the same discipline. So yeah, I would say that on top of all of the resources that everyone else is mentioning that are a little bit more like official that you'd find online and you'd really be able to analyze, just using your network can be a really effective way to tackle this too.
Yeah, I think I was thinking along the same lines. So this is my first SSP, and like personally in my day to day work, I don't often talk to other publishers or there feels like there's a kind of competitive disadvantage to talking about what problems you have with other publishers. But the two spaces that I would really recommend are the research data alliance, which is a global organization focused on research data sharing, and there's some kind of publisher representation I think we could do with having more and then the associations research data program again.
So this is focused on publishers. Um, but there is a group chaired by joris van rossum that's talking just about how can publishers increase the implementation of stronger data policies. So I think we had a question from the floor. Hi so, um, I wanted to ask a question about that. There was a comment about having a more junior researcher be able to sort of support this, um, the implementation piece.
And one thing I was thinking about is, is where we're dealing with students. I'm a librarian working in a university, so I'm getting researchers who are faculty members and then I'm getting students coming who are going to be the future researchers. And I don't necessarily speak to them about open data. And I'm wondering how as publishers and people in this space, is there a connection we're missing in terms of getting directly to that group of upcoming researchers?
I know again, it's sort of the networking and mentoring, but I'm wondering if you've had any thoughts about how to reach those students who will be the next generation more directly, I'd be really interested if anyone in the audience also had thoughts on this. Like I think, um, I suppose from my perspective, there's a bit of a sense that, like early career researchers are the ones who are very open minded, who would be more amenable to open science, who want things to be more equitable, but at the same time don't necessarily have the control over where they publish.
It's not necessarily just their data that it's being asked to share. So I would be interested to the panel, have any suggestions on. You obviously want to, I guess, be speaking to these researchers and kind of raise awareness. But like, how do you overcome that challenge of firstly getting to them and then making sure that they kind of have the capability to be more open in their own work?
Yeah, I can speak to a few different initiatives we've had in this area, some of which have had more success than others. And we getting to Akers is a big challenge within this because we tend to think about things in terms of workflows from a publisher's point of view, like we get submission said. But being able to actually drill down and make this distinction on one side is a data challenge there.
There's a technical challenge and being able to, you know, actually know who it is that you're talking to. We've run nature research academies for a while. Which part of the aim of this is really to try and move upstream? And what we're doing is that interaction point is quite late on and we often say part of the challenge with data sharing is it's so late on that if good data management hasn't happened up to that point, it's very hard to fix it, then it's very hard to just delay the publications.
Well, now you have to do all of this. And and one of the sessions we had was actually with PhD candidates. So it was the more early career researchers who said that they reported back exactly what we're saying. The questions were, how can we do anything about this? You know, we're not the pis, but really the idea that they are the future think actually trying to embed this through the community model, that's probably the best way to think about it.
Again, it's these strong ties, weak ties that was discussed in the plenary session before. Question oh, do you want to answer that? No, I think should go ahead. OK so we talked a lot about compliance. We talked a lot about barriers and burdens. I'm wondering if there's any carrots out there for the researchers. And you could talk a little bit about discoverability and citation data.
That was something that I thought was really great. Solution in your approach is giving them credit. One thing, the state of open data report that figshare produced for several years is that 76% of respondents said they feel they should get more credit for sharing their data and empowering that. So that's one incentive. And I wondered if you could talk about implementing that.
Yeah so I think I've already seen it on a slide at SSP, but I think what we often talk about when we think about incentives is a study that was published by floss in 2020, which reported a connection between open data sharing and repository and increased citations to the paper. So not just citations to the data set, the data being reused, but actually the paper that was supported by Open data, getting more citations than papers that don't share data openly.
So I feel like I really overuse that. Like it's the most compelling, um, kind of stat that I feel I can show on a slide to researchers. Um, it's interesting that you brought up state of open data, which is a long running study of researcher attitudes towards data sharing. Um, because I was looking at reanalyzing some of the data from last year's study, last year or last week, should I say.
And one of the other drivers that came up quite strongly was public good. So like the idea that there's a like maybe something that's not connected to career progression or citation, but just that it's the right thing to do. And to me, that's always been the more compelling argument that if you have researchers who believe in openness and transparency and open science, you don't need to convince them.
So much with your nice stats about citation, um, with the rest of the panel, like to comment Tim um, so obviously with open data within your own journal, you can encourage researchers to share new data sets. It's another step to get them to cite data sets that they've used from elsewhere. However, the reuse of the new data sets that have been published in your journals out of your control because that reuse the citations that should be being made is going to be spread throughout the publication network.
And so something we're working on with natural language processing is working out how to convert what are clear well, we call them ghost data citations because the authors write down something like we downloaded the data set from chave et Al and analyzed it, but chave et Al is a bibliographic reference. And and so that's a bibliographic reference. But it's clearly a data cite, a data citation that should be being recorded and being rewarded to the authors.
So but you can't catch that unless you analyze the whole sentence. And so that's something we're trying to come up with a tool that will help us track those ghost citations down and put them out of the authors. Yeah so I think even thinking about citation, like that's not a solved problem, like tracking data citations and yeah, have two comments on that.
So here, sorry, I'm really struggling with the microphone for some reason. So my first comment is that I think that's part of why having like comprehensive data sharing policies are so important. Like even if you have a journal where you're not really requiring all of your authors to. To share their data. Typically, even an entry level data sharing policy does tell authors that they need to cite data that's included in their paper.
So I think just having something established and that doesn't even really need to be a data sharing policy that could just be part of your editorial policies. But having something set, at least at your own organization, you're helping to contribute to this norm that like if you publish your data, you should be rewarded for that in a way that you can put like on your CV.
I think that's really important. And then I think secondly, like this really gets at like and I think the previous question got at this too, that like this whole conversation is really more about cultural shift. I think long term, like we can create these policies, we can like invest in like various forms of like enforcement like, but it's all really complicated. Like, it's much easier just to try to create well, maybe not easier, but it's long term.
I think it's more sustainable to try to create an environment where, like people want to share their data and where they are rewarded in their career for sharing their data. Because like, as long as people feel like if they share their data, someone's going to scoop their work and they're not actually going to get any credit for what they've shared or like, as long as people feel like, you know, putting all this effort into sharing their data isn't a good use of their time because like, no one really cares about that.
Maybe their doesn't want them to do it anyway and their institution isn't going to reward them for it. Then like you're going to be fighting an uphill battle. So I think, yeah, I think that really gets at like the importance of data sharing policies and the importance of just trying to like create a cultural shift beyond just policies. Thanks, Jenny. So we're just at time.
Um, one more. Yeah oh, I was going to add that I. So I work with librarians and kind of figuring I. Their policy is, and I think that that's a really great opportunity to talk about this open data and including that in policies. And like instead of just saying, OK, we, we want you to publish an but what does that mean? And starting at the institutional level because a lot of the researchers, they're getting all of their cues from their institution.
Right so we really need to collaborate with the institutions if we want to make this effective and implement it. But Thank you. That's a great way that we would agree. Yeah um, so we'll close there. I wrote down something yesterday. I can't remember which session it was in, but somebody was talking about perfect being the enemy of the good.
And I think in this session we've kind of been talking about the same thing. So thank you all so much for being here and for participating. And Thank you to the panel.