Name:
Designing A Metadata Fitness Program Recording
Description:
Designing A Metadata Fitness Program Recording
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/34794791-9fb2-4a04-9ce5-1f730648ac2b/videoscrubberimages/Scrubber_3.jpg
Duration:
T00H55M08S
Embed URL:
https://stream.cadmore.media/player/34794791-9fb2-4a04-9ce5-1f730648ac2b
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/34794791-9fb2-4a04-9ce5-1f730648ac2b/Designing A Metadata Fitness Program-NISO Plus.mp4?sv=2019-02-02&sr=c&sig=jJngzTpuRY7mUb%2FCd1hEFCScPyqW1cPiHDLRTFnapNk%3D&st=2024-11-05T07%3A18%3A34Z&se=2024-11-05T09%3A23%3A34Z&sp=r
Upload Date:
2024-03-06T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
CHRISTINE STOHN: OK Hello and welcome to the session about designing a metadata fitness program.
CHRISTINE STOHN: My name is Christine Stohn and I'm the moderator for this session. The data and the quality, especially of metadata, is essential for our ecosystem to function, and especially when it's shared between different platforms. I'm very excited about this session, so without further ado, I pass it on to Chris Kenneally, who will introduce our panel of experts. Well thank you, Christine Stohn, and welcome everyone to this session.
CHRISTOPHER KENNEALLY: Part of the NISO Plus 2023 conference. I'm Christopher Kenneally with CCC, where I host the velocity of content podcast series. Now, however, you get your exercise, the goal of a sensible, practical and responsible fitness routine should be progress, not perfection. Likewise for designing a metadata fitness program. Indeed, efforts undertaken at CCC and by my guests today have helped to identify exercises that consistently raise essential metadata fitness levels by emphasizing data quality.
CHRISTOPHER KENNEALLY: Panelists today will share how they overcame specific data challenges for completeness, consistency, accuracy, currency redundancy and reliability. At universities and within scholarly publishing, a transformative drive is underway, leading to open science. Stakeholders And the global research ecosystem, for example, have created transformative agreements under which authors, institutions, funders and publishers unite in the goal of open access for research.
CHRISTOPHER KENNEALLY: That effort is generating more data and more metadata than ever before. Yet the same stakeholders are United as well in their frustrations over the state of that data. Clearly metadata quality, data accuracy and the ability to link data must all be dramatically improved if we are to meet the challenges ahead. Our first speaker is Wayland Butler with AP publishing, a subsidiary of the American Institute of Physics and publisher of highly regarded journals, books and resources that cover all areas of the physical sciences.
CHRISTOPHER KENNEALLY: As director of data analytics, Wayland Butler drives system and process improvement, seeks ever higher levels of product quality and steers development of data and analytical capabilities. Following Wayland Butler is Ginny Hendricks, director of member and community outreach for Crossref. She helps guide several open infrastructure initiatives such as ROR and POSI and recently co-founded Force11's upstream community blog for all things open research.
CHRISTOPHER KENNEALLY: In 2018, Ginny founded the Metadata 20/20 collaboration. Our third presenter is Arjan Schalken program manager for UKB, scholarly information services at UKB, the network of Dutch University libraries and the royal library. Arjan's work focuses on designing and managing sustainable publish and read agreements. An important result of the UKBSis program is creation of a data warehouse to monitor, manage and analyze publication related data as input for negotiations, contract fulfillment and portfolio strategy.
CHRISTOPHER KENNEALLY: Our final presenter is my colleague, Laura Cox, senior director, publishing industry data. Laura joined CCC following the acquisition of Ringgold in 2022 and she leads data operations for csc's products and services. Laura is also treasurer of the International standard name identifier board. You know it as ISNI. In the discussion period that follows the presentations, We encourage you to share your own experiences and insights about data fitness for the next 45 minutes, though don't just do something, sit there.
CHRISTOPHER KENNEALLY: Waylon Butler with AIP publishing, you get to open the program.
WAYLON BUTLER: Great thank you, Chris. So I will start. I've got a few slides to share. So talking about fitness and getting in shape, we're not in the typical New Year's resolution window, but perhaps a lunar New Year's resolution to kind of get your data in shape and thoughts about some data fitness regimen for you.
WAYLON BUTLER: So as Chris mentioned, Waylon Butler I'm Director of Data and Analytics with AIP Publishing, AIP Publishing. So just as a kind of a caveat to my presentation, we're a applied physics publisher, so we're focused in the physical sciences, no life science, no biomedical. So our needs might be a little different than yours or your customers.
WAYLON BUTLER: Now, AIP Publishing is very excited, though, about this transition to open access. We do really think this will be beneficial to the researchers, the science and the world at large. So we're very excited about this move despite all of the work and anxiety that it produces. So a starting place for us where we have started to get in data shape, if you will, is really thinking about and driven by those transformative arrangements or read and publish agreements that Chris mentioned earlier. Institutions that have been paying historically for subscription, leveraging that to publish content from their researchers.
WAYLON BUTLER: So in particular, over the last several years, we've been really working on tuning this connection between authors, manuscripts and institutions. We know funders are very important in this mix. We were not yet being driven and we haven't really tackled the funders, so they're not fully tied in. But we did work to link authors to institutions, manuscripts to institutions, and we used Ringgold IDs. So mentioning the Ringgold database earlier, Ringgold IDs were first pragmatic for us.
WAYLON BUTLER: We already had Ringgold IDs associated with all of our institutional customers. We've been using Ringgold for ages, for years. So because that was already extant in one of our data sets, we push to leverage that in other places. So that we would understand which authors were affiliated with, which institutions normalize through a Ringgold ID and which manuscripts were primarily associated with which institutions using corresponding author in particular.
WAYLON BUTLER: So we've got this running really well now. It supports all of our read and publish arrangements. Authors are able to pick validated institution affiliations with Ringgold IDs attached so that we're able to actually administer in a fully automated way these read and publish, these transformative agreements. And we were able to do that without adding staff, which is maybe not the case for all publishers. We we really thoughtful about the beginning of leveraging this data normalization to support automation so that we don't have to solve this problem by throwing people at it, which is going to be costly and doesn't scale well as the entire world moves to an open access model.
WAYLON BUTLER: Now, it's not perfect, right? We still have certainly ongoing challenges. Much of the work that we're doing are in downstream data warehouse, data hub, we call it data exchange internally environments where we're doing normalization, cleansing, enhancement, enrichment of this data. So whatever the author has given us, whatever the institution or the submitter for a manuscript has given us, we're making improvements to that.
WAYLON BUTLER: So we're seeing improvements to the data. But getting that back up stream so that we can. Get that implemented during the fulfillment of a manuscript is a little tough. So we're doing we do still have some. Retroactive work where we have to go back and clean and get articles swept back in that we missed on the first pass. Another challenge, of course, getting better data from the start.
WAYLON BUTLER: Here's where we have to strike a balance between requirements and adding work to the author who's already very busy and hopefully doing lots of research. Adding work for the author to fill out forms and give us IDs and pick from the right list versus how do we incentivize them? How do we make it worth their while? So that's something we're still tuning to, to try and not be so onerous on the author, because we do want the better data, but we have to make sure there's something in it for them.
WAYLON BUTLER: Now, of course, everything I've been describing, we've done all of this work internally. We're doing this in our manuscript tracking system, we're doing this in our subscription fulfillment system. We're doing this on our production systems, but it's all internal. We're doing this work in our transactional systems and our data warehouse.
WAYLON BUTLER: Extending this understanding out to the wider market and understanding what's happening throughout all of the physical sciences is key. That's that's where we need to be focused. So that's a challenge of understanding. How can we do that when we certainly don't have the control over those systems? So thinking ahead, kind of next hurdles that we know we need to tackle author disambiguation when we're thinking of this, looking further, looking beyond our walls, understanding which author is the same author across publishers or papers, or if that author has moved institutions as they progress through their career and again thinking very heavily about ORCID.
WAYLON BUTLER: So persistent identifiers, I think are key part of your fitness program. You call it the stretching, if you will, that you really need those IDs to help you make those links. So ORCID is something we're really focusing on in terms of author disambiguation. We do need to lay your funders into this mix as we transition to open access. Funder mandates and funder initiatives.
WAYLON BUTLER: And funder funds, of course, become very important. So we have FundRef IDs captured on the manuscript level, but we're not integrating that. So we're a little fuzzy on how good is the quality of that data. We're completely reliant on what the author is given us during submission. And again, looking beyond our walls, looking at authors who've published with us, where do they go?
WAYLON BUTLER: Looking at the articles coming from an institution with a read and publish deal, where else might they be publishing? Could we make can we make the argument or the pitch to that institution to encourage more of those papers to come to AIP Publishing. Now means a little primer. Thinking of what? What would I love to have? This is kind of a wish list for this is the Mr Olympia championship crossfit level of data fitness.
WAYLON BUTLER: I wish we had PIDs everywhere. I think they are really critical to helping make those connections. So we're not trying to match text strings and names and institution names and you know, is it University spelled out or univ with a 'dot.' identifiers ORCID for institutions, Ringgold and RoRfor institutions. FundRef for funders, and DOIs for any piece of content that we could reference.
WAYLON BUTLER: But I don't want to get all of those IDs by throwing all of the work on the author right now. It seems like that's the path of make the author give it to us. And that's not scalable and it's not fair to the author. And it certainly it's a distraction potentially and detracts from the research work that they could and should be doing. So how do we build tools in to make it much easier to provide these ids?
WAYLON BUTLER: A tools, suggestion tools. If we have an ORCID ID, why do we also then ask the author all of their affiliation information? Can we just pull that in off of an ORCID ID for example? I also wish third party data providers, if they have PIDs, if they have those persistent identifiers, they could just expose them in their outputs or their API feeds or their UIs? They do great work normalizing already, but my experience is often more normalized to a text string, so it would be great to get those IDs.
WAYLON BUTLER: that makes those matching and integration of data points easier? And then thinking further ahead to beyond just making the article content itself open, thinking about the research data, the underlying research data, the figure is the software that's used, code that's used, making all of that available, shareable with reference DOIs so we can extend this and understand the full connection of this funder funded research at this institution with this researcher.
WAYLON BUTLER: And it produced all of this output articles, data sets, software, et cetera. So thank you very much. I will stop sharing. And we can head on. Waylon Butler with AIP publishing.
CHRISTOPHER KENNEALLY: Thank you so much. And I actually I really enjoyed the way you played with our fitness image here and you were talking about training.
CHRISTOPHER KENNEALLY: So I want to ask about that because the points you raised with regards to authors is that you need to balance this carrot and stick approach when requiring and incentivizing them to provide these IDs. All of it's with the goal of getting better data from the start. So with these transformative agreements, the read and publish agreements and so forth, are universities and even funders providing training as well?
CHRISTOPHER KENNEALLY: Are they giving the kinds of assistance for authors that that would help accelerate this?
WAYLON BUTLER: So I think from what we've observed, there's a lot of good intention there, but I'm not sure that the infrastructure is there. What we found and what we've experienced is the larger kind of top tier research institutions that are like very large, very heavily research focused, may have dedicated staff and they may have funding all around, making content open.
WAYLON BUTLER: They may have their own internal, internal, institutional repository. They may have staff within the library or within the centralized research office that will support this type of work, keeping the researchers at the institutions informed, facilitating the process, helping them find funds, helping them navigate the process. But when you start to get the smaller organizations or ones that are maybe simply not organized in this way, it's not there.
WAYLON BUTLER: So there isn't that same follow through. And I don't think it's a case of unwillingness. I think it's also part of in the same way that publishers are in some cases, not prepared for an open access transition, I think the research institutions may not fully be there. And anecdotally, what we've heard about funders, again, that there can be funding can be applied to it, but there is a gap of between the funding and how is it implemented and how do we make this easier and how do we find compliant publishers who will publish, according to that funders mandate?
WAYLON BUTLER: Again, a lot of this work is still being left to the author, which is, I think, something that we'll have to solve as an industry.
CHRISTOPHER KENNEALLY: Well, it really means that everybody at the gym has to get together and support each other, it sounds like. Right anyway. Waylon Butler with AIP publishing, thank you very much. And Ginny Hendricks with Crossref.
CHRISTOPHER KENNEALLY: Over to you.
GINNY HENDRICKS: All right. Thank you very much. Does it all look good? OK so I'm not I'm not going with fitness. I'm going with love because this session is being aired on Valentine's Day. But I will talk about metadata quality enrichment, and you could say healthy metadata.
GINNY HENDRICKS: So I work for crossref. We're an organization that's been around for 20 years, started by a bunch of publishers and societies. And we really do love metadata. Very briefly, we're community governed, we're non-profits. And we are we reach sustainability within the first three or four years. We charge for services and membership fees. We have grown from the original 12 founding members to over 18,000 organizations and from those early ones who were US / Western Europe- based to more members now in Indonesia than in any other country.
GINNY HENDRICKS: So we have 150 countries. And what that means is these members are providing metadata with their records up to 140 million now. And it means that we have a much richer picture across the world of what research is out there. And it's of course expanded as well beyond articles as well and nicely set up. There's data, there's code, and all of these records contain information about lots of different research objects.
GINNY HENDRICKS: And we also see their usage of this metadata from thousands and thousands of systems. We don't even know how many thousands because it's anonymous and open to use crossref metadata. We have open APIs. We are smallish, we have 44 staff at the moment across seven time zones. So we've also slightly grown with our membership. And like I said, we really do love metadata.
GINNY HENDRICKS: We have actual artwork and t-shirts and badges and I wish I could give some out. Can't do that online. This isn't me. This is Rosa, who's our Communications Manager. And yeah, we talk about metadata a lot. Why? why do we love my data? I'm basing this around the campaign from a few years ago Metadata 20/20 as Chris introduced, I was part of co-founding that and we had over a couple of years about 500 volunteers from different stakeholder groups talking about metadata.
GINNY HENDRICKS: So there was a whole phase of a lot of talking and there were funders and researchers and universities and librarians and of course publishers and service providers. They identified some common needs. And these the work developed into very specific projects and some of them had really good outcomes. Some of them were just too difficult to handle. Like how do you educate researchers about metadata?
GINNY HENDRICKS: Or even we should they not have to know to Waylon's point as well? So there's a lot of information on the metadata 2020 campaign. Now we're in the phase of sort of handing it back to the community to use some of those outcomes as research that's been published. There's surveys among researchers, there's a pledge to sign, to advocate for richer metadata. And these are the three sort of benefits that the community came up with for why we should advocate for richer metadata.
GINNY HENDRICKS: So it fuels discovery and innovation, bridges the gaps between systems and between communities, and it eliminates duplication of efforts. And I think that's I think Waylon at AIP has given an excellent example of some of these challenges. And really metadata is a form of communication. And so for an industry that calls itself scholarly communication, it really is the bedrock back to crossref and our metadata and the metadata users.
GINNY HENDRICKS: We this bibliographic shows around the outside some of the types of users of the metadata. So lots of databases, reference management systems, posters, recommend recommendation engines, all the kinds of systems that publishers and others work with and certainly that researchers work with and some of the things they do with that metadata around sort of in the inner circle. So they're filling in gaps.
GINNY HENDRICKS: They're getting they're using crossref as one source, of course, not the only source. They're matching and linking. They're aggregating. They're developing search technologies and discovery tools. And these are only growing. So we're only seeing the usage of this open metadata expanding.
GINNY HENDRICKS: To be more specific about the types of metadata we're talking about. As I said, it really is information about a research object. And so that object in the crossref world could be an article (very commonly, I think almost 65% of those 140 million records are about articles). The rest are about book chapters or conference papers or research grants.
GINNY HENDRICKS: So funders are now contributing to this metadata ecosystem. But if we take a typical case - an article, perhaps in the biomedical field, we would ask for references. That's kind of why Crossref was started. The clue is in the name of cross-referencing, that's still huge for publishers to be able to not have to have bilateral agreements with every other of those 18,000 publishers to agree linking.
GINNY HENDRICKS: So so using DOIs to link their references through Crossref, it's really that collective effort. Abstracts are really important. And it's one of the most asked for types of metadata and it's also one of the most, I guess one of the, one of the fields that we're seeing increasing from crossref members. More more members are adding abstracts than any other elements of metadata at the moment.
GINNY HENDRICKS: That could be to do with the campaign, the initiative for open abstracts abstracts are also really useful for analyzing things downstream. Obviously full text is useful for that as well. If if the full text is open access. But abstracts are also really useful for analyzing things like plagiarism detection downstream or any kind of quite a lot of research integrity issues that I know the community is sort of looking at and concerned with at the moment related to that retraction and correction information really important to get these asserted, get them linked so that future generations know that what they're reading or even current users know that what they downloaded three years ago has been corrected or retracted.
GINNY HENDRICKS: And there's a notice about that that has a DOI and it's linked in the record, and Ideally also linked online on the interfaces where they're finding them as well. I won't go through this whole list. Somebody already mentioned identifiers. So we like to link up contributors. We like to link up affiliations, funding information. We have a similarity check tool that checks for plagiarism.
GINNY HENDRICKS: So we have lots of links that people can add as well. And these are just, I think eight or nine of the ones I would say are quite critical in order of my wish list. But there's hundreds of fields that you could add to crossref, so it's all possible. So what do we need to do? Maybe this is a little bit fitness.
GINNY HENDRICKS: We do need the right tools. We need the right equipment. So I think I see Crossref's job is just making it possible. Whatever the community needs, we try to make it possible. Our next step is to try to make it easy. And that's really hard because not everybody knows XML. Not everybody has that expertise. Not everybody is even aware of all the possibilities. So it's a little bit of a balance.
GINNY HENDRICKS: Like you have to be aware of what the health benefits are to you and to society, and you also have to be able to have access to the right tools. So I guess this is a bit of a wish list. I would like to ask that people really advocate for better data quality, convince budget holders to invest. I have seen it done where, you know, time, budget strategy, awareness kind of comes together at the same time and you see a massive uptick with certain members and then they in turn see increased traffic to their content, for example.
GINNY HENDRICKS: On quality - it IS important. But more and more and this might be a shift that we're seeing more and more. The downstream users are saying we can also pass those loose strings if this doesn't have a persistent identifier yet or if this isn't well structured, we can still use it. So funders are asking us for free text acknowledgment statements that they can then analyze and pass.
GINNY HENDRICKS: So it doesn't all the work doesn't have to come from thepublishers. Maybe in the future we can actually collaborate and that that's something that we're thinking about a lot of Crossref to. How can we as a community actually collectively enhance the record? Maybe we'll talk a bit more about that in the discussion. And I also wanted to just give a big, big shout out to the members, because actually we know that metadata doesn't stand still.
GINNY HENDRICKS: You have to keep updating it. It's the same with your fitness. You know, as soon as you leave it over Christmas or the holidays, you're like, Oh my gosh, I've thought again, even two weeks not paying attention makes a difference. So this is just a graph that shows that actually 80% of our records up to 2016 have been updated by members.
GINNY HENDRICKS: So publishers really do really do do what their obligations with crossref are asking them to do. So I just wanted to give a bit of a shout out for that and I will leave it there just with some more pictures. Really this is our finance director's son and even he, jack, loves metadata and that backpack is perfect for hiking snacks, apparently.
CHRISTOPHER KENNEALLY: Well, Ginny Hendricks with Crossref We love that presentation.
CHRISTOPHER KENNEALLY: Thank you very much. And you know, I wanted to pick up on something and I I'll go with love and relationships here. And the thing about relationships, of course, is sometimes it means you have to be honest about things. So you have told us about how Crossref has a lot of metadata provider providers and users. How does this community work together to address metadata errors and quality complaints? And how in what other ways would you like to have help to improve data fit this data quality?
GINNY HENDRICKS: Yeah thank you. How do we do it at the moment? Very inefficiently. I would sum up, even though we see lots of records being updated, we don't make it that easy to update. You often have to re deposit the entire metadata file just to add some extra ORCID IDs or something. So we need to work on that. That's our part to make, make it.
GINNY HENDRICKS: As I said, we've made some of it possible. We need to make it easier. We need to have more understanding between the, I guess the curators and the suppliers and the creators of the metadata and the consumers of that metadata. So, for example, that example I gave where funders are looking for free text strings, how do we let publishers know that and open a conversation where something can be done programmatically?
GINNY HENDRICKS: We have an API called Event Data which takes if you like, assertions from non-members so we can report on when DOIs are in the news and things like that. How about if we could take have a simple form where instead of emailing Crossref with you know, I'm an author and my name is misspelled. All we do is email the publisher and say, please, can you update your record in crossref, which often they don't respond to, or they may do within six months or something.
GINNY HENDRICKS: If that was an open request, and if publishers had an easy way of saying, I accept or reject this change request, you know, there'd be a bit of development work involved, but it would also at least be a sort of a matter of public record. So I think there's more I can see that we can be responsible for, but I definitely see the willingness from downstream users and upstream creators to work on something together.
GINNY HENDRICKS: So I think that could be. That's my sort of "fantasy", talking about love, you know, to have something like that, that's like an open sort of crowdsourced metadata assertion store.
CHRISTOPHER KENNEALLY: All right. Well, Ginny Hendricks with Crossref. Thank you very much for that. Our third presenter is Arjan Schalken, program manager for UKB Scholarly Information Services at UKB, the network of Dutch University libraries and the royal library. Welcome, Arjan.
ARJAN SCHALKEN: I said, Christopher, great to be here. I kind of start my presentation. I'm very glad to. To show you something. A part of the program I'm running that the UK based data hub as part of the panel discussion on designing a metadata fitness program. Some context about this consortium.
ARJAN SCHALKEN: It covers all University and University medical centres, and our researchers are involved in almost all of the Dutch scientific outputs. Open access is on our agenda since 2015. We had our first reading published in 2015. Also a government policy regarding 100% open access. And at this moment we have 18 written publish deals, but we also have a strong wind up next, policy based on copyright law and we see open access as part of a broader open science strategy.
ARJAN SCHALKEN: In 2021, 82% of all our corresponding co-authored peer reviewed articles were open access. And I hope that in 2022 the score is 85% or higher. So we are very determined to reach a 100% goal. However, that needs some efforts and also based on the complicating playing fields, more and more money involved in the system, the consortium started to program a basis to strengthen negotiations, contract management, portfolio management, open access uptake, and that program needed to develop a data hub that I will discuss in my second part of the presentation, but also toolkits with process descriptions, checklists and under the program manager metadata.
ARJAN SCHALKEN: Why is it important? Maybe obvious, but for I want to address it anyway. First of all, as a consortium, you want to bring your own data to the table and you want to audit, publish your data. And if you have a contract with the publisher on Publish deals and open access services, you want to monitor the progress of that service and also the quality. But also you want to look beyond those deals.
ARJAN SCHALKEN: You want to look at the whole playing field of scholarly communication, what is published, open access in the wild, and maybe even label it with what kind of money is in the system. And last but not least, more and more we see funnel compliance as a management topic is an article that is funded, for example, by a client as funder compliant, is open access, does it has the right license.
ARJAN SCHALKEN: So a lot of things going on that meant that we want to build a data warehouse that was centered around publications, institutions, contracts, publishers and journals. So I'm not mentioning authors. We also we do collect corresponding author names, but we don't collect other information from authors also based on privacy reasons and we the Doi. So the unique ID of an article is the centerpiece of that data warehouse and we collect as much dois as possible from three main streams.
ARJAN SCHALKEN: One is, are the publishing reports or the publish reports we get each month from the contracts we have. Then we use Scopus from corresponding and co-authors. There are, of course, also other commercial and non-commercial providers of these kind of services and all that universities have their own system. And there we get the data from the central harvester nurses. We add additional information from our contracts so that we have from consortium manager and the surf journal catalog.
ARJAN SCHALKEN: And that adds additional information on which publishers we have to deal with journals are part of the deal. What do we publish in those journals? Which institutions are part of the deal, etc., etc.? And on top of that, we try to harvest as much as possible from other open sources like babel, crossref, open APC, doj, ICs and aac to get as much data metadata as possible around those topics.
ARJAN SCHALKEN: You can't do it in Excel. You have to need. You need to have a dedicated tool. For us, it's click software. And that helps us to use all this data and also to present it in a dashboard to. And to reuse it also through an API. So that is the core, let's say, of our data warehouse, the tooling.
ARJAN SCHALKEN: And then, of course, we have the dashboards. The dashboards are with plenty of dashboards, all have their own roles. So we have dashboards, for example, to filter and report on outputs. What do we publish? And this is an example of such dashboards. And here you can say, OK, show me all publications in 2020 to show me those from the universities or the University medical centers and then filter on specific publisher, for example, widely, then say, OK, but now show me only the publications in full or open access journal.
ARJAN SCHALKEN: So not the hybrid ones and also from only our corresponding authors. And then you have some idea of what you publish in full open access journals with Wiley. But you could also say and show me only the journals that are not part of a deal. Then you have the open access in the wild. So this is a very flexible dashboard where we get a lot of information out.
ARJAN SCHALKEN: It's then presented with all the metadata. You can use it in a dashboard, you can download it in Excel and other formats and reuse it in any way you want. Then we have dashboards to monitor and manage our output. For example, this is a dashboard. Several of our deals have cap, so there's a limit on the amount of publications. We can publish open access as part of the deal.
ARJAN SCHALKEN: And then it's very important to know where are you at during the year and how much apps we have left. And here you see the blurb at the left is lines of the publishers. But this is one publisher in the middle where you see a line, a linear line that is the average we can publish per month in the journals of this publisher. And in the blue you see what we actually published.
ARJAN SCHALKEN: And then around April we say there is a bit of a gap. We have some apps left on average. Let's discuss a repair action with this publisher and the publisher does a repair and then it hits the average mark. And then during the year we see we left it behind. We're missing some open access. We talk with the publisher and the publisher starts another repair action.
ARJAN SCHALKEN: And at the end of the year, we have 10 vouchers, 10 apps left, and those are transferred to the next year. So we use a monitor to manage the output and to discuss this with the publisher and other dashboards. We have also dashboards to compare and analyze multiple sources. For example, for missed open access. I have now presented it in a heatmap. We are more intuitive way of presenting data here.
ARJAN SCHALKEN: We see we try to analyze the publications we have in journals. We think we are eligible for corresponding authors, but somehow those dots are not popping up in the reports of the published reports. So why are not they not there are those missed open access. And then in an heatmap like this, we can see what our dark spots are, which publishers have relatively more open access or which institutions have relatively more missed open access.
ARJAN SCHALKEN: And from there on, we can dive into the workflow with the publisher or maybe the communication from an institution about deals and improve the uptake. And last but not least, because there's a lot of data going on in the data that we use, the data have to audit and manage the data, the data quality from the report itself. So this is a net case from last week. We get reports from publishers and those reports are complimentary with the monthly reports we get.
ARJAN SCHALKEN: And with this publisher, we saw a difference between the year report and the year report on 2022 and the monthly reports from 2022. And with this tool in the data hub, we can rather smoothly see that 15 of those dailies in the year report were already represented in the 2020 and the 2021 monthly reports of the publisher. And we want to provide that. We have to pay twice or maybe even more times.
ARJAN SCHALKEN: So then we can contact the publisher and say, well, we see some duplications in the dailies. When you have 15,000 publisher reports, duties and from consortia contracts and 60,000 a year overall, you can't do that based on single Excel files. Of course, you need to have a data. So with that many data, there are some, let's say, issues or to deal with.
ARJAN SCHALKEN: And if you do that, don't handle it correctly, then there is a risk that 1 Plus 1 is how is less than 1. So if you have one data set and the characteristics, you know the strengths in the weaknesses, and you combine it with another data set, you know the strengths and weaknesses before you're not sure what the strengths and weaknesses are of that combined data sets and that someone says, OK, I want to have single data sets, then combine sets because for example, there's too much duplication of data, there's conflicting data, and I don't know how to deal with it.
ARJAN SCHALKEN: It's so much data, it's becoming too complex. Or we built some business rules on how to act on data, and suddenly one of the data sets is missing, and then there's an error and data fields are empty, metadata fields are empty, and suddenly it's not workable anymore. So with the data help, we try to reverse that and try from 1 Plus 1. How can we make three if we have much data?
ARJAN SCHALKEN: Of course it will be duplications, but we are also more complete if we have conflicting data and we know that some metadata fields from one data source are better than from another data source. We can build in a business rule that only the best part of the data source is represented, so that we know for sure that we present the best and most correct data. If we have more data, on one hand, it can become too complex, but more data is also more knowledge.
ARJAN SCHALKEN: You have more data to process and hopefully you can get more knowledge of it. And of course you can have a domino effect from instability, but at the same time, it also can create stability if one source suddenly drops the quality of, let's say, the open access status, but you have also an open access stage from another source. Then you can use that, create another business rule, and then improve the overall stability of the whole system.
ARJAN SCHALKEN: So you're not relying on just one pipeline. So that is what we try to accomplish. And with the dashboards are showing you, this is also where we work with on a daily basis. And yeah, I can imagine that there are some questions, but this is my presentation. And Thanks for listening.
CHRISTOPHER KENNEALLY: Well, thank you very much. Arjan Schalkin with the UKB Scholarly information services. And it's really impressive work. You've done some creative math there as well. And I want to ask you about the effort that is behind designing and building a data hub like this for UKB. so do you have lessons to share for the audience on how to build support for a data hub within a consortium or a library?
ARJAN SCHALKEN: Yeah, Thanks for the question, Christopher. For UK , we were aware of some issues.
ARJAN SCHALKEN: For example, missed open access was already in 2018, 2019 on the agenda, but we couldn't track it. So we knew we were missing data. But it was we were not aware what we missed. Also, we were not that happy that with negotiations we had also always need to trust the data from the publisher. We wanted to bring our own data, but there was a lot of effort and it was also always on stop and go based. So there was one action, the data that was created and then it was used and afterwards it was stored but never reused again.
ARJAN SCHALKEN: So it was very inefficient. And that was for library directors, the main drive to say we need some kind of a data hub. Instead of that, everybody in the consortium is doing this for themselves. We have to combine efforts. At this moment, you see that there are different incentives. For example, a PC is in the wild. How much do we pay with full OA publishers that are now skyrocketing?
ARJAN SCHALKEN: Their output is very prominent or the plan compliance is very prominent. And so I would encourage consortia, libraries to interview different stakeholders and to get on the table those different stories about the challenges. And then based on that, build a business case of the added value. And for the most simple value, the most simple business case is most open access.
ARJAN SCHALKEN: We have 15,000 articles a year that are published under the contracts. When we started the data up, I think we had 20% missing open access. Now we have 5% missing out open access. I mean, you're talking about 100,000. So millions of euros that are not used within a contract if you have that much of missing open access. And so the data help is worth this money already.
ARJAN SCHALKEN: But I think it's key to see what are the main topics for your own library or consortium, and you will find them, I promise.
CHRISTOPHER KENNEALLY: All right. Well, thank you again, Arjan Schalken with the UKB scholarly information services. You're welcome. Our final presenter is my colleague, Laura Cox, Senior Director, Publishing Industry Data. Laura, welcome and tell us what you've got.
LAURA COX: Thanks, Chris. It's great to be here. And I'm going to focus a little bit more on the fitness, a little less on the love, because I hadn't noticed the date. So I'm going to talk about some more generic things. We've heard there were lots of challenges to be met regarding metadata.
LAURA COX: and at CCC, we hear from all sorts of stakeholders in the ecosystem and take a kind of macro level view of the landscape. So metadata is already all around us. It's one of the tools in our workflows. But as we've already heard as a tool, it's not necessarily doing what is needed for all concerned. So we're experiencing a collective push for open access publishing and moreover open science, which includes making research data and different types of output readily available.
LAURA COX: So this transition phase really needs quality metadata to fulfill this effort. We're starting to approach a point where we need metadata to be more robust, uniform and comprehensive in all parts of the research lifecycle; and it should be applied to a wide range of information, content, the people who create and review that content, the organizations that fund, enable, publish, curate and utilize that content.
LAURA COX: So we need metadata to be distinguishable, as we have heard, in the form of persistent ids, but not in isolation. The metadata attached to those persistent IDs is important too, and all sorts of metadata can add meaning to all of the attributes of research on initiation, publication and assessment. So what I'm going to do is go through a few ideas on how to build a metadata fitness program and how to assess data quality to aid decision making.
LAURA COX: So I'm going to start with six attributes. These are things that you can measure, readily, easily. So completeness is essentially where data field exists. Is it populated? Consistency is about conforming with expectations, is a date field displaying a date, where there's a numeric value expected. Do we have numbers?
LAURA COX: Do syntaxes are they being applied consistently? Currency is simply when a record was last updated. As Ginny said, we've got some improvement in that area, which is great to hear. Redundancy evaluates duplicates, including within and between records. And accuracy evaluates a true description. We get into our slightly harder area depending on the nature of the data you're assessing.
LAURA COX: Reliability can be difficult to compare. So we often need comparisons to produce reliability statistics, and if all sources of the data originate in the same place, reliability can be compromised. If all origins are the same, the inaccuracies will agree. We can add additional measures. So what I've already covered are objective measures and there are some additional ones.
LAURA COX: We could look at coverage. What percentage of the events or objects of interest have records? If you care about all researchers in North America, what percentage of North American researchers are represented in a dataset? We can look at integrity. If there are relationships present in the data, how often does the related data perform
LAURA COX: as you would expect. So if you have numerical values in relational data, do they add up to the top? Do external IDs that are relational point, something that's still live, that still exists? How exact is the data? Accuracy is one thing, but it's paired with precision. So accuracy is that I'm using a laptop. Precision is that I'm using a particular brand model specification, even the serial number of the laptop.
LAURA COX: And then we can also bring in the users and look at subjective data quality metrics, and this is where things like user surveys can be really illuminating. So we're looking to ask how how easily can data be used, what value can be extracted from it to perform the desired function? How understandable is the data? Is it well documented?
LAURA COX: In the user's opinion, how impartial is the source of data? And then trust. Do users trust the data? Now, you can sometimes judge this by behavior. Are they using alternative sources for the same reason rather than the data that you intended? So here are just a few ideas, a few examples of what you could measure.
LAURA COX: There are loads of resources available on the internet, in blogs and all sorts of different information environments. There are entire books on data quality, and so you can apply these sorts of questions and many, many hundreds more. You could look at whether when there are fields which when combined, do they equal a duplicate that helps with redundancy where persistent IDs, some, use checksums.
LAURA COX: Do they validate? This is to do with syntax and validation, and there are lots of questions you can ask with data. It really comes down to the data in question and whether it's metadata about people organizations, content, references to content. So another aspect that's really important is context, and that's the role that the entity performed in connection to the research process or stage of the research process, which is important.
LAURA COX: So was this initiation, was this during the research process, is it a midway point of output or is it actually something that's a publication? Are we talking about assessments of publication after the fact? And the more we can apply those sort of taxonomies and ontologies, the better and more robust our data becomes. So this is quite an exercise.
LAURA COX: It sounds like a lot of work. So there are things to think about. You need to think about how the data you're looking at will be practically used and in what settings and what use cases are involved. We to think about how we might benchmark against any of these measures and how to demonstrate improvement in the data quality. Very importantly, how to make metadata, as I think has been mentioned already, interoperable through different parts of workflows and between different data sets and systems.
LAURA COX: And we need to ensure rules to ensure data integrity. Something as simple to start with as syntax, validation and deduplication can make a huge difference. If we build strength into our data by using taxonomies and ontologies, we can do vastly amounts more. So we look at the current use cases. We look at what people are doing now. We look at what we're trying to achieve in the near future.
LAURA COX: But what might we want to achieve once, once we've met that point? What might we need in the future, and how might these data interact? So we need to look at whether issues are adding time. We've heard of costs, potential for loss of data integrity in a workflow or loss of trust. And so carefully thinking about the approach taken to future proof, the program that you're going through with better data plans is really, actually very important.
LAURA COX: Once data fitness is improved and it's very much a process, as we've heard, it really drives better data driven decisions, and this can be for a wide range of different things. I have just a few examples here. This could be from impact assessments and analysis of past outcomes to inform future decisions. Providing researchers with quality metadata, not the researchers providing the quality metadata, but providing them with quality metadata to enable research to be shared and discovered more effectively.
LAURA COX: For publishers, it can be the provision of accurate publishing entitlements, providing analytics back to institutions and funders, which they can then use in their decision making and feed through back into the process. So that we really creating a loop so that the data fuels the workflow, and so for everyone from researcher to funder institution to publisher and user of the research, it's about building quality data that fulfills open research and creates smooth workflows for the researchers themselves.
LAURA COX: Thank you, Chris.
CHRISTOPHER KENNEALLY: Well well, thank you, Laura Cox, my colleague at CCC. And I'll just follow up on that last slide, Laura, you were emphasizing the ability to do better with our data driven decision making. So does measuring data quality come to play for that? Does it help to translate into that data driven decision making?
LAURA COX: I say I think it does. Thanks, Chris, because just having some data, just having partial data, incomplete data, potentially inaccurate data is it doesn't build trust, and when you're going to use data to inform decisions and to push through workflows and inform maybe other people's decisions in that workflow, we really need to build trust. The more we trust the data, the more we have that data quality benchmarking, the more we can compare things, the more we can analyze with a degree of confidence.
LAURA COX: That means we can analyze outcomes and evaluate and track, and with more dependable metadata, we really have more information to drive those decisions. And I think this operates throughout the entire research lifecycle. And often the current situation is very much that metadata is either pushed upon the researcher to add or the publisher;
LAURA COX: and there are all sorts of different intermediaries as Waylon had suggested and funders and others who are involved in the whole process where this is important. And if we want to make this work smoothly, having high quality data to drive those decisions for everybody concerned makes quite a lot of difference.
CHRISTOPHER KENNEALLY: All right. Well, Laura Cox, thank you for that. And that brings us to the end of the presentations in our session, designing a metadata fitness program. We return in a moment for audience questions.