Name:
Curating a community registry of research organizations-NISO Plus
Description:
Curating a community registry of research organizations-NISO Plus
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/6ec45098-aa69-45b9-9963-3e9035378bc6/thumbnails/6ec45098-aa69-45b9-9963-3e9035378bc6.png?sv=2019-02-02&sr=c&sig=hCjJ6OC%2BRpD39BwtFjKSEE%2Ft1IOJplnYSdRkedGH9eA%3D&st=2024-11-21T21%3A44%3A15Z&se=2024-11-22T01%3A49%3A15Z&sp=r
Duration:
T00H29M54S
Embed URL:
https://stream.cadmore.media/player/6ec45098-aa69-45b9-9963-3e9035378bc6
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/6ec45098-aa69-45b9-9963-3e9035378bc6/Curating a community registry of research organizations-NISO.mp4?sv=2019-02-02&sr=c&sig=xDDLELBBe2xuLm%2BdbK%2BCyQ0oBAngaZO6b6EpaeppCqM%3D&st=2024-11-21T21%3A44%3A15Z&se=2024-11-21T23%3A49%3A15Z&sp=r
Upload Date:
2022-08-26T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
MARIA GOULD: Hello. This presentation is about the process, successes, and challenges of implementing a community-based curation model for the Research Organization Registry, also known as ROR. I'm Maria Gould from California Digital Library, one of ROR's operating organizations. And I'm joined in this presentation by members of the ROR curation advisory board. You'll hear from them in a little bit.
MARIA GOULD: The presentation will start with an overview of ROR and how we initially developed our approach to curation. And then we'll discuss the implementation of the workflow, and finally close with personal reflections from curation board members about their experience with the process so far. So let's start with some background on ROR. ROR is a community-led registry of open, sustainable, usable, and unique identifiers for every research organization in the world.
MARIA GOULD: The registry launched in 2019. And since then, it has grown to include unique IDs and metadata for more than 100,000 research organizations around the world. The registry data is available under a CC0 waiver, and there is an open API and public data term. The original seed data for ROR came from GRID as a foundation for ROR to build upon. The question that ROR was created to help answer is how to connect research outputs to research institutions.
MARIA GOULD: For example, let's say we wanted to locate recent articles published by researchers at UC Berkeley. If we just search for these articles using a text string, we may find some articles, of course, but we also may miss a lot because there can be so many different ways to express an institution's name. Even if an official version of the name exists, people and publishers may still use different variations for various reasons.
MARIA GOULD: For more efficient tracking or discovery of research by institution, we need a standard way to identify this institution. For example, the human-readable metadata shown here for this published data set indicates that it comes from UC Berkeley. But under the hood, the metadata for the DOI associated with the data set includes a ROR ID for UC Berkeley.
MARIA GOULD: So instead of searching for research using institution's name as a search term, we can query the ROR ID to get a more comprehensive and reliable set of results. When we have ROR identifiers embedded in metadata for research outputs or in metadata about researchers, it means we can establish meaningful connections between key components of the research ecosystem.
MARIA GOULD: To that end, ROR IDs are integrated or are being integrated in Crossref and DataCite metadata, in ORCID, and in various platforms and systems where affiliation data is collected or harvested. Now let's take a look at what is inside the registry. Here is an example of a ROR record in the registry's search interface. You can see the unique ROR ID for UC Berkeley, as well as additional metadata about the organization which is helpful for discovery and disambiguation.
MARIA GOULD: The record also includes crosswalks to other IDs for the organization when these exist, such as Disney IDs and Wikidata. And here is the same record in the ROR API. Some additional metadata is available here, like related organizations, and specific location data for the organization. So the registry got a jumpstart by launching with the sea data from GRID, and it is fairly comprehensive at this point with 100,000-plus organizations.
MARIA GOULD: But there's still a need to maintain records going forward and to update new organizations as they come up. In order for ROR to be a trusted and useful source of information about affiliations, we need to ensure that the registry records are comprehensive, up to date, and as usable as possible. Shortly after ROR launched, we began exploring different approaches to curating the registry. The goal was to come up with an approach that would help ROR address crucial gaps in the registry's coverage, reflect the changes that organizations go through over time like changing their names or merging with each other, fix inconsistencies in our metadata, optimize the metadata for discoverability and global usability, and develop policies and processes for maintaining broad data with community input given ROR's focus on being a community-driven registry of organizational data.
MARIA GOULD: We eventually settled on the notion of a community-based curation model to address these goals. A core aspect of this model is balancing the need for ROR to be a community-driven registry with the need to have consistent, usable, maintainable data. Early on in the process, we consulted with our advisory groups to discuss certain considerations, such as whether it was necessary for organizations to have the ability to manage their own records in ROR.
MARIA GOULD: Ultimately, the feedback we received out of these conversations did not indicate that this was not-- that this was a priority. We determined that we could manage the registry more efficiently by centralizing our curation processes while remaining open to ongoing input. We decided to set up open feedback channels for anyone to suggest registry additions and updates, which allows us to leverage a wide range of expertise about institutional metadata and to centralize this feedback through an open review process to make sure that we are making changes in a consistent and transparent way.
MARIA GOULD: And to develop this model, we were very fortunate to secure funding from The Institute for Museum of Library Services to bring this to fruition. A key to implementing this curation model for ROR has been working with a board of advisors from across the community to help develop consistent practices and policies for how the registry is managed and to help review proposed changes that come in via community feedback channels.
MARIA GOULD: In 2020, we began piloting this work with a curation advisory board of individuals representing different types of expertise about organizational metadata and different expertise about certain regions and sectors. We began piloting workflows that would allow these community curators to more or less work asynchronously to help develop policies and review proposed changes to ROR and also have the flexibility to take on different levels of involvement, depending on their time and interests.
MARIA GOULD: So this is a bird's view of the process that we have in place now for reviewing proposed changes to the registry. You can see at the top that it starts with input coming in from various channels across the community. There is a preliminary triage step to confirm that it's in scope and can proceed further. And then it undergoes a review process by our community curators to determine whether it can proceed even further in the process and go through a final decision as to whether this organization can be added to ROR, or whether the proposed metadata changes can be approved.
MARIA GOULD: The final step in the process is preparing the ultimate metadata records to be incorporated into the ROR production site. This is a key part of the technical work and technical infrastructure development that ROR has been undertaking that we will be rolling out later this year. And now, I'm going to turn it over to Carly Robinson, who will go through this workflow in more detail.
CARLY ROBINSON: Hey, thanks, Maria. So again, my name is Carly Robinson. I'm with the US Department of Energy's Office of Scientific and Technical Information. And I'm one of the volunteer members on the curation advisory board. So in this section, we'll discuss the community-based curation workflows that we have been piloting. So if you could go to the next slide.
CARLY ROBINSON: So the current workflow for reviewing proposed registry updates includes a few different components. So it includes a community feedback form, GitHub repository, GitHub project board, and the decision-making process. And I'll go through each of these in a little bit more detail. So if you can go to the next slide. Anyone can submit suggestions to add or update the Registry data via public request form.
CARLY ROBINSON: And you can see links here to that request form. It collects a number of information about the request, whether it's a request for a new ROR ID for an organization, or if it's a request for an update. So you can use that form to do that. If you go to the next slide. Requests are automatically converted to GitHub issues in a public repository. And they go through an initial triage stuff, like Maria mentioned, to categorize the type of request, the complexity or effort involved to review it, and also to assign it a relative priority level.
CARLY ROBINSON: We also flag requests that come directly from a given organization as opposed to those who might be submitted by a non-affiliated user. All requests are very welcome, but we do flag those in that way. We also decided to use the Google request form as the first step because we didn't want to require folks to have a GitHub account. Google request forms are really easy to use, and for folks who might not have GitHub experience, it's a nice format to be able to use.
CARLY ROBINSON: But this is something that we do want to potentially consider changing in the future. It may evolve just depending on the community needs. Go to the next slide. So the triage GitHub issue ends up on a public GitHub project word for curation board members to review. So depending on the type or the complexity of the feedback, some requests can be approved right away, whereas others might require a secondary review by another curator, or it might need a larger group discussion.
CARLY ROBINSON: So curators that are part of the group work asynchronously as they have time to pick up issues. But we do have a monthly meeting where we review issues that might be flagged for discussion and how we want to handle them. So we have gone through a lot of these, and sometimes we still want to have discussions for specific areas that come up. Next slide.
CARLY ROBINSON: So as a group, we will find documentation and criteria for reviewing requests to ensure that they are within scope. And this documentation is also publicly available on GitHub. This has been developed through-- throughout the group working together, working through requests, and coming up with some standard practicing for how to evaluate what's within scope of ROR. Next slide.
CARLY ROBINSON: So requests that have been approved to get queued up to be prepared for the registry's next release, including preparing the metadata files, and deploying to ROR's production site. So historically, ROR's updates were synchronized to GRID, and we were on GRID's schedule, which is approximately four times per year. But ROR is now beginning to make its own updates independently.
CARLY ROBINSON: And we will be establishing our own schedule and the timeline moving forward. Next slide. So the project board also has a space for holding onto requests that can't be processed right away, either because they represent a long-term project, or they might be dependent on new technical functionality that doesn't quite exist yet.
CARLY ROBINSON: And so there's a space for holding those. Also, for requests that might not have been approved, they're tagged with the reason why. It's a public GitHub board, so you can follow along and see that. This could be because it might be a duplicate request or the organization in question might not be within more scope, or maybe we just received feedback that was a general question and not a specific request.
CARLY ROBINSON: So with that, I'm going to hand it over to Arthur.
ARTHUR SMITH: Thanks, Carly. So in this section, we'll cover what's coming up next in the implementation of the curation model for ROR. Next slide. So, as Carly covered, once something has been approved by the curation process, we need to go through the actual process of changing it in the registry. And we haven't actually done that independently yet.
ARTHUR SMITH: As Carly said, we coordinated through GRID up to now, but we need to start processing changes independently with our own infrastructure for doing that. So the initial version, the MVP, Minimum Viable Product, is a form that the curators are going to use to generate validated JSON. That's the metadata file, either for a new entry or changing an existing one. And those JSON files would then be added or replaced in the production registry.
ARTHUR SMITH: So the changes are available to whoever uses it in the API in the search interface of the data that we provide. Next slide. So this is a screenshot of the JSON form that we're going to use to create these metadata records to feed into ROR. It produces automatically validated JSON that follows the schema that we've defined, and you can see the actual file on the right-hand side of the screen that it would be producing.
ARTHUR SMITH: Using this form means we don't have to actually code. It's just filling in the fields and-- so hopefully, pretty simple for our curation people to use. And so this piece of it is actually pretty much complete and being tested right now. There's another piece to handle relationships. So that's a little more complicated because you have to update more than one JSON file at the same time.
ARTHUR SMITH: That's still under development, and also the process of deploying to the released ROR data set is also still in development. And that will be coming soon. Next slide. So longer-term beyond that minimum basic infrastructure to update ROR records, there's a lot of other things that we would like to do. We need to better handle historical records.
ARTHUR SMITH: So records are organizations that are no longer active either because of a merger or the organization shut down. Right now, we basically follow the GRID model up to now which essentially removed those from the main repository, so we need to handle that better. We also need to handle bulk requests, such as when data managers have a large set of institutional data that they want to reconcile with ROR.
ARTHUR SMITH: We've had a couple of those come in already and basically handled them one by one. So it'd be nice to have a little bit more automation there. We also expect to evolve the data model a bit. Now that we're no longer tied to GRID, there's some changes that we know are probably going to be needed.
ARTHUR SMITH: And we also, in general, want to automate the processing to save time and focus curator efforts on the most complex and high-touch changes. I did want to mention the most frequently repeated community request I think we've seen is some way to identify some organizations like university departments and institutes. That's not something that we're going to do real soon. But if we tackle some of these other things, handling that in some way or another should be possible for the longer term.
ARTHUR SMITH: So some challenges in the next slide-- challenges in curating. So yes, ROR is a research organization registry. It's not the only identifier for organizations. But it's focused on, as Maria said at the beginning, associating research outputs with the organizations that produce them.
ARTHUR SMITH: And so if you're not engaged in scholarly research, your organization is most likely outside of ROR's scope. Still, we somehow get many requests for-- from businesses, such as real estate, consulting, media, and marketing. Those are not what ROR is about. ROR does not exist to promote or validate your organization. Its purpose is to help the research community identify research organizations.
ARTHUR SMITH: So making that clear is one challenge we have. There is some tension between some of the international-- we're trying to cover all of the research organizations around the world, and so that we have a centralized system and a global community, and there's a bit of tension in there. So an example of that is the label, the main label in ROR is usually we've used an English label no matter what language the organization's official name is in.
ARTHUR SMITH: But there are some organizations that do not have an official English translation of their name. And so that may be a case where we need to use the native language name. If their name is commonly used in affiliations, that would make sense. But in the longer term, we may need to update our metadata model or do something to better distinguish these different types of labels, official names, translations, old names, and so on.
ARTHUR SMITH: Another challenge is the range of updates that need to be made to the registry. In some cases, we need to make a change that is just a simple updating a URL. But other times, there may be a more in-depth proposed change that we need to figure out how to represent it in the metadata, or we may need to consolidate or clean up a number of records altogether. So we need to have workflows that can allow for these different ranges, these different types of updates to be done efficiently.
ARTHUR SMITH: The simple changes should be quick, but we need also ways to efficiently review and process the more complex things as well so they don't get backlogged. And then finally, this is a volunteer effort. The curation board, myself, and the others are all doing things on our own time pretty much. And it's important that we have ways for the board members to contribute at the last for their flexibility, and also focuses their time and efforts on the most important tasks and decisions.
ARTHUR SMITH: I think we've been pretty lucky. We've been able to balance that OK up to now. We only have maybe a month or two's worth of requests that are still in the pipeline. We've gone through a lot already. But that may not last. We need to be able to continue balancing that and optimizing the time of our curators and not overwhelming those who contribute their free time to this effort.
ARTHUR SMITH: So with that, I'm turning the time over now to some personal reflections from members of the curation board on how this process has worked for them and what they think about it all.
MARTIN SPENGER: Hello from Munich, Germany. I'm Martin Spenger from the University Library at Ludwig Maximilian University of Munchen, where I primarily work with topics related to research data management. In this context, using persistent identifiers is a very important part of our work. Organizational identifiers like ROR can help immensely in the publication process of research data and other publication types.
MARTIN SPENGER: Regarding ROR, I'm relatively new to the ROR curation advisory board. From what I have experienced in the last month, huge benefits from this group are a transparent and efficient workflows that come with the curation process. Everything from updating ROR metadata to creating entirely new ROR records is documented into workflows, and decisions are openly available, especially for entries regarding European or German organizations, this can be very helpful.
MARTIN SPENGER: For example, many institutions in Germany do not have an English translation as the preferred organization name. This can be tricky sometimes. But the ROR curation guidelines also help organizations to identify the best ways to create suitable records and to support specific use cases. The curation process also helps to create consistency for similar organization types, for example, research clusters that consist of several entities that each are suitable for their own ROR IDs.
MARTIN SPENGER: I can only encourage you to take a look at the work of the curation working group to learn more about ROR and to make sure that the entry for your organization is up to date. Thank you.
KELLY STATHIS: Hi, I'm Kelly Stathis, and I am the discovery and metadata coordinator for the Digital Research Alliance of Canada. On the ROR curation advisory board, I think that one of the biggest challenges we have grappled with is language. Because ROR is an international registry, it contains organizations from all over the world, and the names of these organizations are in many different languages.
KELLY STATHIS: ROR supports multiple languages and character sets. And the metadata schema can include alternate names or labels for a given organization, and this includes names in different languages. Even with this multilingual support, however, I think that we still have work to do to make more metadata and a schema more reflective of our multilingual worlds and by extension, less Anglocentric.
KELLY STATHIS: To give an example, here in Canada, we have two official languages, French and English, which means that there are many institutions with two primary names, one in each official language. Because the primary name field in ROR is not repeatable, there hasn't been a way to accurately reflect these two names in the metadata. So curators have to prioritize one name as the primary name. This has meant that for some organizations, the ROR metadata indicates that the English name is the primary name, and the French name is an alternate label.
KELLY STATHIS: But in reality, they are both primary names of the institution, and they are equally official. Relatedly, there are also cases where we have received requests to update an organization's primary name because the metadata had that name in English but there is never actually an official name in English, to begin with. In some of these cases, it was because the original metadata in GRID was an English.
KELLY STATHIS: GRID tended to use more translated names. And so now, we are working to correct these names on an individual basis to make sure that organizations are accurately represented in their official languages in ROR. It's been very rewarding to engage in this work with the ROR team and the curation board. And I'm really excited to see how ROR will continue to grow as we see more adoption and begin to evolve the metadata schema.
KELLY STATHIS:
NICK LUNDVICK: Hello, my name is Nick Lundvick, and I am a scholarly communications and curation librarian at Argonne National Laboratory. I joined the ROR curation team in early 2021. Reflecting on my experience as a member of the team, I found that the ROR curation model exhibits multiple unique strengths. The open and transparent process allows all interested parties to submit requests and be aware of the curation and decision-making process.
NICK LUNDVICK: And a requester can view the GitHub project board to see curator comments as well as review ROR metadata policies and curation workflow documentation. Also, the community-based curation approach brings in expertise and unique perspectives. This allows the team to navigate nuanced and complicated requests and ensures data integrity. These and other aspects make the ROR curation model a valuable tool for curating organizational metadata.
NICK LUNDVICK:
CARLY ROBINSON: Hello again. Carly Robinson from the US Department of Energy's Office of Scientific and Technical Information and also a member of the ROR curation advisory group. I've been involved with this group since ROR created it. And it has really been a joy to work with Maria and the rest of the curation team to create these curation processes and discuss questions that regularly come up. So there are many unique cases that regularly are discussed when organizations are requesting a new ROR ID or requesting an update to metadata associated with one of the existing ROR IDs.
CARLY ROBINSON: So having this group of community members to vouch thoughts and ideas off of has really been wonderful. Being in the US government, one of the areas that I have focused on are how government offices, labs, and facilities have worked within ROR.
ARTHUR SMITH: Hi, I'm Arthur Smith. I've been with the American Physical Society Publications for over 25 years. And part of my work is involved in encouraging our participation in common standards for identifiers, such as DOIs, particles, or kit for authors, and now ROR for affiliations. The Physical Review journal has managed an internal database of several thousand institutions for authors and reviewers going back almost 50 years.
ARTHUR SMITH: In 2015, we merged this with the GRID database of institutions from Digital Science. And around the same time, I started looking at the Wikidata as a crowdsourced alternative or option for institutional identifiers. And I have spent some time reconciling GRID and Wikidata identifiers. And that turned out to be very useful in helping find duplicates and other kinds of errors in the GRID data, and, of course, correcting things on the Wikidata site as well.
ARTHUR SMITH: So in the end, I sent hundreds of suggestions for additions or corrections to GRID between the Wikidata and the institutions we needed for our database. Now GRID is being retired in favor of ROR. And I've been excited to be involved heavily in the curation process for ROR. My experience with Wikidata gives me a good background in tracking down identifiers for newly proposed institutions, not just Wikidata but Disney, Crossref Funder, and so on.
ARTHUR SMITH: And in turn, those help confirm that it really is a new institution, not just a different name for something that we already have. I also had the opportunity to work on a few bulk requests. For example, we received a list of universities and government agencies, and related things in Peru that were missing from ROR. It turned out we had one of those already, thanks to GRID. But I fleshed out and verified the others.
ARTHUR SMITH: Reviewing every curation request takes a few minutes, at least, checking the website, looking for Wikipedia entries, looking for other identifiers, verifying how the name is used, particularly in affiliations. But it's the borderline cases that take the most time. ROR is scoped currently just to include top-level institutions. So does this propose new institution qualify? Is it sufficiently well established? Is it really a research organization?
ARTHUR SMITH: The majority of requests we receive do easily meet the criteria, but the borderline ones, there are plenty of them. They do take some time. Interacting with the other curation team members has been a great experience. We have a diverse group with many different perspectives. And I look forward to continuing to help guiding ROR into the future.
ARTHUR SMITH:
MARIA GOULD: Thank you so much for listening. And please get in touch if you would like to learn more or get involved. And thank you again to everyone on the curation board for helping to make this happen, and all of the great work that you do, and look forward to more. [MUSIC PLAYING]