Name:
Visualizing Institutional Research Activity using Persistent Identifier Metadata Recording
Description:
Visualizing Institutional Research Activity using Persistent Identifier Metadata Recording
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/1633944b-c1a4-4d88-8d29-8e4487676ec7/videoscrubberimages/Scrubber_3.jpg
Duration:
T00H37M33S
Embed URL:
https://stream.cadmore.media/player/1633944b-c1a4-4d88-8d29-8e4487676ec7
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/1633944b-c1a4-4d88-8d29-8e4487676ec7/Visualizing Institutional Research Activity using Persistent.mp4?sv=2019-02-02&sr=c&sig=T57%2BQ0LD4Ihn6FADDrKpJzC0%2B56S%2FLAbCvD8SG0ojbE%3D&st=2025-01-20T06%3A22%3A46Z&se=2025-01-20T08%3A27%3A46Z&sp=r
Upload Date:
2024-03-06T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
Hello and welcome to this session.
On visualizing institutional research activity using persistent identifier metadata. It's a topic I'm very interested in and I'm delighted to have some expert speakers here with me today. I'm going to let them introduce themselves because I'm terrible at pronouncing names and reading them at the same time. So please enjoy this recording and at the end watch for a link after the recording ends.
Watch for a link to join the question and answer session. Zoom and I hope you will have lots of questions and join in the conversation. So take it away. OK so thanks, everyone, for joining us today for our presentation. So our presentation on visualizing institutional research activity using persistent identifier metadata is a project funded by Lyrasis and the Institute of Museum and Library Services.
My name is Sheila Rabun. I'm the program leader for persistent identifier communities at Lyrasis, which includes the ORCID US Community Consortium and the Lyrasis DataCite US Community Consortium. My colleague Paolo Gujilde is our ORCID US community specialist and he's also on the call with us today. For those who might not be familiar, Lyrasis is a non-profit organization based in the United States that provides a variety of different services to libraries, archives, museums and related knowledge communities.
Lyrasis is the administrative home of the United States ORCID consortium. So I mentioned the ORCID US community. This is an ORCID consortium for non-government nonprofits, and we also provide a DataCite consortium for nonprofit organizations in the US. But for our talk today, we're coming from our roles working with the ORCID US community, which started in 2018 as a partnership between the Big Ten academic alliance, the greater Western library alliance and the Northeast research libraries.
So currently we have over 185 member organizations and our main goal is to support ORCID adoption and foster a community of practice around ORCID in the US. So over the last several months, we have been partnering with the Drexel University LEADING program, which stands for Library Information science, education and data science integrated network group.
This program matches organizations with early career Fellows to work on data related projects, basically helping organizations to get work done and helping Fellows to build skills related to data science. So last year we submitted a proposal to the LEADING program to work on a data visualization project using information from ORCID records in the United States. And from July through December, we had the pleasure of working with two Fellows who are presenting with us today.
So I'm pleased to introduce Negeen Aghassibake the data visualization librarian at University of Washington libraries, and also Olivia Given Castello, who is the head of business, social sciences and education at Temple University libraries. So the inspiration for our project came from others who have already done some work on visualizing data from ORCID records. So Melroy Almeida, who leads the Australian ORCID consortium, had done some work visualizing collaborations between researchers at Australian institutions and researchers in other parts of the world using information from ORCID records.
Also an article by John Bohannon in Science Magazine from 2017 used ORCID data to create a visualization of how scientists migrate from one country to another. And also last year, Simon Porter, the director of innovation at Digital Science, was also writing about ORCID adoption and tweeting data visualizations based on information in ORCID records. So we had some good inspiration out there, and so we settled on a focus for our project after we kind of surveyed our community and tried to determine what the interest might be, we decided to focus on creating resources to allow librarians and others to create visualizations of collaborations between researchers at their own institution and other institutions.
So basically using public ORCID data as well as Crossref DOI metadata. So over the course of just six months, Negeen and Olivia created an R script that can be used to pull collaborator information from ORCID records and Crossref DOIs resulting in a data file that can be loaded into a custom Tableau Public dashboard template to produce the visualization as a collaboration map.
So I'm going to hand it over to Negeen now to tell us more about, about the project. Thanks, Sheila, I'm going to dive a little deeper, but I'll first start with some additional context. Could you please go to the next slide, Sheila? Thanks! So our main goal was to visualize collaborations between institutions within a certain time period, to look at patterns and trends.
One of the most important lessons we learned is that researcher collaborations are really complex and can be difficult to define. This wasn't really a surprise, but after working with the data, we had a better understanding of the scope and the scale of that complexity. But despite these barriers, we found that there was still a lot that we could work with, and we pulled data, modified the code, pulled more data, and went through the cycle a few times until we got to a point where we could more easily work with it.
From there, we built a dashboard that explores an institution's data, and conceptually that starts with what we call a home or an anchor institution. And from there, we pull data for a specific time period for that institution using ORCID and Crossref. The data include individual home authors and their collaborators at other institutions based on those collaborators' current institutions. And we'll demonstrate what this looks like in just a minute.
Before we look at the dashboard, though, I want to be transparent about the data, which are imperfect and contain gaps as well as user and machine errors. The data and the dashboard capture what we can using the data sources that we have, and they do give a really good view of that. But the numbers are not absolute nor are they perfect. Next slide, please.
All of this work led to a suite of tools that we've created or that Lyrasis and ORCID have created and that we've curated. So the first one is obviously the dashboard, which we've created in a visualization tool called Tableau. Like Sheila mentioned earlier, and aside from what we'll cover here today, the key takeaway is that it is available to all to use and customize, and that you can replace the data and the template dashboard with your own institution's data for the time period of interest to explore collaborations wherever you are.
There's also an R script that creates a network visualization to further explore collaborations through a different lens. And this is aimed at those who are interested in more customized solutions than what the Tableau template can provide. I'll also say the script is just a start and can use a lot more refinement and exploration. So if you're interested in it, we really encourage building on it and would love to hear more about that.
Then like Sheila mentioned, there's our script that pulls ORCID and DOI data at a specific institution. If you're comfortable with running the script on your own, there's some good documentation embedded in the script to help you do that. Or you can reach out to Lyrasis for help with pulling that data. We also came across already created ORCID outreach materials.
We recommend that you use these materials from ORCID and Lyrasis to encourage your users to get into more fully adopt an ORCID profile if that's something that you're interested in doing. And we think the dashboard is something that you can consider adding to those materials as well. And last but certainly not least, Lyrasis is here to support users and supply data for institutions and are really a key part of this process.
All right. So we are at the dashboard itself. And we've recorded a video that goes over some of the key components of the dashboard. This dashboard is currently displaying data for Temple University for the time period 2021 to around the present. The background page includes a lot of the information that will cover here today.
The most important pieces are the context and considerations for the data, which impact how the data and this dashboard should be used. Also important is that contact information for further support that will be including in this section. The summary dashboard contains high level stats, including the number of article collaborations, the number of collaborating cities, and the number of ORCID ID holders for the specific time period for the data pulled.
It's also a bar chart at the bottom that notes the institutions with which Temple has had the highest number of collaborations based on the data for this time period. I want to note here that the color scheme is very neutral because again you're able to customize this to fit your own institution. When we share out this dashboard, it will likely contain mock data.
So that it takes less time to load and so that it's easier to swap out the data. You can navigate between the different sections at the very top by clicking on the buttons, which I'll be demonstrating throughout this video. The collaborations map shows collaborations at the institutional and city levels. You can search for an institution and if it shows up in the data, it will show you where that institution is located.
For this example, I'll search for a major collaborator of Temple University, which is the University of Pennsylvania, and you'll see that the map zooms in on Philadelphia, which is where U Penn is located. And below it, it lists the institutions that come up when you search for the University of Pennsylvania. One limitation here is that typos, whether on the search end or on the data end, could potentially cause issues in bringing up information.
But on the whole, it'll pull back a pretty good idea of collaborations with that institution. Also, if you click on one of these dots below, you'll see the institutions within that city and the number of collaborations with those institutions. This is especially useful if you're interested in diving deeper into the national and global reach of these collaborations.
Other thing that you can do here is exclude your own institution from the map. Many researchers at an institution will obviously collaborate with other researchers at that same institution. So if you're more interested in external collaborations, you can filter them out here. The individual search section lets researchers explore their own collaborations by searching with an ORCID ID.
I'll show an example here. You can now see this researchers collaborations. We chose not we chose not to search by name, not only because there are cases of names that are not unique, which is really what ORCID is trying to solve here in the first place. But also because we didn't want to make it easy to just look up individuals, especially if they want to lock down their ORCID profiles.
It's important to note that this data shouldn't be used to evaluate or compare researchers against one another because again, the data are not perfect and do not give a full picture of collaborations in impact. This dashboard is just one angle through which to approach this information. However, if a researcher has filled out their ORCID profile fully and feels confident with the information listed here, they can easily take a screenshot or export an image from this dashboard and drop it right into their promotion package or CV or website.
Another one of the features that we're excited about is the why can't I find my ORCID ID button? So if someone can't find their ORCID ID and I'll just put a string here to show this, if someone can't find their ORCID ID, they can come up here and click on the why can't I find my ORCID ID button? And this button takes you to a page that includes steps to take, to acquire or clean an ORCID ID so that researchers or the users that you work with are counted in the data.
We're hoping that this can be a useful outreach tool to encourage researchers to either adopt or more fully fill out their ORCID profiles. We will likely be adding some more specifics to this page, but these are some of the key issues that can cause someone to not show up in the dashboard. I will also note that we've worked to build an accessibility into this Tableau dashboard and every way that we can working around Tableau's, own limitations.
And some of these include adding captions and checking color contrast and using resizable layouts where possible. However, we encourage you to check for accessibility as you customize, especially if you will be changing any of the major visualizations or adding your own branding to the template. So I won't be spending time on this slide because it's just the main points of the video that was just covered except written out.
But please feel free to refer to the slide in the video again if you want to revisit it. Next slide, please. Like I mentioned earlier, you can actually make this dashboard your own. So to do that, first, you want to reach out to Lyrasis for your institution's collaboration data or run the script yourself to get that data.
From there, you'll need to create a Tableau Public account. Go to the template dashboard and copy it and then replace the data with your own institution's data. This can usually be done in browser without actually having to install Tableau, and you don't have to be a Tableau expert to do it. There are just a couple of key things to keep in mind. So any data that's saved to Tableau Public is saved on their public server, even though the data that we're working with is openly available and easy to access.
It's just still something to consider. There's also more detailed steps in the documentation for the dashboard, which is available through the Lyrasis GitHub. So now I'll pass it off to Paolo, who will talk about some of the highlights of the project. Thank you. And again, so we're very excited about this project.
And of course, the results and the products that came out of it. So there are a lot of project highlights that we really kind of want to share with you, but I just kind of want to go through some of them all throughout this project. So first, really the timeline, I think we've kind of talked a little bit more about a little bit about it in the earlier slides. But really this project, the timeline of this project was just six months.
So within that time span, we were able to create an R script and then of course the dashboard. So a really big kudos to Negeen and Olivia for doing all of that work and it's been wonderful so far. Second is that the availability of this tools currently right now, all of this, the script and the dashboard are available for everyone to use. So we are doing a lot of beta testing, but basically at this time.
It can be used already. So it's a lot of usability, availability of it. The usability of the product as well is that it is very easy to use. All of the our script as well as this dashboard. I'm not the most technical person at all, but that was able to follow and run to our script and also upload the data into Tableau Public and that really kind of makes it easier for everyone to use.
I also want to highlight that a lot of this tools are available for free. So the our studio, for example, as well as Tableau Public are available online and you can get that for free, which makes it easier for everyone. So you don't really have to purchase another product to be able to run all of this data. And speaking of all of that, our fourth highlight is that there is a good amount of PID adoption, specifically looking at ORCID as a persistent identifier for individuals and as well as DOIs or digital object identifier.
So all of those data is really looking at all of those persistent identifiers. And you can see that on the video as well, that all of those are getting pulled as well as put together. And that's really kind of a great representation of the PIDs that are being used currently. And finally, it also tells the story of collaborations so discrepant.
The dashboard gathers the data and shows all of that collaborations with researchers, organizations and places. So it's really kind of tells you a story and it gives you a great visualization on how everything kind of fits in together. So with that, I'm going to pass it on to Olivia to talk more about the project. Thanks, Paolo. So I'm going to go in a little bit more depth about how we're getting the data that is visualized.
We created an R script to pull data from ORCID about authors and from Crossref about the authors' work. It uses the rorcid and rcrossref packages developed by Scott Chamberlain and builds on code developed by Clark Iakovakis from Oklahoma State University for reporting ORCID adoption by institution. Next slide.
A challenge that the script deals with is messy, complex data. It does a lot of data wrangling, cleaning and restructuring that's needed to generate the CSV file that can be uploaded to Tableau. Briefly, starting from the perspective of a home institution, it retrieves profile data on the current ORCID ID holders from that institution. Then it unpacks the works list from the ORCID ID profiles and retrieves the Crossref data for every DOI in the works list. For each DOI, It then unpacks the co-authors list and retrieves the ORCID profile data for each co-author who has an ORCID ID.
It checks those co-authors' ORCID profiles employment section to fill in location information for the co-authors' current institution. And then it repackages all of it into a CSV file of individual home author and co-author collaborations to upload and visualize in Tableau. Next slide, please. Here's a visualization of that workflow.
Representing a single home author just to show how quickly it branches out into many works and then many co-authors and all their home institutions. In reality, a home institution has many authors. So imagine this diagram repeated manyfold. That's what the script is doing. And R is great for doing this type of data wrangling. But there are challenges. First of all, as Negeen mentioned, there are some potential blanks and gaps in the data.
And for schools with large numbers of authors and large research output, some portions of the script can take several minutes to run due to all this branching and the need to communicate with external services over the ORCID and Crossref APIs. The more years of data you would want to retrieve also mean the script would take more time to run. As authors adopt ORCID IDs and complete their ORCID profiles, it will definitely improve the data visualization.
By the same token, it also means the script has more data to pull which make it take longer. Next slide. The R script also presents opportunities and spaces for improvement. Next. First off, it exposes where ORCID profiles could be more complete or where ORCID adoption is needed in the first place.
This translates into the button on the Tableau dashboard that was shown in the video labeled why can't I find my ORCID ID? Next slide. The script and project documentation have been easy to share on Lyrasis' GitHub and R, as Paolo mentioned, is free and open. And many people are familiar with running scripts in RStudio particularly.
Those who are comfortable running it themselves can do so. Or Sheila and Paolo from Lyrasis are also available to run the script for you and can give you your organization's data file for your time period of interest. A potential improvement related to this that would help anyone who's not comfortable with R and RStudio and would save Sheila and Paolo's time would be to create a web interface for what the script does.
Next slide. Beyond this, there are many opportunities to improve the script code. But just to name a few of them, most DOIs are minted by Crossref, which this script queries, but the script could generate an even more complete data file by also retrieving metadata for DOIs minted by DataCite.
The script also already attempts to resolve personal name variations and fill in ORCID profile data for known authors, but it could be updated to do even more iterations of that, which would help with filling in blanks in the data. Right now. The script retrieves collaboration data since a given year, and an improvement would be to modify the code to allow more freedom in choosing the time period like to retrieve only discrete years or to retrieve data from very specific start and end dates.
So Negeen and I have left some additional code that's not used in the current script for any future LEADING Fellows or employees of Lyrasis that would give them a head start on making some of these code improvements. Some other ideas for totally new functionality include using the script to try to resolve messy department affiliation data. So that the dashboard could also visualize collaborations by discipline or department.
We also think it could be useful to create a version of the script that individuals could run to retrieve and visualize only their own collaboration data. And then finally, we think that expanding the script to retrieve citation metrics for DOIs would add another dimension that would really enhance the data visualization.
And now Negeen is going to talk about the data in more detail. Thanks, Olivia. Next slide, please. So we are able to visualize a lot with existing or getting Crossref data, but there are still many potential sources of gaps and blanks. The R script attempts to fill in blank ORCID data for known researchers where possible, but there are still things that the script just can't fix.
The top issue is really just authors who do not have an ORCID profile. So this means home institution authors with no ORCID profile or co-authors with no ORCID profile, or where co-authors ORCID ID is not a part of the Crossref data and the script just can't fill it in. Another challenge is authors who may have an ORCID profile but are missing the employment or works information needed by the script.
Those constitute the vast majority of the blanks, but there are two other more minor sources of blanks. So occasionally profile data just includes some errors. So typos or perhaps the wrong institution name. And also currently the script only looks at DOIs issued by Crossref. So if there is a DOI that was not issued by Crossref, which Olivia discussed a little bit earlier, it will also lead to blanks.
Next slide, please. These are some examples of the types of errors that can lead to blanks or other issues in the data. And the example on the far left demonstrates the complete lack of data. If an author does not have an ORCID profile in the middle, you can see an author who has their education and some works listed but doesn't have institution information, which is key in pulling data to be able to actually map them.
The example on the far right shows where researcher probably mistakenly added the University of Washington as their institution when they meant to add Washington University instead. So again, these are errors that the script just can't fix, but I'll pass it to Paolo now to discuss opportunities that might help improve these issues. Thank you.
And again. So as you already know, for every challenges, there are a lot of opportunities that we can take. So one of those is really that this project motivates PID usage. So the data and visualization shows the value of persistent identifiers. So it provides us support for the use of PIDs in research related workflows.
So for example, if we take publications in which persistent identifiers for individuals like ORCID IDs and the DOIs, both of those play important role really in kind of pulling the data, gathering the data and visualizing the data. So really this project kind of look into those two persistent identifiers. And in addition, PIDs can help us reconcile missing data. So such as researchers that do not represent an individual association, for example, or perhaps missing DOIs.
And we already know that there are a lot of additional PIDs available and they are for different purposes. And all of those are really an opportunity to look into how they are connected with each other and why they are important in our research landscape. Next slide, please. Next slide.
Thank you. So now as we talked about persistent identifier for individual that the data really motivates ORCID adoption. So we know we know that there are a lot of missing researchers in the data. So this gives us an argument for support of ORCID adoption that are your institution or organization. It is a way to provide compelling story of collaborations like we mentioned earlier and a research impact for administrators, faculty and other stakeholders that may not be familiar with ORCID as well.
So all of this really kind of gives you a you as a librarian or as an individual working with ORCID in your organization to really kind of give you an opportunity to do promotion and outreach to a lot of people in your organization. This also encourages ORCID participation overall, not just from administrators, but also from researchers and other stakeholders. So it really encourages the creation of an ORCID ID and record.
And we know that sometimes it is a challenge to populate and maintain things. So there are often times that we hear from our researchers that it's really a lot of things sometimes to kind of maintain your ORCID record. So this will encourage ORCID users to populate and maintain their ORCID records. And this can be done by using existing ORCID integrations and connecting to various systems, organizations, systems or organizations with ORCID integrations.
So it could be an integration in your institution, for example. So it is an opportunity for organization to encourage faculty, researchers and others to connect with them. Furthermore, the data can be used to for tenure and promotions and other related activities. Researchers represented in the data can use the visualization to show their research and collaborations, not just across campus, but external collaborations as well.
So now I'm going to pass it on to Sheila to talk more about opportunities. Yeah so again, this project also we're hoping will motivate research collaborations and publishing collaborations, because organizations can use this tool to kind of get a visual picture of collaborations that are taking place. And they may start to notice patterns or trends in the collaboration activities of the researchers.
And it can provide just kind of a further insight into that activity, which could help look at the impact that an organization is having across the global landscape. And looking more at the collaboration activity at an institution. Some of those trends that this could expose are like, are there certain areas of the country or areas of the globe that are missing from the collaboration map?
So for organizations that are looking to expand their reach and impact and representation, this might be a helpful tool for identifying collaboration areas of focus. So those are just a few ideas about how this could potentially provide more insight into collaborations and motivate more collaboration. And then also we're thinking this will help to motivate ORCID best practices too.
So with all of the data errors that can be uncovered through this tool, we're hoping that organizations will see that they can actually use the ORCID member API to mitigate some of those errors by writing data to researchers' ORCID records for them. And that would help with institution name variations and spelling errors, location errors and/or for example, reference to the whole institution versus just one department.
So kind of standardizing the information that we're finding in ORCID records. So these discrepancies can be avoided when institutions use the ORCID member API to write that information to researchers' ORCID records and removes that chance of error and ORCID's best practice for organizations which includes funders and publishers and also universities and research institutions and other groups. The best practice is for organizations to write data to researchers ORCID records.
So that way researchers can save time and the organizations can confirm the correct affiliation information on their researchers' ORCID records. So if you want to create your own visualization, we do have everything documented on GitHub. And just to recap the basic steps for how someone would take advantage of this project. And these tools would first be to go to our GitHub page, which has all of the information you need on it.
And you can see that our script is included right down there on our GitHub page. So you would go to that our script, you would download our studio, which is a free tool, open RStudio and basically execute the script. There are some values that need to be filled in based on what organization you want to kind of search for. But you basically just go through the script in order.
When the script is done running the final output is a CSV file that's called orcid-data, and it kind of looks something like what you're seeing here can be a very large data file or small one depending on again, how much data is available in the ORCID records of the researchers. You'll take that data file and basically you go to Tableau public, another free tool sign up for an account.
Then you'll go to our ORCID US community. We have a template that you'll make a copy of, so you make a copy of our template and then you basically replace the template data with your own data file. So we have the detailed instructions on how to do this again on GitHub. And then when you're done customizing your dashboard, you would publish it, and you would get to see the visualization for your organization, you can then share the link to your visualization.
So other people can see it. You can take screenshots to use in presentations, et cetera the one you're seeing on the screen right now is just a test example that we did for Thomas Jefferson University. So you can see all the little dots on the map where those. collaborations are taking place and hopefully you can see how having a visualization like that could be valuable in multiple contexts.
So again, everything's documented and we're hoping that people will be as excited about this as we are. We're really looking forward to having people use these tools. And we will be using them as well. And thank you again for attending our presentation. We really look forward to any questions and also discussion. And of course, feel free to contact any of us with the email addresses that are here.
Well, thank you very much. That was really wonderful. And I am certainly excited by everything I just saw. I want to dig into our script and get ucsf's or code information and new data visualizations. So let's watch for the link to the next Zoom chat session. As soon as this video stops, you should see it coming up and I look forward to hearing from you there.
Thank you very much.