Name:
MakingTheBusinessCaseForInvestingInMetadata
Description:
MakingTheBusinessCaseForInvestingInMetadata
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/5e101d96-19c9-4e46-bd45-cd22548d5c42/videoscrubberimages/Scrubber_1.jpg
Duration:
T00H33M50S
Embed URL:
https://stream.cadmore.media/player/5e101d96-19c9-4e46-bd45-cd22548d5c42
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/5e101d96-19c9-4e46-bd45-cd22548d5c42/MakingTheBusinessCaseForInvestingInMetadata.mp4?sv=2019-02-02&sr=c&sig=RZlrQLpgiiS9%2BmCS2B0VhuoIeVl7OlE1ORSD9Nb2QSo%3D&st=2024-12-08T18%3A33%3A39Z&se=2024-12-08T20%3A38%3A39Z&sp=r
Upload Date:
2024-03-06T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
So while you're all having a think. If I can't, Josh, I had a question for you. So some of the research that you talked about is so compelling, right. And I guess I was wondering if you can expand a little bit more on what you think would be involved in implementing some of these strategies that potentially result in the time savings and those efficiencies.
Thanks to you. That's a really interesting question. And I think that there are a few key components there. One is identifying the kind of priority areas where work is being done, whether there are pain points for researchers or particular policy priorities, whether that's monitoring an open access publishing mandate or, say, a large scale research reporting process, those kinds of things create a huge amount of effort, take up a lot of time.
And I think that if you can identify some of those key information processes that underpin or enable those and focus on those as your target integrations, one, you see that people are going to recognize that value. They're going to feel the benefits immediately, but also it tells you which systems you need to implement it and making sure that those heads are that match data are there and under the fingers of researchers or administrators when they try to do that.
Jobs is the key to success. I mean, you alluded to the fact that when metadata works well, it's invisible. And I think it's a really important component here is actually saying to people, you know, how long this used to take you. Now you can see actually, we can say there were quantifiable time savings here that we've delivered.
And, you know, you're welcome, basically. But the key thing there is folks get that benefit in the systems they need to use to do something they hate doing. And it's over quicker as a result of that. So if you can put the pieces together and identify those systems and those processes and those priorities, you will deliver real, tangible value to the largest possible population.
So are there any other questions? And please do keep in mind that the chat is open for you to put your questions and comments there.
The notes document is live will remain live. And so it's a great opportunity for us to continue to both capture and document the discussion, but also to continue. Adding thoughts. And as always, nice. The Plus is very keen on capturing any identifying particular pain points or concerns.
And then brainstorming about what solutions might so might be able to offer. And even if an ISO specifically can't be involved in that work, that we might be able to connect with other communities and collaborators in our space to follow up. Just just building on what I should say. And I can see someone's put in the notes that re-entering data is something people hate doing and great opportunity to save them from doing that.
You know, this is a theme that's coming out across a lot of the sessions. I was in one of the earlier sessions today, and there was a lot of discussion around the same kind of topic. And I think the piece that I keep getting a little stuck on, on where I would sort of want to put the challenge out to this entire group is. How do we address the reality that at the moment, the author is often the only person who has all of the metadata upfront.
But then there's a reality, as we know. I work in journal, so using journal examples, the journal articles are often going to be rejected and flow through multiple journals. So that author is the only source of that knowledge. But they're having to recreate many, many times, even in a best case scenario. But does anyone have any thoughts on that challenge of sort of is the more that can be done to help solve the challenge of the author being the only person who can answer a lot of these questions and provide a lot of the metadata upfront.
I'd like to answer that, Julia, because. It shouldn't be the author who needs to be doing it. What needs to happen is at different stages of the author's research. Metadata needs to be captured. For example, if the author is using a particular instrument, how that instrument is being calibrated, what sort of instrument they're using, that all needs to be captured.
Right at the point where the author actually or the researcher for that matter registers to use the instrument or is using the instrument. So that needs to be captured. It needs to be stored in a certain place. Then if they are working with data, the data sets, those data sets again need to be captured, at least the metadata relating to those data sets. If it's different versions, depending on what is being performed on them or how they're being used again, that.
All this can be done exactly at the point. There, they need to do it or their systems can record it. And then as and when the research is complete completed, the papers being written, it can directly flow into if it's any particular journal. I mean, all they need is these particular diaries or these particular instrument persistent identifiers associated with this piece of journal article. And that's all that is actually required for then for them to actually trace the problem and see if that research is fair, if it can be reproduced, if it's easy to do, at least that's what I feel.
So it shouldn't be. Author's responsibility, but rather infrastructure and service providers. Even facility providers need to implement systems at that stage, primarily because it makes the researchers life easy. But more importantly for these providers, it also means that they are then able to report back to whoever is funding them on the return of investment, for example, such and such, because they can then track how many researchers have used it, what sort of funds it has been used for, what the impact of that research is, and how that research facility or those data sets have made a difference.
So it is I guess it's a bit of both, but happy to hear any other thoughts of views on this. No, I think that's a really good point. And just to follow up with a quick cut of comment here as well, what are the things we just go? I worked at ORCID for about five years. And we worked with a lot of kind of manuscript tracking systems and production systems in collecting ORCID ID'S and passing them on.
And that turned out to be one of the biggest problems is actually you can ask the researcher for that information and they are the source of that. You should ask them once, right? And then make it available for all the subsequent systems downstream to reuse. But even within one University or one publishing house, there are multiple systems that don't speak to each other, that use different terminologies for the same things.
So that level of interoperability in data re-use and the data flows is just not there. And that is that results in people kind of, you know, maybe giving the same information to the journal twice or having to correct it within the publisher. There might be outsourced metadata creation or curation. There may be multiple systems before it gets before the PDF goes out onto the wild web. And then you factor in the different fund of reporting systems, the different institutional research management systems and so on that are partially pulling this data in.
And, you know, as a result of this, you can see why there is good money to be made out of actually plugging some of these gaps in metadata and selling that on to the institutions, or the funders who need this to do their jobs or assess their impact. It's a there are huge gaps in the system and it's a lot of the better data that is captured and created is not passed on effectively.
Thanks for that, Josh. And do we have any other questions relating to semantic metadata or anything for. Michelle or Julia. I think there was one question in the Q&A, and one of it said that do you think that there are too many different metadata standards and schemas in use today across different sectors, which is probably inhibiting optimal metadata, quality sharing and portability?
Michelle, do you want to answer that, though? Yeah, sure. I think in the chat if that helps. Great Yeah. Yeah, that'd be great. There's a lot to that. I feel like Josh already touched on a bunch of those pain points, right?
To make things portable and easily reusable. Different systems are ingesting things. Different systems are creating metadata that gets given. Object will accrue different pieces of information as it moves through the pipeline. And then different pieces of information will fall away when it moves into another submission system or another discovery platform.
I feel like optimal share ability is intimately tied to having standard identifiers, for lack of a better place to put information. But not only standard identifiers, a standard repositories of information. So many people, many companies, publishers of data sets or of articles, books, et cetera will have a business need to add metadata, subject analysis, name analysis to make content discoverable inside of their own platforms.
And I don't think that will probably change. But what we could provide people submitting data sets are article submissions, book submissions, et cetera is a place to put the high level data that doesn't change across their submissions. So if they keep their institution updated, they keep their name information updated. The professors a controlled vocabulary for the areas of expertise that they specialize in, as well as allowing for a space to submit a title and a subtitle for whatever they're working on, be it a data set or journal or article or a book or whatever.
I don't think there's anything like that. Orchid ORCID does that after the fact. But we don't have anything before the process starts. And then there was a repository that many publishers and other providers could pull from and then deposit that into their system. We use interoperable formats like JSON in XML. Then it should be easy, relatively easy to deposit into an existing system.
But we'd have to build it first. So Thanks for that, Michelle. Just on that, I mean. With regards to a system where as part of the research, whether its being built as embedded researchers can put stuff in there. It'd be interesting to see, I mean, with a new persistent identifier rate that's it's not exactly new, but it's called research activity identify.
And what it does is all the elements that are associated with the research activity. So the researchers who are working on it and. The grants that go into that particular research activity, the publications that come out from that research, as well as what data set and software is being used or even instruments are being used. All that gets associated with that particular. Identifier so it's kind of like a keychain where everything is connected to it and that might be useful.
So having said that, I guess this is also a platform session that's happening. The next day at around
12: 00 Brisbane time.
12: So it's happening on the Thursday in case, if people are interested in attending that one. But I'll move on to the next question now. One of the things another question that they've asked has anyone considered using blockchain technology? To help replicate or standardize metadata. Anyone in the audience aware of blockchain technology being used to replicate or standardize metadata?
12: It was never part of our workflows in any of the metadata adjacent or like immediate things, like roles that I have had. Heather, have you. Heather Heather might not be here. Julia, do you have any thoughts on that? Has it is it something that publishers have been considering, do you think?
12: Yeah I mean, I think blockchain is one of those things that sort of it feels like everybody is gone. Blockchain, how could we use it? I don't think there's anything serious that I'm aware of in this space, but I think actually kind of going back a step, this huge variability, obviously, depending on the specifics of how the blockchain is administered. But I think we have to be really conscious of the potential environmental impacts with blockchain.
12: There are massive energy demands for several of the major services. And, you know, all of us are thinking about how we improve on our sustainability for the future. And so I would throw that out there as kind of a caveat when thinking about how can we make use of things like blockchain? What is the cost of doing so as well?
12: I know for sure. I agree with you on that, Julia. Now, Ted's put in a nice job, and that says that when you're looking at augmenting human work first as a pathway to automation, you can have the lead researcher provide a list of all good IDs and see how much of the metadata is required that can be gathered from those IDs and then build that amount over time.
12: So that's an interesting way to also do it, I guess. So do we have any other? Do we have more comments or questions from our audience? Just just to kind of chime in on Ted's observation there. I like that approach. And it relates to it relates very closely to Michelle's comment about where we can get this information further back in the system is things like looking at raids, looking at where encourages people to start to populate these their ORCID record and make connections in it, that if they can see they're going to be reused.
12: And over time, the incentive for them to keep that up to date grows and they will start saying to other researchers, do this at the very beginning, make sure you keep this up to date from day one and it just becomes an accepted part of kind of the process of becoming a professional researcher is to start to build up this kind of online portfolio, which can then act as a source of truth for other systems. And it means that might address some of the challenges we have with interoperability, because people don't necessarily need to have all of their systems interacting with one another.
12: They can connect them all to the central repository and pull stuff in from there. And it gives us a pathway for things like corrections and tracking modifications over time and things like that as well. Yeah and just building on that, Josh, I think the other piece that came into my mind in one of the earlier sessions is thinking about how do we round out the circle, right?
12: Because we keep coming back to the importance of the author keeping their records up to date for things like ORCID. But equally, if you're tracking all of the publications from that person, well, basically every journal requires you to list your institutional affiliation as part of the publication process. We know when an article is published, so what are the mechanisms that we could try and use to feed in some of the data that we have elsewhere in the system to try and automate?
12: You know, we think your institutional affiliation has been updated, is that correct? You know, so I do think we could do with challenging ourselves a bit more on sort of how we complete the circle rather than just coming back to we need authors to keep things up to date themselves in the first place. Yeah you know, that's interesting.
12: In the study that letty and I have been working on, we've sort of been playing around with our own Google Scholar profiles just to see what they look like and. For for what? For better or for worse, they do have us have the publishing industry busted in this one way, which is they have a fairly reasonable idea of what people have published.
12: And they ask you, oh, do you want to add this to your profile? Because it looks that part of this. And academia, edu, if people are still in that space, also has a sort of recommender mechanism that is functional and seems to be pretty accurate. So the technology is there and it's usable with the outputs that people in and adjacent to our industry are already creating.
12: So it would be worth it, I think, to explore that if we can do it in an ethical way that doesn't upset the researchers in our space. Thanks for that, Michelle. And Ted also mentions that many times communities built up around repositories provide opportunities for reusing identifiers across the repositories.
12: And that internal reuse can also help build. Connections so Yeah that's another comment that anybody else. Have any more questions? Comments Oh. Michelle's asked if you have a specific example, Ted.
12: This can you guys hear me? Yes One of the examples that I did a bunch of work with was a geophysical repository called unico, which is a repository of all geodetic GPS data in the world. That that repository exists has existed for many decades, and we're now starting to find identifiers for people that have contributed to that repository over that time.
12: And so a person may have a single person may have 25 or 15 data sets that are in this repository. And you find there there ORCID once and you can then spread it. Something I call spreading and curation. Curation is the process of adding new information to metadata over time. So yeah, that's the, that's the organization.
12: And I have some a series of blog posts on metadata game changers about that process and we're doing the same thing now with field stations for instance that are around the world and other kinds of facilities because facilities and field stations also have groups of people that use them. And have used them over some long time period. And many of those field stations also have bibliographies of papers, and those can be sources of identifiers for four authors have written and provided ORCIDs in those papers.
12: You can bring those back into the data sets that are in the field station or the facility and try and start making those connections. These these are real interpersonal connections that exist in these communities. And now we're trying to use identifiers to re instantiate those connections in the digital world. And it's working pretty well.
12: So that from an article and book perspective that would take. Having access. If the data set for those outputs in, say, the social sciences and humanities would be the article or the book or both, then we would need to have a correlative repository of those outputs as the data set to begin to make the connections, it sounds like. Yeah there's a lot of there's know, you can use things like crossref.
12: Jean mentioned earlier or open Alex or whatever. Another interesting thing that we're doing is full text searching for like names of these field stations. And so rather than searching for duis, which are slowly being populated, divisive data sets say you can search for names of field stations or names of facilities and some full text search engines and find papers that reference those facilities that the managers of the facilities may not know about.
12: So so using the research using the global research infrastructure as a source for information that we can use to augment metadata I think is, is an interesting and potentially useful thing. Thanks for that, Ted.
12: And do we have? Are there more questions or. Any other comments? Just going to check the Google Docs to see if there are any other comments.
12: So there was I'm not sure if we've spoken about this, but. Somebody asked if there's a working group at nice or elsewhere on how to define chapters consistently across platforms, and that it is one of the major holdups on the organization creating deals for chapters. Has that been covered or. I can say something about that.
12: The there is the e-book metadata group, but I don't think they've taken up chapters that's letting and are looking to have a project or two being taken up. It may or may not be nice, so it could be nice. So it might be great if it's nice. So it could also be nice. So collaboration with like the books interest group, there are other options as well, I think.
12: I think a cross collaboration would probably be ideal because it affects many different stakeholders in this industry. So Thanks for that, Michelle. And there's been another question in the Q&A window. And have you seen how or if automated systems handle name changes and affiliation changes to link papers from the same person? Or is it the responsibility of the person to make the links?
12: So I've just put that question in the chat window. Yeah so I mean, I'm happy to touch on this and then for anybody else to build on it, I think name changes in particular are one of those issues we need to be really careful around and it goes to policies around name changes and ensuring that we're not accidentally revealing information about somebody.
12: For example, if their name has changed as a result of sex change or a different identity to gender terms, we want to be able to make some of those changes silently and not draw a lot of attention to them in order to preserve that privacy for the individual. But obviously there are many reasons why name changes can occur, and I think that speaks to the complexity of trying to address that automatically, where you're trying to balance that tension between privacy.
12: But then at the same time, obviously the more we can connect information then the benefits of that for everybody involved. But Josh and Michelle are interested if you've got anything to add in that perspective. Is this can be really helpful for this. The name identifier because those can take a ton of data. It's it can be your primary name, it can be your secondary name, also alternate names.
12: And in addition to how your name is represented in other languages. And I think in general, librarians love is knees and they are very difficult to implement certainly in cataloging. I think with XML formats, JSON tags, they would be easier to integrate into an automated system. If we could pull in.
12: Different metaphor, different iterations of the same person to make our picture of somebody more robust and complete. And then just as a question for you, when Kimberly is asked if you could tell us more about how you've collected some of the data points for cost analysis? And is it shared as a data set or is it available?
12: I'd seen there was a message there from Richard in the chat here and I just finished typing a response with the Doi. So I was verbalize that and then hit Send so you guys can see it. But yeah, for the time taken to enter metadata, we relied on some preexisting work which is cited in our reports or the cost associated with that time. We calculated average salaries for a variety of roles, such as senior researcher, junior researcher, research manager and so on.
12: And we used sources such as there's the heard see and the Isa data collection systems in the UK and Australia, which pull in information about. The number of people employed in institutions, a number of universities, and so on. We looked at dimensions data from Digital Science to pull in information about, say, the number of grants that were issued and complemented that with data from funders themselves and a variety of other sources.
12: So we've cited much all of these sources really clearly in the report. So what I'll do is I'll just click Send here in the chat so you guys can see the sources that we used. And Richard, there's a link to that report, the Doi at the end of that message there, which gives a bit more detail and you can see all the citations there. I hope that's helpful.
12: Fantastic Thanks for that, Josh. And there's been another question and this one's again from the Q&A section, thinking about Dr. Weinberger's opening keynote today and the artificial intelligence machine language and related metadata. And then. Julia comment on how important metadata is for publishers to maintain.
12: Could you talk about? Do you agree on how important it is to have our foundational metadata and IDs in order to prepare for and help that possibility? Or do you think it's inevitable? Just on this topic, there's a screenshot that's being shared around online at the moment from the New GPT augmented Microsoft Bing search, right, which is AI generated answers to searches.
12: And the question was, what's the human population of mars? And the answer is $10 billion. That's what the AI offered based on basic based on sci-fi. So I think if we're going to be allowing machines to generate metadata, we are going to have to deal with AI hallucination. I think that there are services that use a lot of machine learning to infer relationships between, say, grants and publications, and you can go to different services and get radically different numbers.
12: This is something that we've really struggled with doing various pieces of research built on those data sets. And to my mind, it really emphasizes the importance of getting that kind of those foundational metadata at the source. Right and feeding that into machine learning so that you've got you do not have garbage going in because we are going to end up with a lot of junk information.
12: You know, Microsoft's own FAQ for it's being searched says Bing may misrepresent the information it shows you. So I think the caveats around machine learning and AI generated metadata are colossal at the moment. And I think that we as a community need to be really aware that we need to get our own houses in order and fix this and not rely on speculative technology to fix it any time soon.
12: Because otherwise we are going to risk looting the future scholarly record with a lot of kind of AI hallucinations, which have entered the record as we wait to try and get it right. Yeah I think there's a lot of truth to that. I do also think it goes back to the question or comment earlier about do we have too many different standards? And I think in some cases, yes, we do.
12: And that is one of the challenges in such a complex environment with so many different stakeholders. How do we try and avoid that creep in terms of the number of foundational pieces of metadata and different standards that are used to capture those? Because like you're saying, Josh, it's so important that we have good quality information going in because we're going to build.
12: We could build sandcastles on top of stuff that just won't hold it over time. And I think that's a danger we've got to be really conscious of. Thanks for that, Julia and Josh. And thank you also Michelle. And to all our attendees, we have reached the time and I'm happy for this to continue.
12: Ahead if there are more questions. Otherwise Thank you all once again for attending this session and especially to our speakers for presenting and answering the questions afterwards. So, yes, I hope to see all of you all at the conference, at some of the other sessions. They can now.