Name:
Publishing with meaning: which vocabularies are you using to enhance discovery and description?
Description:
Publishing with meaning: which vocabularies are you using to enhance discovery and description?
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/1d80f3f3-9620-474a-a344-9e5c6ecdb7b8/videoscrubberimages/Scrubber_1.jpg?sv=2019-02-02&sr=c&sig=K8qOLhqE5YJ1eJhCcYO7TbbjBhkZJcWTSGKGqp6z3%2FA%3D&st=2025-01-22T09%3A57%3A09Z&se=2025-01-22T14%3A02%3A09Z&sp=r
Duration:
T00H35M22S
Embed URL:
https://stream.cadmore.media/player/1d80f3f3-9620-474a-a344-9e5c6ecdb7b8
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/1d80f3f3-9620-474a-a344-9e5c6ecdb7b8/53 - Publishing with meaning - which vocabularies are you us.mov?sv=2019-02-02&sr=c&sig=oDfHZ8Ks%2FGyWuAlDkdDWeD9Yvsx7GR%2FsgICaOipJtOE%3D&st=2025-01-22T09%3A57%3A10Z&se=2025-01-22T12%3A02%3A10Z&sp=r
Upload Date:
2021-08-20T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
SPEAKER 1: Hi, everyone. Thanks for joining us here at NISO PLUS 2021 at the session on publishing with meaning. We have three speakers for this session, Lesley Wyborn, Simon Cox, and Simon Hodson. And we hope that you stick around afterwards and join us for a discussion on the topic. Thank you and enjoy the presentation.
LESLEY WYBORN: Welcome to this session, which is about publishing with meaning, which vocabularies are you using to enhance discovery and description? This session will be run by myself, Lesley Wyborn, Simon Cox, and Simon Hodson, and covers some topics that the three of us have been working on pretty solidly in the last few months. We are focusing on emerging issues with vocabularies because they're actually proliferating like crazy.
LESLEY WYBORN: And so we'll have three presentations, proliferation of vocabularies-- how does the user know which one to use? 10 simple rules for making a vocabulary fair by Simon Cox, while Simon Hodson will finish off on international projects that are tackling the vocabulary issue. So first of all, let's start off with the definition of a vocabulary.
LESLEY WYBORN: And for this session, the definition is taken to mean any semantic asset containing terms and usually information about those terms including value sets i.e. a list of terms, concepts sets, topics, vocabulary, glossaries, thesauri, concept maps, taxonomies, ontologies, and knowledge graphs. And these are actually ordered in increasing level of complexity and expressivity.
LESLEY WYBORN: Another way of showing this is this figure, which I have modified from a diagram of Leo Orbit from 2007 to 2008 in which he starts out with a list or a taxonomy. And as you move up towards the top right, you're going through thesauri, conceptual models, formal ontologies, and axiologies. The important thing is as you move from the weak semantics in the bottom left to the strong semantics in the top right, you increase your reasoning capability.
LESLEY WYBORN: But to set up these higher order vocabularies and ontologies, it actually takes a lot of work, and not everyone is able to do that at this stage. So why is it important to know whether you can trust a vocabulary? Publishing is the act of making data and information resources accessible to others. But unless we know what we mean by terms we are using, communication can become fuzzy and not understood well, particularly in transdisciplinary and/or multilingual environments.
LESLEY WYBORN: For effective and scholarly communication, adoption of shared vocabulary, terminologies, and semantic assets is critical to the discovery and understanding of published resources, helping to reduce ambiguity whilst at the same time increasing interoperability. How widely can I share my data and have it understood? Well, actually many disciplines and organizations have local lists of vocabularies that serve multiple functions, including enhancing discovery, annotation, and description.
LESLEY WYBORN: And hence the size of a group that can interact with any data set is only as large as the size of the group that understands the definitions, concepts, and languages being used to describe the data set. So what's the issue? With increasing globalization of data and information resources, particularly given by events such as COVID and increasing acceptance of climate change, enabling a common understanding of any concept used to describe or define a thing in our world is becoming critical.
LESLEY WYBORN: And there's a growing need to develop vocabularies that can be used across borders communities and support harmonization of information both within and across disciplines and languages, and then be able to communicate to users how reliable, usable, and persistent semantic assets are that they are wanting to use; if the content is governed and endorsed by an authority, if the vocabulary is FAIR-- findable, accessible, interoperable, and reusable-- and above all, what the quality of the vocabulary is.
LESLEY WYBORN: So just coming back this diagram again. As I said, really as you move from the bottom left to the top right, it's a growing maturity of the semantic asset that is being made available. But the important point is that in many cases all you actually have is a list or a simple taxonomy. And so you can't just say, "Well, for a resource to be usable, I have to have it as a formal ontology." In many cases, you have to be able to use the lower terms.
LESLEY WYBORN: And so that's one thing we need to communicate to our users is to how mature a vocabulary is. Another important issue is we need to be able to tell people how sustainable it, is it governed? Is it authorized or endorsed by a recognized authority? What are the conditions of use? So I now want to borrow from some work that was done by Ramapriyan et al.
LESLEY WYBORN: on ensuring and improving information quality for Earth science data and products. And in this they argue that there are four dimensions of quality in science data. There is the science or the scientific content that we're trying to communicate and the product that we put that content into. Once it is produced, we hand it over to usually the data stewards or curators who maintain, preserve, and disseminate.
LESLEY WYBORN: And then in the top left hand corner, you have a service, that is, something that is put onto that data or vocabulary so it can be used. Taking these concepts into vocabularies, I've drawn it in this way and I've said, well, I've got a vocabulary who's defined the term? Has someone endorsed this vocabulary? If I have, are they authoritative? And I'll ask the question which Simon Hodson will come back to is if there's a role for science unions and their equivalent to start coming out and really helping us sort out which are the better and more scientific vocabularies.
LESLEY WYBORN: Moving now down to the bottom right. What is the vocabulary structure? Does it have terms? What is its expressiveness? Does it follow standards? And then moving over to the people that maintain the vocabularies over time, who's been managing the content and do they have persistent funding?
LESLEY WYBORN: And then when you have services put onto those vocabularies so you can access, are those services based on standards? Do we have a process for automatic updates with the content provider? How reliable are the services? And does it have persistent funding? So again, as I said earlier in the maturity diagram, you're not going to get everything in every vocabulary in the shorter term.
LESLEY WYBORN: And so is there a quick approach [INAUDIBLE] that can help us just at a glance see whether a vocabulary if we have to in front of us, which ones should we choose? What's something we can give to people fairly simply and fairly quickly? And so I looked at the work that Tim Berners Lee developed the five star open data on the web where you move from one star which is a PDF, again, like a list is accessible up to fully blown linked data.
LESLEY WYBORN: And the more mature you are up in this top right, the more you can do with that vocabulary, the more you can reason with it and do something. And again, in this catalog I've got from Data.gov.au, you can see how for each data set you've got a four star, five star rating against it. And so again, we could do this with vocabularies so that when we find out there are a few available, this would be a quick way to make a user know where and which vocabulary would be the better choice.
LESLEY WYBORN: So here in this next slide, I offer you a five-star rating for vocabularies. It's very preliminary, but if someone would like to help me formalize this and make it more widely available, you're welcome. So one star is just somebody simple list, two stars is machine readable, three stars it's non-proprietary with definitions building up to the five star which is concept based, RDF, linked, endorsed, and multilingual.
LESLEY WYBORN: So I [? hope ?] I've now covered some of the things that we're starting to realize are happening with vocabularies and I for one accept that we need something more substantial than that simple five-star rating in my previous slide. And this is probably the fourth session that Simon Cox and Simon Hodson and others have been raising since July 2020. And we're all very similar.
LESLEY WYBORN: I've got all the references and links to that. And at each session we ran, we all came to the same conclusion that we are aware of the problem, but are we really tackling it at the moment and what can we do. So the examples of what is being done, I'll now hand over to the next two presenters. Simon Cox will talk about 10 simple things to make vocabularies FAIR, and he will be followed by Simon Hodson, who will be raising awareness of international projects that are tackling the issues of sustainability, governance, trust, authority, and endorsement of semantic resources as an aid to enabling transdisciplinary interoperability of scientific data.
LESLEY WYBORN: Thank you.
SIMON COX: Thanks, Lesley. So now I'm going to focus in a bit on some recent work that we have done to define a set of guidelines around FAIR vocabulary. The first one being 10 simple rules to make a vocabulary FAIR. But before we describe the guidelines, we need to be clear what are the use cases, why should vocabularies be FAIR?
SIMON COX: Well, there are a series of reasons. First off, as a data user we want to be able to verify if terms used in different data sets mean the same thing. Next, as a data provider, we want to be able to accurately and efficiently mark up or annotate data, both column headings and field values using values from controlled vocabularies. We want to use vocabularies that we know are governed and trusted by the relevant community, and we want the terms to be described following standards, and for the definitions to be machine processable.
SIMON COX: Just to tease this out a little, here is an extract from a typical observational data set from an ecological survey exercise. And here we highlight how many different controlled vocabularies will be involved. A row in the table describes a single observation. Every field in the row and some column headings as well except for the actual number refers to an item from a controlled vocabulary.
SIMON COX: Looking at how publishing and maintaining a vocabulary might fit into the FAIR framework, to be findable means the vocabulary must have good metadata and be registered in a community service or portal. To be accessible the vocabulary must be on the web downloadable as a whole or one term at a time published as linked data.
SIMON COX: To be interoperable the vocabulary must be encoded using a standard model and syntax. These days that means RDF or SKOS alongside an HTML web page for each term and for the vocabulary as a whole. To be reusable the vocabulary must have an open license and be trustworthy with good metadata and providence in information and an orderly maintenance program.
SIMON COX: So what do we find when we look at some standard vocabularies? The Australian standard soil and land survey vocabularies are printed in a book. Each vocabulary is a little table on a page with codes shown typographically in red. The geological time scale is one of the most important vocabularies used in historical geology, but it's the main way in which this is presented is there's a beautiful color diagram on a PDF.
SIMON COX: There's a lot of structure and information on this chart with ordering, relationships, nesting, hierarchies, and chronometric calibrations and uncertainties all shown. And the color scheme, which it's colored with matches the expectation of conventional geologists. But this is purely graphical, and the semantics or meaning is not machine readable at all.
SIMON COX: The SI units are described in the so-called SI brochure which is a set of tables essentially on web pages hosted by the BIPM. Fortunately, nothing about this is machine actionable except for rendering it in a browser. The last two of these, that is, the geological time scale and the units of measure or the SI are some of the most central vocabularies used in science, but the custodians do not provide them as FAIR vocabularies.
SIMON COX: So what does a FAIR vocabulary look like, and how is it that we might take an unFAIR controlled vocabulary that's a legacy vocabulary like the ones that we've been looking at and make it FAIR? So here's a few examples of existing FAIR vocabularies. This one is a short list of organizational functions that was required for a single project.
SIMON COX: It's published and hosted in a registry of vocabulary registry by CSIRO. This one is a science vocabulary related to observational sampling, which is hosted by the Australian Research Data Commons in their registry called research vocabularies in Australia.
SIMON COX: And here is a general and very large resource of agricultural terms, which is hosted in a disciplinary repository. Here's a set of anatomy terms and relationships hosted in Bio portal. And this shows an extract to a part of the environment ontology which is a very rigorous highly axiomatized vocabulary encoded in OWL as part of the OBO Foundry suite of biomedical ontologies.
SIMON COX: So these examples show that it is possible to prepare and publish and maintain a controlled vocabulary and publish it and make it available as linked data in a FAIR way. So the basis of these examples and having run through the process a few times, we're confident that it is possible to make an existing vocabulary FAIR.
SIMON COX: And we've attempted to write out the guidelines as a series of 10 simple rules. This format 10 simple rules is a standard series in public library of science. And here are the 10 simple rules. And it's worth pointing out that the actual encoding step where you convert the non machine readable vocabulary to a machine readable form is just one of the 10 rules, rule number six.
SIMON COX: The rest of the rules focus on governance, maintenance, publishing, licensing. Making vocabulary FAIR, we find, is a lot more than just a technical exercise. Most of the effort goes into organizational processes both in terms of vocabulary content and publishing it and preparing it for the linked data platform.
SIMON COX: Returning to a couple of the examples mentioned earlier, many of the soil vocabularies from the Yellow Book have been converted to FAIR. This snapshot shows one of the definitions together with a key to how they made three of the rules. The vocabulary is registered in CSIRO's registry, it's encoded in SKOS, and it's downloadable in various serialization TTL, XML, JSON, CSV as well as being shown in this reasonably user-friendly web page.
SIMON COX: The URI's which are the official identifiers shown under the label identifier redirect or arranged so that when somebody dereferences or tries to resolve one of those URI's it redirects to this page. And covering the other parts, we're liaising with the custodian, the maintenance custodian and have a whole process for managing the machine readable content and uploading it to the registry when there are changes that need to be-- the vocabulary needs to be republished.
SIMON COX: The geological time scale is available in a fully structured machine readable form in both SKOS and according to a specialized ontology, and it's updated routinely in collaboration with the content custodian, the International Commission on Stratigraphy. You can see how the structure of the Cambrian is shown in the SKOS representation here.
SIMON COX: So looking at the whole area of FAIR controlled vocabularies, we've been developing a workflow covering various aspects of the vocabulary management and use process, including finding vocabulary, selecting vocabularies, linking between data sets and vocabularies, remediation of vocabularies, creation of new vocabulary, and the conversion of vocabularies, which is what's covered by this guideline that I've described here and is shown in the circled area, the bottom right of the overall workflow.
SIMON COX: This guideline is just now just the first of a series of guidelines planned. So to summarize, we need FAIR vocabularies to support FAIR data. Many existing vocabularies are not FAIR, and as part of the suite of guidelines relating to FAIR vocabularies, we have prepared a paper describing 10 simple rules to make a vocabulary FAIR.
SIMON COX: Of the guidelines covering these other areas such as maintaining a FAIR vocabulary, aligning vocabulary, publishing vocabulary, and semantic data integration are in preparation following up on this initial guideline. Thank you.
SIMON HODSON: Thank you very much, Leslie and Simon. My name is Simon Hodson, the executive director of CODATA, and I'm going to talk about some work that CODATA is doing with and for FAIR vocabularies, and also in the context of a decadal program entitled Making Data Work for Cross-Domain Grand Challenges that we're preparing for our parent organization, the International Science Council. CODATA is the Committee on Data of the International Science Council, but of course, we're more than a committee.
SIMON HODSON: We are a global membership organization, and our mission is to connect data and people to advance science and improve our world. The International Science Council was itself formed in 2013 by the merger of two predecessor organizations which were very long standing. On the one hand ICSU, the International Council for Scientific Unions which had a remit for the natural and physical sciences; and on the other hand, the ISSC, the International Social Science Council which, of course, served the social sciences.
SIMON HODSON: And the merger was significant because it created this unified representative body for all branches of science, which had-- which has an explicit mission for also for interdisciplinary and cross-domain research. Now that's important because addressing global grand challenges, those societal, environmental, and planetary challenges which we face in the 21st century requires precisely cross-domain collaboration and the ability to combine and analyze data from many different sources.
SIMON HODSON: Over the past two decades the predecessor organizations of the International Science Council together sponsored a global coordinating science programs on precisely these grand challenge research areas, for example, futurearth on climate change, adaptation and mitigation, and programs on disaster risk reduction and urban health and well-being. And again, what these programs have in common is the wish to coordinate internationally, but also to bring to bear an interdisciplinary or cross-domain methodology on these grand challenge research areas.
SIMON HODSON: What they recognize, of course, is the fundamental need to be able to access and combine data from many different sources. So it was recognized in the ISC action plan the need for a global coordinating activity to assist data interoperability for cross-domain research. The preparation of that program has been tasked to CODATA as the Committee on data at the International Science Council.
SIMON HODSON: That program is in preparation and will be launched later this year. But clearly, a key component must be work on the vocabularies for particular domains and vocabularies which allow cross-domain research and the combination of data. And so that's why we're interested as an organization in working with particular domain vocabularies in making those vocabularies FAIR and in exploring the opportunities for vocabularies which help these areas across domain or interdisciplinary research.
SIMON HODSON: I'd like to introduce the next part of this presentation with this cartoon from XKCD on software dependency which I think can be equally well applied to vocabularies. Significant parts of the global semantic infrastructure are maintained by voluntary low-funded effort, and that is not necessarily a bad thing, but it brings risks and potentially some vulnerabilities.
SIMON HODSON: An example of that potential vulnerability can be found in the CASRAI research data management glossary. CASRAI, a Canadian organization, developed some great products and had a very strong vision for the role of semantic resources in research and in research administration. Unfortunately, CASRAI folded in 2020. It's worth mentioning as this is the NISO press conference that the credit contributor role taxonomy will be maintained by NISO.
SIMON HODSON: The CASRAI research data management glossary will continue to be maintained as a community resource by CODATA. We'll be applying the 10 simple rules as laid out by Simon in the first-- the second presentation, and we will restart community review cycles from roughly April 2021. Now everything has worked out. We hope OK for that resource, but the fact that the organization originally maintaining it went under means that some of these vocabularies and resources and glossaries are vulnerable, and it's something which we as a community need to address.
SIMON HODSON: A few years ago CODATA conducted with the OECD a study and report on business models for sustainable research data repositories. We surveyed the income streams and business models of roughly 50 repositories, considering also their mission, their governance, and how they presented their value proposition. We created a typology of income streams and conducted an economic analysis of those business models and they were then resulted in recommendations for funders for scientific communities and for the repositories themselves.
SIMON HODSON: Our proposal now is to conduct a similar study for vocabularies and perhaps ontologies and other semantic artifacts. At the very least, we need a landscape study. There's a need for an empirical survey in how vocabularies are maintained, how they're governed and their sustainability because you cannot manage what you do not measure. We want to find out what the governance models are, what the approaches are to technical implementation, of course, and also what the approaches are to publication, maintenance, funding, and sustainability.
SIMON HODSON: And we think at least that landscape survey will be a valuable resource in addition to our knowledge, and we hope that it can be turned into a similar report and set of recommendations for vocabularies and semantic resources. CODATA is also involved in a number of activities to encourage good practice around FAIR vocabularies in domain and cross-domain research areas. For example, with IUSSP, the International Union for the Scientific Study of Population, CODATA has recently founded a joint working group on vocabularies in population research.
SIMON HODSON: And that group will work with owners of a set of key vocabularies in population science to apply the good practices laid out in the 10 simple rules for making a vocabulary FAIR. Also, CODATA has been involved in a community activity to develop a terminology for FAIR skills which defines the skills, competencies, and learning outcomes required to make data FAIR.
SIMON HODSON: This emanated from a series of workshops, which CODATA hosted and is now being developed into a terminology as part of a European Open Science Cloud cocreation project and also with support from the affairs FAIR project. CODATA FAIR sharing and other partners will maintain this as a community resource following good practice, and will also initiate review cycles in parallel with the CASRAI research data management glossary.
SIMON HODSON: Another example is the ISC-UNDRR Hazard Information Profiles. So the International Science Council and the UN Office for Disaster Risk and Reduction convened a working group roughly 18 months ago with the goal of assembling a common set of hazard definitions for monitoring and reviewing Sendai implementation. So this is the list, the revised list of hazards against which member states that are signatories to the Sendai framework for disaster risk reduction must report their disaster loss.
SIMON HODSON: The effort as I said, was convened by ISC and UNDRR and was chaired by Virginia Murray from Public Health England and who's also a member of the CODATA executive committee. The output of that activity is over 300 hazard information profiles each of which contain the name of the hazard, the definition, and a series of annotations and additional information, including metrics and the numeric limits, which are attached to that hazard.
SIMON HODSON: CODATA has made recommendations on how to make these HIPS FAIR and good practice again for maintenance and governance. CODATA will also encode the HIPS to make them FAIR such that they can be surfaced on the UNDRR prevention web platform. Finally, I'd like to present a vision for an on-ramp certification for research vocabularies which would encourage good practice around the maintenance and use of the vocabularies and would help, I think, to build trust around their sustainability and engagement with research communities.
SIMON HODSON: Think of this, if you like, as something similar to CoreTrustSeal, which is for repositories, and this would be analogous as a basic level of certification but for vocabularies. The development of this would need to build on the landscape survey. It would build on the recommendations in "Ten Simple Things," and elsewhere, and also the case studies that I've mentioned with ISSP, for example, and with the hazard information profiles.
SIMON HODSON: It would cover issues such as technical presentation and FAIR, how to make the vocabulary FAIR. It would also address things such as governance, maintenance, and their sustainability. To discuss the landscape survey, the case studies that I've mentioned and this proposal for an on-ramp certification, Leslie, Simon, and I have convened a or have made it both proposal a birds of a feather proposal for the next RDA plenary in April.
SIMON HODSON: So we'd like to invite participants in this session and others to come along to that birds of a feather session at RDA if it's approved to discuss this further. And in any case, we intend to take this work forward, the landscape survey, the case studies which are already underway, and this proposal for a on-ramp certification, we plan to take that forward with partners as part of the Decadal Programme about which I introduced this section of the joint presentation.
SIMON HODSON: I would like to close with a plug for International Data Week 2021 which comprises-- which combines the Research Data Alliance plenary meeting and the SciDataCon conference. The deadline for session proposals for the SciDataCon component is the 31st of March. So please think about submitting a session to that. There'll be lots of sessions about these issues of interoperability within particular domains and across domains, and we'll hear a lot more about the pilot activities of the Decadal Program, Making Data Work for Cross-Domain Grand Challenges.
SIMON HODSON: Thank you very much for your attention. [MUSIC PLAYING]