Name:
Document Semantic Support-NISO Plus
Description:
Document Semantic Support-NISO Plus
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/53aaf09a-f0de-4de0-96e9-a86c6993e038/thumbnails/53aaf09a-f0de-4de0-96e9-a86c6993e038.png?sv=2019-02-02&sr=c&sig=8AQlF0%2FPCrUa21blFB2RvuXxJgbQGLORpopapgPSFwY%3D&st=2024-04-29T13%3A22%3A13Z&se=2024-04-29T17%3A27%3A13Z&sp=r
Duration:
T00H37M43S
Embed URL:
https://stream.cadmore.media/player/53aaf09a-f0de-4de0-96e9-a86c6993e038
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/53aaf09a-f0de-4de0-96e9-a86c6993e038/Document Semantic Support-NISO Plus.mp4?sv=2019-02-02&sr=c&sig=8vYUbEd26iMyzYQWhC99JRXid68dGfzbK83jLzxSlJs%3D&st=2024-04-29T13%3A22%3A13Z&se=2024-04-29T15%3A27%3A13Z&sp=r
Upload Date:
2022-08-26T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
WAYNE DE FREMERY: All right. So good afternoon, good evening, good morning, depending on where you are. My name is Wayne de Fremery. I'm the convener of ISO SC 34s working group 9, which is tasked with preparing a technical report about the ways that the semantic metadata in office documents can be captured and made useful. Francis Cave, chair of SC 34, will have just given an overview of SC 34 and how the working group fits into it.
WAYNE DE FREMERY: I'd like to present the basic and general description of what working group 9 is working on. So our working group grew out of an SC 34 study group that brought together experts interested in exploring the idea that it may be possible to better support the variety of ways that people create and make use of office documents by accounting for how elements of documents can mean differently, and be differently important to a variety of different kinds of communities.
WAYNE DE FREMERY: So the idea was, and here's one example, we might compare, for example, how publishers and data scientists might think about the significance of certain elements in office documents, for example, like italic type. For publishers using Word documents to create content, the representation of text in italics is considered extremely important and meaningful. And a great deal of time and energy is spent by publishers ensuring that italic text is consistently represented when Office documents are converted to other formats as part of their publishing workflows.
WAYNE DE FREMERY: Data scientists, on the other hand, using Office documents and machine learning workflows such as various kinds of natural language processing workflows and these kinds of things will generally not care if any portion of the text is represented with italic type. Italics and other kinds of text formatting is really generally not very meaningful in their work. So this is the sort of basic idea of how documents can-- different elements of documents can be semantically meaningful in very different ways to different communities.
WAYNE DE FREMERY: So since Office documents have historically been viewed as unstructured documents, the study group began to investigate how semantically meaningful elements of documents could be captured to support diverse groups of document users who will, like publishers and data scientists, differently understand what should be considered meaningful in the Office documents with which they work. So based on the study group, SC 42 or SC 34 study group in September of 2020, SE 34 decided to form a working group to create an ISO technical report.
WAYNE DE FREMERY: Technical reports like international standards are informational-- actually rephrase that. Technical reports, unlike international standards, are informational reports. They're meant to describe the state of the art in the given field, for example, as opposed to presenting some sort of normative statement that can be used as an international standard. So in our case, we're attempting to produce a technical report that describes use cases, where associating semantic metadata with Office documents would substantially benefit communities that use Office documents.
WAYNE DE FREMERY: And so we've begun exploring use cases and publishing artificial intelligence, document preservation, search, but are eager to entertain others. So if you have ideas about additional use cases that we should consider, please let us know. We've also begun examining ways that semantic metadata can be associated with Office documents either through linking the metadata or embedding it, for example.
WAYNE DE FREMERY: Lastly, we are considering what kinds of semantic metadata might be supported by Office documents, and how that data might be structured. To help us think through these issues and these ideas, we've been organizing what we've called document semantic support information sessions, kind of a mouthful. But these are informal information gathering sessions, where experts from around the world join us to speak on topics related to supporting semantic metadata in Office documents as that might serve again different sorts of communities.
WAYNE DE FREMERY: So people working in artificial intelligence, people thinking about document preservation publishing search indexing and the like. And we envision these sessions as opportunities for leaders to gather and essentially share their experience. So I'd like to wrap up my short remarks with an invitation to you to join us at these information sessions. We'd be super interested if you'd like to speak with our group about your thoughts for how best to support the integration of useful semantic metadata into Office documents.
WAYNE DE FREMERY: We'd also welcome you as audience members to listen to our experts and to join in. Really they turn out to be really interesting discussions after our speaker's presentation. So we hope that you'll join us. If you're interested in joining us, please reach out to me during the live portion of this conference or to the NISO Plus organizers who will have my contact information.
WAYNE DE FREMERY: Thanks a bunch. Thanks for your interest.
SPEAKER: Good evening, and thank you for joining this session on document semantic support. I shall be relying on my colleagues to tell you in detail about the project, and what we're hoping to achieve. So I propose to focus on the background of the project, my believe this to be an important piece of work for all who are concerned with the creation, use, and preservation of documents. Subcommittee 34 is just one of many technical subcommittees that do the technical work of developing standards on behalf of two international standards bodies, the International Organization for Standardization or ISO, and the International Electrotechnical Commission or IEC.
SPEAKER: Back in the early 1990s, ISO and IEC set up a Joint Technical Committee, JTC 1, to take responsibility for standards in the information technology domain, a domain of shared interest with the aim of avoiding duplication of effort, a name that has been largely, if not entirely successful. The technical subcommittee that I chair is concerned among other things with ensuring that there are robust international standards for the encoding of documents.
SPEAKER: And in this context, I am talking about resizable documents whose content is designed to be created, used, and modified by people rather than by machines. Many of our audience today will be creating, modifying, and otherwise using documents that rely upon one of the families of international standards for which we are responsible, either Office Open XML file formats or OOXML, the file formats that are used by Microsoft Office in particular, or alternatively, Open Document Format or ODF, which is developed and maintained by Oasis, but we support them in maintaining ODF as an international standard.
SPEAKER: It's important to understand that our work is not going on in isolation. Ours is part of a broader program of work on the development and maintenance of international standards across the full gamut of information and communication technologies. That broader program provides much of the context for today's topic. My colleagues will no doubt mention machine learning, and I could also mention big data, autonomous vehicles, trustworthiness of IT, brain computer interface, all areas of current standardization activity that I believe will sooner or later lead to new requirements on Office systems.
SPEAKER: In the next three slides, I've highlighted the domains of current standardization work in JTC 1 that I personally believe could lead to new requirements on Office systems and their file formats. Other domains might be added to this tentative list. So in this first slide, you can see trustworthiness is one of the working groups that I think could lead to new requirements for Office systems.
SPEAKER: Information security, cyber security, and privacy protection. You can see us at the bottom of that list there. Learning technology is an area, which I'm sure could lead to new requirements on Office file formats. At the Internet of Things and digital twin, that's where a lot of the work on big data is being done. And artificial intelligence, obviously with machine learning and the newest committee of all, on brain computer interface.
SPEAKER: I hope that I don't need to convince you of the importance of metadata in the context of documents in general and revisable documents in particular without necessarily being aware of all the details. Whenever we create Office documents, we create metadata alongside the text and other media that form the content of that document. That metadata may be there for a variety of purposes.
SPEAKER: To control how the content is presented to the user is the most obvious purpose. But I'm sure you can think of many more. The Office file formats for which we are responsible use metadata in many different ways, not only to control how content is presented to the user, but for purposes supporting discovery, preservation, version control, collaborative editing, accessibility reliability, and so on.
SPEAKER: Developers of Office systems have understandably focused on applications of metadata that meet the needs of the widest range of office system users, human users. But increasingly, there is a need to consider machines as Office document creators, modifiers, and consumers. With ICT encroaching on ever more aspects of our lives and our businesses, it is inevitable that demand will grow for defining how human and machine users can collaborate on the creation, modification, and consumption of documents.
SPEAKER: Human users need to be able to support machines in their use of documents and vise versa. On the one hand, human creators of documents need to be able to ensure that machines using these documents understand their content and context. On the other hand, human consumers of documents created or modified by machines need to understand the decision processes that led the machine to create or modify the documents in a particular way.
SPEAKER: Metadata is going to play an increasingly crucial role in optimizing the exchange and use of documents in collaborations between people and machines. Our project is just one step along a road that we hope will ensure that standards remain central to the developments that take place, ensuring a level playing field for office system developers, vendors, and users around the world. Thank you and I'll now hand over to my colleagues.
LEE MING: Hello, ladies and gentlemen. I'm Lee Ming, the editor of ISO/IEC JTC 1/SC34 WG9. Currently, we are working on the project on document semantic support to develop a standard for office documents to support semantic metadata. In order to deliver an intuitive impression, the document information processing team of Beijing Information Science and Technology University has especially made demonstration to explain what we have done so far.
SPEAKER 2: Nowadays documents are mainly for people to read. Can we make the semantic information in those documents, such as people's names, names of places, product names, event names, and so on, understandable by machines so that this information can be automatically processed by the computer? To do so, we need to tag these entity names as semantic metadata and record them in the document format.
SPEAKER 2: Therefore, we added a micro format similar to RDFa to the OOXML format used by Microsoft Word. We indicate metadata of different sources through the RDFa attributes, including their namespaces, types, and so on. For example, here is a piece of new support about an earthquake, which includes editorial information, keyword, named entities, time, and more.
SPEAKER 2: We tag this information as semantic metadata in the form of RDFa and embedded them in the document XML file in the OOXML file package. In this way, the format data of the document becomes like these. We can use namespaces to identify metadata from different sources. It could be metadata off the library domain, publishing domain, or business domain, et cetera.
SPEAKER 2: In other words, it could be metadata used in any domain. Now here is a problem. Originally, the OOXML documents used by Word, did not come with these semantic metadata. After embedding the metadata, Word can no longer open these documents normally. So what should we do? We found a solution. We designed a preprocessor to convert these semantic metadata into special annotations in the documents so that word can open them.
SPEAKER 2: And cementing metadata can also be edited and modified with the Word document. Note that when we save the OOXML document from Word, we also need to restore its semantic metadata. We therefore design a post processor to convert the semantic metadata in the form of special annotations back to RDFa tags. What benefits can we get from doing this?
SPEAKER 2: The biggest advantage is that the computer can easily extract this metadata from documents and make corresponding processing, like adding to a database or answering users inquiries, and so on. What's more, we can transform these semantic metadata into knowledge represented by RDF to form a knowledge chart and then perform knowledge management.
SPEAKER 2: This method can be used not only in OOXML format documents, but also in ODF format documents. So it is a relatively general method. Of course, this is just a preliminary attempt. We believe the future's standard to be even better. Thanks for watching and welcome to join us. We look forward to working with you.
LESLIE JOHNSTON: Hi, everyone. This is Leslie Johnston from the National Archives and Records Administration. I am the Director of Digital Preservation there, which means that I am responsible for the infrastructure as well as the policy related to preserving our holdings, which are the permanent records of the United States government. So I'm going to talk today about developing, implementing, and maintaining our digital preservation framework and how we are moving to a linked data resource.
LESLIE JOHNSTON: So I'm going to start off by talking about what our digital preservation framework is. The first part of it is a collections format profile. We have been taking in born digital records since 1971. So we have 50 years of digital records, as well as digitized records that have both come from agencies or that we've created through our own digitization. This is federal records, congressional records, census records, presidential records, multiple systems over time.
LESLIE JOHNSTON: So we needed to assess what we actually had in our holdings across all of these systems. At this point this is approximately 2 billion files and yes, that's billion with a b. We used a manual process to combine the reports from all the systems and we have been managing this in a spreadsheet. There are lots of issues with the granularity across the systems.
LESLIE JOHNSTON: We have different tooling, different analysis, different reporting, which meant that what we have as a data set for this, in this spreadsheet, are reports of different granularities. Such as, we know these are PDF, we know these are PDF 1.4, 1.7, PDF/A. Which version of PDF/A? So we have had to do a lot of data normalization to be able to do a comparison across the holdings. This then led to risk assessment for us.
LESLIE JOHNSTON: So we have a very extensive risk matrix, also a spreadsheet, which is designed with a series of weighted factors related to preservation sustainability of the formats as well as being built on top of the data from that collection format profile. They all have different relative weightings. They map to different level of risks. And they measure to the extent that they can be defined, things like resource costs of staff time or budget.
LESLIE JOHNSTON: But also, what are our current capabilities vis-a-vis those formats? What kind of format is it? Is it open? Is it closed? Is there an official specification? Is there a national or international standard? Or is there a community reverse engineered specification? We are actually open to all of those because we can't do our work assessing risk unless we have a specification that we can take advantage of in characterizing those formats in our holdings.
LESLIE JOHNSTON: So we do calculate a numeric score, but we only map those to high, moderate, and low. And these thresholds are always open to review and revision over time. This is what the risk matrix looks like. You can see here, for example, different versions of Canon RAW or CSS or CGM files. How do we characterize it? What file extensions are associated?
LESLIE JOHNSTON: And in this case, information about the proprietary nature and specifications for those files. We have then also needed to create preservation action plans. So we have two different types of plans. One which identifies the essential characteristics for the record types. This really means something like email, GIS, databases, word processing documents, still images, digital video and audio, because you can't actually compare characteristics from word processing files to still image or video files.
LESLIE JOHNSTON: They're very different types of records. But you can actually compare all the different versions of word processing files or all of the different versions of still images in a way that you can have a set of consistent essential characteristics. So we use these for different purposes. We track appearance, structure, behavior, and context. And these are the properties that should have possibly be retained when we do a format migration.
LESLIE JOHNSTON: And these are also the metrics we use for testing the tools that we need for preservation migrations. Then we also have the preservation action plans for the formats. This has close to 700 file formats listed, in again, a spreadsheet. The record character category plans also are in Word documents.
LESLIE JOHNSTON: But the preservation action plan spreadsheet is all of the record types. It contains the specs, resource links in the community, format information, preservation actions, to ensure that it's actionable and extensible for us, but also for other institutions. So this is an example of a plan for a record type. So this is for digital still images and the characteristics for appearance, for structure, for behavior, so that we can actually use those to test and service metrics.
LESLIE JOHNSTON: This is an example from our preservation action plans. You can see this covers the same Canon RAW and CSS I showed earlier from the matrix. What are the extensions? What are the types? What are our unique identifiers? MIME types, if they exist, specifications, but links to other resources, such as PRONOM, or the Library of Congress, British Library, Wikidata, Archive Team, so any sort of community resource that is authoritative.
LESLIE JOHNSTON: It also includes our actions and what we will actually be using and doing to do the work. So we released this publicly in 2020. It's available on GitHub as CSV files for the action plans as well as for the matrix. A blank copy of the matrix and a filled up copy of the matrix for other institutions to use. And PDFs from the Word documents, from the record category plans that-- We do update these quarterlies but maintaining these is difficult.
LESLIE JOHNSTON: Links break. Links break quickly. In a single quarter, between say a June release and a September release, specifications go away, community resources go away, specifications get overwritten. I'm looking at you Microsoft. So they're not there anymore. So we do actually have to web archive what we can actually get and keep those internally as an internal only resource as well as keeping licensed specifications that we have purchased licenses for.
LESLIE JOHNSTON: And we keep those close hold. We also need to add new formats every single quarter. Sometimes it's because we have received new formats. Sometimes it's because we have done some better identification and characterization of files that we already have in the holdings. Sometimes it's that there are new resources out there. There is more information on Wikipedia or Wikidata, or at Archive Team or PRONOM has released new signatures.
LESLIE JOHNSTON: We had to revise the available tools. Sometimes tools that were available, are no longer available. Or there are new tools that are available. And for us, there's the added issue of available for use by a federal agency because we can accept different terms of service at a lot of other organizations. We might have new record categories.
LESLIE JOHNSTON: We've added calendars and navigational charts in the last year, year and a half. And now we know that we need to look at a new category of e-learning formats because we have had an agency tell us that they have SCORM files, which is an e-learning packaging format that they consider permanent records that they would like to transfer to us.
LESLIE JOHNSTON: We also have changes in our risk matrix, such as format age, the availability of tools. So we have now found ourselves in a situation. We are managing a data set having to do with the file formats, files in our holdings, we have two spreadsheets that we are tracking the risks and the plans for our holdings, and we have 16 word files. So what this means for us is that we are starting to look more and more at linked data.
LESLIE JOHNSTON: I mean, we have several goals related to this. One is that we need to document our decision making processes, and the decisions made by NARA, in terms of long term specification of the specific file formats in our holdings and in our digital preservation framework. This is actually a sort of a key point for us. We have sometimes had other organizations say, hey, can you add formats?
LESLIE JOHNSTON: At this point we are being a little more introspective in terms of our own holdings and what agencies have scheduled to transfer to us as permanent records. And we're not at a point where we are looking widely at every format variation that is perhaps out in the community. So we're already sharing our framework in a human readable format, as Word Perfect word files turned into PDFs, or spreadsheets turned in a CSV.
LESLIE JOHNSTON: But we really need machine readable formats because it's incredibly important that we make this available in a form. And linked data is that form that can be incorporated into other linked data resources which both expands the scope of file format preservation information that's available to the community, but it also allows us to more easily reference and maintain our links to resources that are out in the community and bring them into our digital preservation framework.
LESLIE JOHNSTON: Wikidata for digital preservation has rapidly become the hub for all of these resources. The document file formats and related information. It includes information about software publishers, about software packages, the versions of those packages, the formats that they produce, the documentation that's out, such as file signatures maintained by the National Archives, UK in PRONOM.
LESLIE JOHNSTON: As well as information provided by the Library of Congress, by the Internet Archive, by many other organizations. So our goal is to make sure that our linked data resource, we can incorporate it there as much as possible. So that's very important to us. Transitioning to a semantic resource requires a lot of thought. So there are many other authoritative online resources on file format specifications, their reverse engineer documentation.
LESLIE JOHNSTON: That's really the knowledge graph, if you think about a semantic resource. Our plans actually already have in them a lot of URI's, which connect to the resources out there and then can provide two way linkages between those organizations and their resources and our resources. So making up a larger graph. So those linkages and your eyes are incredibly important for a semantic resource.
LESLIE JOHNSTON: Linked data for descriptive metadata and for preservation metadata, such as in PREMIS. It's already in use. But what we found is that there is a lot of use out there for descriptive metadata, such as in Dublin Core or in some of the other sets of elements, properties, ontologies, vocabularies, that are out there. And we can make use of a lot of them from Wikidata and some from Dublin Core.
LESLIE JOHNSTON: But we have found that there are not a lot of elements properties or ontologies and control vocabularies that have to do with the documentation of risk. So that is something that we needed to create our own elements and vocabularies for. So we took our current spreadsheet structure and we turned it into a data model, internally. So it clarifies what are the internal data relationships we have, such as if we have a software publisher, it can relate to several different software packages, each of which can have several different versions which can then have different file format extensions that are created by those versions.
LESLIE JOHNSTON: If you look at something like a long lived program, like say, File maker, which may have had FM 3, FM 5, FM 7, FM, then you have a lot of 1 to many child relationships. And we discovered that was true of a lot of what was in our data resource. We then also had to identify what are the external relationships. We already track the PRONOM identifier for their documentation and their format signatures that you use to characterize those formats.
LESLIE JOHNSTON: We have many relationships to other resources, some of which are community and some of which are more commercial because that's what's available. And we don't want to say we won't use those because it's an open community resource. We want to link to the resources that are out there. We also needed our own unique identifier. We were already doing this for our work. But in terms of a semantic resource, it's incredibly important to have a unique identifier that other resources can actually link to and then take advantage of your elements and properties and ontologies.
LESLIE JOHNSTON: We also found that we needed to review and normalize our data. We have had an incredible team of processing archivists and standards experts create our resources over time, where they have researched the formats and the software packages and the information about them. But it was a multi person team operating over time. And we did have some inconsistencies in how we documented what were our official preservation actions, what were our, even sometimes because a format can actually exist across multiple types, XML, it can exist in many places.
LESLIE JOHNSTON: In a couple of places for different uses, they had actually acquired different levels of risk. So we had to do a lot of data normalization to be able to make this a consistent semantic resource. So we went into a Pilot for this in 2021. We mapped our preservation action plan and that's where we started is with the preservation action plan for the formats to other linked data elements properties, created our own elements.
LESLIE JOHNSTON: We converted that plan from CSV into RDF Turtle. I will say that we also looked at JSON-LD as a prime contender for this. But we found, in terms of tools that we could use and some services for validation, that JSON-LD looks great but it's not as widespread yet as RDF Turtle. So we ended up going with RDF Turtle. We're currently using OpenRefine as our internal tool for processing and validation.
LESLIE JOHNSTON: We are actually looking at another more extensive tool that was recommended to us by another national cultural heritage organization and we will move forward with that. Now we didn't want to just let this out in the world. Because what happens when you put draft linked data out into the world? As I know others in our community have found, people start linking to it immediately. And we knew that there was a chance that we were going to be making some major changes to our structure as well as to our ontologies, potentially.
LESLIE JOHNSTON: And, we of course, also needed to have a plan for how we were going to remove formats that we were no longer using because we had combined them, made changes. And we did not have a place at that time to make all of the individual URIs actionable, as published, for every single format. So we could not make this live because it would a, fail, b, it would be difficult to take anything back in the wild.
LESLIE JOHNSTON: So we shared the draft document with several community leaders, internationally, for feedback on our draft implementation. So this feedback includes expressing our RDF schema using the ShEx language. This was something that was new to us. That feedback came to us from the Wikidata for digital preservation team. And we are now really interested in using that. Identifying the language for the linked resources.
LESLIE JOHNSTON: A colleague in another country pointed out to us, your resources are almost all English language. You should actually identify that the language is English or in those cases when it is not, such as French or Spanish. We had some suggestions about the use of some specific elements, especially some of the Dublin Core elements. We discovered that we had a couple more use cases that we needed to support in terms of filtering on the target formats as well as the preservation actions.
LESLIE JOHNSTON: Someone might want to see all of our high risk formats or all the formats that we retain versus all the formats that we transform. So some different querying use cases. And we need to look forward to doing more granularity and creating more granular records. We currently have the format as say, Adobe Flash and the version number all in one element. We know that we need to break this out in the way that Wikidata does to Adobe, as the software publisher, Flash, as the software package, and the version.
LESLIE JOHNSTON: And that's something that we will do looking forward but probably not with our first release. So our first release is tentatively planned for later this year. I can't give you an exact date yet because we have some internal work to do. But for us, this is really all about consulting closely with Wikidata for digital preservation and their team since that's what we want to most closely integrate with to make sure there's a path through Wikidata for digital preservation to our resources and ways that others can query against us.
LESLIE JOHNSTON: But we can also query against Wikidata for digital preservation as the hub for information and then bring back updates through SPARQL queries against Wikidata for digital preservation so that we're not constantly, manually, looking for updates. So this is both about making our information available but also being able to bring others information more easily back to us to maintain and create new resources. And that's the real promise of linked data for us, is that ability to go out to a hub and go in all directions, including back to where we are.
LESLIE JOHNSTON: So thank you very much for your time today. And I'm looking forward to the discussion afterwards. [MUSIC PLAYING]