Name:
Archiving and digital preservation -NISO Plus
Description:
Archiving and digital preservation -NISO Plus
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/5226b7da-8181-43ca-aa08-499beb2fa2b0/thumbnails/5226b7da-8181-43ca-aa08-499beb2fa2b0.png?sv=2019-02-02&sr=c&sig=s8FQptv4x2JCbsmkKMc9jZbJwK%2FY9DLSBEjCjMPmTqw%3D&st=2024-11-21T08%3A59%3A45Z&se=2024-11-21T13%3A04%3A45Z&sp=r
Duration:
T00H50M45S
Embed URL:
https://stream.cadmore.media/player/5226b7da-8181-43ca-aa08-499beb2fa2b0
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/5226b7da-8181-43ca-aa08-499beb2fa2b0/Archiving and digital preservation -NISO Plus.mp4?sv=2019-02-02&sr=c&sig=Zcjw76r%2FheVF%2BNRHxpO5iaDGLNkZTYUgeRvd5Zakr5I%3D&st=2024-11-21T08%3A59%3A45Z&se=2024-11-21T11%3A04%3A45Z&sp=r
Upload Date:
2022-08-26T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[MUSIC PLAYING]
PETER MURRAY: Good day, everyone. And welcome to this session on archiving and digital preservation. My name is Peter Murray. And I will be the moderator for this session. During this session, we have three topics. The first is on the policy risk assessment and planning involved with digital preservation. And on this topic, we have Leslie Johnston. Leslie is the Director of Digital Preservation at the National Archives and Records Administration.
PETER MURRAY: Leslie, take it away.
LESLIE JOHNSTON: Thank you very much, Peter, for the introduction. I'm going to talk today about digital preservation generally as a discipline. But I'm going to focus on the work that we do at NARA, the National Archives and Records Administration, the US National Archives, on planning for digital preservation, risk assessment, and how we actually share our materials and research with the rest of the community.
LESLIE JOHNSTON: So what is the discipline of digital preservation? Any digital object is in scope for digital preservation whether it is born-digital or digitized. It really encompasses all format types. So it could be text, images, databases, spreadsheets, any type of image, software, email, social media, the list goes on. And with every IT innovation, digital preservation managers have to respond to this by devising effective strategies for ensuring the durability as well as the ongoing accessibility and usability of new digital materials.
LESLIE JOHNSTON: So digital preservation always remains an ever emerging challenge. I'm often asked, when will we be done with digital preservation? We will never be done with digital preservation. I'm not going to go through all of these for obvious reasons. But there are several digital preservation standards and models out there in the world. Some of them out there have been out there for a very long time, like OAIS.
LESLIE JOHNSTON: And some of them are more recent, like Diagram. So which one is the best-- none of them, all of them, or the typical NARA answer, it depends? This is really about your goals for assessing your digital preservation program and the risks related to your collections. So in looking at your goals and deciding what your goals are, that's how you can perhaps choose one or, more often, more than one model for digital preservation program assessment that's best for your organization.
LESLIE JOHNSTON: So let me talk about the applied work at NARA. So the first step for us is a guiding policy and/or strategy. So we published our first digital preservation strategy in June of 2017. And this was meant to guide our internal operations. And this is publicly available on the archives.gov website. And it outlines the specific strategies that NARA uses in its digital preservation efforts. And it focuses on, what is the infrastructure?
LESLIE JOHNSTON: And that ranges from desktop tools to servers and enterprise level systems. Format and media sustainability and standards-- standards is obviously a huge issue in digital preservation. Data integrity-- so how we actually manage our digital collections or, as we call them, our holdings and ensure that they are unchanged over time. As well as information security-- who has access to this?
LESLIE JOHNSTON: Are the systems maintained? Is the infrastructure maintained? You notice it doesn't specifically mention access. Access is built into our mission at the National Archives. Are two-pronged mission is the preservation of our collections and the accessibility and enabling access to our collections. So that's really built into our mission overall. And this applies to born-digital agency electronic records, so the permanent records of the US government that are transferred to us; digitized records from those agencies, which is going to become even more important in a very short time fram.
LESLIE JOHNSTON: The government memorandum known as M-19-21 has stated that after the end of 2022, we will no longer accept physical transfers. There will always be exceptions. And that is not blanket. But it really is blanket. So we will only be accepting electronic materials, so that's either born-digital or digitized, as well as our own digitization for preservation reformatting and access of records that have come to us.
LESLIE JOHNSTON: So this means that we need to look at file format risks. We have been taking in records since 1971. So we have a 50 year history of taking in born-digital records. So we have to at first issue guidance to the agencies. There are over 200 federal agencies from ones that literally have maybe a couple dozen employees to huge like the Department of Defense. So we provide guidance on metadata, on media types, on file formats.
LESLIE JOHNSTON: By regulations, we cannot be 100% prescriptive. This is different in some other countries. But we can just say, hey, here are our preferred and our acceptable formats. We hope you can meet those. Sometimes they can't because the work of federal agencies is hugely different. It could be just email and office applications, such as Word or Excel.
LESLIE JOHNSTON: Or it could be an agency with a scientific mission, such as NOAA or NASA. So it's very different types of work. So we need to accept or prefer formats but also take what they can give us. So there's always going to be some exceptions. So this meant that we had to create a collections format profile. This is something I recommend for every organization, especially organizations that actually have sort of a longitudinal history of bringing in electronic records, whether born-digital or digitized.
LESLIE JOHNSTON: We have several different systems that we have created over time, so, really, since the 1970s to now but primarily the '90s to now. And they all have different tooling. So this did require a manual process for us. We are slowly but surely automating that process because we have different systems under different regulations. So federal records, congressional records, census records-- they all have different restrictions and requirements on them.
LESLIE JOHNSTON: So they all had different systems that were created at different times. That meant that while one system could tell me, this PDF is a PDF 1.7, the other system could just tell me, I've got things with the extension PDF, and that's all I can tell you. So we needed to do some normalization. It is not perfect. We know it's not perfect.
LESLIE JOHNSTON: But it is good enough. And that's a big message that I think needs to get out more in digital preservation is that there is good enough, and you can start there. So we couldn't even characterize every format. I've got files that have the extension .POTATO. That's my team's personal favorite is .POTATO. What is that .POTATO file? I can't actually tell you.
LESLIE JOHNSTON: We also discovered some past decisions, like in the Clinton records from that presidential administration. There had been decisions that were made at the time about normalization that we now need to look at and potentially revisit if we can. So this is just a quick graphic that shows the percentages of the largest things that we have in the holdings.
LESLIE JOHNSTON: That huge blue box is email. That is 776 million email messages followed closely by JPEGs, TIFFs, HTML, ASCII files because we have a lot of data sets from mainframes that go back to the 1970s, XML, PDFs. And it goes on into an infinitesimally small dot in the corner that are formats that we have a single file that represent them.
LESLIE JOHNSTON: When you think about versions of files, you can say I take in PDFs. We have 16 different types of PDFs. So it really, really runs the gamut in terms of what we have. The next requirement in preservation is assessing risk. So now that we knew what we had and we have this holdings data, we needed to assess the risks associated with those several hundred variant formats. We created a risk matrix, which applies weighted factors.
LESLIE JOHNSTON: And it is extensive. It has many dozen questions in different areas to answer questions like, is this well documented? Is it an open format? Is there a formal or community standard or specification for that format? And we will accept community reverse engineered specs because if that's what's available, we will treat that as authoritative because someone we trust in the community has done that.
LESLIE JOHNSTON: And we're always assessing this against our current environment. So this is another part of digital preservation that's a message to really take home is that you have to constantly reassess because it could be that five years ago there was a supported piece of software that could work with certain file formats. That piece of software may no longer exist in the marketplace.
LESLIE JOHNSTON: And so that means capabilities and capacity will have changed for us in terms of handling these formats. So this is an ongoing process. We map them to just high, moderate, and low risk. It's not worth it to be any more granular than that for this sort of work. So there's a lot of assumptions. I've already actually addressed several of them.
LESLIE JOHNSTON: But some of the key ones is that openness of a format. And the documentation available is huge if we want to actually be able to identify and characterize files that they are a certain format. This feeds into things like PRONOM from the British Library and tools like DROID and Siegfried that we use to characterize them. If we can find something, then we have a chance of identifying it.
LESLIE JOHNSTON: Level of adoption of formats is important. The ability to analyze them directly is important. Self-documentation-- does it have information in its headers that we can actually read? For us, a huge assumption-- if we can actually handle it. Do we have the software? Can we get the software? Can we get the software as a federal agency, which is also sometimes an issue for us?
LESLIE JOHNSTON: Are there licenses? Are there patents, things we need to be aware of? And the age-- the age of a format is always an additive risk factor. There are no negative impacts or positive impacts for age. The older a format gets and the older the documentation and the updates to that format get, the more difficult it is to work with it. And, therefore, there is more risk.
LESLIE JOHNSTON: So this is a screenshot from our risk matrix that shows formats, for example, several different versions of RAW for cameras, CSS cold fusion, CGM files-- we have all of this. And this is just a small percentage. What are the extensions? How do we identify those as record types so that we can assign characteristics, and some of the questions we ask about risk.
LESLIE JOHNSTON: On top of that, we then can create Preservation Action plans, and there are two types of plans that an organization should think about. One is the broad type of record types. If you were trying to look at what the essential characteristics are of records, you can't compare word processing documents to images. They have completely different essential characteristics. But you can compare all types of word processing files versus all types of digital image files.
LESLIE JOHNSTON: So we have plans that are for all of the record types record categories that we use and our transfer guidance for agencies, and these do change every year. Just in the last, say, 18 months we added both calendars and electronic navigational charts, and I now know from some recent communication with agencies that we are going to need to look at e-learning packages like SCORM in the very near future, because those are being identified as permanent records to be transferred to us.
LESLIE JOHNSTON: Our Preservation Action plans-- this slide says 500, it's now closer to 700 in terms of the types of formats that we actually need to track. So it's essential characteristics for the record types, but for the individual formats, this is, what's the risk? What are links to the specifications? What are the community resources out there like the Library of Congress or PRONOM or Internet Archive?
LESLIE JOHNSTON: What's the description of the format? What are our recommended actions and what are our recommended tools based on our environment? So here's an example of a plan for a record type. So this is digital images. It tracks the appearance, the structure, the navigation. These are the characteristics that we hope that we can preserve as we not only bring these in but maintain them over time.
LESLIE JOHNSTON: People ask us a lot about look and feel. Look and feel cannot be as important as the record content. And so that's another message that we have to get out there. It would be great if we could have these all look exactly as they did in the 1990s or 80s or 70s. That's not always possible. The content is what's important. This is an example of the format plans.
LESLIE JOHNSTON: So you can see some of those same things that we had in our matrix, but it's here's our Canon RAW, here's its MIME type, here's a specification, here's other resources out in the community that we think are authoritative that also describe those formats. And then, what is our plan? Do we retain it? How do we describe and justify it and what are our tools?
LESLIE JOHNSTON: So these are the important plans for us. We released these plans in 2020 to the public. So our digital preservation framework, which is our risk matrix, the record type and the Preservation Action plans are all available on GitHub, and we update them quarterly because we do the ongoing work to look at these and we always love to get feedback. We always love to hear if people and other organizations have used it in their work and it's really gratifying to find that they have.
LESLIE JOHNSTON: We are also preparing to release this as linked data. That's a really interesting pilot for us to think about how we track things in spreadsheets and Word documents and transform those into a data model and into linked data. We actually shared a pilot with about two dozen community experts. We've now gotten all the feedback in and we're in the process of actually getting that feedback together, making a new version and we're particularly working closely with the team at Wikidata for digital preservation because that's become the hub for our community, and where we want to make sure that we can plug-in and share our information.
LESLIE JOHNSTON: So anyone looking for information on a format or risk can find it from any of the organizations that are doing this work. We also assess our own program. We use the ISO standard. That's very important to us. We use the instrument that was developed by PTAB in the EU. We did an assessment in 2019. We then did it again in 2001.
LESLIE JOHNSTON: We know we have gaps in our documentation in our systems. But when we did that repeated the self-assessment, we had actually improved. We actually almost have no metrics that are not met. We're still mostly at partially met and we have a lot more that we now have met, and a lot of that was through documentation, because documentation is a huge part of assessment. Do we know what we do?
LESLIE JOHNSTON: Do we know how we do it? Is it a documented and repeatable process? And we're going to continue to do this on an ongoing basis. So this is part of that lifecycle that we're going to continue to work through. So we're going to keep addressing our gaps. We're going to keep assessing our file formats. We're going to keep updating our systems and really never stop moving, kind of like a shark.
LESLIE JOHNSTON: We got to keep moving through the waters, we got to keep doing new things, we need to keep reviewing our work and making sure that our program maturity improves a little bit every single year. So thank you.
SPEAKER 2: Oh, thank you, Leslie for sharing your experience and expertise. That was great. Our next topic is on the development of the NASIG Model Digital Preservation Policy, and on this topic we have Alicia Wise. Alicia is the executive director of CLOCKSS. Alicia?
ALICIA WISE: Thank you very much. Let's get some slides here. There we go. Right. Hello, everybody. We've just been hearing a terrific inspiring presentation from Leslie on the very sophisticated approaches to digital preservation at NARA. And this presentation will be at the other end of the spectrum, for organizations that are just getting started with digital preservation, and particularly, perhaps for smaller publishers, library publishers, scholarly publishers who have never before had a digital preservation strategy or policy at all.
ALICIA WISE: And I'm going to be presenting today on behalf of the NASIG working group that's responsible for having developed a model preservation policy. Let me just explain a little bit who's on that group. There we go. So we have a working group that consists of both librarians and publishers, and it's a little bit cheeky that I'm presenting on behalf of the group today because I'm actually the most recent member.
ALICIA WISE: I've come along after they produced the first draft of the model policy. Currently we are revising that based on community feedback to a shared draft, and it's really the very hard work over multiple years from this group of people in front of you that enables me to talk to you today on their behalf. So let's take a step back. If you're new to digital preservation, what is it at its most basic level?
ALICIA WISE: And if you're scholarly publisher, how to think about it? And oftentimes it seems quite overwhelming. There's a lot of technology that's discussed, file formats that change over time. But at its most basic, digital preservation is a commitment you're making to the future, to users in the future. You are committing your organization's time and attention and resources to actively managing your outputs, whether that's a scholarly publication or some other kind of resource that will have continuing value.
ALICIA WISE: And as a result digital preservation, is really a set of decisions made now and going forward into the future, if the content will continue to be accessible and usable. So a preservation policy. And the reason we have developed a model preservation policy for organizations just getting started is because it's really helpful to just have some time to think and to think about what your organization's commitments are, what the decisions are that it's making now.
ALICIA WISE: It enables you to be thoughtful and deliberate, and also there are always trade offs. You can't preserve everything. It's an opportunity to document what you can do, what you can't do and where you look to partner with other organizations, and to make those choices and decisions explicit, and hopefully intended rather than unintentional. So digital preservation, in some ways, is a really long term answer to some short questions.
ALICIA WISE: In a scholarly publications context, those questions are if I publish or cite this article, this book, what will the future readers see in 150 or 500 years, and how close will their experience be to what I see now and what I intend now? OK. So first and foremost, I think Leslie made this point as well-- can we preserve everything?
ALICIA WISE: Probably not, especially if you're an organization just getting started in this field, and if you're a publisher not primarily a memory organization like a library archive or a museum. Digital preservation done well is expensive. Resources are finite and you just have to prioritize. And crucially important, decide what you're not going to do and then seek help and support from specialist organizations who can help you look after that stuff that you're not able to look after yourself.
ALICIA WISE: So some of the questions that have informed the development of the NASIG model digital preservation policy are these. There are series of props and tools that enable you, if you are a digital preservation champion in your organization, to start a conversation more widely throughout the organization about what your future users might need, what they really need. As Leslie said, the content-- is that most important?
ALICIA WISE: Or is it the look and feel? Which version do you need to preserve? For a journal article, [INAUDIBLE],, do you need only the final version? Do you need all the versions of preprints that might have been shared? Do you need an author's accepted manuscript? Do you need all of those? What do you need, what you want?
ALICIA WISE: Does the content exist independently of the software through which it is delivered, or does that content-- say it's an enhanced e-book which we'll be hearing about in the next presentation-- does it call out to software on other platforms in other services, and if so, how are you going to capture those kinds of connections?
ALICIA WISE: Metadata-- totally important, because if you have full text content, it isn't really usable if you don't know who wrote it and how it was disseminated, and if you've got a software or a piece of code that is integral to that, you need to know all sorts of things about the file format in order to access it and continue to make that reformatable and usable over time. And emphasizing again, it's unlikely you'll be able to do everything, so what partners do you need, what skills do you need in-house and what skills do you need to look for in those partners?
ALICIA WISE: What resources can your organization bring, not only now, but what's realistic and sustainable and scalable into the future, and what partners do you need to make sure that things are sustainable? And libraries particularly have a role, I think, to encourage their authors on campus. For example, researchers and publishers who might disseminate those works, whether they're library publishers or university presses or scholarly societies or commercial publishers, to really think about making sure that those outputs are looked after properly for the very long term.
ALICIA WISE: OK. Let me tell you a little bit about the model policy itself. So it was born in 20-- well, it began to be born, it was conceived, in 2018, when the NASIG digital preservation committee issued a survey to see what its priority should be. And the results really suggested that many organizations involved in scholarly communications, particularly publishers, didn't have any policy at all for preservation of their outputs, but many libraries didn't either.
ALICIA WISE: So a key recommendation from that survey and its analysis was for the group to develop a model policy or template, and that's when the working group that I'm representing today was born. The working group has been very open. It's very diverse. It's very inclusive. Like I said, it involves libraries and publishers, particularly representatives from the library publishing coalition and the Society for Scholarly Publishing.
ALICIA WISE: They actively began working in-- at the height of the first round of the COVID pandemic in the middle of 2020, and they've asked me to really acknowledge early presentations that they found inspirational from people like Heather Staines, Jeremy Morse, and Wendy Robinson, some of whom we've co-opted onto the group, which is just great. They reviewed lots of existing policies and samples and templates and resources and then have really kind of tried to coalesce this into an easy to use and accessible model policy.
ALICIA WISE: And we envisage that libraries and publishers will use this policy to get started in digital preservation to really document their organization's mandate and how their commitment scope and goals for digital preservation support that organization's mandate. It's not going to go into too much detail. There will be additional need for platform level or repository level policies and procedures, and we also don't expect that the model policy will fit all needs.
ALICIA WISE: Context resources and content vary between institutions and so policies and approaches need to vary as well. So the policy is very much in construction. We had a terrific response to an open survey we did in the autumn of 2021. The committee is currently revising it with the aim of launching the model policy at this year's NASIG conference which takes place in Baltimore and online in early June.
ALICIA WISE: And this is our current outline, but it is a work in progress so things may change just a little bit. If I go through these different sections very quickly, this is really to outline for your institution what your executive summary would look like. Why do you have a digital preservation policy now? With some instructions and guidelines about how to use the document, actually how it's structured, and where to find what you need.
ALICIA WISE: And then, there's a place for your organizational context. Are you in a teaching intensive institution? Are you in a University press or library press? What are your strategic pressures and drivers? What's your mandate? The scope is really critical, and this is where reflection is required. What will you preserve, what won't you preserve, what's the scope of your preservation activities?
ALICIA WISE: And then we get into sections and the text of the model policy will guide you through these discussions internally to help you identify your strategies, policies, different rules and responsibilities, both inside the organization and who you might collaborate with or partner with externally, how you administer and review the policy, and then there are ready sections that you could embed in your own policy-- related documents, external things like works from the digital preservation coalition, for example, and a glossary-- so the sample text and the guide to decision making and conversation.
ALICIA WISE: And that's really it. If you are in a mad hurry, if you are tasked with developing your own preservation policy now, there's a consultation draft that's online, so you can have a look at it, but it will change quite a lot before it's finalized and launched in June at the music conference in Baltimore. And with that, thanks very much.
ALICIA WISE: I'm looking forward to the discussion session. Back to you, Peter.
PETER: Thank you for that preview of that work. Looking forward to that coming finishing up later this year. Our third topic is on the COPIM's Work Package 7 on the archiving and preservation of open access books, and we have Miranda Barnes. She is the research associate for archiving and preserving open access books at Loughborough University. I might have mispronounced that.
PETER: Miranda.
MIRANDA BARNES: Thank you very much, Peter. I'm just bringing my slides up. Hopefully you can all see those fine. It's actually Loughborough, but that is very much a stumbling block for many folks. So thank you very much again for the introduction, Peter, and also for Alicia and Leslie, for your wonderful sessions earlier. So yes.
MIRANDA BARNES: The COPIM project, as it's also known, is the community led open publications infrastructures for monographs. It's an internationally funded partnership of researchers, universities, and established open access presses. This is a scholar led consortium which includes Mattering Press, meson press, Open Humanities Press, and Open Book publishers as well as punctum books.
MIRANDA BARNES: So there's also libraries, and other infrastructure providers that are a part of that. Sorry. I managed to lose my slides here. All right. Very sorry.
PETER: Yeah. It seems to have gone out of presentation mode in PowerPoint.
MIRANDA BARNES: Right. Yes that's probably what happened. There we go. Sorry about that. And so COPIM is just dedicated to investigating the difficulties that impede the progress of smaller publishers interfacing with large scale organizations and processes. So we are looking at one to three person teams putting these monographs out rather than the larger publishers.
MIRANDA BARNES: So we're looking at these publishers that don't have some of the already existing infrastructure and support. And so it's not for profit. COPIM's focused around open infrastructures, and also the concept of scaling small. And in terms of work package 7, it's focused on archiving and preservation. There are seven work packages in COPIM.
MIRANDA BARNES: This is looking at funding models business models, governance, all sorts of other things, but work package 7 is around the archiving and preservation of open access monographs. So it is looking at the key challenges associated with the archiving of research monographs in all their variation and complexity. But specifically looking at some of these monographs that are more complex than just straightforward texts.
MIRANDA BARNES: So some of the questions that we're looking at are what are the boundaries of a book, How are complex digital monographs preserved, and what are the risks to archived content. So in terms of the policy landscape, so it must be said that while COPIM is an international project, what I'm framing here is the UK context, because the UKRI has announced a new open access policy for all of their funded projects.
MIRANDA BARNES: That now includes monographs, so any monographs published on or after the 1st of January 2024 must be made open access within 12 months of publication. So this is very influential, and it is expected that the research excellence framework may follow with their own open access policy that will require a similar open access mandate for monographs.
MIRANDA BARNES: So that's setting the scene here for the fact that any outputs that are submitted to the RAF, which happens every seven years in the UK-- so any scholarly monograph that is submitted to that needs to have an open access, may need to be open access. So this includes complex digital monographs. So complex digital monographs are born digital publications that cannot be easily replicated in print form.
MIRANDA BARNES: So they may interact with different parts of the content in other parts of the internet, they may link to content on a video platform, they may have embedded content within them. So it's really just going beyond this traditional idea of what a book is, but they are an important part of particularly digital humanities or creative practice works where these boundaries are being pushed.
MIRANDA BARNES: So as I mentioned, embedded content-- these may include images, videos, audio files, geospatial data, 3D models. But this is by no means a full list of the possibilities. These are just some of the ones that we are aware of as well as linked content, QR codes. So what are the boundaries for preservation? Whose responsibility is it to preserve this content and how is it preserved, where is it preserved?
MIRANDA BARNES: This is particularly an issue for the smaller scholarly publishers who don't have a supporting organization to even begin to archive the content somewhere. So we're looking at it from that perspective. But these questions are going to affect all publishers including university publishers. So in terms of file formats, this is again, not a comprehensive list. This is just a very basic overview of common ones.
MIRANDA BARNES: So it won't be anywhere near as complicated as what Leslie deals with on a day to day basis. But in terms of how the formats that publishers tend to use, the PDF is really the main one, particularly again for the smaller presses. It's most widely adopted. I mean in that way, that means the PDF will probably stick around for a long time, so that's a benefit, but it also closely mimics the familiar book format.
MIRANDA BARNES: However, the there are some preservation implications for embedded content. One of the ones that we've come across is that as a container, PDF isn't actually recognized as a container with the software that scans the files in archives in many cases. So rather than an XML or a zip file which will be recognized as a container and all the files within it will be scanned, the PDF doesn't benefit from that.
MIRANDA BARNES: There's also, as Leslie mentioned earlier, many different PDF formats. So this PDF will return to haunt us later on, I'm sure, but that's something that's important to look at. XML is preferred often by key preservation players. It integrates data, metadata, and infrastructure. It is more ideal for long term preservation than PDF, and relationships between content can be defined.
MIRANDA BARNES: The problem that we note is that many small publishers don't have the tech or resource or staff resource to convert away monographs into XML on a regular basis. So while it is preferred in many ways, it isn't as commonly used. And then EPUB, MOBI, they're often offered by some of the publishers, but they are proprietary formats, which means there are of course complications when it comes to open access and infrastructure.
MIRANDA BARNES: Another question that we're looking at is an access versus an archive copy. So because complex digital monographs may contain multiple file types, sometimes in large sizes, the discussion continues about whether you have an access copy versus an archive copy, and that you have both of them in play. So for many access users, a smaller, simpler file package could be sufficient.
MIRANDA BARNES: But for full preservation, to make sure all of those component parts continue to stay usable into the future in the longer term, a different option might be preferable. So just leading into some of the work we've been doing. So Open Book Publishers is a scholar at Open Access Press publishing monographs, but they are also a partner on the COPIM project. They publish about 45 new titles every year, and we have been using some of their monographs on their workflow experimentations, which I will talk about now.
MIRANDA BARNES: So Diderot's Rameau's Nephew and Ryan's Image, Knife and Gluepot are two of the monographs that we've been using to test a manual input into a repository workflow. So while repositories are not preservation in the classical sense, as I mentioned earlier, some of these publishers have no active preservation workflows. So this is somewhere that their content could be placed as an archive possibility.
MIRANDA BARNES: So we again, we tried this out. It was a manual ingest option. But we are also working with work package 5 colleagues on an API that could potentially automate ingest, because, again, that staff resource issue for small scholarly presses-- manual deposit is too time consuming when it comes to that limited staff resource. So Rameau's Nephew has-- this is a representation of the additional enhanced or additional content that goes with it.
MIRANDA BARNES: So there are 13 audio files that went along with it. So those are individual records in this collection with where we did the ingest, and then there's also five additional text files that went with the collection outside of the four types of files that represented the book itself. So we had an XML, we had an EPUB and MOBI, but we also had the PDF.
MIRANDA BARNES: So that was the more complicated one that we put together. This again was the second one with Image, Knife, and Gluepot. This one is primarily-- the embedded content is imagery rather than the more complex audio files. But just to pop back to this other one. So one of the things we were talking about in terms of how this might all work together is, how do we put these things in a package, how do we indicate those different relationships and connections, do we assign it a DOI or no.
MIRANDA BARNES: So these are just, again this is throwing up probably more questions than it is answers, but these are questions that are going to concern others in the community, certainly. So other questions are what other repositories might we experiment with and what are the metadata differences in the different repositories versus the original publishers. So just a little bit of background. Open book publishers archive their content in Portico by way of OAPEN.
MIRANDA BARNES: So in terms of how that works, every six months, new books and book chapters are added to Portico by OAPEN. The book files are packaged into zip files with the metadata. This is automated through a macro document, and then once material is sent to Portico, there's not actually a lot of interaction or knowledge about what happens once it's in there in terms of OAPEN. So this third party preservation option and archiving option has just also thrown up some additional questions for what happens to the metadata, what happens to the files, and that sort of thing.
MIRANDA BARNES: So we are currently investigating all of that. But it leads to concerns about the leaky metadata pipeline. That's what we're calling it internally. Basically, we don't know if the metadata that OBP holds is exactly the same as what OAPEN has, and then what happens to that metadata once it's in Portico. So we just are going to be looking at ways of comparing those using THOTH, I believe.
MIRANDA BARNES: And seeing what is preserved, what is lost, and what the implications are for that. So just I mentioned THOTH. So THOTH is a part of work package 5. It is an open metadata management and dissemination system. It is open source software. It is currently being used in two publishers for their day to day metadata management, but we're also looking at it as a way of automating the ingest of content into repositories.
MIRANDA BARNES: So there's some more background in the links on the slides, so if you want to look at that in the future, please do. So in terms of what we've learned so far, which I think is important around what the direction is going forward, is around file formats, you know. How they are packaged and how they are preserved are going to directly influence each other.
MIRANDA BARNES: And that PDF's well widely used present complications to preservation that we plan on digging into much more thoroughly, particularly the differences in functionality between the different PDF formats. That there can be a loss of metadata between parties, particularly as a concern for small OA presses who depend on third parties for preservation-- so again, we're going to be looking at in more granular detail what the different metadata types are, what fields are lost and what are retained, and how that can be perhaps remedied or looked at in terms of improvement, perhaps via the use of THOTH in the future.
MIRANDA BARNES: And also there may be specific challenges for small publishers in preservation in terms of file formats, but also pathways to preservation and resource, and how can we look at that and see that option of scaling these needs and their solutions? And that we have learned that there does need to be a tiered approach to best practice guidelines with this in mind. So next steps.
MIRANDA BARNES: We need to determine what those tiered preservation solutions are for various levels of resource. Good, better, best. Again, that good enough-- we need to figure out what is good enough and what is better, and what people can choose depending on their level of resource or what publishers can choose. What formats are best for preservation and what the reasons are for this, what the pitfalls are for the well-used file formats, and perhaps work with other community partners and finding ways to resolve that, because the PDF isn't going anywhere.
MIRANDA BARNES: And again, I already mentioned finding out more about what metadata is lost and that leaky pipeline. And also, how content will appear be accessed or uploaded into other repositories and what impact that will be on the tools that we're building for that potential option of automation. And also what more can we learn about repository archiving workflows through the practical experimentation that we're undertaking, and last but not least, how can we use this knowledge to advocate for scalable archiving solutions for early monographs.
MIRANDA BARNES: And I think a framing all of this is, what are the longer term solutions? What can the community build for all levels of monograph publishers, because this is a part of the scholarly record that we absolutely must preserve for future scholars and future publishers. So that is all for me today. So thank you very much for listening, and I also look forward to the conversation.
MIRANDA BARNES: Thank you.
PETER: Great. Thank you, Miranda, as well as Alicia and Leslie earlier. This concludes the prepared recording part of this session. We'll now move to the interactive part for reactions and discussions on these topics and related ideas. One of the things that we'll be doing at the end of the discussion session is to identify one or two main ideas to take forward. As NISO identifies areas of new standards best practices or publication or educational topics, We also want to find people interested in taking these ideas and drafting more concrete proposals for the NISO topic committees to consider.
PETER: With that, we'll close out this recording, and we'll see you in a moment.