Name: Archiving and preservation of unusual born-digital objects -NISO Plus

Description: Archiving and preservation of unusual born-digital objects -NISO Plus

Thumbnail URL: https://cadmoremediastorage.blob.core.windows.net/3fc56d7e-053c-434d-8d09-183eed8458e6/thumbnails/3fc56d7e-053c-434d-8d09-183eed8458e6.png?sv=2019-02-02&sr=c&sig=%2Fb1ze2tQvLAI9SAGhQfTIjN%2FvlpDXqUUTAY6FlBoi1s%3D&st=2024-05-05T18%3A29%3A02Z&se=2024-05-05T22%3A34%3A02Z&sp=r

Duration: T00H48M36S

Embed URL: https://stream.cadmore.media/player/3fc56d7e-053c-434d-8d09-183eed8458e6

Content URL: https://cadmoreoriginalmedia.blob.core.windows.net/3fc56d7e-053c-434d-8d09-183eed8458e6/Archiving and preservation of unusual born-digital objects -.mp4?sv=2019-02-02&sr=c&sig=5UO4kM7%2FA9z94Wamxv9Jwu2Umqt49nTSqMh0zNIv10I%3D&st=2024-05-05T18%3A29%3A02Z&se=2024-05-05T20%3A34%3A02Z&sp=r

Upload Date: 2022-08-26T00:00:00.0000000

Transcript: Language: EN.
Segment:0 .
[MUSIC PLAYING]
SALWA ISMAIL: Greetings to everyone joining us. We're delighted you're here attending NISO Plus 2022 and our session on Archiving and Preservation of Unusual-Born Digital Objects. I'm Salwa Ismail. My pronouns are she and her, and I work with University of California, Berkeley. I will be moderating the session. And we have two phenomenal speakers, Euan Cochrane, of Yale University Library, and Jane Winters, from University of London.
SALWA ISMAIL: The move to digital has enabled the development of all kinds of innovation and valuable forms of content. But preserving these new and unusual types of born-digital objects presents new and unusual challenges and a different way of normalized standard setting. In this session, our speakers will discuss, from their perspective, the expertise and nuances, along with the challenges, around preservation and then access, where we preserve to be able to access, discover, and use for posterity of these born-digital materials.
SALWA ISMAIL: Both Euan and Jane will speak for about 20 minutes each, and then we'll open up the Q&A session for you. Before I pass the virtual mic onto Euan and Jane Winters, let me introduce them to you. Euan is the Digital Preservation Manager and is responsible for the Library's Digital Repository, and provides related preservation services to the University.
SALWA ISMAIL: He established digital preservation services through Preservica and Emulation as a Service. He has particular interest in software preservation and the use of emulation to maintain access to born-digital content, and is currently involved in grant-funded work in both these areas. He's also on the Governance Group for the Software Preservation Network. Jane Winters is the Professor of Digital Humanities and Director of the DH Research Hub at the School of Advanced Study, University of London.
SALWA ISMAIL: She has published most recently on non-print legal deposit, the values of web archives, born-digital archives, and the problem of search and archiving, an analysis of national domains. Jane's pronouns are she and her. And with this, I will pass the virtual mic onto Euan.
EUAN COCHRANE: Hello, everyone. My name is Euan Cochrane. I'm the Digital Preservation Manager at Yale University Library, and the co-PI on the EaaSI Program of Work, which I'll we talking about today. So today what I'm going to do is give an overview of software preservation and what it is we do in the Emulation as a Service Infrastructure Program of Work to enable people to preserve software and do so at scale.
EUAN COCHRANE: I'll talk about some of the challenges we face and how the EaaSI Program is addressing those. So software archiving is an integral approach to archiving and preservation of unusual born-digital objects, and software is itself a unique object to preserve. For these reasons, I think it's a really relevant topic to this audience. So why do we want to preserve software at all?
EUAN COCHRANE: Well, I think most of us have experienced something like this. This is an older example, but what you're seeing here is a file-- in this case, a spreadsheet file-- that's opened in one piece of software and then moved to another computer and opened in a different piece of software. In the first piece of software, you don't see the comment that is visible in the second piece of software, because that software is taking the same file but presenting the information differently, adding different formatting and making it so that the user gets an actually different information experience.
EUAN COCHRANE: So it's not just the look and feel, but the information presented to the user is different. People experience this in modern times when they, say, take a file that was maybe created on a Windows computer and try to open it on a Macintosh computer. This is an older example from the Windows 98 era I believe, but it illustrates the same issue. What it illustrates is that it's really important to make sure you have access to the original software when you are presenting a digital object as evidence or as an historic item.
EUAN COCHRANE: Because without it, you may find that the information that gets presented to the user is wrong, changed, incorrect, or possibly even deceptive. I mean, here's another example. This is where we took a file that was created in Wordperfect for Windows 95 I think, and then opened it in modern software. And you can see the information is just completely changed. You're not getting the information that was in the original.
EUAN COCHRANE: But the file is identical. It's the exact same file that's opened in both pieces of software. And this problem gets worse the larger the gap in technology. And that gap could be because the two applications you're trying to open the thing in are very different but contemporaneous, or it could be that there's a huge amount of time between when the original application was published and used to create the object and the time when you're trying to open it in a more modern application.
EUAN COCHRANE: The bigger that gap, the more likely it is that something will have changed, such that the information that the user gets presented with has changed as well. And in many cases, it can also be completely inaccessible, insomuch as, over time, as that software is no longer available, the original software, you can find many file formats that there is no software, no contemporary software, to open the objects with.
EUAN COCHRANE: Here's just one more example. So what we see here is something like what you might get if you migrated a file between file formats, which is another approach that's used often in digital preservation to make sure you can still access the content in an object. You take the content from one file and save it into a different file that's more compatible with modern software.
EUAN COCHRANE: However, the interesting thing about this example is, in the original, there was information in the file that was related to the way the software is meant to format the content and present it on the page. And when we opened it in modern software-- this is from some research I conducted in the late, like around 2010, so we were using Microsoft Office 2007 as the modern software then-- anyway, the modern software interpreted that formatting information to say we should put private tags around the title.
EUAN COCHRANE: And that kind of added more information to the thing, and incorrect information. It implies that that object, and especially by putting it around the title, that that object should be private. And it was never there. That's not something that was in the file. It's the way the software interpreted the file that added that information to the experience that the user got to interact with.
EUAN COCHRANE: So one of the big reasons for preserving software is to be able to maintain access to the original, so you can open your objects in the original to ensure that nothing has changed in what it is presented to a user in the future. Another big reason, of course, is to be able to keep the software around, and if anyone ever wants to do research using that software on the software itself, then we certainly want to be able to keep that information around for that.
EUAN COCHRANE: And just to drive the point home, here's another one. This is an interesting one, because often when you're taking migration, which is a strategy to keep your information available, you will try and look in the file and say, OK, this file has these features, maybe it has images in there, so let's check the migrated file to make sure that there are images in there and that they're in the same place.
EUAN COCHRANE: Well, in this case, the user has used the software in a way where they've created an equation in the document, and they've used just spacing and line breaks to put in exponentials on the line above the equation-- you can see 2, 3, 4, 5, B, 7, on the left as the exponentials for the equation below, and they line up with the gaps and the brackets, and so on. When you open this in any other software, contemporary software or modern software, that equation-- the lines move around, the formatting means that everything moves around, and the equation no longer makes sense.
EUAN COCHRANE: And why that's important is because it's not something we could have looked for. It's not a feature that a user used, so that we could say, OK, let's check that-- find all the features that were used in the original, say automatically, and then confirm that they're there in the migrated version. It's not possible, because this wasn't a specific feature. It was just general functionality of the software that was used to do this.
EUAN COCHRANE: And so the alternative would be to manually check every single item. And that is just not feasible. The amount of items that we have to preserve these days is in the hundreds of millions, billions even. At Yale, so far in our archive, we've got over 120 million files. If we had to migrate them all and check every single one, we just simply would not have the time.
EUAN COCHRANE: And I mentioned software that-- sorry, objects where the software-- there's no longer any software that actually opens them. A lot of CAD, computer-aided design, files and software are like that, where the software is backwards compatible with the next generation-- usually is-- but the backwards compatibility has often been cut off at some point in time, so you lose access to the earlier versions.
EUAN COCHRANE: Another example is Microsoft Chart. There's no software anymore that opens Microsoft Chart files, which were a way of saving a chart as a single file rather than as part of a spreadsheet or workbook. So what we're seeing in this example is just a diagram of a building on Yale's campus, and this is the CAD drawing of it. Since we had the original software, we can zoom in. We can view the details.
EUAN COCHRANE: We can view it from different angles. There's a lot of options that we have when we're able to interact with the thing using the original software. So challenges. There are a lot of challenges to making sure we have access to the software over time. For one, just finding the software-- if you're in an archive library or any kind of memory institution, finding legacy software is difficult.
EUAN COCHRANE: And no single organization can collect all the software and hardware that would be needed to keep all of the digital objects that we have available over time. It's just too big of a job. So what we've been focusing on is figuring out how to address this. We really need a solution that enables a community-based approach to finding, acquiring, collecting, and then sharing the software that is out there and that we need to keep access to our digital objects over time.
EUAN COCHRANE: Another challenge we've come across with this endeavor is that of copyright culture and digital rights management, technologies associated with software that's been distributed on installation media. An example I came across when putting together an exhibit with some legacy CAD, computer-aided design, software in it was the software required you to have a hardware dongle. In this case, it was a USB drive.
EUAN COCHRANE: You plug it into your computer, and it had to be there at all times when you were running the software, otherwise the software wouldn't work at all. And many companies, regardless of how old the software is, still keep quite a tight control, or try to keep quite tight control, over who's allowed to access it and what people are allowed to do with it.
EUAN COCHRANE: And this is a big challenge for those of us that are just trying to use it as a tool to make sure the content is still available in the future. Thirdly, there's not much metadata out there about software, about its requirements and capabilities. So if we want to do things like ask questions, like, what software could open this particular file or what might have been the creating application for the software, there's not really a place we can go and look that up and easily get an answer to that.
EUAN COCHRANE: And even just what file formats can this particular software open, there's very little metadata about that out there. And then secondly, if we want to go and try and run that software in a virtual or emulated computer, knowing what virtual hardware we would need to configure to make that happen, it's difficult, because, again, there aren't good databases of that software out there-- or databases of that metadata, I should say.
EUAN COCHRANE: So a fourth challenge we've come across is that there is just so much complexity out there, so much variability, amongst the competing platforms that are needed to run the software, that's needed to access the content over time. And that means that while one group of practitioners may understand really well, say, how to run DOS programs and how to configure a virtual emulated computer to run a DOS program, they may not know how to run a Mac computer and the software on that from that period of time.
EUAN COCHRANE: So there's a lot of knowledge that is slowly being lost about how to run these things. And because there's so much complexity, there's a really long tail, which makes it a really big challenge. And we are trying to do things to address this by providing tools to capture as much of that knowledge over time as we can. So this is what EaaSI is about.
EUAN COCHRANE: EaaSI is about addressing those issues and making it so that we can preserve software and use it to access digital content over time. We started EaaSI in 2018. And we're trying to provide scalable emulation services, and services to enable people to share legacy software and pre-configured legacy computers between organizations that will run EaaSI. So as this slide says, EaaSI provides technology and services for the application of software emulation across a diverse spectrum of professional disciplines, organizations, and individual use cases.
EUAN COCHRANE: So just a note on emulation, in case you haven't come across the term before. It means creating a virtual computer in software on another computer, and a computer that may not be at all compatible with the physical hardware of the computer that's hosting that virtual computer. And what it allows you to do is create, say, an old Windows 95 computer, install Windows 95 on it-- and this is all just done in software-- and then use that emulated Windows 95 computer to open things that require Windows 95.
EUAN COCHRANE: So the core services of EaaSI are here. We enable people to go into this interface-- it's all web-based-- and look for a software or entire computers that have been pre-configured by the community. So you might say, I need Excel 95, so you can search for Excel. And you might find there's a few computers pre-configured, that already have it running and installed on them.
EUAN COCHRANE: It allows you to import your own content, software or objects, and then open those in that preserved and accessible and emulated software. And it allows you to create your own new environment. So you might take that Windows 95 environment with Excel on it and install AutoCAD, so that you can put the archives from an architect, say, into that environment and users will be able to browse both the spreadsheets that are on there and the CAD files that are on there.
EUAN COCHRANE: And you can save those environments. I mean, if you want to, you can publish them to the network of organizations that are running EaaSI, and they can then replicate them locally and use them themselves locally, solving that problem of finding and configuring legacy software. It also allows you to manage access and control how things are used, and provide end user access. One of the things we're almost ready to make available is providing being able to create an environment, add some content to it, and then provide a link to that to an end user.
EUAN COCHRANE: And they can just click the link, maybe there'll be a login in front, if you want to implement that, and they will see the content in their browser and be able to interact with it in that original software. And of course sharing and collaboration of both the software, the environments, and the metadata about them, that you can capture in the EaaSI software, is a key part of this.
EUAN COCHRANE: It really is. A big goal is making it so that it's really easy to get access to the thing you need to provide access to your digital content. And the sharing is what enables that. So yeah, benefits include: makes finding necessary software easy. So we pre-configure all the hardware emulators, so the things that provide the ability to install Windows or Mac OS, or whatever on them.
EUAN COCHRANE: So you don't need to understand that if you don't want to. There is the ability to customize that if you need to. But otherwise, you don't need to think about it. You just say, I need this software, and you can make it available. And it's all done by our web interface, and it makes all the processes involved much smoother than they've ever been. We're trying to make it as easy as possible and as approachable as possible for all sorts of different use cases.
EUAN COCHRANE: So this is a short video showing the interface. What you'll see here is an environment-- what we call an environment-- this is a Windows 95 environment. It is being loaded and accessed via a web browser. You can see, there's lots of ways you can interact with it. You can take screenshots. You can capture metadata about maybe if you've made a change to the environment and you want to save that as a new one.
EUAN COCHRANE: And if you need to restart the emulated computer or anything like that, you can do that. And I think what we're going to see here next is there's user management functionality in here. So if you have, say, been donated a computer from a famous person, an author, an artist or something, you could have taken a snapshot of the hard drive and [INAUDIBLE] it in there.
EUAN COCHRANE: And you could make it private to only authorized users. So users would be able to interact with it however they wanted, but they would have to be authorized to do so. All right, I think that's looped around once, so I'll move to the next one. And this is another one of the applications we've built using EaaSI. What this does-- I believe it's quite transformative-- it allows you to-- you can integrate this with your finding aids, access systems.
EUAN COCHRANE: This is just one example of how you might implement it. But what it does is you submit an object of some sort, it matches it to environments that are in EaaSI, that had the software that would be needed to open and interact with the object, and then it automatically opens that object in that software in the browser. And you can embed all of that in your access platforms so that they only see-- the end user can click on an object and have it automatically opened in the original software.
EUAN COCHRANE: We're seeing the first example I showed here, but interactively, which is that spreadsheet. So this is it being opened in Quattro Pro. But in the way we've configured this tool, in this example, you can just click to choose a different environment to open it in. And so if you wanted to in your organization, you could implement something where an end user could click on an object and then be given a list of options for how they open it.
EUAN COCHRANE: Because it may be that at the time that the object was created, there were multiple different ways the thing could have been opened, different applications. And it may be relevant for research to be able to try them, and see what might be different. Because we may not know what the original actually was. But as you see, opening it in Excel allows you to see that extra information that wasn't visible in the Quattro Pro version.
EUAN COCHRANE: So this is a [INAUDIBLE] of programming interfaces that can be integrated into your own access tools to make things available automatically in the original software. We're calling that the Universal Virtual Interactor. You could also save things as a new format and export them, which is what we saw at the end there. So you could use this functionality to do migration via emulation as well.
EUAN COCHRANE: All right. So if I can jump slides. So I'm wrapping up now. The team, we've got quite a large team. These are the individuals that are directly involved. And then we have two or three different companies we're contracting with, PortalMedia for UI design, and Dual Labs, who have been doing a lot of the front end development.
EUAN COCHRANE: The main developers of the EaaSI emulation tooling are OpenSLX, and you see Klaus and Oleg there, who represent them. They're the geniuses that came out of the University of Freiburg, in Germany, who put together this whole platform for doing emulation in a web browser. But I should point out the rest of the team, who are just amazing and really have done all the work behind this. So yes, Seth, Ethan, Jessica, Katherine Cat.
EUAN COCHRANE: And we have a new one, who's not on here, sorry, Claire, who will have to get added next time around. All right. And thanks finally to our funders. So we've been funded for four years now by both the Mellon Foundation and the Sloan Foundation. And I believe it's one, if not the first, times they've worked together.
EUAN COCHRANE: And normally Mellon focuses more on the arts and Sloan on the sciences, but there are so many applications of this EaaSI software that they wanted to come together to fund us. In the sciences, there's a lot of things that this can be applied to in regards to reproducibility and research data management. And obviously in the arts, any kind of legacy objects that we need to make accessible for historic and cultural heritage purposes, and things like video games, this is a really good solution for.
EUAN COCHRANE: So I mean, there's obvious reasons why they were both interested in this, and we're really grateful for their support. And so that's it. This was meant to be presented by Seth, so his details are on there. But feel free to contact him. I'm sure he'd love to hear from you, if you have any questions about this.
EUAN COCHRANE: And I think we'll both be here for this session to answer your questions live. Thank you.
SALWA ISMAIL: Thank you, Euan. And with this, we'll now pass the virtual mic onto Jane Winters.
JANE WINTERS: Thank you very much. So I'm just going to share my screen, and start from the beginning. Hello, everyone. My name is Jane Winters and, as Salwa said, I'm a Professor of Digital Humanities at the School of Advanced Study in the University of London. And I'm also a historian by training as well. And that's really my starting point for engaging with web archives.
JANE WINTERS: I haven't presumed to talk about archiving and presentation because my role as a researcher is to benefit from the amazing work that goes on around archiving and preservation, so really I'm going to talk today about access and the access challenges around web archives in particular. And my starting point is this short quotation from Paul Koerbin, who manages the web archive at the National Library of Australia.
JANE WINTERS: And he wrote this in a fascinating edited volume, called Web 25, which was looking back on the first 25 years of the World Wide Web. Paul noted that, without access, preservation is little more than a costly and meaningless storage burden. I like to think that Paul is overstating here to make his case and provoke debate. Because actually, as researchers working with the archive, we've benefited enormously from the fact that it was archived before we knew what to do with it.
JANE WINTERS: So so much of the early years weren't lost while we waited around to try and work out how we could handle this particular kind of data. Most of you will be familiar with the Wayback Machine. And Brewster Kahle founded the Internet Archive in May 1996, and the first web pages were archived in October of that year. So we now have more than 25 years of web archives for researchers and others to work with.
JANE WINTERS: And you can see the 25 up at the top left of the screen there. But there have been frequent and often significant changes in web archiving during this period, both in the methods and technologies used and the institutions and organizations involved. The Internet Archive is no longer the only actor. Web archiving happens in a range of national libraries, and through philanthropic and community-based initiatives, which are growing all the time.
JANE WINTERS: And the IA's own practices have evolved and continue to change. So for a very long time, you could only find material if you knew the URL, but now you can do limited keyword searching across some pages in the site. So looking at some of the other players involved, just to give you a sense of the diversity of this landscape and some of the decisions that researchers have to make when deciding where to look for their data, these are just four examples.
JANE WINTERS: At the top left, there's the Archive Team, which describes itself as "a loose collective of rogue archivists, programmers, writers and loudmouths, dedicated to saving our digital heritage." They can mobilize a rapid response when a service or platform is suddenly threatened with closure in ways that a national library can't. It has to make sure that it's abiding by legislation. The International Internet Preservation Consortium, at the top right, is an important global network, which brings together web archiving organizations from over 45 countries, and talks about how common standards and so on can be developed to help researchers and archivists in this field.
JANE WINTERS: At the bottom left, Documenting the Now has developed tools and a community focused on the ethical archiving of social media. It began its work in 2014 and has been expanding and supporting that really key area, because social media very often falls outside the scope of national library collecting. And finally, there's the Common Crawl, which is a non-profit organization which has built and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.
JANE WINTERS: And these seem like discrete initiatives, but they're not. They all overlap. They duplicate each other. They contain different information, similar information. They all overlap with the Internet Archive at points as well. So it is really tricky trying to work out where versions of things are held, what's unique, what isn't, what your natural starting point should be as a researcher.
JANE WINTERS: Today, I'm going to focus on the role of national libraries, which host, develop, and maintain the majority of national web archives. They fall into two broad categories, those which are created primarily under a legal deposit regime, and those which are collected on a permissions or fair use basis. The IIPC list of countries or regions in which some form of [INAUDIBLE] based archiving takes place, includes 17 countries in the former category and 12 in the latter.
JANE WINTERS: And this has enormous implications for what web archives contain and what researchers can expect to find. Are they going to access the entirety of a country code top level domain, such as .uk or .fr, or will they only be able to see a much smaller subset of that data? And it also affects how this data can be accessed. Sometimes it can only be looked at on site, in a library at a dedicated terminal.
JANE WINTERS: Other times you might be able to access it remotely, or limited subsets remotely. There's a fractured international timeline for archiving the web as well, and that's dependent on changes in legislation, available resources, and a range of other factors, and changes in technology too. In France, legal deposit dates all the way back to Francis I and the Ordonnance de Montpellier, and legislation was extended to include multimedia, software, and databases in 1992, and then websites that fall within the .fr domain in 2006.
JANE WINTERS: In the UK, the British Library was only empowered to conduct an annual harvest of the UK web as late as April 2013. And then we have a truly bewildering array of access arrangements. I've talked about some of them, and I'll show you a few examples later on. So that's the complex international picture, but I'm going to narrow it down even more now and focus on the situation in the UK.
JANE WINTERS: And I think it's fair to say that this is another level of complexity in how this data is archived, preserved, and managed. At the British Library, there are three main data sets, which overlap partially with each other, but also have a large amount of unique material, and it's impossible really in many ways to judge how much unique material there is. There's what I've described here as the Jisc Domain dataset, which was funded by Jisc in the UK and purchased from the Internet Archive to backfill the collection.
JANE WINTERS: And that covers the period from 1996 to 2013. And then there's a legal deposit crawl which happens every year, which has been running since 2013. And then there are curated special collections derived from both of those datasets, which cover the period from 2004 onwards. And then we have a range of other archiving bodies which hold material of interest. There's a UK Parliamentary Web Archive.
JANE WINTERS: The Internet Archive has huge amounts of content related to the UK, some of which will be in the British Library holdings and some of which won't. The Common Crawl overlaps in similar ways. And there's an entirely separate National Archive, at the National Archives of the UK, which deals solely with government data. And again, some of that material will be held elsewhere too. So it's a very difficult picture to orientate yourself as a researcher.
JANE WINTERS: Where's your first port of call? And actually, this isn't even the full extent of it. We also have collections that are held in the Archive-It subscription service, provided by the Internet Archive. And because of the vagaries of web crawling processes, there'll be a large amount of .uk material which ends up, for example, in the Portuguese Web Archive or the Icelandic Archive, where it's been accidentally scooped up in the clean processes.
JANE WINTERS: So just trying to track down what is in web archives, let alone how they can be described and standardized, is a really tricky challenge. So how do researchers find a pathway through all of this? I think it's really, in the UK, again, there are all those other options, but it's best to focus on the two main holdings at the British Library and the National Archives. They've been collected under two different legislative frameworks, the Legal Deposit Libraries Regulations which, as I mentioned, were expanded in 2013, and the much older Public Records Act, which dates from 1838.
JANE WINTERS: And interestingly, that was worded in such an open way in the 19th century that it didn't have to be adapted in order to allow for collection of digital materials. So the National Archives is on a much firmer footing from the start [AUDIO OUT] existing statutory obligations allowed it to archive the web. And these are the two national web archives really that we have.
JANE WINTERS: The others are subsets or they focus on particular areas, such as Parliament. And there are two primary modes of access, on-site only in the UK's six legal deposit libraries, or free to anyone online for the National Archives collection. And that's thanks again to the legislation in the Public Records Act, which allows citizens to have access to the records of government, so there are no restrictions on being able to view that remotely.
JANE WINTERS: This is what the access looks like, and the different forms that this can take, and the different challenges that come with the different forms of access. This is the main search interface to the UK Web Archive, which is relatively recent. A huge amount of energy went into working out the best way to make this really huge, billions and billions of URLs and terabytes of data, searchable for researchers. So it is a full keyword search index of the whole of the UK Web Archive.
JANE WINTERS: And then these smaller collections, which showcase particular areas of strength, and give people a sense of what the archive contains. And they have much tighter metadata associated with them than other aspects of the collections, because curators have become involved in that process. And you can see the range here, from format types, like blogs, to particular political events, like Brexit, or to Caribbean communities in the UK, disciplines, like Celtic studies, and so on.
JANE WINTERS: So it's a really diverse collection, and it's a good starting point for researchers, and particularly for students. But going back to the search, I think you can probably see here some of the challenges that people face. Just to search for NISO, for records that can be viewed online, gives us 14,424 results, which are not ranked in any way other than by the order of date in which they were archived.
JANE WINTERS: Not the date on which they were published, the date on which they were archived. And you might be able to see at the top left that there are a further 91,000 records that are only viewable on site in the library. So that is over 100,000 records just for NISO that are held in the Web Archive. And there's no easy way, really, to narrow that down. There are some options on the site.
JANE WINTERS: You could narrow it by domain or by document type. But you're still looking at really large amounts of information that have not been categorized in any other way. So the search is intriguing, and you can bounce off it to do qualitative analysis, but it really doesn't allow you to do very much more than that. The earlier data set that goes back to 1996, because it was derived from the Internet Archive, is open.
JANE WINTERS: So some tools have been built on top of that which allow a bit more flexibility for researchers. And again, looking for NISO in the N-grams, so just instances of the mention of the term, and you can see below a keywords and context view that allows you to see how that's been discussed, and the prevalence over time within the archive of the acronym. So there's a spike in 1997, when the collecting started.
JANE WINTERS: And I don't know what was going on in 2011 that was making everybody talk about NISO on the UK web, but you can see there's a peak in 2011, which would be interesting to explore further. So this is very suggestive of trends in the data. Other modes of access. There's a fantastic set of data on the British Library UK Web Archive GitHub account. And that shows you, for example, links between web pages, lists of domains that have been crawled, format profiles, and CDX files are available.
JANE WINTERS: If you're not familiar with those, they consist of individual lines of text, each of which summarizes a single web document. So a CDX file for a year will have summaries of all of the web pages that have been harvested in that year. And that summary might give you information about the time the page was archived, the actual URL crawls, the content type, the HTTP status code reported by the server, and so on.
JANE WINTERS: And that's something that I found in my research, that when you're dealing with these huge volumes of data, the metadata actually becomes an important object of study. And you can find out huge amounts about the nature of the UK web without having to get to the historian's Holy Grail of the content, as it were. The metadata is enormously revealing. And there are also various workbenches [AUDIO OUT] Jupyter Notebooks, which allow you to, for example, as here, compare different versions of an archived web page, or use screenshots to visualize changes over time, or find when particular pieces of a text first appeared on an archived web page.
JANE WINTERS: So you can start to track those changes that are a bit hidden at the moment, and what are duplicates, when was there substantive change of this particular website, for example. And I hope that just talking about some of those issues illustrates why web archiving context really matters, and understanding the data before you start researching it.
JANE WINTERS: Because there are a lot of pitfalls, and you can trip up very easily if you don't understand provenance, how this has been put together. They're patchwork collections, really, that have been pulled together from different sources over different times to different parameters with different technical infrastructure, and so there's huge embedded variety in web archives. Just some examples.
JANE WINTERS: There are changing rates and patterns of data collection over time. Sometimes that might be the result of limited resources or the absence of a team member in the library. None of that is apparent to a researcher. You can only really find that out by going and talking to the web archive teams. As I mentioned, data has been backfilled from a variety of other sources.
JANE WINTERS: And it's been stitched together seamlessly for the researcher, but actually I want to know what the data source was. Where did that come from? Why did you fill that gap with data from this archive rather than another one? And there are constant changes in the parameters and depth of web archiving, often in response to events. I mean, there's been a huge focus on COVID collecting among web archives across the world, and understandably.
JANE WINTERS: And that would have led to different kinds of websites suddenly appearing dominant in the web archives, where they wouldn't have been before. And you need to understand why that happened. I also mentioned the ongoing development of new tools and new functionality. Interpreting other people's research, you really need to know what was accessible to them at the time, what tools did they have available to them.
JANE WINTERS: Because most of the tools are still being developed by the cultural heritage institutions rather than researchers for in-browser activity and access. And one of the really intriguing things about archived web pages is that they won't necessarily have been captured as a single object at one point in time. They're often constructed from page elements that have been captured at different times.
JANE WINTERS: So I can show you an example of what that looks like, again, in relation to NISO. There's this wonderful tool, called the Memento: Time Travel for the Web. And you can put in a date for when you want to find what a website looks like. And I've chosen 10 years ago today. And you can see what the NISO website might have looked like at that point in time.
JANE WINTERS: And you can see here, it's been captured in multiple archives, including at the bottom the Icelandic Web Archive. So information ends up in very strange places through crawling. And there isn't a capture on the date that I wanted, so you can see there's one that was 12 days before in the Internet Archive, or 222 days before, or 308 days after. So that's the sort of spread of what might be available if you're interested in 2012.
JANE WINTERS: When you go down to look at the page itself-- this used not to be the case, but the Internet Archive, really helpfully now, identifies the points in time at which different elements of the page were captured. So if you center on the date that you're interested in-- it's quite small here, but the very first line of this extra information tells you that the CSS style sheets were archived 13 days before the rest of the page.
JANE WINTERS: And further down, as well, there is a much longer period of time where another aspect of it was archived as well. But it's mostly kind of in the seconds, but there are still these chronological differences that mean that, effectively, no single person ever looked at this page, because it didn't exist in real time at any point. It's been artificially stitched together to look like a coherent web page.
JANE WINTERS: Oh, and I can see it now, yes. There was a Google tool that was captured 20 hours after the rest of the page as well. And if dealing with rapidly moving sites, like weather reporting or sports news, you might end up with inconsistent results, you know, showing you different information. It just doesn't look coherent, because it's changed during the archive process.
JANE WINTERS: My colleague Niels Brugger describes archived web pages as reborn digital, because it's an archival artifact that's been created. And he has a rather nice and it sort of feels slightly exasperated description of what this means in practice for a researcher that you can see here: The archived web must be regarded as a unique version, a version of a probably lost original, and very likely one version, among other versions, none of which can be identified as the original.
JANE WINTERS: The fundamental heterogeneity cannot be removed or resolved by technical means, because it's a constitutive part of the web and of the web being archived. And to add insult to injury, the heterogeneity of the web archive also has a history of its own, because the web and the web archive continue to develop, thus accumulating previous heterogeneities. So it's this layered artifacts on top of layered artifacts within the web archives, and trying to pick your way through that, is not an easy thing.
JANE WINTERS: I'm just going to finish with talking about a few of the most recent developments in access, one of which actually is empowering individual researchers to create their own web archives. And this tool here, Conifer, which was formerly known as Web Recorder, which is open for people to use, has the great advantage of being able to record the logged-in social media experience of the person using it, which captures another kind of web archive context.
JANE WINTERS: If you're logged into a Facebook group, for example, you will see the view that you would see with information that's not there to a public page that might end up in a national archive. And I think there are lots of researcher collections of really important interesting data that we don't know about that have been created using tools like this. There's a project based in Canada, Archives Unleashed, which is building a range of tools to allow people to do some more analysis at scale than has generally been possible.
JANE WINTERS: And you can see some of the word cloud, and so on, on there, and other forms of visualization and analysis. And that's trying to remove the technical barriers to people interacting with these carefully preserved websites. And finally, there are moves away from the keyword searching that I showed you at the start and thinking about visualization. You can't catalog this data. There's too much of it.
JANE WINTERS: Certainly, the technology isn't there to even do semi-automated cataloging at the moment, really. And so if we can't do that, how can we access it? And this is a visualization of links between websites in the 1996 .uk domain. And you begin to get a sense of the shape of it. There's this really dense cluster of connected websites at the center, and then this kind of penumbra of websites that don't link on to anything else.
JANE WINTERS: So that tells you something about how the web was developing in that period. Things were starting to link together, but there was still this whole different set of pages which weren't connected up to anything else. You can't do much more with it than that, but it is an interesting way of thinking about this data. And then, finally, this is using similar methods but applying topics, and trying to get down to the subject matter within the web pages.
JANE WINTERS: This is produced by the team at the National Archives of the UK, looking at the blogs that are in their collections and trying to identify how different kinds of blog connected to each other. This is the government and Web Estate. So there are lots of blogs related to justice in green at the center, and data on heritage, interestingly, the kind of pink color there is quite closely connected to the justice one.
JANE WINTERS: But over to the left, you might notice, there's a sort of pinkish cluster, which is actually about potatoes, which must have been about healthy eating and lifestyle on government websites, which is perhaps understandably not connected on to anything else. And in orange, you've got education which, rather more worryingly, is not connected onto anything else. So this is giving really nice insights into the shape of the government web in 2009, in this case.
JANE WINTERS: So there's lots of experimentation. We're starting to think much more about how this unstructured data can be structured, how standards can be applied in appropriate areas to make it accessible, and how researchers find their way through these really challenging but fascinating collections. And that's everything from me. Thank you very much.
SALWA ISMAIL: Thank you, Jane. And thank you again, Euan. Before we open up the Q&A session, please join me in giving a round of virtual applause to Euan and Jane. And with this, we'll open up the session for a live Q&A. Thank you. [MUSIC PLAYING]

Cadmore media player playing video Archiving and preservation of unusual born-digital objects -NISO Plus

Video Player

Transcript

Segments

End of Video Player Control