Unsustainability and Retrenchment in American University Web Archives Programs

2023 May 12 International Internet Preservation Consortium (IIPC) 2023, Hilversum, Netherlands


This presentation, co-authoed with Amanda Greenwood, overviewed the expansion and later retrenchment of UAlbany’s web archives program due to a lack of permanently funded staff. We found that web archives can be really wasteful and require increasing maintenance and scoping labor over time. While there is a rational case for increased staffing in web archives, we tried to connect our local case study with the broader resource environment in American colleges & universities, showing that this is a long term crisis for research libraries.

Hi everyone. I’m Greg Wiedeman. I’m the University Archivist at Albany, and I should give bit of a heads up that I’m probably going to kill the vibes at this conference with this talk. Hopefully you hear some more optimistic talks about web archives this week, but I think it’s useful to have an honest understanding of what’s happening in web archives collecting right now.

This presentation is about program sustainability in American University web archives programs. I also want to thank my co-author, Amanda Greenwood, who was unable to make it here for IIPC this year but is just as responsible for this work as I am.


First a bit of backstory about our web archives program at UAlbany. In 2012 we first started our Archive-It subscription with the Internet Archive, and the goal was to capture as much of our university website, albany.edu, as possible. This is because we are a public university, and there’s public records on our website that we have to collect and preserve permanently according to State records laws.

There was at first, no one really assigned to manage the web archive. There was an inital monthly crawl set up and it did its work for better and worse and no one really touched it.

In 2015, when I was hired it was worked into the University Archivst position as a very minor bullet at the bottom of the job responsibilities, but I was able to at least better scope and schedule the crawls, which is when we first started actually getting all of albany.edu.

In 2016 we expanded to capture our outside collecting areas. The university archives is also part of our special collections department which collects outside manuscript collections and organizational records, mostly on New York State politics and death penalty activism. Much of that work now happens online, so when we get transfers of organizational records, we would start collecting their websites as well.

We also tired to do some topical collecting for the 2016 and 2018 State elections, but we realized we didn’t really have the resources to do that.

And we also noticed that our progress was creating a big maintenance backlog. It was feasible to crawl new collections one-by-one, but crawls would later time out and need rescoping periodically over time. So, while it was kinda easy to expand one-by-one, each one increased the long term mantenance work that we really didn’t allocate for.


So I think we are pretty typical. There are many college & university archives collecting the web in the United States. The NDSA web archiving survey is a couple of years old now, but the majority of respondents in 2017 were college and university archives.

Archive-It has also reported at least as late as 2018 that about 60% of their subscribers were college and university archives, and their next category was public libraries, which now might be a bit bigger, but is subsidized by their really cool community web program. Archive-It’s efforts to diversify their partner orgs really underscores how dominant colleges and universities are there.

So really, college and university archives are really dominant in American web archives, particularly through Archive-It. The Internet Archive in practice uses us as kind of curators, as all of our collections get added to the Internet Archive, too. So, hopefully, we can do some more specialized collection that they’re not able to do, and all of this gets collected and preserved.

Many of these organizations are public universities, and they’re collecting because of simmilar State records laws that require the preservation of public records on University websites.


It’s well understood that web archives are pathetically underfunded. This is a quote from David Rosenthal, and according to those NDSA surveys over two thirds of respondents reported about quarter FTE dedicated to web archives, and this was actually an increase from earlier surveys.


And if we look at the graph here, you can see that there’s very few web archives run by dedicated staff and I agree with David that this looks pretty pathetic.


So, there’s a one-year graduate assistantship that we can apply for our library. It kind of goes from department to department and the student gets tuition credit and a stipend. We applied for and recieved that position for 2020 in the hopes of making this program more effective.

We hired my co-author who was later hired away by Union College. The goal was to better management over what at that point had grown to 30 web archives collections and try to better document New York State politics and the death penalty. We knew there was stuff we were losing.

In February of 2020 Archive-It 7.0 came out, which incorporated youtube-dl, which was a major advancement that helped us better crawl video content. The downside was, it helped us better crawl video content.

We quickly saw really substantial data increases, and we were soon right at our data budget for the year.

I was scheduled to go on leave for March and April, so I had about two days at the end of February to kind of roll back our cralls and try to get them to capture as minimal as possible, and we turned off many crawls for “the time being.”

Then a pandemic happened.

So, while some other web archives programs were able to shifted more resources to web archives in a remote world, Our web archive was kind of frozen. I was on leave, the person that was managing the web archive program, and with this timing and not having dedicated staff coverage the result was significant loss of documentation of university policies and practices in the rapidly changing early period of the pandemic. So that wasn’t a great situation to be in.


So we refocused the GA position around sort of resetting the program. We really had to rescope all active collections, be more thoughtful about collecting priorities, and try to allocate for and limit long term maintenance.


We figured out that appraisal for web archives is pretty weird compared to traditional archives. When you’re working with paper, or even born-digital records, appraisal saves us time and labor. It reduces the time to needed to acquire and process collections.

Appraisals for the web archives is different. There’s very little upfront time costs. It’s very easy to crawl a lot without a lot of time dedicated to scoping. But it takes a lot more labor to crawl efficiently.

Lack of staff time doesn’t prevent you from crawling but it creates a lot of waste that is really hard to account for.


This makes web archives a really wasteful collecting method. There are so many crawler traps now. I’m not sure if its React or misconfigured Wordpress sites or what, but sites will often just append random strings to URLs infinitely. Browser-based crawling helps some of this, but there still so many sites that need really labor-intensive scoping.

And just mistakes happen. During our 2018 state election crawls we I was working with a grad student at the time, but since he was focused on a bunch of other work, he wasn’t really trained on web archives. He just spent some time going through and collecting all of the state candidates websites URLs for me to crawl.

We had done some test crawls on an initial batch but a couple of days before the election, he came with another batch, and we didn’t have time to do a full test crawl. I unfortunately did not review his seed list that thoroughly, and there was a Youtube link, because one of the candidates used Youtube as their main on candidate website, which means our crawler for the next week started downloading as much of YouTube as possible.

So I think this happens actually, very often. And you know how I know this?

Because most institutions dedicate about .25 FTE to web archives.

Also Archive-it now has automated scoping rules that get applied for these situations to try and avoid this. So when you put a Youtube link it’ll automatically apply some scoping rules to try and reduce the waste.

So web archives have some really significant costs that we don’t see. The compute and storage necessary to run web crawlers are relatively cheap for us, but we need to allocate for the cost of servers responding to the crawler.

And this is all of this is physical infrastructure so when a crawler is stuck in a trap forever and getting a lot of useless data its generating emissions. Even if it’s very efficient, even if you buy carbon offset credits, it’s still generating emissions.


There is also access waste that is even harder to measure. The less efficient web archives are, bigger our collections are, the harder they are to use.

So we have some some great progress with the Archives Unleashed project which has made it feasible to use big data tools to work with these web archives, but they would just be a lot easier to use and it would lower the barrier for use if collections were more precise.

The lack of research use for web archives is a widely known problem. This is a direct reflection of the lack of staff we devote to web crawling and scoping.

I know this because some of the faculty I’ve tried to convince to use web archives say if I scrape it myself, I have like a lot more granular control about what I’m getting. I only get what I want and I don’t have to do a lot of filtering and data cleanup afterwards. It’s actually more work for me to use web archives, both either through us or collections that exist.

So this waste, the lack of opportunity for use is really hard to measure. And so we really need to further study this and try to measure and account for this waste.


We also tried to be really thoughtful about the experience of web archives, as it turns out web crawling can be a really time consuming job despite the powers of automated crawlers.

Multiple test crawls are now usually needed and sometimes these crawls could last up to 4 weeks.

And one of the hard things is that we have no idea what the crawls are actually gonna finish. Sometimes they’ll finish in a day, sometimes they’ll finish in 5 days.

So that’s really hard to work around other duties, because you don’t really know when the crawl is going to finish. So you hope, oh, I might have some time on Thursday to scope this crawl, but it didn’t finish on Thursday, it finished on Friday, and we have a bunch of meetings on Friday, so I didn’t get it to next week, and it gets pushed down the list.

So it’s actually really hard to like manage web archives and other duties this way.

And you also have really no control of what will cause the maintenance that we need cause we don’t know when websites are going to change.

So all of a sudden you’ll have a crawl that will be working well over time but they make a major update and changed what they’re doing and all of a sudden now our calls breaking, and we have to dedicate a lot of time for rescoping.

So our the time committment is really decided by website changes and Archive-It technology updates that we don’t have a lot of control over and can’t really plan for.


As a new web archivist, it was harder for my co-author to get up to speed than when I started crawling as the web keeps geting more complex.

This required a lot of work with myself analyzing host reports that I struggled to explain.

So there’s a lot of technology you need to learn, like what HTTP codes and URL parameters are.

But also even despite the technology learning curve, it’s just hard to read a host report. Looking at lists of URLs and trying to figure out what is happening is really hard.

There’s some really good documentaion about how to learn to use the Archive-It tool and crawl the web. But it’s not going to help the problem of what is this particular scoping issue?


Another thing we found out, which is kind of the biggest takeaway is that the maintenance for our web archives is increasing over time quite significantly.

The web is getting more complex and dynamic day by day. I think this is something we all kind of know.

There’s really kind of an arms race between web technology and archiving technology. So now we have browser-based crawling, but even though we have more tools to collect, more tools means we have more options, which takes more time.

So sometimes you do a traditional crawl, and then you do a browser based crawl on top of it to see if that solves the scoping issue you don’t understand.

We found out that the same websites take significantly more time to scope on average, even 3 to 5 years later.


So this is some of the results from our 2020 rescoping.

And you can see that the average was about 1.5 test calls over the course of almost 30 days. So about a month here among other duties. This was the time until we saved a final test crawl.

So only like 3 to 5 years later we found a significant increase in both the number of test calls they required, which went to almost 5, and the amount of time to sucessfully crawl which reached over a hundred days.

One of the worst collections was CSEA, which is New York’s public workers union.


In February of 2016, it took 5 days to scope and capture, 2 test crawls and we ended up with 0 scoping rules.

In 2020, in the Fall and over the winter it took 6 months from start to finish juggling this among a bunch of different other web collections and other duties.

It required 12 different test calls, each about 3 and a half days each, and we ended up with 10 scoping rules.

[This is actually shorter, as we did an initial test crawl in August and put it aside when we figured out that this wasn’t a good collecion to learn scoping on.]


So there’s definitely a rational increase for increasing staff dedicated to web archives, right?

As our maintenance load is growing significantly over time, even to keep capturing what we’re collecting without expanding.

There’s also another weird quirk being in a university special collections, as usually collections come in years or decades after they’re created.

If you talk to our department head, who does a lot of collecting about New York politics, he’s getting great collections from the 1960s, 70s, 80s.

As people get older, they retire, and their political advocacy becomes maybe less controversial. Maybe they’re more protected from harm in some cases. So they’re more comfortable donating these and managing these collections.

So there’s always been kind of a lag time, but much of State politics and death penalty activism now happens on the web or in born digital archives are hidden on Google drive somewhere or other tools.

So the weird thing is that we’re in this period of double collecting.

We need to collect both the 1970 and 1980 and current records at the same time, which is a kind of peculiar period.

Because that delay for the new stuff is gonna really result on unacceptable loss. Much of that stuff, is going to be gone. But its hard to do twice the work without twice the number of people collecting.


But while there’s rational to increase staff dedicated to web archives, there’s been major staff loss in North American research libraries over the past 20 years.

Specifically since 2008 you can see there is a really significant drops in the median FTE in ARL libraries. Over this time it’s dropped from about 210 to a hundred and sixty.

And this has created what Eira Tansey has described as a “poverty cycle” where these losses compound, as libraries are unable to take strategic risks and focused only on short term needs instead of things with long term value.

This is why Libraries are still more focused on collecting decades in the past, because they get more immediate value for them, as researchers are using those collections now.


And again, here UAlbany, is pretty typical.

We had a hundred and fifty FTE in the year 2000 at our library, and now this year we’re reporting to ARL 75 FTE for a 50% decrease.

This makes it hard to impossible to maintain current programs and services that are currently producing value to our university community, much devote more staff to web archives or double our collecting staff.


So why is this happening? Why? Why are research libraries hemorrhaging staff?

Here again, UAlbany can be a really useful case study.

And the real truth is that administrators are reallocating staff from academic libraries to units that better demonstrate a direct impact on student recruitment and retention.

Our Provost office had made the last 10 years of staffing data open. These numbers are a bit different as they don’t include temporary or adjunct lines.

In the last 10 years we’ve lost 26.7 positions for an over 27% drop but meanwhile, university as a whole has grown in staff and academic affairs has grown in staff.

And, what’s most glaring is that in our university’s functional budgeting we are budgeted as Academic Affairs Support.

Over this period, Academic Affairs Support has actually grown by over 10%, which includes the Libraries’ losses, gaining over 47 positions, most of those in writing and critical inquiry and academic advising.

University administrators are pulling staff away from research libraries towards efforts that can get more students in the door and make sure they graduate.

And this problem might get worse. There’s the potential for enrollment drops starting around 2026 just from demographic changes. This can be really noisy and really dependant on attendence rates. But the resource environment is more likely to get worse than better.

I support our university’s mission of providing an accessible education as broad a set of New Yorkers as possible, but archives and research libraries have a different mission.


So this often gets attributed to austerity, and it is austerity and a crisis in how American colleges and universities are resourced.

But its also kinda worse than that. It a lack of value for the mission of research libraries, both by universities and the public.

An example of this is that last year New York State, like many states had a historically pretty unique budget surplus and the state did invest in our university.

The “Albany AI Supercomputing Initiative” includes $75 million in permanent state funding and 49 new tenure track positions in every college & school in the university, most dedicated to AI.

Yet this resulted in 0 hires for the University Libraries.

Even when there’s funding for universities, there is not funding for our work.

Unlike AI apparently, the politicians behind this funding cannot run in their districts on support for research libraries.


So the really unsettling conclusions to this are that lack of staff makes web archives really wasteful in ways that we need to start trying to measure and account for.

Even when just maintaining current collections we should budget for increases in mantenance over time.

Faced with this reality, we have slowly stopped collecting the web in our major subject areas. As crawls need rescoping we just turn them off.

We are now just crawling to meet our legal requirements.

I think our pressures and experience are typical. This is a really a long term existential crisis for research libraries.

Improvements in practice or technology won’t address these structural forces.

And it’s really hard in this current environment to see how this gets better.

For this community at IIPC, this means we really cannot rely on most college and university libraries to preserve the web.