Describing Web Archives with the Partner Data API

Gregory Wiedeman
University Archivist
University at Albany, SUNY

Web Archiving at UAlbany

Web Archives are Archives

We do not describe web archives as archives

If I had to enter detailed metadata for each Archive-It collection (or each seed!),

I would collect less.

Describing Web Archives with DACS

DACS Statement of Principles

“3. Because archival description privileges intellectual content in context, descriptive rules apply equally to all records, regardless of format or carrier type.”

DACS Statement of Principles

“4a. Records must be described in aggregate and may be described in parts.”

DACS Statement of Principles

“10. Archivists must have a user-driven reason to enhance existing archival description.”

I tried this, it’s difficult

Now We Have the Tools


from asnake.aspace import ASpace

AS = ASpace()
repo = AS.repositories(2)

collection = repo.resources(253)
for child in collection.tree.children:
    if child.level == "Web Archives":
        series = child.record
        newArchivalObject =
        # add description here
        newArchivalObject.parent = series
        newArchivalObject.resource = collection

Archive-It CDX

edu,albany)/president/about-the-president.php 20171218033537 text/html 200 VQR3D3JAD6BIB36O4ELJH7L6U2IESQOT - - 6977 871017071 ARCHIVEIT-3308-MONTHLY-JOB512297-20171217200124134-00000.warc.gz
edu,albany)/president/about-the-president.php 20180117191831 text/html 200 7BXF5EVUN52ZYR7THHFXKQXF7I3LBHNC - - 7157 309246547 ARCHIVEIT-3308-MONTHLY-JOB541502-20180117165511664-00000.warc.gz
edu,albany)/president/about-the-president.php 20180217194454 text/html 200 XD6ET6ZBLTNN6OLSCVSMIXU44TWN5W4M - - 8019 411195409 ARCHIVEIT-3308-MONTHLY-JOB550452-20180217165509323-00000.warc.gz
edu,albany)/president/about-the-president.php 20180317200843 text/html 200 KY54GB6LH24DRIZGBIO4R3VXXA2UHTN3 - - 8074 392459899 ARCHIVEIT-3308-MONTHLY-JOB557606-20180317165527317-00000.warc.gz
edu,albany)/president/about-the-president.php 20180417185510 text/html 200 O6SIJTKE2X565QJ5ZEWPZYSK3H53HZDH - - 8085 227334518 ARCHIVEIT-3308-MONTHLY-JOB566289-20180417155443995-00000.warc.gz
edu,albany)/president/about-the-president.php 20180517190232 warc/revisit - O6SIJTKE2X565QJ5ZEWPZYSK3H53HZDH - - 648 223149579 ARCHIVEIT-3308-MONTHLY-JOB576427-20180517155506038-00000.warc.gz
edu,albany)/president/about-the-president.php 20180617184722 text/html 200 F2CRCBL5TZRROHBN5GCF7IESJR44ABIQ - - 8081 59342345 ARCHIVEIT-3308-MONTHLY-JOB599287-20180617155438242-00000.warc.gz
edu,albany)/president/about-the-president.php 20180717204617 text/html 200 W6VTFG3HQYDTRLUEGMVA2M5BU75QT47N - - 7968 130376348 ARCHIVEIT-3308-MONTHLY-JOB647847-20180717155427452-00000.warc.gz

Partner Data API

Content Storage

Scope and schedule crawls in Archive-It

Describe in ASpace with other records

Use CDX to post new description to ASpace when page is updated

Download WARCS with WASAPI?

Get crawl and scoping rules

Make all this discoverable

Two Processes

  1. Create ArchivesSpace records from CDX
  2. Create content records for each capture from Partner Data API

I need help!