Describing Web Archives with the Partner Data API


Gregory Wiedeman
University Archivist
University at Albany, SUNY
@GregWiedeman

Web Archiving at UAlbany


Web Archives are Archives


We do not describe web archives as archives


If I had to enter detailed metadata for each Archive-It collection (or each seed!),

I would collect less.


Describing Web Archives with DACS


DACS Statement of Principles

https://github.com/saa-ts-dacs/dacs

“3. Because archival description privileges intellectual content in context, descriptive rules apply equally to all records, regardless of format or carrier type.”


DACS Statement of Principles

https://github.com/saa-ts-dacs/dacs

“4a. Records must be described in aggregate and may be described in parts.”


DACS Statement of Principles

https://github.com/saa-ts-dacs/dacs

“10. Archivists must have a user-driven reason to enhance existing archival description.”



I tried this, it’s difficult


Now We Have the Tools


ArchivesSpace

from asnake.aspace import ASpace

AS = ASpace()
repo = AS.repositories(2)

collection = repo.resources(253)
for child in collection.tree.children:
    if child.level == "Web Archives":
        series = child.record
        newArchivalObject = AS.archival_object.new()
        # add description here
        
        newArchivalObject.parent = series
        newArchivalObject.resource = collection
        newArchivalObject.post()

Archive-It CDX

edu,albany)/president/about-the-president.php 20171218033537 http://www.albany.edu/president/about-the-president.php text/html 200 VQR3D3JAD6BIB36O4ELJH7L6U2IESQOT - - 6977 871017071 ARCHIVEIT-3308-MONTHLY-JOB512297-20171217200124134-00000.warc.gz
edu,albany)/president/about-the-president.php 20180117191831 https://www.albany.edu/president/about-the-president.php text/html 200 7BXF5EVUN52ZYR7THHFXKQXF7I3LBHNC - - 7157 309246547 ARCHIVEIT-3308-MONTHLY-JOB541502-20180117165511664-00000.warc.gz
edu,albany)/president/about-the-president.php 20180217194454 https://www.albany.edu/president/about-the-president.php text/html 200 XD6ET6ZBLTNN6OLSCVSMIXU44TWN5W4M - - 8019 411195409 ARCHIVEIT-3308-MONTHLY-JOB550452-20180217165509323-00000.warc.gz
edu,albany)/president/about-the-president.php 20180317200843 https://www.albany.edu/president/about-the-president.php text/html 200 KY54GB6LH24DRIZGBIO4R3VXXA2UHTN3 - - 8074 392459899 ARCHIVEIT-3308-MONTHLY-JOB557606-20180317165527317-00000.warc.gz
edu,albany)/president/about-the-president.php 20180417185510 https://www.albany.edu/president/about-the-president.php text/html 200 O6SIJTKE2X565QJ5ZEWPZYSK3H53HZDH - - 8085 227334518 ARCHIVEIT-3308-MONTHLY-JOB566289-20180417155443995-00000.warc.gz
edu,albany)/president/about-the-president.php 20180517190232 https://www.albany.edu/president/about-the-president.php warc/revisit - O6SIJTKE2X565QJ5ZEWPZYSK3H53HZDH - - 648 223149579 ARCHIVEIT-3308-MONTHLY-JOB576427-20180517155506038-00000.warc.gz
edu,albany)/president/about-the-president.php 20180617184722 https://www.albany.edu/president/about-the-president.php text/html 200 F2CRCBL5TZRROHBN5GCF7IESJR44ABIQ - - 8081 59342345 ARCHIVEIT-3308-MONTHLY-JOB599287-20180617155438242-00000.warc.gz
edu,albany)/president/about-the-president.php 20180717204617 https://www.albany.edu/president/about-the-president.php text/html 200 W6VTFG3HQYDTRLUEGMVA2M5BU75QT47N - - 7968 130376348 ARCHIVEIT-3308-MONTHLY-JOB647847-20180717155427452-00000.warc.gz



Partner Data API


Content Storage


Scope and schedule crawls in Archive-It


Describe in ASpace with other records


Use CDX to post new description to ASpace when page is updated


Download WARCS with WASAPI?


Get crawl and scoping rules


Make all this discoverable


Two Processes

  1. Create ArchivesSpace records from CDX
  2. Create content records for each capture from Partner Data API

I need help!