Data Analytics Approaches for Web Archives
Gregory Wiedeman
University Archivist
M.E. Grenander Department of Special Collections & Archives
Collecting the Web at UAlbany
- Preserve and manage permanent public records
- Crawling and preserving albany.edu since 2012
- Began outside collecting in 2016
- 1.5 TB of Web Data
- Collect, preserve, provide access, and encourage use
Working with WARCs
- .warc file ISO standard
- Tremendous volume of data
- Mess of HTML, CSS, JavaScript
- Unclear provenance
- Standard derivative datasets
- Tools becoming easier to use
Collecting the Web at UAlbany
- We need partners!
- Smaller research-focused collections
- Work with Internet Archive to get older data
- How can we improve?
Data Analytics Approaches for Web Archives
Gregory Wiedeman
University Archivist
M.E. Grenander Department of Special Collections & Archives