Greg Wiedeman
Menu
About
Writing
presentations
Code
CV
Processing Born-Digital Images at Scale
Gregory Wiedeman
University Archivist
University at Albany, SUNY
NSDA Content Interest Group
Born-Digital Photography at UAlbany
Began in 1999
Campus Photographer in Digital Media Department
Photographer is state employee, so images are public records!
Two periods, with gap in-between
1999-2010 wrote files to DVDs, Access DB
2012-present SmugMug service
Growth of Born-Digital Photography 1999-2010
1999-2010 Period
4 Boxes of DVDs, about 600 disks
about 20 CD-Rs
1.8 TB of data
Camera Raw files (.NEF, .CR2) and .JPG derivatives of used images
1999-2010 Period
Job number folder contained images, written on disk
Metadata in Access DB
Date
Photographers description of event
Costs, etc.
Previous Work
Student Assistant manually selecting images
Adding detailed item-level metadata
Upload to Luna DAMS
Small Percentage over years.
2012-present
SmugMug service
Online photo web app
about 20,000 images
Photographer Uploads selection of images with metadata
SmugMug has an API
Needs
Automation
Need to scale
No metadata creation, must describe themselves
Transparency
Researchers need context
Access
No restrictions, immediate public access
Presentation within existing collections (EAD)
Support reference work now
Using Python
Python is great at working with data across systems
Requests library to query SmugMug API
os library to read filesystem, copy files
Subprocess to call other command line tools
TSK, ImageMagick
Bagit-python
lxml to work with XML and EAD
Mass Image DVDs
Used BitCurator Forensic Machine
Ripping from disk ran into Filesystem issues
different ISO formats
Running dd was most dependable
5 external disk drives
Mass Image DVDs
Appraisal in Born-Digital Processing
Archives manage materials at scale
Time-limited project, initially less than 2 months
Other collections need attention
Appraisal Decisions
Not to retain camera raw permanently
Large, access barrier
Not a final product, proprietary
Convert all files to JPG
1.8 TB down to a managable 274 MB
Not spending time recovering all files
diminishing returns
JPEG compression!?
Edited, used pre-2010 images were JPGs
Why go back to unedited raw?
All post-2010 images through SmugMug were JPGs
JPG compression visually looked the best
Purpose of collection was to document university events
Users happy with JPGs
Not using compression is not a preservation strategy
In the Spirit of OAIS
Does appraisal conflict with OAIS?
SIPs are non-permanent .dd images and Access DB exports
AIPs are processed bags with metadata
DIPs are metadata in EAD, linked to JPGs on web server
The SIPs/AIPs
Crawling SmugMug
Wrote a crawler for SmugMug
Download all images
Periodically crawl for updates
Hash index to see if already downloaded
Package in to Bags with XML metadata file
Directory structure and descriptive metadata
Separate script to write metadata into EAD files
it broke
Access
Listed DVD albums in CSV
Student arranged into SmugMug directory structure with spreadsheet
Described image sets without Job numbers
Python script updated EAD-XML with directory structure
Each image set is a component with DAO link
Batch generated HTML files
Exposed as
XTF finding aid
Access
Things I’ve learned
This gets complicated
Just because you can script it, doesn’t mean its a sustainable workflow
You can put any junk in XML
Maintenance is an issue
Infrastructure first
Building Infrastructure
Stopped collecting from SmugMug for now
Arclight for archival description
Hyrax repository for digital archives content
connects to Arclight API
backed by data model
New
SIP/AIP Model
Validate to Spec using
Bagit-profiles
Future Plans
Network of open systems connected by REST APIs
Ingest utility
Make Bag according to Spec
Post to Hyrax, backed by Data Model
Post accession to ArchivesSpace
Processing utility
Any content processing
Post description to ArchivesSpace, exposed in Arclight
Post public content to Hyrax, linked to Arclight description
Processing Born-Digital Images at Scale