EHRI-4Memory OAI-PMH Adapter¶
Overview¶
The EHRI-4Memory OAI-PMH Adapter acts as a bridge between EHRI data sources and the NFDI4Memory Data Space.
Test Endpoint (Temporary)
This is a test endpoint for development and validation. The final production URL will be provided after successful completion of Update API compliance tests.Data Sources¶
Proxied EHRI Sets (passed through from EHRI OAI-PMH)¶
These sets are directly proxied from the EHRI OAI-PMH endpoint with minimal processing for harvesting archival descriptions from the EHRI Portal:
- Data source: EHRI OAI-PMH API
- Purpose: Harvesting archival descriptions from the EHRI Portal
- Caching: No local caching, direct passthrough
- Data freshness: Real-time - directly from EHRI, no delay
- Change frequency: Varies by institution
- Metadata formats:
oai_dcandead(Encoded Archival Description) - Deletions: Managed and passed through by EHRI
Set Structure:
Two types of sets are supported:
1. Country Sets
- Lower-case ISO 3166 alpha-2 (2-letter) codes
- Examples:
us(United States),de(Germany),il(Israel),at(Austria) - Contains all archival descriptions from institutions within that country
2. Repository Sets
- Compound identifiers consisting of the country code, a colon, and the repository's EHRI ID
- Format:
{country_code}:{ehri_repository_id} - Examples:
us:us-005578(United States Holocaust Memorial Museum)de:de-002624(Institut für Zeitgeschichte–Archiv)- Contains archival descriptions from a specific institution
Currently 916+ sets available.
⚠️ Important Note for oai_dc format: Portal URLs are automatically added as additional dc:identifier elements with a url: prefix to ensure traceability back to EHRI's portal (e.g., url:https://portal.ehri-project.eu/units/fr-006203-frad080_dossier_c6). This enhancement applies only to oai_dc format. EAD formats remain unchanged to preserve EHRI's original structure.
For more details about the official EHRI OAI-PMH API that we are proxying, see the EHRI OAI-PMH Documentation.
Virtual Sets (from EHRI GraphQL API)¶
Virtual sets include controlled vocabularies, authority sets, country reports and archival institution descriptions not directly available through the original EHRI OAI-PMH API. These are compiled by querying the EHRI GraphQL API and transforming its responses into OAI-PMH compatible records:
- Data source: EHRI GraphQL API
- Caching: Stored locally in SQLite, synchronized daily
- Data freshness: Maximum 24 hours old
- Change frequency: Very stable - these datasets rarely change
- Metadata format: Only
oai_dc(Dublin Core) - Sets:
ehri:camps,ehri:ghettos,ehri:terms,ehri:persons,ehri:corporatebodies,ehri:countries,ehri:repositories - Deletions: Not tracked for virtual sets
Dublin Core enhancements for virtual sets:
Virtual set records include semantic prefixes in dc:coverage and dc:identifier elements:
url:prefix on portal URLs indc:identifier(e.g.,url:https://portal.ehri-project.eu/keywords/ehri_camps-1)geo:prefix for geographic coordinates (e.g.,geo:50.026199,19.204099)temporal:prefix for dates of existence (e.g.,temporal:1889-1945)spatial:prefix for place names (e.g.,spatial:Budapest, Hungary)
OAI-PMH Verbs¶
Identify¶
Returns information about the repository.
Request:
Response contains:
- Repository name and description
- Base URL
- Earliest date: 2013-09-09 (EHRI service start date)
- Deletion policy:
transient(EHRI tracks deletions for proxied sets; virtual sets don't track deletions) - 4Memory API Advertisement per Blue Paper:
- Update API endpoint and documentation URLs (may change after compliance testing)
- Access API advertised with
oai_dcprefix (Note: Access API is not implemented. However, it is advertised to instruct harvesters to use the Update API with theoai_dcmetadata prefix instead of the not-yet-availablen4m-dsformat)
ListSets¶
Lists all available sets (virtual + EHRI sets).
Request:
Important for 4Memory harvesters:
- The first page contains all 7 virtual sets followed by the first 200 proxied EHRI sets
- If there are more than 200 proxied EHRI sets (currently 916), a resumptionToken is returned for pagination
- Total of 923+ sets available (7 virtual sets + 916 proxied EHRI sets)
ListMetadataFormats¶
Shows available metadata formats for a specific record or globally.
Request (global - lists all formats available in the repository):
Returns:oai_dc, ead, and ead3
Request (specific record):
Behavior by identifier type:
- Virtual record identifier (e.g.,
ehri_camps-38): Returns onlyoai_dc - EHRI identifier (e.g.,
us-005578-irn501226): Proxied to EHRI, returnsoai_dc,ead, andead3 - Invalid identifier: Returns
idDoesNotExisterror
ListIdentifiers¶
Lists only identifiers (headers) without metadata.
Request:
Parameters:
metadataPrefix(required):oai_dcoreadset(optional): Set specification. Warning: If no set is specified, the response defaults to EHRI OAI-PMH proxied sets only and does not include records from virtual setsfrom(optional): ISO 8601 date - see note below about date semanticsuntil(optional): ISO 8601 date for date range
Important notes:
- Virtual sets only support
oai_dcformat. Usingeadwith virtual sets returns acannotDisseminateFormaterror. - For virtual sets, dates reflect when records were synced to our cache, not when they were modified in EHRI. This is useful for tracking what's new in our cache but doesn't represent actual changes in the source data.
ListRecords¶
Returns complete records with metadata.
Request:
Parameters:
metadataPrefix(required):oai_dcoreadset(optional): Set specification. Warning: If no set is specified, the response defaults to EHRI OAI-PMH proxied sets only and does not include records from virtual setsfrom(optional): ISO 8601 date for incremental harvestinguntil(optional): ISO 8601 date for date range
Important notes:
- Virtual sets only support
oai_dcformat. Usingeadwith virtual sets returns acannotDisseminateFormaterror. - For virtual sets, dates reflect when records were synced to our cache, not when they were modified in EHRI. This is useful for tracking what's new in our cache but doesn't represent actual changes in the source data.
GetRecord¶
Retrieves a single record.
Request:
Parameters:
identifier(required): Record identifiermetadataPrefix(required):oai_dcoread
Pagination with Resumption Tokens¶
How it works¶
For large result sets, resumption tokens are used:
- First request returns:
- Virtual sets: Maximum 200 records
- EHRI proxied sets: Maximum 200 records
- Resumption token is provided at the end of the response
- Subsequent calls use only the token (no other parameters)
Example for complete harvesting¶
# First request
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=ehri:camps" \
-o batch1.xml
# Extract token from batch1.xml
TOKEN=$(grep -o '<resumptionToken[^>]*>[^<]*</resumptionToken>' batch1.xml | \
sed 's/<[^>]*>//g')
# Retrieve further batches
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&resumptionToken=$TOKEN" \
-o batch2.xml
Token structure¶
- Virtual sets: Base64-encoded, contains set, offset, date range
- EHRI sets: Original EHRI token, directly passed through
Incremental Harvesting¶
For regular updates we recommend:
- Updates for virtual sets: Full harvest recommended (sets are small and completely replaced during sync)
- Updates for EHRI sets: Use the
fromparameter for incremental harvesting
Example for incremental harvesting¶
For macOS:
FROM_DATE=$(date -u -v-2m '+%Y-%m-%dT00:00:00Z')
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&from=$FROM_DATE"
For Linux:
FROM_DATE=$(date -u -d "2 months ago" '+%Y-%m-%dT00:00:00Z')
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&from=$FROM_DATE"
Supported date formats:
- Full ISO 8601:
2023-01-01T00:00:00Z - Date only:
2023-01-01(interpreted as start of day) - Empty dates are ignored (useful for optional parameters in scripts)
Specifics of incremental harvesting¶
Virtual sets:
fromanduntilfilter by sync date (when records were added/updated in our cache)- All records from a sync batch receive the same timestamp
- Daily sync replaces all records, so dates indicate cache freshness, not actual EHRI changes
- Useful for: "Show me what was added to the cache since yesterday"
- Not useful for: "Show me what actually changed in EHRI since yesterday"
EHRI sets:
fromanduntilare passed through to EHRI- EHRI tracks actual modification dates of records
- Provides true incremental harvesting of changed records
- Useful for: "Show me what actually changed in EHRI since yesterday"
Error Handling¶
Standard OAI-PMH error codes¶
| Code | Meaning | Typical cause |
|---|---|---|
badVerb |
Unknown verb | Typo in verb parameter |
badArgument |
Invalid parameters | Missing required parameters |
cannotDisseminateFormat |
Format not supported | e.g. ead for virtual sets |
idDoesNotExist |
Unknown identifier | Record doesn't exist |
noRecordsMatch |
No matches | Empty set or no matches in time range |
badResumptionToken |
Invalid token | Token expired or corrupted |
Performance Considerations¶
Harvesting parameters¶
- Batch size: 200 records per page
- Timeout: 30 seconds per request
Monitoring and Status¶
Health Check Endpoint¶
Returns JSON with:
status: "healthy" or error statusversion: Adapter versiondatabase: Connection statussync_in_progress: Boolean indicating if sync is running