Skip to content

EHRI-4Memory OAI-PMH Adapter

Overview

The EHRI-4Memory OAI-PMH Adapter acts as a bridge between EHRI data sources and the NFDI4Memory Data Space.

Test Endpoint (Temporary)

http://ehrinfdi4memory.toolbox21.com/oai
This is a test endpoint for development and validation. The final production URL will be provided after successful completion of Update API compliance tests.

Data Sources

Proxied EHRI Sets (passed through from EHRI OAI-PMH)

These sets are directly proxied from the EHRI OAI-PMH endpoint with minimal processing for harvesting archival descriptions from the EHRI Portal:

  • Data source: EHRI OAI-PMH API
  • Purpose: Harvesting archival descriptions from the EHRI Portal
  • Caching: No local caching, direct passthrough
  • Data freshness: Real-time - directly from EHRI, no delay
  • Change frequency: Varies by institution
  • Metadata formats: oai_dc and ead (Encoded Archival Description)
  • Deletions: Managed and passed through by EHRI

Set Structure:

Two types of sets are supported:

1. Country Sets

  • Lower-case ISO 3166 alpha-2 (2-letter) codes
  • Examples: us (United States), de (Germany), il (Israel), at (Austria)
  • Contains all archival descriptions from institutions within that country

2. Repository Sets

  • Compound identifiers consisting of the country code, a colon, and the repository's EHRI ID
  • Format: {country_code}:{ehri_repository_id}
  • Examples:
  • us:us-005578 (United States Holocaust Memorial Museum)
  • de:de-002624 (Institut für Zeitgeschichte–Archiv)
  • Contains archival descriptions from a specific institution

Currently 916+ sets available.

⚠️ Important Note for oai_dc format: Portal URLs are automatically added as additional dc:identifier elements with a url: prefix to ensure traceability back to EHRI's portal (e.g., url:https://portal.ehri-project.eu/units/fr-006203-frad080_dossier_c6). This enhancement applies only to oai_dc format. EAD formats remain unchanged to preserve EHRI's original structure.

For more details about the official EHRI OAI-PMH API that we are proxying, see the EHRI OAI-PMH Documentation.

Virtual Sets (from EHRI GraphQL API)

Virtual sets include controlled vocabularies, authority sets, country reports and archival institution descriptions not directly available through the original EHRI OAI-PMH API. These are compiled by querying the EHRI GraphQL API and transforming its responses into OAI-PMH compatible records:

  • Data source: EHRI GraphQL API
  • Caching: Stored locally in SQLite, synchronized daily
  • Data freshness: Maximum 24 hours old
  • Change frequency: Very stable - these datasets rarely change
  • Metadata format: Only oai_dc (Dublin Core)
  • Sets: ehri:camps, ehri:ghettos, ehri:terms, ehri:persons, ehri:corporatebodies, ehri:countries, ehri:repositories
  • Deletions: Not tracked for virtual sets

Dublin Core enhancements for virtual sets:

Virtual set records include semantic prefixes in dc:coverage and dc:identifier elements:

  • url: prefix on portal URLs in dc:identifier (e.g., url:https://portal.ehri-project.eu/keywords/ehri_camps-1)
  • geo: prefix for geographic coordinates (e.g., geo:50.026199,19.204099)
  • temporal: prefix for dates of existence (e.g., temporal:1889-1945)
  • spatial: prefix for place names (e.g., spatial:Budapest, Hungary)

OAI-PMH Verbs

Identify

Returns information about the repository.

Request:

GET /oai?verb=Identify

Response contains:

  • Repository name and description
  • Base URL
  • Earliest date: 2013-09-09 (EHRI service start date)
  • Deletion policy: transient (EHRI tracks deletions for proxied sets; virtual sets don't track deletions)
  • 4Memory API Advertisement per Blue Paper:
    • Update API endpoint and documentation URLs (may change after compliance testing)
    • Access API advertised with oai_dc prefix (Note: Access API is not implemented. However, it is advertised to instruct harvesters to use the Update API with the oai_dc metadata prefix instead of the not-yet-available n4m-ds format)

ListSets

Lists all available sets (virtual + EHRI sets).

Request:

GET /oai?verb=ListSets

Important for 4Memory harvesters:

  • The first page contains all 7 virtual sets followed by the first 200 proxied EHRI sets
  • If there are more than 200 proxied EHRI sets (currently 916), a resumptionToken is returned for pagination
  • Total of 923+ sets available (7 virtual sets + 916 proxied EHRI sets)

ListMetadataFormats

Shows available metadata formats for a specific record or globally.

Request (global - lists all formats available in the repository):

GET /oai?verb=ListMetadataFormats
Returns: oai_dc, ead, and ead3

Request (specific record):

GET /oai?verb=ListMetadataFormats&identifier={id}

Behavior by identifier type:

  • Virtual record identifier (e.g., ehri_camps-38): Returns only oai_dc
  • EHRI identifier (e.g., us-005578-irn501226): Proxied to EHRI, returns oai_dc, ead, and ead3
  • Invalid identifier: Returns idDoesNotExist error

ListIdentifiers

Lists only identifiers (headers) without metadata.

Request:

GET /oai?verb=ListIdentifiers&metadataPrefix={format}&set={setSpec}

Parameters:

  • metadataPrefix (required): oai_dc or ead
  • set (optional): Set specification. Warning: If no set is specified, the response defaults to EHRI OAI-PMH proxied sets only and does not include records from virtual sets
  • from (optional): ISO 8601 date - see note below about date semantics
  • until (optional): ISO 8601 date for date range

Important notes:

  • Virtual sets only support oai_dc format. Using ead with virtual sets returns a cannotDisseminateFormat error.
  • For virtual sets, dates reflect when records were synced to our cache, not when they were modified in EHRI. This is useful for tracking what's new in our cache but doesn't represent actual changes in the source data.

ListRecords

Returns complete records with metadata.

Request:

GET /oai?verb=ListRecords&metadataPrefix={format}&set={setSpec}

Parameters:

  • metadataPrefix (required): oai_dc or ead
  • set (optional): Set specification. Warning: If no set is specified, the response defaults to EHRI OAI-PMH proxied sets only and does not include records from virtual sets
  • from (optional): ISO 8601 date for incremental harvesting
  • until (optional): ISO 8601 date for date range

Important notes:

  • Virtual sets only support oai_dc format. Using ead with virtual sets returns a cannotDisseminateFormat error.
  • For virtual sets, dates reflect when records were synced to our cache, not when they were modified in EHRI. This is useful for tracking what's new in our cache but doesn't represent actual changes in the source data.

GetRecord

Retrieves a single record.

Request:

GET /oai?verb=GetRecord&identifier={id}&metadataPrefix={format}

Parameters:

  • identifier (required): Record identifier
  • metadataPrefix (required): oai_dc or ead

Pagination with Resumption Tokens

How it works

For large result sets, resumption tokens are used:

  • First request returns:
  • Virtual sets: Maximum 200 records
  • EHRI proxied sets: Maximum 200 records
  • Resumption token is provided at the end of the response
  • Subsequent calls use only the token (no other parameters)

Example for complete harvesting

# First request
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=ehri:camps" \
  -o batch1.xml

# Extract token from batch1.xml
TOKEN=$(grep -o '<resumptionToken[^>]*>[^<]*</resumptionToken>' batch1.xml | \
        sed 's/<[^>]*>//g')

# Retrieve further batches
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&resumptionToken=$TOKEN" \
  -o batch2.xml

Token structure

  • Virtual sets: Base64-encoded, contains set, offset, date range
  • EHRI sets: Original EHRI token, directly passed through

Incremental Harvesting

For regular updates we recommend:

  1. Updates for virtual sets: Full harvest recommended (sets are small and completely replaced during sync)
  2. Updates for EHRI sets: Use the from parameter for incremental harvesting

Example for incremental harvesting

For macOS:

FROM_DATE=$(date -u -v-2m '+%Y-%m-%dT00:00:00Z')
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&from=$FROM_DATE"

For Linux:

FROM_DATE=$(date -u -d "2 months ago" '+%Y-%m-%dT00:00:00Z')
curl "http://ehrinfdi4memory.toolbox21.com/oai?verb=ListRecords&metadataPrefix=oai_dc&from=$FROM_DATE"

Supported date formats:

  • Full ISO 8601: 2023-01-01T00:00:00Z
  • Date only: 2023-01-01 (interpreted as start of day)
  • Empty dates are ignored (useful for optional parameters in scripts)

Specifics of incremental harvesting

Virtual sets:

  • from and until filter by sync date (when records were added/updated in our cache)
  • All records from a sync batch receive the same timestamp
  • Daily sync replaces all records, so dates indicate cache freshness, not actual EHRI changes
  • Useful for: "Show me what was added to the cache since yesterday"
  • Not useful for: "Show me what actually changed in EHRI since yesterday"

EHRI sets:

  • from and until are passed through to EHRI
  • EHRI tracks actual modification dates of records
  • Provides true incremental harvesting of changed records
  • Useful for: "Show me what actually changed in EHRI since yesterday"

Error Handling

Standard OAI-PMH error codes

Code Meaning Typical cause
badVerb Unknown verb Typo in verb parameter
badArgument Invalid parameters Missing required parameters
cannotDisseminateFormat Format not supported e.g. ead for virtual sets
idDoesNotExist Unknown identifier Record doesn't exist
noRecordsMatch No matches Empty set or no matches in time range
badResumptionToken Invalid token Token expired or corrupted

Performance Considerations

Harvesting parameters

  • Batch size: 200 records per page
  • Timeout: 30 seconds per request

Monitoring and Status

Health Check Endpoint

GET /health

Returns JSON with:

  • status: "healthy" or error status
  • version: Adapter version
  • database: Connection status
  • sync_in_progress: Boolean indicating if sync is running