Format Openness

Open formats (CSV, JSON, HDF5) vs. proprietary

Access (A)

Interoperable (I1) — placed in Access for practical reasons

Justification

Open formats determine whether users can access data without proprietary software. RDA-I1-01D (Important) requires standardized format. DataCite includes Format. Dublin Core includes Format. Note: Maps to FAIR I1 (Interoperability) but placed in Access because format determines practical accessibility.

Practical Guide

should-have

Use open formats (CSV, JSON, HDF5). Ensures long-term accessibility.

Open formats determine whether users can access data without proprietary software. We couldn't measure format impact in Zenodo's record-level metadata (format is stored at the file level), but domain repositories demonstrate the value: OpenNeuro requires NIfTI (open), SRA requires FASTQ (open), and these repositories show consistently higher SHARE scores. Open formats are a practical accessibility requirement.

Why this signal matters despite the numbers

No citation data available because Zenodo stores format information at the file level, not in record-level metadata. Domain repositories that enforce open formats (OpenNeuro, SRA, GEO) show consistently higher SHARE scores.

For Repositories

Accept and recommend open formats (CSV, JSON, HDF5, NIfTI, FASTQ)
Flag proprietary formats with a warning during upload
Map to DataCite #14 Format and Dublin Core Format

For Depositors

Convert proprietary formats to open alternatives before depositing
Prefer CSV over Excel, JSON over proprietary schemas, HDF5 over MATLAB
Include format documentation (codebook, data dictionary) with your files

Three standards converge (DataCite, Dublin Core, RDA). Not yet measured but domain repos validate it.

Standards Sources

Convergence score: 3/4 independent sources —

Well justified

Standard	Field / Property	Obligation Level
DataCite 4.6	#14 Format	Optional
Dublin Core	Format	Core Element
RDA FAIR	RDA-I1-01D	Important

FAIR Principle Alignment

Primary mapping: Interoperable (I1) — placed in Access for practical reasons

I1: (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

RDA FAIR Data Maturity Model Indicators:

RDA-I1-01D: Data uses knowledge representation expressed in standardised format

How This Signal Is Measured

File format classification against open format list. Binary: at least one open format present.

Empirical Evidence (Zenodo, n=1.3M)

Per-signal statistics use Zenodo as the primary validation source because it is the largest general-purpose repository with structured DataCite metadata, natural variance across all 25 signals, and available citation/usage data. Domain-specific repositories exhibit ceiling effects or restricted variance that preclude per-signal discrimination. Cross-repository validation is reported separately.

Data Source

Zenodo (CERN)

1,328,100 records analyzed

Interpretation: Not directly measurable in Zenodo metadata schema (format stored at file level, not record level).

Cross-repository note: Format openness is best measured in domain repositories: OpenNeuro requires NIfTI (open), SRA requires FASTQ (open). Dryad tracks file formats explicitly.

Quantitative Evidence

Scoring Formula

file_formats ⊆ open_formats → 4 pts

Contribution: 4 of 100 points · Access bucket (0–20)

Data Gap

Empirical validation not yet available for this signal

File format information is stored at the individual file level in Zenodo, not in record-level metadata. Computable from file extension analysis of 1.3M records but not yet processed. Domain repositories enforce open formats by design: OpenNeuro (NIfTI), SRA (FASTQ), GEO (CEL/TXT).

Method: Not yet computed · Source: Zenodo (format at file level)

A — Access Bucket

All signals in this bucket:

A1: Open Access Status

A2: License Clarity

A3: License Permissiveness

A4: No Embargo

A5: Format Openness

← A4: No Embargo All Signals R1: Discovery Metrics →