Format Openness
Open formats (CSV, JSON, HDF5) vs. proprietary
Justification
Open formats determine whether users can access data without proprietary software. RDA-I1-01D (Important) requires standardized format. DataCite includes Format. Dublin Core includes Format. Note: Maps to FAIR I1 (Interoperability) but placed in Access because format determines practical accessibility.
Practical Guide
Use open formats (CSV, JSON, HDF5). Ensures long-term accessibility.
Open formats determine whether users can access data without proprietary software. We couldn't measure format impact in Zenodo's record-level metadata (format is stored at the file level), but domain repositories demonstrate the value: OpenNeuro requires NIfTI (open), SRA requires FASTQ (open), and these repositories show consistently higher SHARE scores. Open formats are a practical accessibility requirement.
Why this signal matters despite the numbers
No citation data available because Zenodo stores format information at the file level, not in record-level metadata. Domain repositories that enforce open formats (OpenNeuro, SRA, GEO) show consistently higher SHARE scores.
For Repositories
- Accept and recommend open formats (CSV, JSON, HDF5, NIfTI, FASTQ)
- Flag proprietary formats with a warning during upload
- Map to DataCite #14 Format and Dublin Core Format
For Depositors
- Convert proprietary formats to open alternatives before depositing
- Prefer CSV over Excel, JSON over proprietary schemas, HDF5 over MATLAB
- Include format documentation (codebook, data dictionary) with your files
Three standards converge (DataCite, Dublin Core, RDA). Not yet measured but domain repos validate it.
Standards Sources
Convergence score: 3/4 independent sources —
| Standard | Field / Property | Obligation Level |
|---|---|---|
| DataCite 4.6 | #14 Format | Optional |
| Dublin Core | Format | Core Element |
| RDA FAIR | RDA-I1-01D | Important |
FAIR Principle Alignment
Primary mapping: Interoperable (I1) — placed in Access for practical reasons
- I1: (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
RDA FAIR Data Maturity Model Indicators:
- RDA-I1-01D: Data uses knowledge representation expressed in standardised format
How This Signal Is Measured
File format classification against open format list. Binary: at least one open format present.
Empirical Evidence (Zenodo, n=1.3M)
Per-signal statistics use Zenodo as the primary validation source because it is the largest general-purpose repository with structured DataCite metadata, natural variance across all 25 signals, and available citation/usage data. Domain-specific repositories exhibit ceiling effects or restricted variance that preclude per-signal discrimination. Cross-repository validation is reported separately.
Data Source
Zenodo (CERN)
1,328,100 records analyzed
Interpretation: Not directly measurable in Zenodo metadata schema (format stored at file level, not record level).
Cross-repository note: Format openness is best measured in domain repositories: OpenNeuro requires NIfTI (open), SRA requires FASTQ (open). Dryad tracks file formats explicitly.
Quantitative Evidence
Scoring Formula
file_formats ⊆ open_formats → 4 pts
Contribution: 4 of 100 points · Access bucket (0–20)
Empirical validation not yet available for this signal
File format information is stored at the individual file level in Zenodo, not in record-level metadata. Computable from file extension analysis of 1.3M records but not yet processed. Domain repositories enforce open formats by design: OpenNeuro (NIfTI), SRA (FASTQ), GEO (CEL/TXT).
Method: Not yet computed · Source: Zenodo (format at file level)
A — Access Bucket
All signals in this bucket: