Subject Classification
Controlled vocabulary terms (MeSH, LCSH, FOR codes)
Justification
Controlled vocabularies enable cross-repository discovery and semantic interoperability. DataCite Recommends the Subject property with support for classification schemes. schema.org includes keywords for Google Dataset Search. Dublin Core includes Subject as a core element. The RDA model specifically addresses vocabulary use in RDA-I2-01M (Important priority).
Practical Guide
Add keywords. 5.2x citation lift — the easiest high-impact action.
Subject keywords from controlled vocabularies are one of the simplest metadata fields to add and one of the most impactful. Datasets with keywords receive 5.2x more citations (RR = 5.23, p < 0.001). With 85% prevalence on Zenodo, this is already common — but the 15% without keywords are essentially invisible to search.
For Repositories
- Make subject keywords a required or strongly prompted field
- Provide controlled vocabulary suggestions (MeSH, LCSH, FOR codes)
- Map to DataCite #6 Subject and schema.org keywords
For Depositors
- Add at least 3-5 subject keywords from your field's standard vocabulary
- Use controlled terms (MeSH for biomedical, LCSH for general) when possible
- Include both broad and specific terms to maximize discoverability
Strongest positive signal in Stewardship bucket. Easy to implement, high impact, well-adopted (85.2%).
Standards Sources
Convergence score: 4/4 independent sources —
| Standard | Field / Property | Obligation Level |
|---|---|---|
| DataCite 4.6 | #6 Subject | Recommended |
| schema.org | keywords | Recommended |
| Dublin Core | Subject | Core Element |
FAIR Principle Alignment
Primary mapping: Findable (F2), Interoperable (I2)
- F2: Data are described with rich metadata
- I2: (Meta)data use vocabularies that follow FAIR principles
RDA FAIR Data Maturity Model Indicators:
- RDA-I2-01M: Metadata uses FAIR-compliant vocabularies
How This Signal Is Measured
Presence of subject keywords, ideally from controlled vocabularies with scheme identifiers. Binary: at least one subject term present.
Empirical Evidence (Zenodo, n=1.3M)
Per-signal statistics use Zenodo as the primary validation source because it is the largest general-purpose repository with structured DataCite metadata, natural variance across all 25 signals, and available citation/usage data. Domain-specific repositories exhibit ceiling effects or restricted variance that preclude per-signal discrimination. Cross-repository validation is reported separately.
Prevalence
85.2%
of Zenodo datasets
Citation Lift
5.2x
vs. datasets without
Data Source
Zenodo (CERN)
1,328,100 records analyzed
Interpretation: Strong positive signal. Datasets with subject classification receive 5.2x more citations. Keywords enable cross-repository discovery — one of the highest-impact metadata fields.
Quantitative Evidence
Scoring Formula
subject_keywords.length ≥ 1 → 4 pts
Contribution: 4 of 100 points · Stewardship bucket (0–20)
With Signal Present
1,132,179
datasets (85.2%)
μ = 0.277 citations/dataset
Without Signal
195,921
datasets (14.8%)
μ = 0.053 citations/dataset
Rate Ratio
5.23
95% CI: [5.13–5.33]
P-value
< 0.001
z = 165.79
Significance
Method: Poisson rate ratio · Source: Zenodo (n = 1,328,100)
S — Stewardship Bucket
All signals in this bucket: