iSamples Data Serializations
A catalog of the parquet files that back the iSamples query substrate
1 1. Purpose and scope
iSamples has roughly a dozen parquet files in circulation at any given moment — each with a specific role, a specific upstream parent, and a specific set of downstream consumers (the web Explorer, the Python reference notebook, the progressive globe, the PQG conformance work). Some are primary archival products; others are derived aggregates or caches; still others are source-specific variants published outside the data.isamples.org namespace.
This document is a catalog, not an ingestion guide: it tells you what each file is, where it came from, who consumes it, and where in the spec tree to look for its normative definition. For how to build these files, see the scripts in scripts/ and the converters in pqg/. For how to query them, see query-spec.qmd. For how to cite them, see the Zenodo deposition plan.
All sizes and row counts below were verified by DuckDB DESCRIBE + COUNT(*) against https://data.isamples.org/ on 2026-04-24.
2 2. The derivation DAG
Zenodo export (doi:10.5281/zenodo.15278211, ~300 MB, 6.7 M samples)
│ sample-centric, nested STRUCTs (PQG "export" format)
│
└─► isamples_202512_narrow.parquet (820 MB, 101 M rows)
│ graph-normalized, nodes + _edge_ rows (PQG "narrow")
│
└─► isamples_202601_wide.parquet (278 MB, 20.7 M rows)
│ entity-centric, p__* relationship arrays (PQG "wide")
│
├─► isamples_202604_wide.parquet (292 MB, 20.7 M rows)
│ = 202601 wide + ~47 K OpenContext thumbnails
│ (see scripts/enrich_wide_with_oc_thumbnails.py)
│
├─► isamples_202601_wide_h3.parquet (292 MB, 20.7 M)
│ = wide + h3_res4 / h3_res6 / h3_res8 columns
│
├─► isamples_202601_samples_map_lite.parquet (60 MB, 6.0 M)
│ display projection for map points
│
├─► isamples_202601_sample_facets_v2.parquet (63 MB, 6.0 M)
│ pid → facet-URI strings for multi-dim filtering
│
├─► isamples_202601_facet_summaries.parquet (2 KB, 56 rows)
│ baseline (facet_type, facet_value, count) tuples
│
├─► isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)
│ single-active-filter cross cache
│
└─► isamples_202601_h3_summary_res{4,6,8}.parquet
geospatial aggregates for the progressive globe
(38 K / 112 K / 176 K cells)
Source-specific variants (parallel to the substrate, not derived from it):
oc_isamples_pqg.parquet (GCS, 11.8 M, narrow, OC-only)
oc_isamples_pqg_wide.parquet (GCS, 2.5 M, wide, OC-only)
└─► serve as upstream for OpenContext thumbnails folded into 202604 wide
Vocabulary labels (parallel to the substrate, sourced from isamplesorg/vocabularies):
vocab_labels.parquet (58 KB, 537 SKOS concepts)
└─► consumed by Search Explorer to render facet URIs as prefLabels
Arrows indicate derivation, not containment. Every file in the left column can be rebuilt from its parent by a script in isamples-python/ or isamplesorg.github.io/scripts/.
3 3. Catalog
3.1 Tier: source of truth
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
zenodo.15278211 export |
Aggregated Zenodo export (all 4 sources, sample-centric, nested) | ~300 MB | 6.7 M | SESAR + OpenContext + GEOME + Smithsonian ingestion | PQG converters (narrow, wide) | PQG §3.3 (export format) |
3.2 Tier: graph normalization
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
isamples_202512_narrow.parquet |
Graph-normalized with explicit _edge_ rows; canonical archival form |
820 MB | 101.4 M | Zenodo export | Graph traversals, PQG tutorials, narrow→wide converter, Zenodo archive | PQG §3.1, §4.2 |
isamples_202601_wide.parquet |
Entity-centric, relationships as p__* arrays; primary analytic substrate |
278 MB | 20.7 M | narrow | Search Explorer, Python notebook, facet/h3/lite derivations | PQG §3.2, §4.5 |
isamples_202604_wide.parquet |
202601 wide + ~47 K OC thumbnails folded in | 292 MB | 20.7 M | 202601 wide + oc_isamples_pqg.parquet |
current/wide.parquet alias points here |
PQG §3.2 |
3.3 Tier: derived aggregates (progressive globe / H3)
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
isamples_202601_wide_h3.parquet |
Wide with h3_res{4,6,8} BIGINT columns pre-joined |
292 MB | 20.7 M | wide | Deep-Dive Analysis tutorial (H3 filtering without join) | QUERY_SPEC §2.4 |
isamples_202601_h3_summary_res4.parquet |
Continental tier: (h3_cell, sample_count, center_lat, center_lng, dominant_source, source_count, resolution) |
580 KB | 38 K | wide_h3 | Interactive Explorer globe (zoomed out), Python Explorer H3 tier mode | QUERY_SPEC §2.4 |
isamples_202601_h3_summary_res6.parquet |
Regional tier | 1.6 MB | 112 K | wide_h3 | Interactive Explorer globe (mid zoom) | QUERY_SPEC §2.4 |
isamples_202601_h3_summary_res8.parquet |
Neighborhood tier | 2.4 MB | 176 K | wide_h3 | Interactive Explorer globe (close zoom) | QUERY_SPEC §2.4 |
3.4 Tier: display projections
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
isamples_202601_samples_map_lite.parquet |
Minimum map projection; only MaterialSampleRecord rows with coordinates |
60 MB | 6.0 M | wide (filtered) | Interactive Explorer point-level rendering below ~120 km altitude | QUERY_SPEC §4.1 |
3.5 Tier: facet caches
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
isamples_202601_sample_facets_v2.parquet |
(pid, source, material, context, object_type, label, description, place_name) — all VARCHAR scalars; each facet column is a single URI per sample (not an array) |
63 MB | 6.0 M | wide | Search Explorer multi-dim facet filtering | QUERY_SPEC §3.3, §5.1 |
isamples_202601_facet_summaries.parquet |
Baseline (facet_type, facet_value, scheme, count) |
2 KB | 56 | wide | Every tutorial (instant initial facet counts) | QUERY_SPEC §3.3 tier 1 |
isamples_202601_facet_cross_filter.parquet |
Pre-computed counts for single-filter cross-facet queries | 6 KB | 526 | wide | Search Explorer cross-filter UI | QUERY_SPEC §3.3 tier 2a |
3.6 Tier: vocabulary labels
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
vocab_labels.parquet |
SKOS concept URI → human-readable pref_label map (plus definition, alt_labels, scheme); covers material, sample object type, and sampled feature type vocabularies |
58 KB | 537 | isamplesorg/vocabularies TTLs (built by scripts/build_vocab_labels.py) |
Search Explorer (renders facet URIs as prefLabels); any tutorial that surfaces controlled-vocabulary URIs | issue #148 |
3.7 Tier: alternative export formats (upstream of the aggregated Zenodo export)
The export_client can emit each source’s records in multiple formats; the aggregated Zenodo deposition archives the GeoParquet flavor, but JSONL and CSV are also emitted by the same pipeline and are useful for streaming or human inspection.
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
isamples_export_*.jsonl |
Streaming JSON export (one sample per line, nested structs) | per query | — | isamplesorg/export_client (isample export -f jsonl) |
Local DuckDB ingestion, STAC catalog generation | export_client docs |
isamples_export_*.csv |
Flat CSV export — convenience only, not authoritative for the query substrate | per query | — | isamplesorg/export_client (isample export -f csv) |
Human inspection | export_client docs |
stac.json / manifest.json |
STAC/discovery sidecars emitted with local exports | < 1 KB | — | isamplesorg/export_client |
STAC browser, local server, refresh workflow | export_client README |
3.8 Tier: legacy bindings and convenience copies
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
| Solr indexed documents | Legacy search-server binding for the same canonical query dimensions. Not a portable serialization; listed here because QUERY_SPEC §5.3 documents the Solr dialect bindings | N/A | ~6 M | isamplesorg/isamples_inabox harvest/index pipelines + schema mappings |
iSamples Central (API offline as of Aug 2025; Solr schema remains the authoritative precedent for dimension names) | QUERY_SPEC §5.3 |
| H3 + lite CSV twins | Human-readable CSV duplicates of samples_map_lite.parquet and h3_summary_res{4,6,8}.parquet |
~640 MB total | mirror | the corresponding parquet files | Manual inspection only | parquet copies are authoritative; CSV twins excluded from the Zenodo substrate deposition by design |
3.9 Tier: source-specific variants (not part of the substrate)
| File | Role | Size | Rows | Upstream | Consumers | Spec |
|---|---|---|---|---|---|---|
oc_isamples_pqg.parquet (GCS) |
OpenContext-only narrow; carries thumbnail_url values absent from the aggregated export |
~1.8 GB | 11.8 M | OpenContext ETL (Eric Kansa) | scripts/enrich_wide_with_oc_thumbnails.py → 202604 wide; PQG development |
PQG §3.1 |
oc_isamples_pqg_wide.parquet (GCS) |
OpenContext-only wide | ~600 MB | 2.5 M | OC narrow | OC-specific analyses, PQG benchmarks | PQG §3.2 |
No OpenContext sidecar file exists yet. Per the sidecar-pattern plan (Raymond endorsed 2026-04-17), thumbnails are currently merged directly into isamples_202604_wide.parquet rather than joined at query time from a sidecar. A future isamples_202601_oc_sidecar.parquet (keyed on pid, with thumbnail_url, is_public, license, media_url, harvested_at) is planned — see project_isamples_sidecar_pattern.md.
4 4. Per-file detail
URL convention: each file is available at https://data.isamples.org/<filename> (versioned, 1-yr immutable cache) and, where applicable, at https://data.isamples.org/current/<alias> (302 redirect, 5-min cache). Examples below use the versioned URL; swap for the alias when you want “latest.”
4.1 4.1 Zenodo export (source of truth)
- Role: The raw aggregated Zenodo export — all four sources, sample-centric, nested STRUCTs.
- DOI:
10.5281/zenodo.15278211 - Headline schema (PQG export, 19 cols):
sample_identifier,label,description,produced_by {sampling_site {sample_location {latitude, longitude, ...}}}, etc. - Query pattern: one row per sample; no JOINs needed for basic queries.
- DuckDB: download the parquet from Zenodo, then
SELECT * FROM read_parquet('isamples_export_*.parquet') LIMIT 10.
4.2 4.2 isamples_202512_narrow.parquet
Role: PQG narrow format — the canonical, lossless graph-normalized representation.
Headline schema (40 cols):
row_id, pid, otype, s, p, o, n, altids, geometry, ...entity-specific columns.... Edges are rows withotype='_edge_'and populateds/p/o.Query pattern: multi-hop JOIN via
_edge_rows (see PQG §2.2).DuckDB:
SELECT COUNT(*) FROM read_parquet('https://data.isamples.org/isamples_202512_narrow.parquet') WHERE otype = 'MaterialSampleRecord';
4.3 4.3 isamples_202601_wide.parquet
Role: PQG wide format — primary analytic substrate for Explorer + notebook.
Headline schema (49 cols): same core columns as narrow, plus
p__produced_by,p__sample_location,p__sampling_site,p__site_location,p__responsibility,p__registrant,p__has_material_category,p__has_context_category,p__has_sample_object_type,p__keywords,p__curation,p__related_resource— each an integer array of targetrow_ids. Exact DuckDB types are mixed:p__produced_by,p__sample_location,p__sampling_site,p__site_location,p__registrant,p__curationareINTEGER[];p__has_material_category,p__has_context_category,p__has_sample_object_type,p__keywords,p__responsibility,p__related_resourceareBIGINT[].Column name gotcha: the source column is
non wide/narrow (PQG convention), notsource. Alias it in projections (e.g.n AS source) to match what the lite and facet parquets already call it.Query pattern: entity-centric; relationships via array-element JOIN (see PQG §3.2).
DuckDB:
SELECT n AS source, COUNT(*) FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet') WHERE otype = 'MaterialSampleRecord' GROUP BY n ORDER BY 2 DESC;
4.4 4.4 isamples_202604_wide.parquet
Role: 202601 wide enriched with ~47 K OpenContext thumbnails.
current/wide.parquet302-redirects here.Headline schema: identical to 202601 wide (49 cols). Only the
thumbnail_urlcolumn on OCMaterialSampleRecordrows is populated differently.Query pattern: drop-in replacement for 202601 wide; use
current/wide.parquetunless you need a pinned version.DuckDB:
SELECT COUNT(*) FROM read_parquet('https://data.isamples.org/current/wide.parquet') WHERE thumbnail_url IS NOT NULL;
4.5 4.5 isamples_202601_wide_h3.parquet
Role: Wide with H3 indices pre-joined, so H3 predicates don’t need a join.
Headline schema (52 cols): wide columns +
h3_res4,h3_res6,h3_res8(BIGINT).Query pattern: direct H3-cell filtering without an H3 UDF.
DuckDB:
SELECT COUNT(*) FROM read_parquet('https://data.isamples.org/isamples_202601_wide_h3.parquet') WHERE h3_res6 = 604932829406232575;
4.6 4.6 isamples_202601_h3_summary_res{4,6,8}.parquet
Role: Zoom-adaptive aggregates that back the Cesium progressive globe and the Python Explorer’s “H3 tier” rendering mode.
Headline schema (7 cols, identical across resolutions):
h3_cell(BIGINT),sample_count(INT),center_lat,center_lng(DOUBLE),dominant_source(VARCHAR),source_count(INT),resolution(INT).Query pattern: fetch the right resolution for the current zoom; no join needed.
DuckDB:
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_h3_summary_res6.parquet') ORDER BY sample_count DESC LIMIT 20;
4.7 4.7 isamples_202601_samples_map_lite.parquet
Role: Display projection for point-level map rendering. Contains only
MaterialSampleRecordrows with valid coordinates.Headline schema (9 cols):
pid, label, source, latitude, longitude, place_name, result_time, h3_res8, h3_res8_hex. Nodescription— it’s in wide only.Query pattern: the Explorer reads this directly when altitude drops below the point-render threshold.
DuckDB:
SELECT source, COUNT(*) FROM read_parquet('https://data.isamples.org/isamples_202601_samples_map_lite.parquet') WHERE latitude BETWEEN 32 AND 42 GROUP BY 1;
4.8 4.8 isamples_202601_sample_facets_v2.parquet
Role: Cross-dimension facet filtering — one row per sample, each facet column holds a single controlled-vocabulary URI (the leaf concept the sample is tagged with at that dimension).
Headline schema (8 cols, all VARCHAR):
pid, source, material, context, object_type, label, description, place_name.material/context/object_typeare scalar URI strings, NOT arrays — the file’s grain is one row per sample, so a sample tagged with multiple material URIs is represented by a single chosen URI (currently the first/leaf). For multi-material accuracy, JOIN back towide.p__has_material_category.Query pattern:
WHERE material = '<uri>'for exact match;WHERE material ILIKE '%rock%'to substring-match URI fragments.DuckDB:
SELECT pid, label FROM read_parquet('https://data.isamples.org/isamples_202601_sample_facets_v2.parquet') WHERE material ILIKE '%rock%' LIMIT 10;
4.9 4.9 isamples_202601_facet_summaries.parquet
Role: Baseline (no-filter) facet counts. Loaded by every tutorial at startup.
Headline schema (4 cols, 56 rows):
facet_type(source|material|context|object_type),facet_value,scheme,count.Query pattern: sort by
count DESCto render a top-N facet list.DuckDB:
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_facet_summaries.parquet') WHERE facet_type = 'material' ORDER BY count DESC;
4.10 4.10 isamples_202601_facet_cross_filter.parquet
Role: Cross-facet counts for the single-active-filter case (QUERY_SPEC §3.3 tier 2a). Avoids recomputing when one facet dimension is active.
Headline schema (7 cols, 526 rows):
filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count. Exactly onefilter_*column is non-NULL per row.Query pattern: lookup by the active filter to get counts for the remaining dimensions.
DuckDB:
SELECT facet_type, facet_value, count FROM read_parquet('https://data.isamples.org/isamples_202601_facet_cross_filter.parquet') WHERE filter_source = 'SESAR' ORDER BY facet_type, count DESC;
4.11 4.11 oc_isamples_pqg.parquet and oc_isamples_pqg_wide.parquet (OC variants)
- Role: OpenContext-specific PQG files maintained by Eric Kansa. Hosted at
https://storage.googleapis.com/opencontext-parquet/, not underdata.isamples.org. They are not part of the cross-source substrate — they carry OC-internal detail (notablythumbnail_url) that the aggregated Zenodo export drops. - Headline schema: PQG narrow (40 cols) and wide (47 cols). OC wide has slightly fewer
p__*columns than the unified wide — this is schema drift, not semantically meaningful for standard queries. - Consumer:
scripts/enrich_wide_with_oc_thumbnails.pyuses OC narrow to fill thumbnails into 202604 unified wide. Also used directly in PQG benchmark work. - Future: these become the prototype upstream for per-source sidecars (see §3, bottom row).
5 5. URL convention
All substrate files live under https://data.isamples.org/ — a Cloudflare Worker fronting an R2 bucket. The Worker provides:
- Versioned URLs
https://data.isamples.org/isamples_<YYYYMM>_<variant>.parquet— 1-year immutable cache. Safe to pin in papers, Zenodo manifests, reproducibility notebooks. - Alias URLs
https://data.isamples.org/current/<alias>— 302 redirect with 5-min cache; always resolves to the latest snapshot. Use for “always fresh” consumers. Currentlycurrent/wide.parquet → isamples_202604_wide.parquet.
Never reference the raw pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/... URL. It bypasses the Worker and defeats both the alias layer and the Cache-Control headers that DuckDB-WASM relies on for HTTP range requests.
OpenContext-specific variants live at https://storage.googleapis.com/opencontext-parquet/ and are maintained outside this convention.
6 6. Relationship to other documents
query-spec.qmd§5.1 — the DuckDB binding table, which maps query-spec dimensions (source,material,bbox,h3,time,text) to the specific parquet files above. This catalog says what the files are; the query spec says which dimension each file serves.ZENODO_DEPOSITION_PLAN.md(in the monorepo root) — specifies which subset of these files are archived in each Zenodo deposition. The 202601 deposition bundles the 10 R2-served files plus aMANIFEST.jsonandREADME.md. Source-specific OC variants and the raw Zenodo export are not part of the substrate deposition.pqg/docs/PQG_SPECIFICATION.md— defines the three canonical formats (export, narrow, wide) whose schemas the primary files conform to. §3.5 is the normative section.pqg/docs/conformance_matrix.md(planned) — will document, for each file above, exactly which clauses of the PQG spec it satisfies (required columns, allowedotypevalues, edge-type constraints, etc.). This catalog is the prose companion; the conformance matrix will be the machine-checkable companion.project_isamples_sidecar_pattern.md(memory) — planning for per-source sidecars that would sit alongside the unified wide file rather than being folded in at build time (as OC thumbnails currently are). When that lands, it adds a new tier to §3.
Last updated: 2026-04-24 by iSamples team. Row counts and sizes verified by DuckDB against https://data.isamples.org/ on the same date.