iSamples Data Files
Parquet snapshots, H3 tier aggregates, and facet caches served from data.isamples.org (+ Zenodo)
Quick start: every file on this page is queryable directly from a URL — no bulk download needed. DuckDB’s httpfs extension fetches only the byte ranges a query touches.
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
# Note: the wide parquet uses `n` for the source column (PQG `n` = source name).
# Other files (lite, sample_facets_v2) use the friendlier alias `source`.
con.sql("""
SELECT pid, label, n AS source, latitude, longitude
FROM read_parquet('https://data.isamples.org/current/wide.parquet')
WHERE n = 'GEOME' AND latitude BETWEEN -20 AND 20
LIMIT 10
""").df()1 1. Where to get it
Two places depending on what you need:
https://data.isamples.org/— Cloudflare Worker in front of Cloudflare R2. HTTP range requests supported; DuckDB and DuckDB-WASM work directly against URLs. Two layers:/<versioned-file>.parquet— 1-year immutable cache. Pin in papers./current/<alias>.parquet— 302 redirect to the latest snapshot. Use for “always fresh.” Currently aliases:/current/wide.parquet→isamples_202604_wide.parquet
- iSamples Zenodo community — long-term archival with DOIs. The raw aggregated export is doi:10.5281/zenodo.15278211 (April 2025, all four sources, ~300 MB). A query-substrate deposition for snapshot 202601 is planned (see the deposition issue).
Never reference the raw pub-*.r2.dev URL. It bypasses the Cloudflare Worker and defeats the versioned/alias cache layer. Always cite https://data.isamples.org/<file>.
2 2. Quick-pick table
| If you want to… | Use this file | Size |
|---|---|---|
| Show samples on a map (display fields only) | samples_map_lite.parquet |
60 MB |
| Query all fields on all samples | current/wide.parquet |
~292 MB |
| Aggregate map clusters by zoom | h3_summary_res{4,6,8}.parquet |
≤ 2.4 MB each |
| Filter by material / context / object-type | sample_facets_v2.parquet |
63 MB |
| Walk relationships (graph queries) | isamples_202512_narrow.parquet |
820 MB |
| Translate vocabulary URIs to human-readable labels | vocab_labels.parquet |
58 KB |
3 3. Copy-pasteable DuckDB snippets
Each snippet is self-contained. Prepend these two lines once per session:
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")3.1 3.1 Map-lite: points near Kyoto
con.sql("""
SELECT pid, label, source, latitude, longitude, result_time
FROM read_parquet('https://data.isamples.org/isamples_202601_samples_map_lite.parquet')
WHERE latitude BETWEEN 34.9 AND 35.1
AND longitude BETWEEN 135.6 AND 135.9
LIMIT 10
""").df()3.2 3.2 Wide: source breakdown
con.sql("""
SELECT n AS source, COUNT(*) AS n_samples
FROM read_parquet('https://data.isamples.org/current/wide.parquet')
WHERE otype = 'MaterialSampleRecord'
GROUP BY n
ORDER BY n_samples DESC
""").df()3.3 3.3 H3 res-4 aggregates: densest continental cells
con.sql("""
SELECT h3_cell, sample_count, dominant_source, center_lat, center_lng
FROM read_parquet('https://data.isamples.org/isamples_202601_h3_summary_res4.parquet')
ORDER BY sample_count DESC
LIMIT 10
""").df()3.4 3.4 Sample facets: OpenContext artifacts
# object_type is a URI; match on URI fragments (the concept leaf name).
con.sql("""
SELECT pid, label, place_name, object_type
FROM read_parquet('https://data.isamples.org/isamples_202601_sample_facets_v2.parquet')
WHERE source = 'OPENCONTEXT'
AND object_type ILIKE '%artifact%'
LIMIT 10
""").df()3.5 3.5 Narrow (graph): count edges by predicate
con.sql("""
SELECT p AS predicate, COUNT(*) AS n_edges
FROM read_parquet('https://data.isamples.org/isamples_202512_narrow.parquet')
WHERE otype = '_edge_'
GROUP BY p
ORDER BY n_edges DESC
LIMIT 10
""").df()3.6 3.6 Vocab labels: render facet URIs as human-readable text
# Join sample facets to vocabulary prefLabels so the UI shows
# "Ceramic Clay" instead of the raw concept URI.
con.sql("""
SELECT f.pid, f.label, v.pref_label AS material_label
FROM read_parquet('https://data.isamples.org/isamples_202601_sample_facets_v2.parquet') f
LEFT JOIN read_parquet('https://data.isamples.org/vocab_labels.parquet') v
ON f.material = v.uri
WHERE f.material IS NOT NULL
LIMIT 10
""").df()5 5. Full catalog + companion docs
- Serialization catalog — every shipped file with role, schema headline, upstream, consumers, and size
- Query Specification — substrate-neutral query contract that these files bind to
- Zenodo deposition plan — planned 202601 snapshot deposition
- PQG Specification — property-graph parquet format semantics
- PQG conformance matrix — which QUERY_SPEC dimensions each file carries
6 6. Data sources and licensing
Four upstream sources contribute to the aggregated iSamples corpus:
- SESAR — geological samples (~4.6 M records)
- OpenContext — archaeological samples (~1 M records)
- GEOME — biological / genomic samples (~605 K records)
- Smithsonian — museum specimens (~322 K records)
Each source has its own license and use terms. Authoritative license information for any specific deposition is carried in the Zenodo record metadata — see the iSamples Zenodo community. When reusing these data, cite both the original source and the iSamples aggregation DOI.
Last updated: 2026-04-24. Sizes and row counts verified by DuckDB DESCRIBE + COUNT(*) against https://data.isamples.org/ on the same date. Every snippet on this page was executed successfully against the live files during authoring.