iSamples Query Specification
A substrate-neutral contract for searching and filtering iSamples data
Field inventories are drawn from the Solr schema (authoritative precedent) and the PQG metadata model. v0.2 incorporates findings from the PQG conformance matrix (which parquet files actually carry which dimensions) to resolve naming drift, drop ghosts, and tighten substrate bindings. Comments and PRs welcome — see issue tracker.
1 1. Purpose and scope
iSamples data is reached today through at least three substrates — and potentially more in the future:
- DuckDB-WASM against parquet (this website’s Interactive Explorer)
- DuckDB / Ibis against parquet (the Python client and notebooks)
- Apache Solr (legacy iSamples Central; potentially revived)
Each substrate has its own query dialect. Users and maintainers shouldn’t have to relearn the facet vocabulary, the text-search semantics, or the spatial filter grammar when moving between them. This document specifies a substrate-neutral query model that each implementation can bind to.
What this spec covers:
- Canonical facet / filter dimensions and their names
- Filter grammar (an abstract syntax, not a wire format)
- Full-text search semantics (which fields participate)
- Spatial and temporal primitives
- Sample-card projection (what a clicked sample returns)
- Substrate binding tables (spec → DuckDB, spec → Solr)
What it does NOT cover:
- PQG graph traversal queries (edge walking, multi-hop joins). See QUERY_COMPARISON.md in the monorepo root for that work and the Eric-vs-Observable alignment notes.
- Bulk export / download mechanics. See how-to-use.
- Ingestion and metadata normalization.
Normative precedent. Where this spec names a field, the name mirrors the iSamples metadata model’s dotted-path form as used in the Solr schema (isamples_inabox/solr_schema_init/create_isb_core_schema.py), because that’s the most complete, externally-documented query vocabulary the project has shipped. Aliases for substrate-specific naming are provided in §5.
2 2. Canonical dimensions
A dimension is an attribute of a material sample record that users filter, facet, or search on. Every binding (§5) must provide at least the required dimensions.
2.1 2.1 Identity and provenance
| Dimension | Type | Required | Solr field | PQG path | Notes |
|---|---|---|---|---|---|
pid |
string | ✅ | id |
MaterialSampleRecord.pid |
Primary key |
source |
enum | ✅ | source |
MaterialSampleRecord.source_name |
SESAR\|OPENCONTEXT\|GEOME\|SMITHSONIAN |
label |
string | ✅ | label |
MaterialSampleRecord.label |
Display name |
description |
text | ✅ | description |
MaterialSampleRecord.description |
Free text |
registrant |
string | registrant |
MaterialSampleRecord.registrant |
Who registered | |
sourceUpdatedTime |
instant | sourceUpdatedTime |
MaterialSampleRecord.tmodified |
Freshness; bind to tmodified (INTEGER epoch) — see note below |
|
thumbnailURL |
string | — | MaterialSampleRecord.thumbnail_url |
Optional; shipped in wide today (OpenContext only). Expected to move to per-source sidecars over time (see §4.2 sample card, issue #131) |
sourceUpdatedTime binding: the wide parquet ships both last_modified_time (VARCHAR) and tmodified (INTEGER unix epoch). v0.2 picks tmodified as canonical because epoch is easier to filter and sort; last_modified_time is kept as a deprecated alias for backwards compatibility and will be removed in a future major release.
2.2 2.2 Classification (the four facets)
| Dimension | Type | Required | Solr field | PQG path |
|---|---|---|---|---|
material |
enum | ✅ | hasMaterialCategory |
MaterialSampleRecord.has_material_category.label |
context |
enum | ✅ | hasContextCategory |
MaterialSampleRecord.has_context_category.label |
objectType |
enum | ⚠️ (see below) | hasSampleObjectType (alias hasSpecimenCategory) |
MaterialSampleRecord.has_sample_object_type.label |
keywords |
multi-string | keywords |
MaterialSampleRecord.keywords[] |
Naming resolution (v0.2): v0.1 named this dimension specimen with Solr field hasSpecimenCategory. Every shipped parquet file uses object_type / hasSampleObjectType. v0.2 adopts the data-side name (objectType) as canonical and keeps hasSpecimenCategory as a Solr alias. See PQG conformance matrix §3.2 for the audit that prompted this rename.
objectType is in the blessed vocabulary but is not currently exposed in the web Explorer. Adding it is on the P1 stack.
Dropped from v0.2: informalClassification was named in v0.1 but no shipped parquet file carries it (it was a Solr-era remnant). It is removed from the canonical dimension list until/unless the pipeline adds it.
Each of these has a paired confidence field (…Confidence, pfloat) in Solr. The spec allows filters to reference confidence (e.g. material.confidence >= 0.8) but implementations MAY omit if the substrate doesn’t carry the field.
2.3 2.3 Sampling event and site
| Dimension | Type | Solr field | PQG path |
|---|---|---|---|
resultTime |
instant | producedBy_resultTime (pdate) |
SamplingEvent.result_time |
samplingPurpose |
string | samplingPurpose |
SamplingEvent.sampling_purpose |
featureOfInterest |
string | producedBy_hasFeatureOfInterest |
SamplingEvent.has_feature_of_interest |
responsibility |
multi-string | producedBy_responsibility |
SamplingEvent.responsibility[] |
siteLabel |
string | producedBy_samplingSite_label |
SamplingSite.label |
siteDescription |
text | producedBy_samplingSite_description |
SamplingSite.description |
placeName |
string | producedBy_samplingSite_placeName |
SamplingSite.place_name[] |
elevation |
float | producedBy_samplingSite_location_elevationInMeters |
GeospatialCoordLocation.elevation |
Dropped from v0.2: resultTimeRange (Solr producedBy_resultTimeRange, a date_range field) was named in v0.1 but no shipped parquet carries an interval type. It was a Solr-era remnant that never migrated. Query a resultTime range with time BETWEEN t1 AND t2 (§3.1) instead.
2.4 2.4 Spatial
| Dimension | Type | Solr field | PQG path |
|---|---|---|---|
latitude |
float | producedBy_samplingSite_location_latitude |
GeospatialCoordLocation.latitude |
longitude |
float | producedBy_samplingSite_location_longitude |
GeospatialCoordLocation.longitude |
bbox |
bbox | producedBy_samplingSite_location_bb |
derived |
h3[resN] |
h3-index | producedBy_samplingSite_location_h3_{0..13} |
samples_wide.h3_res{N} |
H3 tier convention. Resolutions 4, 6, and 8 are the spec-recommended tier breakpoints for zoom-adaptive visualization. Other resolutions MAY be materialized but 4/6/8 are load-bearing.
H3 column availability across shipped parquet files (v0.2):
wide_h3ships three direct columns:h3_res4,h3_res6,h3_res8.h3_summary_res{4,6,8}tier files do NOT shiph3_res{N}columns — they ship a singleh3_cell(UBIGINT) plus aresolution(INTEGER) column. Query them asWHERE h3_cell = X AND resolution = N.litecarriesh3_res8(andh3_res8_hex) only — not res4 / res6.- Plain
wideandnarrowdo not carry H3 columns. To filter at res 4 or res 6, querywide_h3or the appropriateh3_summarytier file.
See PQG conformance matrix §3.4 for the full table.
2.5 2.5 Curation
| Dimension | Type | Solr field |
|---|---|---|
curationLocation |
string | curation_location |
curationResponsibility |
string | curation_responsibility |
curationAccessConstraints |
string | curation_accessContraints |
3 3. Filter grammar
A query is a conjunction (AND) of filters. Each binding is responsible for translating the abstract filter into its dialect.
3.1 3.1 Filter primitives
Filter := FieldFilter | TextFilter | SpatialFilter | TemporalFilter
FieldFilter := dim IN (value, ...)
| dim = value
| dim >= value ( numeric / date only )
| dim <= value
| dim CONTAINS token ( multi-string / keywords )
TextFilter := text MATCHES "phrase"
SpatialFilter:= bbox WITHIN (min_lat, min_lon, max_lat, max_lon)
| h3 AT RES n IN (h3_cell, ...)
TemporalFilter
:= time BETWEEN t1 AND t2
3.2 3.2 Full-text search semantics
text MATCHES "phrase" searches the aggregate of these fields (the Solr searchText copy-field target, canonical list):
source,label,descriptionkeywordsproducedBy_label,producedBy_description,producedBy_hasFeatureOfInterest,producedBy_responsibilityproducedBy_samplingSite_label,producedBy_samplingSite_description,producedBy_samplingSite_placeNameregistrant,samplingPurposecuration_label,curation_description,curation_location
Substrates that can’t index all 15 fields MUST document which subset they cover and surface the limitation in UI. (The current web Explorer covers label + description + place_name only — a known gap.)
Multi-term queries default to AND with relevance ranking where the substrate supports it (Solr, DuckDB FTS). See PR #95 for web-side FTS work.
3.3 3.3 Cross-filter counts
A faceted UI exposing a dimension SHOULD show, next to each facet value, the count of records matching the current query excluding that dimension’s own filter. This lets users see the effect of selecting additional values without shrinking the list to zero.
Substrates may pre-compute these counts (see isamples_202601_facet_cross_filter.parquet for the single-filter cache) or compute them on the fly.
4 4. Result projections
4.1 4.1 Map / globe point
Minimum projection for a point on a map:
{ pid, label, source, latitude, longitude }
This is what the web Explorer’s “lite parquet” already provides.
4.2 4.2 Sample card
Projection for a clicked / selected sample:
{
pid, label, source,
description,
latitude, longitude, placeName, elevation,
material, context, objectType, keywords,
resultTime, samplingPurpose,
registrant, responsibility,
curationLocation, curationResponsibility,
sourceRecordURL,
thumbnailURL // see §2.1; ships in `wide` today (OpenContext
// only), moving to per-source sidecars — issue #131
}
Fields MAY be null. The sample card UI in every binding SHOULD handle missing values gracefully.
4.3 4.3 Facet counts
{ dimension, value, count }[]
5 5. Substrate bindings
5.1 5.1 DuckDB-WASM on parquet (web)
| Spec | Binding |
|---|---|
source IN (…) |
n IN (…) on wide / narrow (column is n per PQG); source IN (…) on lite / sample_facets_v2 (alias exposed) |
material IN (…) |
pid IN (SELECT pid FROM sample_facets WHERE material IN (…)) |
text MATCHES "q" |
(label ILIKE '%q%' OR description ILIKE '%q%' OR place_name ILIKE '%q%') — currently a subset of §3.2 |
bbox WITHIN (…) |
latitude BETWEEN … AND … AND longitude BETWEEN … AND … |
h3 AT RES 6 IN (…) |
h3_res6 IN (…) on wide_h3; OR h3_cell IN (…) AND resolution = 6 on h3_summary_res6 (see §2.4 note) |
time BETWEEN … |
TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2 — result_time ships as VARCHAR in lite, wide, and narrow |
Canonical data URL base: https://data.isamples.org/ (Cloudflare Worker in front of the R2 bucket). Two layers:
- Versioned
/isamples_YYYYMM_<file>.parquet— 1-yr immutable cache, safe to pin in papers, spec examples, or reproducibility notebooks. - Alias
/current/<alias>— 302 redirect with 5-minute cache; tracks whatever the latest snapshot is. Use for “always fresh” consumers.
Never reference the raw pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/... URL — it bypasses the Worker and defeats the alias layer.
Data files: see catalog in how-to-use.
5.2 5.2 DuckDB / Ibis on parquet (Python)
| Spec | Binding |
|---|---|
| Same DuckDB SQL as §5.1 | Same URLs under https://data.isamples.org/ |
| Ibis expressions | t.source.isin([...]) and so on |
See isamples-python/examples/basic/isamples_explorer.ipynb for the reference implementation. A isamples_query.py module extracting the filter builder is planned.
5.3 5.3 Apache Solr (if Central returns)
| Spec | Binding |
|---|---|
source IN (a, b) |
fq=source:(a OR b) |
material IN (…) |
fq=hasMaterialCategory:(…) |
text MATCHES "q" |
q=searchText:q (relevance-ranked by default) |
bbox WITHIN (…) |
fq={!field f=producedBy_samplingSite_location_rpt}Intersects(ENVELOPE(...)) |
time BETWEEN … |
fq=producedBy_resultTime:[t1 TO t2] |
See isamples_inabox/isb_web/isb_solr_query.py for the full client.
6 6. Versioning and compatibility
This spec uses semantic-ish versioning:
- Major (1.0, 2.0): new required dimensions, renames, or grammar changes that break existing clients.
- Minor (0.2, 0.3): new optional dimensions, clarifications, additional binding rows.
- Patch: typo fixes.
Breaking changes MUST be accompanied by a migration note and a sunset window for the prior spec version.
7 7. Open questions (for v0.3)
objectTypefilter in the web Explorer. Canonical vocabulary is nowhasSampleObjectType(resolved in v0.2; see §2.2). Thesample_facets_v2parquet carriesobject_typeas a denormalized URI string, so binding is straightforward. Which display labels should the UI surface, and shouldobject_typebe added toliteso specimen-type filters don’t require a second file fetch?- Text-search field coverage in the web Explorer (currently 3 of 15 post-v0.2). Which of the remaining 12 are worth indexing in a browser FTS? See PR #95.
- Cross-filter cache shape for multi-dimension filter combinations (current cache handles single-filter only).
- Confidence thresholds — should the spec define a default for
*.Confidencefields, or leave it per-client? - H3 tier breakpoints — when filters are active, what zoom level triggers the switch from H3 clusters to individual points? The web Explorer currently uses ~120 km; the Python notebooks use viewport bounding box size.
- Sample-card thumbnail provenance —
thumbnail_urlis now named in §2.1 (v0.2) but lives inwideand is populated only for OpenContext. Move to per-source sidecars per issue #131 / the sidecar pattern memo.
7.1 Questions resolved in v0.2
Specimen vs. objectType naming— resolved: adopt data-side nameobjectType(SolrhasSampleObjectType) as canonical. See §2.2 and conformance matrix §3.2.Time filter in lite parquet— resolved:result_timeis already present inlite(as VARCHAR). §5.1 binding now shows the DuckDB cast.
8 Appendix A. Metadata model at a glance
iSamples treats these as the core entity types (domain-agnostic):
MaterialSampleRecord— the sample itselfSamplingEvent— the act of collectionSamplingSite— the placeGeospatialCoordLocation— lat/lon/elevationMaterialSampleCuration— curation metadataIdentifiedConcept— vocabulary terms (materials, contexts, specimens)Agent— people / institutions
The canonical UML is in the isamplesorg-metadata repo. PQG (the parquet property-graph binding) is specified in pqg/docs/PQG_SPECIFICATION.md.