Rasteret 0.3 : EO image datasets should be tables, not folders

Index COG metadata once into Parquet, skip cold starts forever. Filter with DuckDB, train with TorchGeo, fetch pixels on demand. 20x faster runs

Rasteret 0.3 : EO image datasets should be tables, not folders
Photo by NASA / Unsplash

Intro

"Metadata is a love note to the future.", Jason Scott, Internet Archive

In the database world, this has been gospel for decades. The catalog, the thing that tells you where data lives, what shape it has, how to read it, is always more valuable than any single query result. Entire careers in data engineering exist because someone figured out that the description of the data matters more than the data itself, think Iceberg.

We believe in building systems that stay true to this, metadata is better than more data!

We in geospatial industry moved 100s of petabytes of satellite imagery to the cloud. Then we kept re-parsing the table of contents. Millions of times. Every cold start.

If you've trained on cloud-hosted imagery, you know the cold-start problem: your stack parses lots of metadata before it sees a single pixel, and you pay that cost in every restart of compute, every new environment in K8s and more. You shrug your shoulders and say "eh first run is always slow". We've written about this before in depth.

Caching those metadata was step one for us.

But caching alone doesn't change how you work with data, it just makes the same workflow faster.

What changes things, is when you start looking at that cache as a table you can work with. Filter it, join it, enrich it - then pair that with an I/O engine for object storage that gets you pixels on-demand. That's the new Rasteret library!

Introducing Rasteret 0.3 :

Index once. Work in tables. Stream pixels on demand.

At the heart of Rasteret is a Collection: a GeoParquet table with properties of EO images, and its URLs, and cached tile-level byte offsets for every band; so each row already knows exactly where to read its pixels. Your imagery stays in COGs. The Collection is what you reload, share, version, and query.

Here's how to build one:

import rasteret

collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_bangalore",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30"),
)

build() picks from a catalog of 12 pre-registered sources (Sentinel-2, Landsat, NAIP, Copernicus DEM, AEF embeddings, and more), and creates the Collection you want. Don't see your dataset? build_from_stac() for any STAC API, build_from_table() for any existing Parquet that has Tiff URLs in it.

What changes in practice

That build() call hides a big shift. You're not working with files anymore, you're working with rows in a Parquet table.

sub = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))

Once it's a row, you can filter it, join it, version it. Try doing that with files in folder👀

Append columns that you care about, (splitlabelqa_flagmodel_scoreaoi_wkb...), push down filters with DuckDB or PyArrow, materialize a new Parquet for an experiment, share it and get the exact same subset later. Keep working in parquet, play with the metadata as much as you want.

The selection logic you use lives next to the data references in parquet. If you've ever had train_v7_final_final2/ folders of image chips lying around, you can feel why this matters.

When you need pixels, pick the output format that fits:

# NumPy arrays, lightweight, no extra deps
arr = collection.get_numpy(
    geometries=(77.55, 13.01, 77.58, 13.08),
    bands=["B04", "B08"],
)
# shape: [N, C, H, W]

# xarray Dataset, for analysis
ds = collection.get_xarray(geometries=bbox, bands=["B04", "B08"])
ndvi = (ds.B08.astype(float) - ds.B04.astype(float)) / (ds.B08 + ds.B04)

# TorchGeo GeoDataset, for ML training
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

All three paths share the same I/O engine: using cached tiff metadata it goes straight to fast reads, no header re-parsing, no GDAL, explained in detail in previous blogs. geometries accepts bbox tuples, Arrow arrays, Shapely objects, or raw WKB, Arrow columns from GeoParquet are the fastest input.

Tag patches, not files

To make this more concrete, let's walk through a simple example.

Most EO workflows eventually produce semantics: "these polygons are crop fields", "these boxes contain ships", "this tile has clouds", "these 50k patches are hard negatives."

Traditionally, those semantics turn into more files: export chips, name them, keep the CSV that explains them, pray you can reproduce the join logic of csv to files six weeks later.

With Rasteret + Parquet, you keep the semantics as rows and columns:

  1. Build (or import) a scene-level Collection once
  2. Run a model/heuristic that emits detections as GeoParquet
  3. Join detections -> scenes to produce a patch-level Collection (one row per patch)
  4. Train from that patch table directly, no chip exports, no folder structure. Its just metadata with your logic in it.
import duckdb
import pyarrow.compute as pc
import rasteret

# load a collection already built before
scenes = rasteret.load("./s2_bangalore/")

# your ship detections parquet
detections = pq.read_table("./ship_detections.geoparquet")

con = duckdb.connect()
con.register("scenes", scenes.dataset.to_table())
con.register("det", detections)

# Join detections to scenes, geometry is now the patch polygon, not the scene footprint
patches = con.sql("""
  SELECT det.patch_id AS id, scenes.datetime, det.geometry,
         scenes.assets, det.label, det.score, det.split
  FROM det JOIN scenes ON det.scene_id = scenes.id
  WHERE det.score > 0.8
""").fetch_arrow_table()

patch_collection = rasteret.build_from_table(
    patches, name="ship_patches", enrich_cog=True
)

Now patch_collection is a Collection whose geometry column represents patches, not scenes. Each row carries a label, a score, and a split - all filterable with Arrow or DuckDB or Pandas:

# Arrow-native filtering: high-confidence train patches
train = patch_collection.subset(split="train")
table = train.dataset.to_table(columns=["geometry", "label", "score"])
mask = pc.greater(table.column("score"), 0.9)
hard_patches = pc.filter(table.column("geometry"), mask)

# Arrow WKB geometries go directly to get_numpy -- no Shapely, no Python loop
arr = train.get_numpy(geometries=hard_patches, bands=["B04", "B08"])
# arr.shape: [N, C, H, W]
Use Parquet to decide and experiment what to train on. Use Rasteret to fetch the pixels for those rows, on demand whenever needed.

TorchGeo interop

Most teams already have samplers, transforms, and dataloaders built around TorchGeo.

Rasteret's Collection.to_torchgeo_dataset() returns a standard TorchGeo GeoDataset (requires TorchGeo >= 0.9.0), so the rest of your stack stays the same.

Rasteret replaces the I/O backend - Obstore instead of rasterio/GDAL, but everything after GeoDataset stays the same.

That same patch_collection? Here's what it looks like in TorchGeo. label_field is a Rasteret addition, it can be used to pull labels from the Parquet column into each sample, so they flow through stack_samples param of TorchGeo into your batch:

dataset = patch_collection.to_torchgeo_dataset(
    bands=["B04", "B08"],
    chip_size=256,
    split="train",
    label_field="label",     # each sample["label"] comes from the Parquet column
)
sampler = RandomGeoSampler(dataset, size=256, length=1000)
loader = DataLoader(dataset, sampler=sampler, batch_size=8, collate_fn=stack_samples)

for batch in loader:
    images = batch["image"]   # [B, C, H, W]
    labels = batch["label"]   # [B]
    ...

Cold-start benchmarks

Same AOIs, same scenes, same sampler, same DataLoader. Both paths output identical [batch, T, C, H, W] tensors. TorchGeo runs with its recommended GDAL settings for best-case remote COG performance.

Scenario TorchGeo (rasterio/GDAL) Rasteret Speedup
Single AOI, 15 scenes 9.08 s 1.14 s 8x
Multi-AOI, 30 scenes 42.05 s 2.25 s 19x
Cross-CRS boundary, 12 scenes 12.47 s 0.59 s 21x
As for correctness of outputs - We treat rasterio/GDAL alignment as a correctness contract for supported inputs (tiled GeoTIFF/COGs) in our tests. And when Rasteret can't safely match expected semantics, it fails loudly instead of letting bad images go to GPU.

Major-TOM without the image-in-Parquet

Here's more proof of why Rasteret's design of logic + metadata + byte-range aware IO engine works and is actually better for everyone.

Major TOM is a large-scale Sentinel-2 training dataset published on Hugging Face. It has a two-layer Parquet design: metadata.parquet is a global index (2.2M rows with grid cells, product IDs, coordinates), while images/*.parquet stores actual GeoTIFF image data directly inside Parquet cells. That's the "image-in-Parquet" pattern.

We think there's a simpler and better option to achieve the same dataset semantics. The metadata already has everything you need to reconstruct patch geometries and fetch the original source pixels on demand:

import rasteret

# Build a cache from sentinel 2 scenes
collection = rasteret.build("earthsearch/sentinel-2-l2a",
                            name="major-tom-on-the-fly",
                            bbox=(-122.55, 37.65, -122.30, 37.90),
                            date_range=("2024-01-01", "2024-02-01"))

# Add Major TOM-style columns, grid cells, product IDs, and deterministic splits as Parquet columns
enriched = enrich_major_tom_columns(collection, grid_km=10)

# bring back your pyarrow table/pyarrow dataset into a rasteret collection
# it simply validates that table still has columsn that rasteret internals care about
collection = rasteret.build_from_table(enriched, name="major-tom-on-the-fly")

# Fetch patches as NumPy from Arrow geometry, no payload Parquet needed
subset = collection.subset(split="train")
arr = subset.get_numpy(geometries=geometry_array, bands=["B02", "B08"])

The result is a few MBs worth Parquet index instead of multi-GB parquet shards, with the same patch-grid semantics and deterministic splits.

We benchmarked this against the Huggingface "Major-TOM Core" parquets using the strongest datasets read path which can take pyarrow filters and use HF's smart storage to its advantage (not the HF streaming generator):

Patches HF datasets parquet filters Rasteret index+COG Speedup
120 46.83 s 12.09 s 3.9x
1000 771.59 s 118.69 s 6.5x

The speedup grows with patch count because HF datasets re-scans the full Parquet shard for each keyed row, while Rasteret's index resolves directly to byte ranges.

We're not saying image-in-Parquet is wrong, Major TOM is a great project and a great resource. We're saying that for many EO ML workflows, its a good idea to think - your dataset is a table, and pixels stay where they already live.

Parquet is becoming the shipping container for EO ML

The ecosystem is already moving here in its own ways:

  • AlphaEarth Foundation publishes global dense embeddings as GeoParquet-indexed COG tiles — Rasteret already ships a built-in dataset entry for them. Index-first, payload-later.
  • GeoTessera ships a small registry.parquet to discover embedding tiles, then downloads .npy tiles on demand. Same pattern.
  • Many cloud providers — AWS Open Registry, Azure Planetary Computer, Google Public Datasets — have begun publishing Parquet/GeoParquet snapshots of their catalogs (essentially STAC JSONs converted to Parquet for bulk querying).

These projects use Parquet to describe where data lives.

A Rasteret Collection also caches how to read it: COG metadata sit right in the same Parquet. When you call get_numpy() or to_torchgeo_dataset(), Rasteret's fast IO engine goes straight from table row to pixel array: no header re-parsing, no GDAL.

The rest of the library features: subset()build_from_table(), DuckDB joins, Arrow filtering, is data science on the index itself, before you ever fetch a pixel. That's the freedom you get!

What we're building toward

Rasteret 0.3 is our answer to a specific but very real frustration: teams spend too much engineering on the seam between "I found the data" and "I can train on it." The Rasteret Collection is that seam. A Parquet table with enough metadata to skip cold starts, enough structure to query with standard tools, and enough context to fetch pixels on demand.

That's the core bet: the index is the dataset. Pixels stay where they already live in large Tiff files. Everything else - splits, labels, patch geometries, quality flags - lives as columns in a table you can version, share, and reproduce and embed your fancy data science logics in.

What's next for Rasteret

We're actively working on:

  • More catalog entries - every dataset we add is one fewer STAC endpoint someone has to remember. Community PRs for new entries are the highest-leverage contribution and its very easy 20 lines of config only.
  • Deeper TorchGeo collaboration - we've opened a conversation with the TorchGeo maintainers about how Rasteret can serve as a backend for their GeoDataset ecosystem. Early days, but the goal is interop that benefits both communities.
  • Format support - tiled GeoTIFF/COGs today, but the async byte-range reader architecture extends naturally to other tiled formats.

Further out: a dataset descriptor spec

There's a bigger gap we keep running into. STAC is great at discovery, TorchGeo is great at execution, Parquet registries are becoming the lingua franca - but every project (AllenAI rslearn, GeoTessera, BigEarthNet V2, FTW) independently invents its own manifest schema. We think there's room for a small, portable descriptor that says "here's what this dataset is - bands, CRS, resolution, splits - and here's where the records live." We're calling it DDS (data descriptor spec) inspired by DDL from database world, and we'll write more about it in a future post. Tools first, spec claims later.

Come build with us

Rasteret is Apache-2.0, fully open source, with no strings attached to any paid product or commercial offering of Terrafloww. We want to build this in the open, for everyone.

We'd love your help, and it doesn't have to be big. File an issue when something feels wrong. Open a PR to add a dataset entry. Report a docs typo. Tell us about a TIFF layout that breaks. If you find a case where Rasteret's output differs from rasterio, that's a correctness bug and we want to hear about it.

Read more about our larger vision in our previous blog - Streaming Tensors not files

Checkout terrafloww.com to see how our commercial offerings combine and take your EO analysis experience and infrastructure to the next level. If you are convinced, join our waitlist.


This work stands on the shoulders of GDAL, rasterio, TorchGeo, PyArrow, obstore, and the broader open-source geospatial community. Thanks for building the foundations.