Terrafloww Labs

Rasteret 0.3 : EO image datasets should be tables, not folders

Sid — Fri, 27 Feb 2026 12:39:16 GMT

Intro

"Metadata is a love note to the future.", Jason Scott, Internet Archive

In the database world, this has been gospel for decades. The catalog, the thing that tells you where data lives, what shape it has, how to read it, is always more valuable than any single query result. Entire careers in data engineering exist because someone figured out that the description of the data matters more than the data itself, think Iceberg.

We believe in building systems that stay true to this, metadata is better than more data!

We in geospatial industry moved 100s of petabytes of satellite imagery to the cloud. Then we kept re-parsing the table of contents. Millions of times. Every cold start.

If you've trained on cloud-hosted imagery, you know the cold-start problem: your stack parses lots of metadata before it sees a single pixel, and you pay that cost in every restart of compute, every new environment in K8s and more. You shrug your shoulders and say "eh first run is always slow". We've written about this before in depth.

Caching those metadata was step one for us.

But caching alone doesn't change how you work with data, it just makes the same workflow faster.

What changes things, is when you start looking at that cache as a table you can work with. Filter it, join it, enrich it - then pair that with an I/O engine for object storage that gets you pixels on-demand. That's the new Rasteret library!

Introducing Rasteret 0.3 :

Index once. Work in tables. Stream pixels on demand.

At the heart of Rasteret is a Collection: a GeoParquet table with properties of EO images, and its URLs, and cached tile-level byte offsets for every band; so each row already knows exactly where to read its pixels. Your imagery stays in COGs. The Collection is what you reload, share, version, and query.

Here's how to build one:

import rasteret

collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_bangalore",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30"),
)

build() picks from a catalog of 12 pre-registered sources (Sentinel-2, Landsat, NAIP, Copernicus DEM, AEF embeddings, and more), and creates the Collection you want. Don't see your dataset? build_from_stac() for any STAC API, build_from_table() for any existing Parquet that has Tiff URLs in it.

What changes in practice

That build() call hides a big shift. You're not working with files anymore, you're working with rows in a Parquet table.

sub = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))

Once it's a row, you can filter it, join it, version it. Try doing that with files in folder👀

Append columns that you care about, (split, label, qa_flag, model_score, aoi_wkb...), push down filters with DuckDB or PyArrow, materialize a new Parquet for an experiment, share it and get the exact same subset later. Keep working in parquet, play with the metadata as much as you want.

The selection logic you use lives next to the data references in parquet. If you've ever had train_v7_final_final2/ folders of image chips lying around, you can feel why this matters.

When you need pixels, pick the output format that fits:

# NumPy arrays, lightweight, no extra deps
arr = collection.get_numpy(
    geometries=(77.55, 13.01, 77.58, 13.08),
    bands=["B04", "B08"],
)
# shape: [N, C, H, W]

# xarray Dataset, for analysis
ds = collection.get_xarray(geometries=bbox, bands=["B04", "B08"])
ndvi = (ds.B08.astype(float) - ds.B04.astype(float)) / (ds.B08 + ds.B04)

# TorchGeo GeoDataset, for ML training
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

All three paths share the same I/O engine: using cached tiff metadata it goes straight to fast reads, no header re-parsing, no GDAL, explained in detail in previous blogs. geometries accepts bbox tuples, Arrow arrays, Shapely objects, or raw WKB, Arrow columns from GeoParquet are the fastest input.

Tag patches, not files

To make this more concrete, let's walk through a simple example.

Most EO workflows eventually produce semantics: "these polygons are crop fields", "these boxes contain ships", "this tile has clouds", "these 50k patches are hard negatives."

Traditionally, those semantics turn into more files: export chips, name them, keep the CSV that explains them, pray you can reproduce the join logic of csv to files six weeks later.

With Rasteret + Parquet, you keep the semantics as rows and columns:

Build (or import) a scene-level Collection once
Run a model/heuristic that emits detections as GeoParquet
Join detections -> scenes to produce a patch-level Collection (one row per patch)
Train from that patch table directly, no chip exports, no folder structure. Its just metadata with your logic in it.

import duckdb
import pyarrow.compute as pc
import rasteret

# load a collection already built before
scenes = rasteret.load("./s2_bangalore/")

# your ship detections parquet
detections = pq.read_table("./ship_detections.geoparquet")

con = duckdb.connect()
con.register("scenes", scenes.dataset.to_table())
con.register("det", detections)

# Join detections to scenes, geometry is now the patch polygon, not the scene footprint
patches = con.sql("""
  SELECT det.patch_id AS id, scenes.datetime, det.geometry,
         scenes.assets, det.label, det.score, det.split
  FROM det JOIN scenes ON det.scene_id = scenes.id
  WHERE det.score > 0.8
""").fetch_arrow_table()

patch_collection = rasteret.build_from_table(
    patches, name="ship_patches", enrich_cog=True
)

Now patch_collection is a Collection whose geometry column represents patches, not scenes. Each row carries a label, a score, and a split - all filterable with Arrow or DuckDB or Pandas:

# Arrow-native filtering: high-confidence train patches
train = patch_collection.subset(split="train")
table = train.dataset.to_table(columns=["geometry", "label", "score"])
mask = pc.greater(table.column("score"), 0.9)
hard_patches = pc.filter(table.column("geometry"), mask)

# Arrow WKB geometries go directly to get_numpy -- no Shapely, no Python loop
arr = train.get_numpy(geometries=hard_patches, bands=["B04", "B08"])
# arr.shape: [N, C, H, W]

Use Parquet to decide and experiment what to train on. Use Rasteret to fetch the pixels for those rows, on demand whenever needed.

TorchGeo interop

Most teams already have samplers, transforms, and dataloaders built around TorchGeo.

Rasteret's Collection.to_torchgeo_dataset() returns a standard TorchGeo GeoDataset (requires TorchGeo >= 0.9.0), so the rest of your stack stays the same.

Rasteret replaces the I/O backend - custom instead of rasterio/GDAL, but everything after GeoDataset stays the same.

That same patch_collection? Here's what it looks like in TorchGeo. label_field is a Rasteret addition, it can be used to pull labels from the Parquet column into each sample, so they flow through stack_samples param of TorchGeo into your batch:

dataset = patch_collection.to_torchgeo_dataset(
    bands=["B04", "B08"],
    chip_size=256,
    split="train",
    label_field="label",     # each sample["label"] comes from the Parquet column
)
sampler = RandomGeoSampler(dataset, size=256, length=1000)
loader = DataLoader(dataset, sampler=sampler, batch_size=8, collate_fn=stack_samples)

for batch in loader:
    images = batch["image"]   # [B, C, H, W]
    labels = batch["label"]   # [B]
    ...

Cold-start benchmarks

Same AOIs, same scenes, same sampler, same DataLoader. Both paths output identical [batch, T, C, H, W] tensors. TorchGeo runs with its recommended GDAL settings for best-case remote COG performance.

Scenario	TorchGeo (rasterio/GDAL)	Rasteret	Speedup
Single AOI, 15 scenes	9.08 s	1.14 s	8x
Multi-AOI, 30 scenes	42.05 s	2.25 s	19x
Cross-CRS boundary, 12 scenes	12.47 s	0.59 s	21x

As for correctness of outputs - We treat rasterio/GDAL alignment as a correctness contract for supported inputs (tiled GeoTIFF/COGs) in our tests. And when Rasteret can't safely match expected semantics, it fails loudly instead of letting bad images go to GPU.

Major-TOM without the image-in-Parquet

Here's more proof of why Rasteret's design of logic + metadata + byte-range aware IO engine works and is actually better for everyone.

Major TOM is a large-scale Sentinel-2 training dataset published on Hugging Face. It has a two-layer Parquet design: metadata.parquet is a global index (2.2M rows with grid cells, product IDs, coordinates), while images/*.parquet stores actual GeoTIFF image data directly inside Parquet cells. That's the "image-in-Parquet" pattern.

We think there's a simpler and better option to achieve the same dataset semantics. The metadata already has everything you need to reconstruct patch geometries and fetch the original source pixels on demand:

import rasteret

# Build a cache from sentinel 2 scenes
collection = rasteret.build("earthsearch/sentinel-2-l2a",
                            name="major-tom-on-the-fly",
                            bbox=(-122.55, 37.65, -122.30, 37.90),
                            date_range=("2024-01-01", "2024-02-01"))

# Add Major TOM-style columns, grid cells, product IDs, and deterministic splits as Parquet columns
enriched = enrich_major_tom_columns(collection, grid_km=10)

# bring back your pyarrow table/pyarrow dataset into a rasteret collection
# it simply validates that table still has columsn that rasteret internals care about
collection = rasteret.build_from_table(enriched, name="major-tom-on-the-fly")

# Fetch patches as NumPy from Arrow geometry, no payload Parquet needed
subset = collection.subset(split="train")
arr = subset.get_numpy(geometries=geometry_array, bands=["B02", "B08"])

The result is a few MBs worth Parquet index instead of multi-GB parquet shards, with the same patch-grid semantics and deterministic splits.

We benchmarked this against the Huggingface "Major-TOM Core" parquets using the strongest datasets read path which can take pyarrow filters and use HF's smart storage to its advantage (not the HF streaming generator):

Patches	HF `datasets` parquet filters	Rasteret index+COG	Speedup
120	46.83 s	12.09 s	3.9x
1000	771.59 s	118.69 s	6.5x

The speedup grows with patch count because HF datasets re-scans the full Parquet shard for each keyed row, while Rasteret's index resolves directly to byte ranges.

We're not saying image-in-Parquet is wrong, Major TOM is a great project and a great resource. We're saying that for many EO ML workflows, its a good idea to think - your dataset is a table, and pixels stay where they already live.

Parquet is becoming the shipping container for EO ML

The ecosystem is already moving here in its own ways:

AlphaEarth Foundation publishes global dense embeddings as GeoParquet-indexed COG tiles — Rasteret already ships a built-in dataset entry for them. Index-first, payload-later.
GeoTessera ships a small registry.parquet to discover embedding tiles, then downloads .npy tiles on demand. Same pattern.
Many cloud providers — AWS Open Registry, Azure Planetary Computer, Google Public Datasets — have begun publishing Parquet/GeoParquet snapshots of their catalogs (essentially STAC JSONs converted to Parquet for bulk querying).

These projects use Parquet to describe where data lives.

A Rasteret Collection also caches how to read it: COG metadata sit right in the same Parquet. When you call get_numpy() or to_torchgeo_dataset(), Rasteret's fast IO engine goes straight from table row to pixel array: no header re-parsing, no GDAL.

The rest of the library features: subset(), build_from_table(), DuckDB joins, Arrow filtering, is data science on the index itself, before you ever fetch a pixel. That's the freedom you get!

What we're building toward

Rasteret 0.3 is our answer to a specific but very real frustration: teams spend too much engineering on the seam between "I found the data" and "I can train on it." The Rasteret Collection is that seam. A Parquet table with enough metadata to skip cold starts, enough structure to query with standard tools, and enough context to fetch pixels on demand.

That's the core bet: the index is the dataset. Pixels stay where they already live in large Tiff files. Everything else - splits, labels, patch geometries, quality flags - lives as columns in a table you can version, share, and reproduce and embed your fancy data science logics in.

What's next for Rasteret

We're actively working on:

More catalog entries - every dataset we add is one fewer STAC endpoint someone has to remember. Community PRs for new entries are the highest-leverage contribution and its very easy 20 lines of config only.
Deeper TorchGeo collaboration - we've opened a conversation with the TorchGeo maintainers about how Rasteret can serve as a backend for their GeoDataset ecosystem. Early days, but the goal is interop that benefits both communities.
Format support - tiled GeoTIFF/COGs today, but the async byte-range reader architecture extends naturally to other tiled formats.

Further out: a dataset descriptor spec

There's a bigger gap we keep running into. STAC is great at discovery, TorchGeo is great at execution, Parquet registries are becoming the lingua franca - but every project (AllenAI rslearn, GeoTessera, BigEarthNet V2, FTW) independently invents its own manifest schema. We think there's room for a small, portable descriptor that says "here's what this dataset is - bands, CRS, resolution, splits - and here's where the records live." We're calling it DDS (data descriptor spec) inspired by DDL from database world, and we'll write more about it in a future post. Tools first, spec claims later.

Come build with us

Rasteret is Apache-2.0, fully open source, with no strings attached to any paid product or commercial offering of Terrafloww. We want to build this in the open, for everyone.

We'd love your help, and it doesn't have to be big. File an issue when something feels wrong. Open a PR to add a dataset entry. Report a docs typo. Tell us about a TIFF layout that breaks. If you find a case where Rasteret's output differs from rasterio, that's a correctness bug and we want to hear about it.

Docs: terrafloww.github.io/rasteret
Benchmarks: Benchmark write-up
GitHub: github.com/terrafloww/rasteret
PyPI: uv pip install rasteret
Discord: discord.gg/V5vvuEBc

Read more about our larger vision in our previous blog - Streaming Tensors not files

Checkout terrafloww.com to see how our commercial offerings combine and take your EO analysis experience and infrastructure to the next level. If you are convinced, join our waitlist.

This work stands on the shoulders of GDAL, rasterio, TorchGeo, PyArrow, obstore, and the broader open-source geospatial community. Thanks for building the foundations.

deneme bonusu casino

Stop selling files: The case for streaming tensors in GeoAI

Sid — Wed, 21 Jan 2026 20:32:52 GMT

Intro

Last year we made Rasteret library, which sped up access to Earth Observation (EO) images by 10x, while doing zero copy of the original data. We honestly thought this solves a lot of problems. But we realized users were still spending 80% of their time doing ceremonies before feeding the GPUs with EO images: surfing STAC catalogs, filling out forms to download images, drawing polygons on maps and waiting for email links, trying to find RAI information, all this is still happening in the age of cloud-native formats and AI agents.

Its like we built a faster car for everyone, but the roads are still full of potholes. So we set out to see why

The gap for EO imagery

2025 was the year Defense "ate" everything. Governments signed 9-figure checks for dedicated satellite constellations, and end-to-end intelligence systems that are proprietary.

honestly? This is good, it shows they buy tangible outcomes no matter the cost.

But what about the rest of the commercial markets like insurance, agriculture, supply chains, real estate? Commercial EO as an industry, whether its raw images from space or insights from sophisticated science labs, faces fractures when trying to work as one system, between open data standards, and special requirements.

Tabular geospatial data (vectors) is already moving to standards like Parquet and Iceberg, while images from satellites and drones are still tricky to work with. They are similar to data in other domains like medical imagery, machinery CAD models which are also hard to put into data analytics for multimodal science.

The ceremonies to face

If you are a ML team building a risk model for real estate today and you want to use EO images of any kind, before you even write a line of ML code you have to survive the ceremonies of Discover, Access and Read:

Discovery is slow: Search across NASA, ESA, Private companies catalogs, public catalogs
Access is full of diversions and no strict standards: Fill forms for data, draw polygons in a map, get download links via email, every data provider’s "catalog" is different
Reading is heavy: EO images need specialised tools and boilerplate code to keep the GPU fed fast

As Aravind from Terrawatch rightly said, for most companies using EO, a "hidden product" is just their internal data pipeline. They burn their seed money building custom infrastructure just to do the unsexy work of finding and handling imagery differences.

The sharing dead end

Let's say you survive the ceremony and create great insights from your AI model. How do you share it? PDF reports. BI Dashboards. Maps. Convert images to tables.

You take high-dimensional, computable data and flatten it for human eyeballs.

If you really wish to share it as images, you end up with STAC catalogs. Probably starting the above "ceremonies" again for each end user looking to build on your outputs.

What matters, in the age of Agents

If a human engineer takes weeks to collate and prepare EO datasets, it is impossible for a generic AI Agent.

We are moving to a world of MCP and A2A commerce. If an AI agent is trying to answer a prompt like "estimate my supply chain's carbon emissions" or "should I buy my home here", the AI might think "location data might be insightful for this prompt, let me get it"

Today, that agent hits a wall. It hits a "Contact Sales" form or a "Download" button. Just like a human does.

Be it a satellite manufacturer publishing their images, non-profits distributing deforestation predictions, or high-end labs sharing Geo Foundation Model predictions, all these images need to be easy to find, read and attribute (FAIR).

We believe the Commerical EO industry needs its own protocol for images. A connectivity layer that handles the handshake between a raw S3 bucket and a GPU tensor. Create a better experience for both Humans and AI Agents.

The Solution: Stream Tensors, not Files

💡

YouTube solved distribution by standardizing the player, not the cameras or content. We at Terrafloww are doing the same for EO.

Instead of forcing you to download GeoTIFFs, or create a new file format which "shoe-horns" your data by copying images into tables. Rasteret SDK acts as a buffering protocol, while Terrafloww Platform acts as the player.

We are building on the open MLCommons Croissant standard to broadcast your datasets across the web better than STAC catalogs. While our SDK speaks Arrow and DLPack.

So what does this mean for you?

For Data Providers -

Keep your images in your own S3 bucket, zero copying.
We index it, you add your attribution and license information
Increased visibility of your data on the web, with correct metadata
Get paid for every chip that flows out of your S3
Set the price for your data, not the old "$ per Sqkm"
Track every byte you shared and the money you earn, on our dashboard

For Data Consumers (Humans/AI Agents) -

With Rasteret SDK you get -
- Easier data discovery and filtering of datasets
- 10x faster reads of EO images and no boilerplate code to feed your GPUs

Fully interoperable with PyTorch, JAX, DuckDB and more libraries via Apache Arrow and DLPack

Track every byte you queried and how much it costs, on our dashboard

Whether an AI Agent reads as little as 5 chips of an EO image or a hedge fund reads images of a whole continent, both providers and consumers can speed up, monitor, and monetize every byte with Terrafloww

This is the foundation needed to make EO data truly ready for agentic- commerce

What if you don't wish to sell data?

All the features listed above remain the same, even for your internal needs.

Filter, discover and accelerate internal analysis on your image datasets with our SDK, and you can monitor and bill internal teams in your enterprise with our console. Skip messing with your data platform, we integrate with you.

Will it devalue my data?

Some might fear that if we make data this easy to mix and match, AI agents will just combine datasets to create something better, lowering the value of the original source.

To that, we say: It's possible, but when friction drops to zero, the volume of innovation skyrockets. We will accelerate towards better data products. We want to see what happens when a group of ML scientists can instantly monetize their inference model without building a sales team, or when a non-profit’s dataset can be effortlessly consumed by a Fortune 500’s AI agent.

Join the Beta

The Defense market has its systems. It’s time the Commercial EO market had its own.

Come test us out, would love your feedback and build this future with you.

Join the Waitlist!

If you like to know more, or just wanna chat with us, join our Discord. We will be releasing more updates and blogs soon.

Join Discord

Rasteret: A Library for Faster and Cheaper Open Satellite Data Access

Sid — Sun, 12 Jan 2025 06:12:17 GMT

Happy New Year, Everyone!

It’s been just over a month since we published our last blog describing a method that improves reading of Cloud-Optimized GeoTIFFs (COGs). Over the New Year break, we worked on open-sourcing not just the core logic but also creating a library with simple high-level APIs to make it easier for users to adopt our approach.

While the library is still in its early stages, we believe it can benefit geospatial data scientists and engineers significantly. By sharing it now, we hope to receive feedback and contributions via GitHub Issues and PRs to make it even more useful for the community.

Performance Snapshot: Rasterio vs. Rasteret

Here’s an exciting stat to illustrate the potential of Rasteret:

1-Year NDVI Time Series of a Farm Using Landsat 9
- Rasterio (Multiprocessing): 32 seconds (first run), 24 seconds (subsequent runs with GDAL caching).
- Rasteret: Just 3 seconds—every time.

Even Google Earth Engine, takes 10–30 seconds for its first run and 3–5 seconds on its subsequent runs.

How was it made ?

Inspired by Kerchunk, Rasteret uses a cache to speed up COG file processing.

We call this cache, "Rasteret Collection", which is familiar to most folks using STAC. It only caches the COG file headers, it does not cache Overviews or image data tiles.

We decided to go with STAC-GeoParquet as its base, and extend it with "per-band-metadata" columns.

"For example, a Landsat Rasteret Collection is an exact copy of the landsat-c2l2-sr STAC Collection, with additional columns in the GeoParquet. If you wish to know more, check out our first blog.

What does this cost ?

In this blog we focus specifically on paid datasets like Landsat on AWS to emphasise cost savings, free STAC endpoints like Earth Search's sentinel-2-l2a is also possible to be cached with Rasteret.

Global Landsat Rasteret Collection (1 year):

Time: ~30 minutes
Cost: $1.8 (AWS S3 GET requests).

Regional Rasteret Collection (e.g., Karnataka, India, 1 year):

Time: ~9 seconds
Cost: Negligible.

Rasteret library as of now, defaults LANDSAT to the landsat-c2l2-sr collection

Creating such Collections means, you pay the cost once upfront. More on costs below.

Why We Built This Library

Intro

You can skip reading this intro section if you know the inner workings of Rasterio/GDAL.

The Hidden Performance Bottleneck For Rasterio, and its backend GDAL functions, getting image data involves quite a few rounds of metadata retrieval. Here's what happens behind the scenes:

First HTTP Request: Fetch file header
- Retrieves critical metadata like CRS, Transform
- Caches this information in-memory or on disk using GDAL's cache
Additional Requests: Grab COG overviews
- Useful for quick visualizations
- But... it means more HTTP requests to the same file

Then, when you call -

from rasterio.mask import mask

data, transform = mask(src, [geometry], crop=True)

Rasterio now uses its cached info in GDAL to:

Determine which image tile is in which byte-range
Make final HTTP requests to fetch the numpy array

The Real Cost: Between 3 to 6 HTTP requests per .tif file, every time you start a new Python environment.

Effect of New Environments on GDAL Cache

It is important to note that GDAL cache helps to reduce the HTTP requests to COG files, if your code is re-run in the same environment/session.

Where GDAL cache remains -

Rerunning the same '.py' file
Consistent Jupyter notebook sessions
Long-running Cloud VMs
Local Python environments in laptop

Where GDAL cache is lost/non-existent -

Laptop or VM restarts
Cloud-native workflows with scaling VMs
Hosted Jupyter notebooks kernel or VM restarting

Era of cloud native geospatial workflows

In today's cloud-based data analysis, we leverage cloud elasticity like never before:

Multiple compute machines (VMs)
Serverless platforms (AWS Lambda, GCP Cloud Run)
Kubernetes orchestrators (Airflow, Kubeflow, Flyte)

The Catch: Each of these creates a NEW Python environment.

Translation: Rasterio must re-fetch metadata for EVERY COG file, every single time.

With workflows that depend on thousands of files and highly parallel tasks, this inefficiency can quickly escalate.

Rasteret addresses these limitations, and ensures consistent performance and cost efficiency, even in distributed, and ephemeral cloud environments.

Cost and Time Analysis - Rasteret vs Rasterio/GDAL

Let’s break down the cost and time differences for a hypothetical project analyzing 100,000 farms. The project needs time-series indices of various kinds like NDVI, NDMI etc.

Here’s the scenario:

Scenes: 2 Landsat scenes covering all farms.
Dates per scene: 45.
Bands to be read per date: 4.
Total files: 2 × 45 × 4 = 360 COG files.
Files worth processing (due to cloud cover): 200 COG files.

Workflow

Rasteret caches metadata for the 200 files in its Collection, and its used across 100 parallel processes (dockers/lambdas).
Rasterio/GDAL reads headers repeatedly in new environments due to lack of shared cache

S3 GET Costs for Header Reads

Rasteret:
- Cache headers once: 200 requests(1 per file) × $0.0004/1000 = $0.00008.
Rasterio/GDAL:
- Headers read 100 times (100 processes): 100 × $0.00008 = $0.008.

S3 GET Costs for Image Tile Reads

Both Rasteret and Rasterio:
- 100,000 farms × 2 images tiles cover per farm × 200 files × $0.0004/1000 = $16.

Total S3 Costs

Rasterio: $16 + $0.008 = $16.008.
Rasteret: $16 + $0.00008 = $16.

EC2 Costs

Assume t3.xlarge AWS SPOT type instances at $0.006 per hour:

Rasterio Processing Time:
- Files processed per second: 1.8.
- Total VM time: 200 files / 1.8 × 100,000 farms = 11,100,000 seconds.
- Time for parallel process = 11,100,000/100/3600 = 30 hours
- EC2 cost: 11,100,000 / 3600 × $0.006 = $18.
Rasteret Processing Time:
- Files processed per second: 11.
- Total VM time: 200 files / 11 × 100,000 farms = 1,818,000 seconds.
- Time for parallel process = 1,818,000/100/3600 = 5 hours
- EC2 cost: 1,818,000 / 3600 × $0.006 = $3.

Total Costs

Rasterio: $16.008 (S3) + $18 (EC2) = $34.008.
Rasteret: $16 (S3) + $3 (EC2) = $19.

Scalability and Repeatability

By running more such workflows parallely, doing EDA on same or different set of farms, the long term implication is that for each run without a GDAL cache, using Rasterio, costs rise due to repeated header reads:

1,000 environments: $34.008 + $0.08 = $34.088.
10,000 environments: $34.008 + $0.8 = $34.808.
100,000 environments: $34.008 + $8 = $42.008.

👀

These 'new environments' count can rack up quite quickly across an organisation, like each employee's laptop, each project's Kubernetes cluster, Jupyter Lab with its kernels restarting, VMs restarting, and more such scenarios.

Rasteret’s cached metadata in Collection ensures consistent costs for the above scenarios.

Conclusion

Rasteret is not intended to replace GDAL or Rasterio, which are amazingly feature rich libraries. Rather, it is focused on addressing a critical aspect of making cloud-native satellite imagery workflows faster and cheaper."

This is an early release of the library, which we shared because its performance and efficiency seem promising. But we want eyes on it early. Do give us your feedback and ideas as issues or PRs.

There are lots of improvements that can be done, from better deployment actions, adding clear release notes, more test coverage, support for Python 3.12, and 3.13, support for S3 based Rasteret Collections, and more.

In parallel, we plan on adding features too, one of the first that we think will be useful is a PyTorch's TorchGeo compatible on-the-fly Dataset creator, for which there's an open issue as well. Do share your thoughts there.

Do try it out with Python 3.11 environment -

uv pip install rasteret

💡

Checkout the Quick Start on Github

If you like it, do give us a ⭐ on Github, if you have already installed it, do upgrade it.

Thanks for reading till the end, wish you have a wonderful year ahead!

Acknowledgements

This work builds on the contributions of giants in the open-source geospatial community:

GDAL and Rasterio, for pioneering geospatial data access
The Cloud Native Geospatial members for STAC-geoparquet and COG specifications
PyArrow, GeoArrow for efficient GeoParquet filtering
The broader open-source geospatial community

We're grateful for the efforts and contributions of these projects and communities. Their dedication, expertise, and willingness to share knowledge have laid the foundation for approaches like the one outlined here.

Terrafloww is proud to sponsor the Cloud Native Geospatial forum as a small startup member

Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL

Aditya Giri — Sat, 30 Nov 2024 07:51:43 GMT

The rapid growth of Earth observation data in cloud storage, which will continue to grow exponentially, powered by falling rocket launch prices by companies like SpaceX, has pushed us to think of how we access and analyze satellite imagery. With major space agencies like ESA and NASA adopting Cloud-Optimized GeoTIFFs (COGs) as their standard format, we're seeing unprecedented volumes of data becoming available through public cloud buckets.

This accessibility brings new challenges around efficient data access. In this article, we introduce an alternative approach to cloud-based raster data access, building upon the foundational work of GDAL and Rasterio.

The Evolution of Raster Storage

Traditional GeoTIFF files weren't designed with cloud storage in mind. Reading these files often required downloading entire datasets, even when only a small portion was needed.

The introduction of COGs marked a significant shift, enabling efficient partial reads through HTTP range requests.

COGs achieve this efficiency through their internal structure:

An initial header containing the Image File Directory (IFD)
Tiled organization of the actual image data
Overview levels for multi-resolution access
Strategic placement of metadata for minimal initial reads

COG File Data Composition, courtesy of Planet Labs blog

This structure allows tools to read specific portions of the file without downloading the entire dataset.

However we feel that even with COGs, trying to do something like a Time-series NDVI graph for a few polygons across a few regions is not as fast as it could be, especially due to the latency caused by AWS S3 throttling.

The STAC Ecosystem and GeoParquet

The SpatioTemporal Asset Catalog (STAC) specification has emerged as a crucial tool for discovering and accessing Earth observation data. While STAC APIs provide standardized ways to query satellite imagery, the Cloud Native Geospatial (CNG) community took this further by developing STAC GeoParquet.

STAC GeoParquet leverages Parquet's columnar format to enable efficient querying of STAC metadata. The columnar structure allows for:

Filter pushdown for spatial and temporal fields
Efficient compression of repeated values
Reduced I/O through column pruning
Fast parallel processing capabilities with the right parquet reading libraries

Current access patterns and challenges

The current approach to accessing COGs, using GDAL and Rasterio, typically involves:

Initial GET request to read the file header
Additional requests if needed if header not found in initial request
Final requests to read the actual data tiles

For cloud-hosted public datasets, this pattern can lead to:

Multiple HTTP requests per file access
Increased latency, especially across cloud regions
Potential throttling on public buckets
Higher costs from numerous small requests to paid buckets

A New Approach: Extending STAC GeoParquet and Byte-range calculations

Extending stac-geoparquet with new columns

Building upon the excellent stac-geoparquet, we explored adding some of COG’s internal metadata information directly into it.

Tile offset info
Tile size info
Data type
Compression info

We do a batch process and we pay the cost and time spent to gather COG files metadata upfront.
In our case we took 1 year's worth of Sentinel 2 items from STAC API for this.

Get data from URLs by calculating byte ranges just-in-time

Now that we have 1 year's worth of STAC items in GeoParquet along with each COG's internal metadata. We can do what GDAL does behind the scenes without needing to query the headers of COG files again and again.

We use the info related to tile offsets and tile size info, to calculate the exact byte-range required to be read for each AOI for each COG URL in STAC items.
We then use Python requests module to get the data from the COG URLs, decompress the incoming bytes, since the data resides as deflate compressed COGs, and create the Numpy array.

Below is the current approach to reading a remote COG URL using Rasterio.

# Current approach with GDAL/rasterio requires
# multiple http requests per file behind the scenes

# find scenes from STAC APIs:

import rasterio
from pystac_client import Client

AOI_POLYGON = Polygon([(77.5, 13.0), (77.55, 13.0),
                (77.55, 13.05),(77.5, 13.05),(77.5, 13.0)])

client = Client.open("https://earth-search.aws.element84.com/v1")
      search = client.search(
          collections=["sentinel-2-l2a"], 
          datetime=f"{start_date.isoformat()}/{end_date.isoformat()}",
          intersects=mapping(AOI_POLYGON),
          query={"eo:cloud_cover": {"lt": 20}}
        )

items = list(search.get_items())

for item in items:
  # First request to get IFD and headers of a file
  src = rasterio.open(item.assets['red'].href)
  # Second or third may happen behind the scene by GDAL
  # if headers are not completely read in 1 http request
  # once its read, rasterio open dataset is assigned

  epsg = item.properties["proj:epsg"]
  utm_polygon = wgs84_polygon_to_utm(AOI_POLYGON, epsg)
  geojson = utm_polygon.__geo__interface

  # using src now, we can ask GDAL to do the byte-range based request
  # to get only those internal tiles of COG that cover our AOI
  # and it returns a numpy array and its transform

  data, transform = rasterio.mask(src, geojson, crop=True)

Current Approach

And below is is our approach.

from pathlib import Path
from shapely.geometry import Polygon
import xarray as xr

from rasteret import Rasteret
from rasteret.constants import DataSources
from rasteret.core.utils import save_per_geometry


def main():
    # 1. Setup workspace folder and parameters
    workspace_dir = Path.home() / "rasteret_workspace"
    workspace_dir.mkdir(exist_ok=True)

    # custom name for local STAC collection to be created
    custom_name = "bangalore"
    date_range = ("2024-01-01", "2024-03-31")
    data_source = DataSources.SENTINEL2

    # get bounds of polygon above for stac search
    bbox = AOI_POLYGON.bounds

    # instantiate Rasteret class
    processor = Rasteret(
        custom_name=custom_name,
        data_source=data_source,
        output_dir=workspace_dir,
        date_range=date_range,
    )

    # create a local STAC geoparquet with COG metadata added to it
    processor.create_collection(
        bbox=bbox,
        date_range=date_range,
        cloud_cover_lt=20,
    )

    # use the processor to get the image data for polygons and bands
    # as a multidimensional xarray dataset
    
    # this uses the local geoparquet created above to optimize HTTP
    # requests to multiple COG files (1 request per required tile)

    # bands, geometries can be changed here
    # date_range too can be changed but must be within the date range
    # available in local geoparquet
    
    # pySTAC's "search" filters, like "platform","cloud_cover_lt" and 
    # others can be passed here too.
    
    xarray_ds = processor.get_xarray(
        geometries=[AOI_POLYGON],
        bands=["B4", "B8"],
        cloud_cover_lt=20,
        date_range=date_range,
    )

Our New Approach

Performance Insights

Initial benchmarks show promising results for this approach. The aim was to improve time-to-first-tile.

There are a few things to note before we get to some raster query speed tests -

It is important to tune GDAL configurations correctly to fully benefit from GDAL’s own multi-range requests, and use settings that avoid GDAL listing all S3 files, which wastes time and adds more S3 GET requests.

Rasterio with GDAL’s virtual cache gets faster with subsequent reads to the same COG S3 file, but if the analysis is across time and space, then we pay the latency time to get headers per new file that is not cached.

This is also true across new VMs/Lambdas/Serverless compute, because a new Rasterio "Session/Environment" will always be created for each run of your Python code in a new Python environment, so scaling out to multiple VMs means these new ENVs keep sending HTTP requests to each COG file it interacts with, even if it was used in an earlier Session, to complete the rasterio.mask task.

Our approach to add new columns to STAC GeoParquet avoids paying this time and cost of multiple HTTP requests for each COG file in each new Python process.

We have made a custom python code for byte-range calculation code based on GDAL’s C++ approach. We also created custom tile merging code as well incase 1 AOI intersects with more than 1 internal tile of a COG.
This allows our library to be pretty lightweight and not require any GDAL dependencies, except for the geometry_mask/rasterize functions.

With all this context set, below are some initial test results -

Machine config - 2 CPU (4 threads), 2GB RAM machine

Test scenario: Processing 20 Sentinel-2 scenes for NDVI calculation over a year for a single farmland in India, with async Python functions

Total time taken - STAC filter + 40 TIF file queries + NDVI calculation

Time for filtering 1 year of STAC items with AOI and Cloud filter

Memory Usage for whole process

Key Factors to speed up in our approach:

Reduced HTTP requests through pre-calculating byte ranges just-in-time for each AOI for each COG file
Pre-cached metadata in STAC-GeoParquet eliminating API calls to STAC API JSON endpoints
Optimized parallel processing for spatio-temporal analysis

Current Scope and Limitations

While these initial results are encouraging, it's important to note where this approach currently works best:

Optimal Use Cases:

Time-series analysis of Sentinel 2 data across multiple small polygons
Optimize paid public bucket data access, to reduce GET requests costs

Areas of improvement:

Pure Python or Rust implementations of operations like rasterio.mask
Adding more data sources like USGS Landsat and others
LRU or other cache for repeated same tile queries
Reducing memory usage
Benchmark against Xarray and Dask workloads
Test on multiple polygons across the world for 1 year date range

What next:

As we continue to develop and refine this approach, we're excited to engage with the geospatial community to gather feedback, insights, and contributions. By collaborating and building upon each other's work, we can collectively push the boundaries of what's possible with cloud-based raster data access and analysis.

We're currently working on an open-source library which will be called “Rasteret” that implements these techniques, and we look forward to sharing the library and more technical details in an upcoming deep dive blog. Stay tuned!

Acknowledgments

This work stands on the shoulders of giants in the open-source geospatial community:

GDAL and Rasterio, for pioneering geospatial data access
The Cloud Native Geospatial community for STAC and COG specifications
PyArrow and GeoArrow for efficient parquet filtering
The broader open-source geospatial community

We're grateful for the tireless efforts and contributions of these projects and communities. Their dedication, expertise, and willingness to share knowledge have laid the foundation for approaches like the one outlined here.

Terrafloww is proud to support the Cloud Native Geospatial forum by being a newbie startup member in their large established community.