<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Terrafloww Labs]]></title><description><![CDATA[Next Gen Data Infrastructure]]></description><link>https://blog.terrafloww.com/</link><image><url>https://blog.terrafloww.com/favicon.png</url><title>Terrafloww Labs</title><link>https://blog.terrafloww.com/</link></image><generator>Ghost 5.88</generator><lastBuildDate>Sat, 28 Feb 2026 04:45:10 GMT</lastBuildDate><atom:link href="https://blog.terrafloww.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Rasteret 0.3 : EO image datasets should be tables, not folders]]></title><description><![CDATA[Index COG metadata once into Parquet, skip cold starts forever. Filter with DuckDB, train with TorchGeo, fetch pixels on demand. 20x faster runs]]></description><link>https://blog.terrafloww.com/eo-datasets-should-be-tables-not-folders/</link><guid isPermaLink="false">69714bb468b4e2fbb0aacb8e</guid><category><![CDATA[Announcement]]></category><dc:creator><![CDATA[Sid]]></dc:creator><pubDate>Fri, 27 Feb 2026 12:39:16 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1451187580459-43490279c0fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDl8fGRhdGF8ZW58MHx8fHwxNzcyMTMzNTEzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h2 id="intro">Intro</h2><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">&quot;Metadata is a love note to the future.&quot;</strong></b></i><b><strong style="white-space: pre-wrap;">, Jason Scott, Internet Archive</strong></b></div></div><img src="https://images.unsplash.com/photo-1451187580459-43490279c0fa?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDl8fGRhdGF8ZW58MHx8fHwxNzcyMTMzNTEzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="Rasteret 0.3 : EO image datasets should be tables, not folders"><p>In the database world, this has been gospel for decades. <em>The catalog, the thing that tells you where data lives, what shape it has, how to read it, is always more valuable than any single query result.</em> Entire careers in data engineering exist because someone figured out that the description of the data matters more than the data itself, <em>think Iceberg.</em></p><blockquote>We believe in building systems that stay true to this, metadata is better than more data!</blockquote><p>We in geospatial industry moved 100s of petabytes of satellite imagery to the cloud. Then we kept re-parsing the <em>table of contents</em>. Millions of times. Every cold start.</p><p>If you&apos;ve trained on cloud-hosted imagery, you know the cold-start problem: your stack parses lots of metadata before it sees a single pixel, and you pay that cost in every restart of compute, every new environment in K8s and more. You shrug your shoulders and say &quot;<em>eh first run is always slow</em>&quot;. <a href="https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/" rel="noreferrer">We&apos;ve written about this before in depth.</a></p><p>Caching those metadata was step <a href="https://blog.terrafloww.com/rasteret-a-library-for-faster-and-cheaper-open-satellite-data-access/" rel="noreferrer">one</a> for us. </p><p>But caching alone doesn&apos;t change how you work with data, it just makes the same workflow faster.</p><blockquote class="kg-blockquote-alt">What changes things, is when you start looking at that cache as a table you can work with. Filter it, join it, enrich it - then pair that with an I/O engine for object storage that gets you pixels on-demand. That&apos;s the new Rasteret library!</blockquote><h2 id="introducing-rasteret-03">Introducing&#xA0;<strong>Rasteret 0.3</strong>&#xA0;:</h2><blockquote><strong>Index once. Work in tables. Stream pixels on demand.</strong></blockquote><p>At the heart of Rasteret is a&#xA0;<strong>Collection</strong>: a GeoParquet table with properties of EO images, and its URLs, and cached tile-level byte offsets for every band; so each row already knows exactly where to read its pixels. Your imagery stays in COGs. The Collection is what you reload, share, version, and query.</p><p>Here&apos;s how to build one:</p><pre><code class="language-python">import rasteret

collection = rasteret.build(
    &quot;earthsearch/sentinel-2-l2a&quot;,
    name=&quot;s2_bangalore&quot;,
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=(&quot;2024-01-01&quot;, &quot;2024-06-30&quot;),
)
</code></pre><p><code>build()</code>&#xA0;picks from a catalog of 12 pre-registered sources (Sentinel-2, Landsat, NAIP, Copernicus DEM, AEF embeddings, and more), and creates the Collection you want. Don&apos;t see your dataset?&#xA0;<code>build_from_stac()</code>&#xA0;for any STAC API,&#xA0;<code>build_from_table()</code>&#xA0;for any existing Parquet that has Tiff URLs in it.</p><h3 id="what-changes-in-practice">What changes in practice</h3><p>That&#xA0;<code>build()</code>&#xA0;call hides a big shift. You&apos;re not working with files anymore, you&apos;re working with rows in a Parquet table.</p><pre><code class="language-python">sub = collection.subset(cloud_cover_lt=15, date_range=(&quot;2024-03-01&quot;, &quot;2024-06-01&quot;))
</code></pre><p>Once it&apos;s a row, you can filter it, join it, version it. <em>Try doing that with files in folder&#x1F440;</em></p><p>Append columns that you care about, (<code>split</code>,&#xA0;<code>label</code>,&#xA0;<code>qa_flag</code>,&#xA0;<code>model_score</code>,&#xA0;<code>aoi_wkb</code>...), push down filters with DuckDB or PyArrow, materialize a new Parquet for an experiment, share it and get the exact same subset later. Keep working in parquet, play with the metadata as much as you want.</p><p>The selection logic you use lives next to the data references in parquet. If you&apos;ve ever had&#xA0;<code>train_v7_final_final2/</code>&#xA0;folders of image chips lying around, you can feel why this matters.</p><h3 id="when-you-need-pixels-pick-the-output-format-that-fits">When you need pixels, pick the output format that fits:</h3><pre><code class="language-python"># NumPy arrays, lightweight, no extra deps
arr = collection.get_numpy(
    geometries=(77.55, 13.01, 77.58, 13.08),
    bands=[&quot;B04&quot;, &quot;B08&quot;],
)
# shape: [N, C, H, W]

# xarray Dataset, for analysis
ds = collection.get_xarray(geometries=bbox, bands=[&quot;B04&quot;, &quot;B08&quot;])
ndvi = (ds.B08.astype(float) - ds.B04.astype(float)) / (ds.B08 + ds.B04)

# TorchGeo GeoDataset, for ML training
dataset = collection.to_torchgeo_dataset(bands=[&quot;B04&quot;, &quot;B08&quot;], chip_size=256)
</code></pre><p>All three paths share the same I/O engine: using cached tiff metadata it goes straight to fast reads, no header re-parsing, no GDAL, explained in detail in previous blogs.&#xA0;<code>geometries</code>&#xA0;accepts bbox tuples, Arrow arrays, Shapely objects, or raw WKB, Arrow columns from GeoParquet are the fastest input.</p><h2 id="tag-patches-not-files">Tag patches, not files</h2><p>To make this more concrete, let&apos;s walk through a simple example.</p><p>Most EO workflows eventually produce semantics: &quot;these polygons are crop fields&quot;, &quot;these boxes contain ships&quot;, &quot;this tile has clouds&quot;, &quot;these 50k patches are hard negatives.&quot;</p><p>Traditionally, those semantics turn into more files: export chips, name them, keep the CSV that explains them, pray you can reproduce the join logic of csv to files six weeks later.</p><p>With Rasteret + Parquet, you keep the semantics as rows and columns:</p><ol><li>Build (or import) a scene-level Collection once</li><li>Run a model/heuristic that emits detections as GeoParquet</li><li>Join detections -&gt; scenes to produce a&#xA0;<strong>patch-level Collection</strong>&#xA0;(one row per patch)</li><li>Train from that patch table directly, no chip exports, no folder structure. Its just metadata with your logic in it.</li></ol><pre><code class="language-python">import duckdb
import pyarrow.compute as pc
import rasteret

# load a collection already built before
scenes = rasteret.load(&quot;./s2_bangalore/&quot;)

# your ship detections parquet
detections = pq.read_table(&quot;./ship_detections.geoparquet&quot;)

con = duckdb.connect()
con.register(&quot;scenes&quot;, scenes.dataset.to_table())
con.register(&quot;det&quot;, detections)

# Join detections to scenes, geometry is now the patch polygon, not the scene footprint
patches = con.sql(&quot;&quot;&quot;
  SELECT det.patch_id AS id, scenes.datetime, det.geometry,
         scenes.assets, det.label, det.score, det.split
  FROM det JOIN scenes ON det.scene_id = scenes.id
  WHERE det.score &gt; 0.8
&quot;&quot;&quot;).fetch_arrow_table()

patch_collection = rasteret.build_from_table(
    patches, name=&quot;ship_patches&quot;, enrich_cog=True
)
</code></pre><p>Now&#xA0;<code>patch_collection</code>&#xA0;is a Collection whose&#xA0;<code>geometry</code>&#xA0;column represents&#xA0;<strong>patches</strong>, not scenes. Each row carries a label, a score, and a split - all filterable with Arrow or DuckDB or Pandas:</p><pre><code class="language-python"># Arrow-native filtering: high-confidence train patches
train = patch_collection.subset(split=&quot;train&quot;)
table = train.dataset.to_table(columns=[&quot;geometry&quot;, &quot;label&quot;, &quot;score&quot;])
mask = pc.greater(table.column(&quot;score&quot;), 0.9)
hard_patches = pc.filter(table.column(&quot;geometry&quot;), mask)

# Arrow WKB geometries go directly to get_numpy -- no Shapely, no Python loop
arr = train.get_numpy(geometries=hard_patches, bands=[&quot;B04&quot;, &quot;B08&quot;])
# arr.shape: [N, C, H, W]
</code></pre><blockquote><strong>Use Parquet to decide&#xA0;and experiment <em>what</em>&#xA0;to train on. Use Rasteret to fetch the pixels for those rows, on demand whenever needed.</strong></blockquote><h2 id="torchgeo-interop">TorchGeo interop</h2><p>Most teams already have samplers, transforms, and dataloaders built around TorchGeo. </p><p>Rasteret&apos;s <code>Collection.to_torchgeo_dataset()</code>&#xA0;returns a standard TorchGeo&#xA0;<code>GeoDataset</code>&#xA0;(requires TorchGeo &gt;= 0.9.0), so the rest of your stack stays the same. </p><p>Rasteret replaces the I/O backend - custom instead of rasterio/GDAL, but everything after GeoDataset stays the same. </p><p>That same&#xA0;<code>patch_collection</code>? Here&apos;s what it looks like in TorchGeo.&#xA0;<code>label_field</code>&#xA0;is a Rasteret addition, it can be used to pull labels from the Parquet column into each sample, so they flow through&#xA0;<code>stack_samples</code>&#xA0;param of TorchGeo into your batch:</p><pre><code class="language-python">dataset = patch_collection.to_torchgeo_dataset(
    bands=[&quot;B04&quot;, &quot;B08&quot;],
    chip_size=256,
    split=&quot;train&quot;,
    label_field=&quot;label&quot;,     # each sample[&quot;label&quot;] comes from the Parquet column
)
sampler = RandomGeoSampler(dataset, size=256, length=1000)
loader = DataLoader(dataset, sampler=sampler, batch_size=8, collate_fn=stack_samples)

for batch in loader:
    images = batch[&quot;image&quot;]   # [B, C, H, W]
    labels = batch[&quot;label&quot;]   # [B]
    ...
</code></pre><h3 id="cold-start-benchmarks">Cold-start benchmarks</h3><p>Same AOIs, same scenes, same sampler, same DataLoader. Both paths output identical&#xA0;<code>[batch, T, C, H, W]</code>&#xA0;tensors. TorchGeo runs with its recommended GDAL settings for best-case remote COG performance.</p><table>
<thead>
<tr>
<th>Scenario</th>
<th style="text-align:right">TorchGeo (rasterio/GDAL)</th>
<th style="text-align:right">Rasteret</th>
<th style="text-align:right">Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single AOI, 15 scenes</td>
<td style="text-align:right">9.08 s</td>
<td style="text-align:right">1.14 s</td>
<td style="text-align:right"><strong>8x</strong></td>
</tr>
<tr>
<td>Multi-AOI, 30 scenes</td>
<td style="text-align:right">42.05 s</td>
<td style="text-align:right">2.25 s</td>
<td style="text-align:right"><strong>19x</strong></td>
</tr>
<tr>
<td>Cross-CRS boundary, 12 scenes</td>
<td style="text-align:right">12.47 s</td>
<td style="text-align:right">0.59 s</td>
<td style="text-align:right"><strong>21x</strong></td>
</tr>
</tbody>
</table>
<blockquote>As for correctness of outputs - We treat rasterio/GDAL alignment as a&#xA0;<strong>correctness contract</strong>&#xA0;for supported inputs (tiled GeoTIFF/COGs) in our tests. And when Rasteret can&apos;t safely match expected semantics, it fails loudly instead of letting bad images go to GPU.</blockquote><h2 id="major-tom-without-the-image-in-parquet">Major-TOM without the image-in-Parquet</h2><p>Here&apos;s more proof of why Rasteret&apos;s design of logic + metadata + byte-range aware IO engine works and is actually better for everyone.</p><p><a href="https://huggingface.co/datasets/Major-TOM/Core-S2L2A?ref=blog.terrafloww.com">Major TOM</a>&#xA0;is a large-scale Sentinel-2 training dataset published on Hugging Face. It has a two-layer Parquet design:&#xA0;<code>metadata.parquet</code>&#xA0;is a global index (2.2M rows with grid cells, product IDs, coordinates), while&#xA0;<code>images/*.parquet</code>&#xA0;stores <em>actual GeoTIFF image data directly inside Parquet</em> cells. That&apos;s the &quot;image-in-Parquet&quot; pattern.</p><p>We think there&apos;s a simpler and better option to achieve the same dataset semantics. The metadata already has everything you need to reconstruct patch geometries and fetch the original source pixels on demand:</p><pre><code class="language-python">import rasteret

# Build a cache from sentinel 2 scenes
collection = rasteret.build(&quot;earthsearch/sentinel-2-l2a&quot;,
                            name=&quot;major-tom-on-the-fly&quot;,
                            bbox=(-122.55, 37.65, -122.30, 37.90),
                            date_range=(&quot;2024-01-01&quot;, &quot;2024-02-01&quot;))

# Add Major TOM-style columns, grid cells, product IDs, and deterministic splits as Parquet columns
enriched = enrich_major_tom_columns(collection, grid_km=10)

# bring back your pyarrow table/pyarrow dataset into a rasteret collection
# it simply validates that table still has columsn that rasteret internals care about
collection = rasteret.build_from_table(enriched, name=&quot;major-tom-on-the-fly&quot;)

# Fetch patches as NumPy from Arrow geometry, no payload Parquet needed
subset = collection.subset(split=&quot;train&quot;)
arr = subset.get_numpy(geometries=geometry_array, bands=[&quot;B02&quot;, &quot;B08&quot;])
</code></pre><p>The result is a few MBs worth Parquet index instead of multi-GB parquet shards, with the same patch-grid semantics and deterministic splits.</p><p>We benchmarked this against the Huggingface &quot;Major-TOM Core&quot; parquets using the strongest&#xA0;<code>datasets</code>&#xA0;read path which can take pyarrow filters and use HF&apos;s smart storage to its advantage (not the HF streaming generator):</p><table>
<thead>
<tr>
<th style="text-align:right">Patches</th>
<th style="text-align:right">HF <code>datasets</code> parquet filters</th>
<th style="text-align:right">Rasteret index+COG</th>
<th style="text-align:right">Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right">120</td>
<td style="text-align:right">46.83 s</td>
<td style="text-align:right">12.09 s</td>
<td style="text-align:right"><strong>3.9x</strong></td>
</tr>
<tr>
<td style="text-align:right">1000</td>
<td style="text-align:right">771.59 s</td>
<td style="text-align:right">118.69 s</td>
<td style="text-align:right"><strong>6.5x</strong></td>
</tr>
</tbody>
</table>
<p>The speedup grows with patch count because HF&#xA0;<code>datasets</code>&#xA0;re-scans the full Parquet shard for each keyed row, while Rasteret&apos;s index resolves directly to byte ranges.</p><p>We&apos;re not saying image-in-Parquet is wrong, Major TOM is a great project and a great resource. We&apos;re saying that for many EO ML workflows, its a good idea to think - your dataset is a table, and pixels stay where they already live.</p><h2 id="parquet-is-becoming-the-shipping-container-for-eo-ml">Parquet is becoming the shipping container for EO ML</h2><p>The ecosystem is already moving here in its own ways:</p><ul><li><strong>AlphaEarth Foundation</strong>&#xA0;publishes global dense embeddings as GeoParquet-indexed COG tiles &#x2014; Rasteret already ships a built-in dataset entry for them. Index-first, payload-later.</li><li><strong>GeoTessera</strong>&#xA0;ships a small&#xA0;<code>registry.parquet</code>&#xA0;to discover embedding tiles, then downloads&#xA0;<code>.npy</code>&#xA0;tiles on demand. Same pattern.</li><li>Many cloud providers &#x2014; AWS Open Registry, Azure Planetary Computer, Google Public Datasets &#x2014; have begun publishing Parquet/GeoParquet snapshots of their catalogs (essentially STAC JSONs converted to Parquet for bulk querying).</li></ul><p>These projects use Parquet to describe where data lives. </p><p>A Rasteret Collection also caches&#xA0;<em>how</em>&#xA0;to read it: COG metadata sit right in the same Parquet. When you call&#xA0;<code>get_numpy()</code>&#xA0;or&#xA0;<code>to_torchgeo_dataset()</code>, Rasteret&apos;s fast IO engine goes straight from table row to pixel array: no header re-parsing, no GDAL.</p><blockquote><strong>The rest of the library features:&#xA0;<code>subset()</code>,&#xA0;<code>build_from_table()</code>, DuckDB joins, Arrow filtering, is data science on the index itself, before you ever fetch a pixel. That&apos;s the freedom you get! </strong></blockquote><h2 id="what-were-building-toward">What we&apos;re building toward</h2><p>Rasteret 0.3 is our answer to a specific but very real frustration: teams spend too much engineering on the seam between <em>&quot;I found the data&quot;</em> and <em>&quot;I can train on it.&quot;</em> The Rasteret Collection is that seam. A Parquet table with enough metadata to skip cold starts, enough structure to query with standard tools, and enough context to fetch pixels on demand.</p><p>That&apos;s the core bet:&#xA0;<strong>the index is the dataset</strong>. Pixels stay where they already live in large Tiff files. Everything else - splits, labels, patch geometries, quality flags - lives as columns in a table you can version, share, and reproduce and embed your fancy data science logics in. </p><h3 id="whats-next-for-rasteret">What&apos;s next for Rasteret</h3><p>We&apos;re actively working on:</p><ul><li><strong>More catalog entries</strong>&#xA0;- every dataset we add is one fewer STAC endpoint someone has to remember. Community PRs for new entries are the highest-leverage contribution and its very easy 20 lines of config only.</li><li><strong>Deeper TorchGeo collaboration</strong>&#xA0;- we&apos;ve opened a conversation with the TorchGeo maintainers about how Rasteret can serve as a backend for their GeoDataset ecosystem. Early days, but the goal is interop that benefits both communities.</li><li><strong>Format support</strong>&#xA0;- tiled GeoTIFF/COGs today, but the async byte-range reader architecture extends naturally to other tiled formats.</li></ul><h3 id="further-out-a-dataset-descriptor-spec">Further out: a dataset descriptor spec</h3><p>There&apos;s a bigger gap we keep running into. STAC is great at discovery, TorchGeo is great at execution, Parquet registries are becoming the lingua franca - but every project (AllenAI rslearn, GeoTessera, BigEarthNet V2, FTW) independently invents its own manifest schema. We think there&apos;s room for a small, portable descriptor that says &quot;here&apos;s what this dataset&#xA0;<em>is</em>&#xA0;- bands, CRS, resolution, splits - and here&apos;s where the records live.&quot; We&apos;re calling it DDS (data descriptor spec) inspired by DDL from database world, and we&apos;ll write more about it in a future post. Tools first, spec claims later. </p><h2 id="come-build-with-us">Come build with us</h2><p>Rasteret is Apache-2.0, fully open source, with no strings attached to any paid product or commercial offering of Terrafloww. We want to build this in the open, for everyone.</p><p>We&apos;d love your help, and it doesn&apos;t have to be big. File an issue when something feels wrong. Open a PR to add a dataset entry. Report a docs typo. Tell us about a TIFF layout that breaks. If you find a case where Rasteret&apos;s output differs from rasterio, that&apos;s a correctness bug and we want to hear about it.</p><ul><li><strong>Docs</strong>:&#xA0;<a href="https://terrafloww.github.io/rasteret/?ref=blog.terrafloww.com">terrafloww.github.io/rasteret</a></li><li><strong>Benchmarks</strong>:&#xA0;<a href="https://terrafloww.github.io/rasteret/explanation/benchmark/?ref=blog.terrafloww.com">Benchmark write-up</a></li><li><strong>GitHub</strong>:&#xA0;<a href="https://github.com/terrafloww/rasteret?ref=blog.terrafloww.com">github.com/terrafloww/rasteret</a></li><li><strong>PyPI</strong>:&#xA0;<code>uv pip install rasteret</code></li><li><strong>Discord</strong>:&#xA0;<a href="https://discord.gg/V5vvuEBc?ref=blog.terrafloww.com">discord.gg/V5vvuEBc</a></li></ul><p>Read more about our larger vision in our previous blog - <a href="https://blog.terrafloww.com/streaming-tensors-not-files-for-geoai/" rel="noreferrer">Streaming Tensors not files</a></p><p>Checkout <a href="https://terrafloww.com/?ref=blog.terrafloww.com" rel="noreferrer">terrafloww.com</a> to see how our commercial offerings combine and take your EO analysis experience and infrastructure to the next level. If you are convinced, join our waitlist.</p><hr><p><em>This work stands on the shoulders of GDAL, rasterio, TorchGeo, PyArrow, obstore, and the broader open-source geospatial community. Thanks for building the foundations.</em></p>]]></content:encoded></item><item><title><![CDATA[Stop selling files: The case for streaming tensors in GeoAI]]></title><description><![CDATA[Rasteret sped up geo image reads by 10x. But users still spend 80% time doing ETL before even feeding GPUs. Its like having a fast car on roads full of potholes. We investigate why, and share our solution]]></description><link>https://blog.terrafloww.com/streaming-tensors-not-files-for-geoai/</link><guid isPermaLink="false">69654b3268b4e2fbb0aac3f5</guid><dc:creator><![CDATA[Sid]]></dc:creator><pubDate>Wed, 21 Jan 2026 20:32:52 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1760344594784-60ff14035eb0?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDQ5fHxzdHJlYW0lMjBkYXRhfGVufDB8fHx8MTc2OTAxODUyMnww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<h2 id="intro">Intro</h2><img src="https://images.unsplash.com/photo-1760344594784-60ff14035eb0?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDQ5fHxzdHJlYW0lMjBkYXRhfGVufDB8fHx8MTc2OTAxODUyMnww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="Stop selling files: The case for streaming tensors in GeoAI"><p>Last year we made <a href="https://blog.terrafloww.com/rasteret-a-library-for-faster-and-cheaper-open-satellite-data-access/" rel="noreferrer"><strong>Rasteret</strong></a> library, which sped up access to Earth Observation (EO) images by 10x, while doing <strong>zero copy</strong> of the original data. We honestly thought this solves a lot of problems. But we realized users were still spending 80% of their time doing ceremonies before feeding the GPUs with EO images: surfing STAC catalogs, filling out forms to download images, drawing polygons on maps and waiting for email links, trying to find RAI information, all this is still happening in the age of cloud-native formats and AI agents.</p><blockquote class="kg-blockquote-alt">Its like we built a faster car for everyone, but the roads are still full of potholes. So we set out to see why</blockquote><h2 id="the-gap-for-eo-imagery"><strong>The gap for EO imagery</strong></h2><p>2025 was the year Defense &quot;ate&quot; everything. Governments signed 9-figure checks for dedicated satellite constellations, and end-to-end intelligence systems that are proprietary. </p><blockquote>honestly? This is good, it shows they buy tangible outcomes no matter the cost.</blockquote><p>But what about the rest of the commercial markets like insurance, agriculture, supply chains, real estate? Commercial EO as an industry, whether its raw images from space or insights from sophisticated science labs, faces fractures when trying to work as one system, between open data standards, and special requirements.</p><p>Tabular geospatial data (vectors) is already moving to standards like Parquet and Iceberg, while images from satellites and drones are still tricky to work with. They are similar to data in other domains like medical imagery, machinery CAD models which are also hard to put into data analytics for multimodal science. </p><h2 id="the-ceremonies-to-face"><strong>The ceremonies to face</strong> </h2><p>If you are a ML team building a risk model for real estate today and you want to use EO images of any kind, before you even write a line of ML code you have to survive the ceremonies of Discover, Access and Read:</p><ul><li><strong>Discovery is slow:</strong> Search across NASA, ESA, Private companies catalogs, public catalogs</li><li><strong>Access is full of diversions and no strict standards:</strong> Fill forms for data, draw polygons in a map, get download links via email, every data provider&#x2019;s &quot;catalog&quot; is different</li><li><strong>Reading is heavy:</strong> EO images need specialised tools and boilerplate code to keep the GPU fed fast</li></ul><p>As Aravind from <a href="https://terrawatch.substack.com/?ref=blog.terrafloww.com" rel="noreferrer">Terrawatch</a> rightly said, for most companies using EO, a &quot;hidden product&quot; is just their internal data pipeline. They burn their seed money building custom infrastructure just to do the unsexy work of finding and handling imagery differences. </p><figure class="kg-card kg-image-card"><img src="https://blog.terrafloww.com/content/images/2026/01/Gemini_blog2.png" class="kg-image" alt="Stop selling files: The case for streaming tensors in GeoAI" loading="lazy" width="2000" height="1091" srcset="https://blog.terrafloww.com/content/images/size/w600/2026/01/Gemini_blog2.png 600w, https://blog.terrafloww.com/content/images/size/w1000/2026/01/Gemini_blog2.png 1000w, https://blog.terrafloww.com/content/images/size/w1600/2026/01/Gemini_blog2.png 1600w, https://blog.terrafloww.com/content/images/2026/01/Gemini_blog2.png 2000w" sizes="(min-width: 720px) 720px"></figure><p><strong>The sharing dead end</strong></p><p>Let&apos;s say you survive the ceremony and create great insights from your AI model. How do you share it? PDF reports. BI Dashboards. Maps. Convert images to tables.</p><blockquote>You take high-dimensional, computable data and flatten it for human eyeballs. </blockquote><p>If you really wish to share it as images, you end up with STAC catalogs. Probably starting the above &quot;ceremonies&quot; again for each end user looking to build on your outputs.</p><h2 id="what-matters-in-the-age-of-agents"><strong>What matters, in the age of Agents</strong> </h2><p>If a human engineer takes weeks to collate and prepare EO datasets, it is impossible for a generic AI Agent.</p><p>We are moving to a world of MCP and A2A commerce. If an AI agent is trying to answer a prompt like &quot;estimate my supply chain&apos;s carbon emissions&quot; or &quot;should I buy my home here&quot;, the AI might think &quot;<em>location data might be insightful for this prompt, let me get it&quot;</em></p><p>Today, that agent hits a wall. It hits a &quot;Contact Sales&quot; form or a &quot;Download&quot; button. Just like a human does.</p><figure class="kg-card kg-image-card"><img src="https://blog.terrafloww.com/content/images/2026/01/Gemini_blog1.png" class="kg-image" alt="Stop selling files: The case for streaming tensors in GeoAI" loading="lazy" width="2000" height="1091" srcset="https://blog.terrafloww.com/content/images/size/w600/2026/01/Gemini_blog1.png 600w, https://blog.terrafloww.com/content/images/size/w1000/2026/01/Gemini_blog1.png 1000w, https://blog.terrafloww.com/content/images/size/w1600/2026/01/Gemini_blog1.png 1600w, https://blog.terrafloww.com/content/images/2026/01/Gemini_blog1.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>Be it a satellite manufacturer publishing their images, non-profits distributing deforestation predictions, or high-end labs sharing Geo Foundation Model predictions, all these images need to be easy to find, read and attribute (<a href="https://www.nature.com/articles/sdata201618?ref=blog.terrafloww.com#:~:text=Box%202%3A-,The%20FAIR%20Guiding%20Principles,-To%20be%20Findable" rel="noreferrer">FAIR</a>).</p><p>We believe the Commerical EO industry needs its own protocol for images. A connectivity layer that handles the handshake between a raw S3 bucket and a GPU tensor. Create a better experience for both Humans and AI Agents.</p><h2 id="the-solution-stream-tensors-not-files"><strong>The Solution: Stream Tensors, not Files</strong></h2><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">YouTube</strong></b> solved distribution by standardizing the player, not the cameras or content. We at Terrafloww are doing the same for EO. </div></div><p>Instead of forcing you to download GeoTIFFs, or create a new file format which &quot;shoe-horns&quot; your data by copying images into tables. <strong>Rasteret SDK acts as a buffering protocol, while Terrafloww Platform acts as the player.</strong></p><p></p><figure class="kg-card kg-image-card"><img src="https://blog.terrafloww.com/content/images/2026/01/Gemini_Generated_Image_xr7eydxr7eydxr7e.png" class="kg-image" alt="Stop selling files: The case for streaming tensors in GeoAI" loading="lazy" width="2000" height="960" srcset="https://blog.terrafloww.com/content/images/size/w600/2026/01/Gemini_Generated_Image_xr7eydxr7eydxr7e.png 600w, https://blog.terrafloww.com/content/images/size/w1000/2026/01/Gemini_Generated_Image_xr7eydxr7eydxr7e.png 1000w, https://blog.terrafloww.com/content/images/size/w1600/2026/01/Gemini_Generated_Image_xr7eydxr7eydxr7e.png 1600w, https://blog.terrafloww.com/content/images/2026/01/Gemini_Generated_Image_xr7eydxr7eydxr7e.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>We are building on the open MLCommons Croissant standard to broadcast your datasets across the web better than STAC catalogs. While our SDK speaks Arrow and DLPack.</p><p>So what does this mean for you?</p><p>For Data Providers -</p><ul><ul><li><strong>Keep your images </strong>in your own S3 bucket, zero copying.</li><li><strong>We index it</strong>, you add your attribution and license information</li><li><strong>Increased visibility</strong> of your data on the web, with correct metadata</li><li><strong>Get paid</strong> <strong>for every chip </strong>that flows out of your S3</li><li><strong>Set the price </strong>for very granular units, not the old &quot;$ per Sqkm&quot;<ul><li>Play with Excel-like formulas to tune prices for micro units of data</li></ul></li><li><strong>Track every byte</strong> you shared and the money you earn, on our dashboard</li></ul></ul><p>For Data Consumers (Humans/AI Agents) - </p><ul><ul><li>With Rasteret SDK you get -<ul><li><strong>Easier data discovery</strong> and filtering of datasets</li><li><strong>10x faster reads of </strong>EO images and no boilerplate code to feed your GPUs</li></ul></li></ul><ul><ul><li><strong>Fully interoperable</strong> with PyTorch, JAX, DuckDB and more libraries via Apache Arrow and DLPack</li></ul><li><strong>Track every byte</strong> you queried and how much it costs, on our dashboard</li></ul></ul><blockquote class="kg-blockquote-alt">Whether an AI Agent reads as little as 5 chips of an EO image or a hedge fund reads images of a whole continent, both providers and consumers can <strong>speed up, monitor, and monetize every byte</strong> with Terrafloww</blockquote><blockquote class="kg-blockquote-alt">This is the foundation needed to make EO data truly ready for agentic- commerce</blockquote><h3 id="what-if-you-dont-wish-to-sell-data"><strong>What if you don&apos;t wish to sell data?</strong> </h3><p>All the features listed above remain the same, even for your internal needs. </p><p>Filter, discover and accelerate internal analysis on your image datasets with our SDK, and you can monitor and bill internal teams in your enterprise with our console. Skip messing with your data platform, we integrate with you.</p><h3 id="will-it-devalue-my-data"><strong>Will it devalue my data?</strong></h3><p>Some might fear that if we make data this easy to mix and match, AI agents will just combine datasets to create something better, lowering the value of the original source.</p><p>To that, we say: It&apos;s possible, but when friction drops to zero, the volume of innovation skyrockets. We will accelerate towards better data products. We want to see what happens when a group of ML scientists can instantly monetize their inference model without building a sales team, or when a non-profit&#x2019;s dataset can be effortlessly consumed by a Fortune 500&#x2019;s AI agent.</p><h2 id="join-the-beta"><strong>Join the Beta</strong> </h2><p>The Defense market has its systems. It&#x2019;s time the Commercial EO market had its own. </p><p>Come test us out, would love your feedback and build this future with you. </p><div class="kg-card kg-button-card kg-align-center"><a href="https://accounts.terrafloww.com/waitlist?ref=blog.terrafloww.com" class="kg-btn kg-btn-accent">Join the Waitlist!</a></div><p>If you like to know more, or just wanna chat with us, join our Discord. We will be releasing more updates and blogs soon.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://discord.gg/hdR2UJ2Zdx?ref=blog.terrafloww.com" class="kg-btn kg-btn-accent">Join Discord</a></div>]]></content:encoded></item><item><title><![CDATA[Rasteret: A Library for Faster and Cheaper Open Satellite Data Access]]></title><description><![CDATA[<p><br><strong>Happy New Year, Everyone!</strong></p><p>It&#x2019;s been just over a month since we published our last <a href="https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/" rel="noreferrer">blog</a> describing a method that improves reading of Cloud-Optimized GeoTIFFs (COGs). Over the New Year break, we worked on open-sourcing not just the core logic but also creating a library with simple high-level</p>]]></description><link>https://blog.terrafloww.com/rasteret-a-library-for-faster-and-cheaper-open-satellite-data-access/</link><guid isPermaLink="false">67820f7168b4e2fbb0aabec0</guid><dc:creator><![CDATA[Sid]]></dc:creator><pubDate>Sun, 12 Jan 2025 06:12:17 GMT</pubDate><media:content url="https://blog.terrafloww.com/content/images/2025/01/L9SanFrancisco.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://blog.terrafloww.com/content/images/2025/01/L9SanFrancisco.jpg" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access"><p><br><strong>Happy New Year, Everyone!</strong></p><p>It&#x2019;s been just over a month since we published our last <a href="https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/" rel="noreferrer">blog</a> describing a method that improves reading of Cloud-Optimized GeoTIFFs (COGs). Over the New Year break, we worked on open-sourcing not just the core logic but also creating a library with simple high-level APIs to make it easier for users to adopt our approach.</p><p>While the library is still in its early stages, we believe it can benefit geospatial data scientists and engineers significantly. By sharing it now, we hope to receive feedback and contributions via GitHub Issues and PRs to make it even more useful for the community.</p><h3 id="performance-snapshot-rasterio-vs-rasteret">Performance Snapshot: Rasterio vs. Rasteret</h3><figure class="kg-card kg-image-card kg-width-wide"><img src="https://blog.terrafloww.com/content/images/2025/01/single_timeseries_request.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="762" height="448" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/01/single_timeseries_request.png 600w, https://blog.terrafloww.com/content/images/2025/01/single_timeseries_request.png 762w"></figure><p>Here&#x2019;s an exciting stat to illustrate the potential of Rasteret:</p><ul><li><strong>1-Year NDVI Time Series of a Farm Using Landsat 9</strong><ul><li><strong>Rasterio (Multiprocessing)</strong>: 32 seconds (first run), 24 seconds (subsequent runs with GDAL caching).</li><li><strong>Rasteret</strong>: Just <strong>3 seconds</strong>&#x2014;every time.</li></ul></li><ul><ul><li>Even Google Earth Engine, takes 10&#x2013;30 seconds for its first run and 3&#x2013;5 seconds on its subsequent runs.</li></ul></ul></ul><h3 id="how-was-it-made"><strong>How was it made ?</strong></h3><p>Inspired by <a href="https://github.com/fsspec/kerchunk?ref=blog.terrafloww.com" rel="noreferrer">Kerchunk</a>, Rasteret uses a cache to speed up COG file processing.</p><p>We call this cache, &quot;Rasteret Collection&quot;, which is familiar to most folks using STAC. It only caches the COG file headers, it does not cache Overviews or image data tiles.</p><p>We decided to go with <a href="https://stac-utils.github.io/stac-geoparquet/latest/?ref=blog.terrafloww.com" rel="noreferrer">STAC-GeoParquet </a>as its base, and extend it with &quot;per-band-metadata&quot; columns.</p><p><em>&quot;For example, a Landsat Rasteret Collection is an exact copy of the </em><a href="https://landsatlook.usgs.gov/stac-server/collections/landsat-c2l2-sr/?ref=blog.terrafloww.com" rel="noreferrer"><em>landsat-c2l2-sr</em></a><em> STAC Collection, with additional columns in the GeoParquet.</em> If you wish to know more, check out our first <a href="https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/" rel="noreferrer">blog</a>.</p><h3 id="what-does-this-cost">What does this cost ?</h3><p></p><p>In this blog we focus specifically on paid datasets like Landsat on AWS to emphasise cost savings, free STAC endpoints like Earth Search&apos;s <a href="https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/?ref=blog.terrafloww.com" rel="noreferrer">sentinel-2-l2a</a> is also possible to be cached with Rasteret.</p><p><strong>Global Landsat Rasteret Collection (1 year):</strong></p><ul><li><strong>Time:</strong> ~30 minutes</li><li><strong>Cost:</strong> $1.8 (AWS S3 GET requests).</li></ul><p><strong>Regional Rasteret Collection (e.g., Karnataka, India, 1 year):</strong></p><ul><li><strong>Time:</strong> ~9 seconds</li><li><strong>Cost:</strong> Negligible.</li></ul><p>Rasteret library as of now, defaults LANDSAT to the <a href="https://landsatlook.usgs.gov/stac-server/collections/landsat-c2l2-sr/?ref=blog.terrafloww.com" rel="noreferrer">landsat-c2l2-sr</a> collection</p><p>Creating such Collections means, you <strong>pay the cost once upfront</strong>. More on costs below.</p><h3 id="why-we-built-this-library">Why We Built This Library</h3><p></p><p><strong><u>Intro</u></strong></p><p>You can skip reading this intro section if you know the inner workings of Rasterio/GDAL.</p><p><strong>The Hidden Performance Bottleneck</strong> For Rasterio, and its backend GDAL functions, getting image data involves quite a few rounds of metadata retrieval. Here&apos;s what happens behind the scenes:</p><ol><li><strong>First HTTP Request</strong>: Fetch file header<ul><li>Retrieves critical metadata like CRS, Transform</li><li>Caches this information in-memory or on disk using GDAL&apos;s cache</li></ul></li><li><strong>Additional Requests</strong>: Grab COG overviews<ul><li>Useful for quick visualizations</li><li>But... it means more HTTP requests to the same file </li></ul></li></ol><p>Then, when you call -</p><pre><code class="language-python">from rasterio.mask import mask

data, transform = mask(src, [geometry], crop=True)</code></pre><p>Rasterio now uses its cached info in GDAL to:</p><ul><li>Determine which image tile is in which byte-range</li><li>Make final HTTP requests to fetch the numpy array</li></ul><p><strong>The Real Cost</strong>: Between 3 to 6 HTTP requests per .tif file, every time you start a new Python environment.</p><h4 id="effect-of-new-environments-on-gdal-cache"><u>Effect of New Environments on GDAL Cache</u></h4><p>It is important to note that GDAL cache helps to reduce the HTTP requests to COG files, if your code is re-run <strong>in the</strong> <strong>same environment/session</strong>.</p><p>Where GDAL cache remains -</p><ul><li>Rerunning the same &apos;.py&apos; file</li><li>Consistent Jupyter notebook sessions</li><li>Long-running Cloud VMs</li><li>Local Python environments in laptop</li></ul><p>Where GDAL cache is lost/non-existent -</p><ul><li>Laptop or VM restarts</li><li>Cloud-native workflows with scaling VMs</li><li>Hosted Jupyter notebooks kernel or VM restarting</li></ul><h4 id="era-of-cloud-native-geospatial-workflows"><u>Era of cloud native geospatial workflows</u></h4><p>In today&apos;s cloud-based data analysis, we leverage cloud elasticity like never before:</p><ul><li>Multiple compute machines (VMs)</li><li>Serverless platforms (AWS Lambda, GCP Cloud Run)</li><li>Kubernetes orchestrators (Airflow, Kubeflow, Flyte)</li></ul><p><strong>The Catch</strong>: Each of these creates a <strong>NEW</strong> Python environment.</p><p><strong>Translation</strong>: Rasterio must re-fetch metadata for EVERY COG file, every single time.</p><p>With workflows that depend on thousands of files and highly parallel tasks, this inefficiency can quickly escalate.</p><blockquote>Rasteret addresses these limitations, and ensures consistent performance and cost efficiency, even in distributed, and ephemeral cloud environments.</blockquote><h2 id="cost-and-time-analysisrasteret-vs-rasteriogdal"><u>Cost and Time Analysis - Rasteret vs Rasterio/GDAL</u></h2><p></p><p>Let&#x2019;s break down the cost and time differences for a hypothetical project analyzing <strong>100,000 farms.</strong> The project needs time-series indices of various kinds like NDVI, NDMI etc.</p><p>Here&#x2019;s the scenario:</p><ul><li><strong>Scenes:</strong> 2 Landsat scenes covering all farms.</li><li><strong>Dates per scene:</strong> 45.</li><li><strong>Bands to be read per date:</strong> 4.</li><li><strong>Total files:</strong> 2 &#xD7; 45 &#xD7; 4 = 360 COG files.</li><li><strong>Files worth processing (due to cloud cover):</strong> 200 COG files.</li></ul><h4 id="workflow">Workflow</h4><ul><li><strong>Rasteret</strong> caches metadata for the 200 files in its Collection, and its used across 100 parallel processes (dockers/lambdas).</li><li><strong>Rasterio/GDAL</strong> reads headers repeatedly in new environments due to lack of shared cache</li></ul><figure class="kg-card kg-image-card kg-width-wide"><img src="https://blog.terrafloww.com/content/images/2025/01/actual_analysis_time.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="762" height="448" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/01/actual_analysis_time.png 600w, https://blog.terrafloww.com/content/images/2025/01/actual_analysis_time.png 762w"></figure><figure class="kg-card kg-image-card kg-width-wide"><img src="https://blog.terrafloww.com/content/images/2025/01/aws_service_wise_costs.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="762" height="448" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/01/aws_service_wise_costs.png 600w, https://blog.terrafloww.com/content/images/2025/01/aws_service_wise_costs.png 762w"></figure><figure class="kg-card kg-image-card kg-width-wide"><img src="https://blog.terrafloww.com/content/images/2025/01/total_vm_hours.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="762" height="448" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/01/total_vm_hours.png 600w, https://blog.terrafloww.com/content/images/2025/01/total_vm_hours.png 762w"></figure><h4 id="s3-get-costs-for-header-reads"><strong>S3 GET Costs for Header Reads</strong></h4><ul><li><strong>Rasteret:</strong><ul><li>Cache headers once: 200 requests(1 per file) &#xD7; $0.0004/1000 = <strong>$0.00008</strong>.</li></ul></li><li><strong>Rasterio/GDAL:</strong><ul><li>Headers read 100 times (100 processes): 100 &#xD7; $0.00008 = <strong>$0.008</strong>.</li></ul></li></ul><h4 id="s3-get-costs-for-image-tile-reads">S3 GET Costs for Image Tile Reads</h4><ul><li><strong>Both Rasteret and Rasterio:</strong><ul><li>100,000 farms &#xD7; 2 images tiles cover per farm &#xD7; 200 files &#xD7; $0.0004/1000 = <strong>$16</strong>.</li></ul></li></ul><h4 id="total-s3-costs">Total S3 Costs</h4><ul><li><strong>Rasterio:</strong> $16 + $0.008 = <strong>$16.008</strong>.</li><li><strong>Rasteret:</strong> $16 + $0.00008 = <strong>$16</strong>.</li></ul><h4 id="ec2-costs">EC2 Costs</h4><p>Assume t3.xlarge AWS SPOT type instances at $0.006 per hour:</p><ul><li><strong>Rasterio Processing Time:</strong><ul><li>Files processed per second: 1.8.</li><li>Total VM time: 200 files / 1.8 &#xD7; 100,000 farms = 11,100,000 seconds.</li><li>Time for parallel process = 11,100,000/100/3600 = 30 hours</li><li>EC2 cost: 11,100,000 / 3600 &#xD7; $0.006 = <strong>$18</strong>.</li></ul></li><li><strong>Rasteret Processing Time:</strong><ul><li>Files processed per second: 11.</li><li>Total VM time: 200 files / 11 &#xD7; 100,000 farms = 1,818,000 seconds.</li><li>Time for parallel process = 1,818,000/100/3600 = 5 hours</li><li>EC2 cost: 1,818,000 / 3600 &#xD7; $0.006 = <strong>$3</strong>.</li></ul></li></ul><h4 id="total-costs">Total Costs</h4><ul><li><strong>Rasterio:</strong> $16.008 (S3) + $18 (EC2) = <strong>$34.008</strong>.</li><li><strong>Rasteret:</strong> $16 (S3) + $3 (EC2) = <strong>$19</strong>.</li></ul><h3 id="scalability-and-repeatability">Scalability and Repeatability</h3><p>By running more such workflows parallely, doing EDA on same or different set of farms,  the long term implication is that for each run without a GDAL cache, using Rasterio, costs rise due to repeated header reads:</p><ul><li><strong>1,000 environments:</strong> $34.008 + $0.08 = <strong>$34.088</strong>.</li><li><strong>10,000 environments:</strong> $34.008 + $0.8 = <strong>$34.808</strong>.</li><li><strong>100,000 environments:</strong> $34.008 + $8 = <strong>$42.008</strong>.</li></ul><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">&#x1F440;</div><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">These &apos;</em></i><i><b><strong class="italic" style="white-space: pre-wrap;">new environments&apos; count</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> can rack up quite quickly across an organisation, like each employee&apos;s laptop, each project&apos;s Kubernetes cluster, Jupyter Lab with its kernels restarting, VMs restarting, and more such scenarios</em></i>.</div></div><p>Rasteret&#x2019;s cached metadata in Collection ensures consistent costs for the above scenarios.</p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://blog.terrafloww.com/content/images/2025/01/env_scaling_costs.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="762" height="448" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/01/env_scaling_costs.png 600w, https://blog.terrafloww.com/content/images/2025/01/env_scaling_costs.png 762w"></figure><h3 id="conclusion">Conclusion</h3><p>Rasteret is not intended to replace GDAL or Rasterio, which are amazingly feature rich libraries. Rather, it is focused on addressing a critical aspect of making cloud-native satellite imagery workflows faster and cheaper.&quot;</p><p>This is an early release of the library, which we shared because its performance and efficiency seem promising. But we want eyes on it early. Do give us your feedback and ideas as issues or PRs.</p><p>There are lots of improvements that can be done, from better deployment actions, adding clear release notes, more test coverage, support for Python 3.12, and 3.13, support for S3 based Rasteret Collections, and more.</p><p>In parallel, we plan on adding features too, one of the first that we think will be useful is a PyTorch&apos;s <a href="https://torchgeo.readthedocs.io/en/latest/?ref=blog.terrafloww.com" rel="noreferrer">TorchGeo</a> compatible on-the-fly <a href="https://torchgeo.readthedocs.io/en/stable/tutorials/custom_raster_dataset.html?ref=blog.terrafloww.com" rel="noreferrer">Dataset</a> creator, for which there&apos;s an open <a href="https://github.com/terrafloww/rasteret/issues/1?ref=blog.terrafloww.com" rel="noreferrer">issue</a> as well. Do share your thoughts there.</p><p>Do try it out with <strong>Python 3.11</strong> environment - </p><pre><code>uv pip install rasteret</code></pre><div class="kg-card kg-callout-card kg-callout-card-green"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Checkout the Quick Start on <a href="https://github.com/terrafloww/rasteret?tab=readme-ov-file&amp;ref=blog.terrafloww.com#%EF%B8%8F-prerequisites" rel="noreferrer">Github</a></div></div><p>If you like it, do give us a &#x2B50; on Github, if you have already installed it, do upgrade it.</p><p>Thanks for reading till the end, wish you have a wonderful year ahead!</p><hr><p></p><p></p><h3 id="acknowledgements">Acknowledgements</h3><p>This work builds on the contributions of giants in the open-source geospatial community:</p><ul><li>GDAL and Rasterio, for pioneering geospatial data access</li><li>The Cloud Native Geospatial members for STAC-geoparquet and COG specifications</li><li>PyArrow, GeoArrow for efficient GeoParquet filtering</li><li>The broader open-source geospatial community</li></ul><p>We&apos;re grateful for the efforts and contributions of these projects and communities. Their dedication, expertise, and willingness to share knowledge have laid the foundation for approaches like the one outlined here.</p><hr><p><strong>Terrafloww is proud to sponsor the Cloud Native Geospatial forum as a small startup member</strong></p><figure class="kg-card kg-image-card"><img src="https://blog.terrafloww.com/content/images/2025/07/cnglogo.png" class="kg-image" alt="Rasteret: A Library for Faster and Cheaper Open Satellite Data Access" loading="lazy" width="1207" height="631" srcset="https://blog.terrafloww.com/content/images/size/w600/2025/07/cnglogo.png 600w, https://blog.terrafloww.com/content/images/size/w1000/2025/07/cnglogo.png 1000w, https://blog.terrafloww.com/content/images/2025/07/cnglogo.png 1207w" sizes="(min-width: 720px) 720px"></figure>]]></content:encoded></item><item><title><![CDATA[Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL]]></title><description><![CDATA[A journey of optimization of cloud-based geospatial data processing.
]]></description><link>https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/</link><guid isPermaLink="false">674aac1b38ed0319e870f017</guid><dc:creator><![CDATA[Aditya Giri]]></dc:creator><pubDate>Sat, 30 Nov 2024 07:51:43 GMT</pubDate><media:content url="https://blog.terrafloww.com/content/images/2024/12/media-gallery-banner.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://blog.terrafloww.com/content/images/2024/12/media-gallery-banner.jpg" alt="Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL"><p>The rapid growth of Earth observation data in cloud storage, which will continue to grow exponentially, powered by falling rocket launch prices by companies like SpaceX, has pushed us to think of how we access and analyze satellite imagery. With major space agencies like ESA and NASA adopting Cloud-Optimized GeoTIFFs (COGs) as their standard format, we&apos;re seeing unprecedented volumes of data becoming available through public cloud buckets. </p><p>This accessibility brings new challenges around efficient data access. In this article, we introduce an alternative approach to cloud-based raster data access, building upon the foundational work of GDAL and Rasterio. </p><hr><h3 id="the-evolution-of-raster-storage">The Evolution of Raster Storage</h3><p>Traditional GeoTIFF files weren&apos;t designed with cloud storage in mind. Reading these files often required downloading entire datasets, even when only a small portion was needed. </p><p>The introduction of COGs marked a significant shift, enabling efficient partial reads through HTTP range requests.</p><p>COGs achieve this efficiency through their internal structure:</p><ul><li>An initial header containing the Image File Directory (IFD)</li><li>Tiled organization of the actual image data</li><li>Overview levels for multi-resolution access</li><li>Strategic placement of metadata for minimal initial reads</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.terrafloww.com/content/images/2024/11/image-4.png" class="kg-image" alt="Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL" loading="lazy" width="611" height="991" srcset="https://blog.terrafloww.com/content/images/size/w600/2024/11/image-4.png 600w, https://blog.terrafloww.com/content/images/2024/11/image-4.png 611w"><figcaption><span style="white-space: pre-wrap;">COG File Data Composition, courtesy of </span><a href="https://developers.planet.com/docs/planetschool/an-introduction-to-cloud-optimized-geotiffs-cogs-part-1-overview/?ref=blog.terrafloww.com" rel="noreferrer"><span style="white-space: pre-wrap;">Planet Labs blog</span></a></figcaption></figure><p>This structure allows tools to read specific portions of the file without downloading the entire dataset. </p><p>However we feel that even with COGs, trying to do something like a Time-series NDVI graph for a few polygons across a few regions is not as fast as it could be, especially due to the latency caused by AWS S3 throttling.</p><h3 id="the-stac-ecosystem-and-geoparquet">The STAC Ecosystem and GeoParquet</h3><p>The SpatioTemporal Asset Catalog (STAC) specification has emerged as a crucial tool for discovering and accessing Earth observation data. While STAC APIs provide standardized ways to query satellite imagery, the Cloud Native Geospatial (CNG) community took this further by developing STAC GeoParquet.</p><p>STAC GeoParquet leverages Parquet&apos;s columnar format to enable efficient querying of STAC metadata. The columnar structure allows for:</p><ul><li>Filter pushdown for spatial and temporal fields</li><li>Efficient compression of repeated values</li><li>Reduced I/O through column pruning</li><li>Fast parallel processing capabilities with the right parquet reading libraries</li></ul><h3 id="current-access-patterns-and-challenges">Current access patterns and challenges</h3><p>The current approach to accessing COGs, using GDAL and Rasterio, typically involves:</p><ol><li>Initial GET request to read the file header</li><li>Additional requests if needed if header not found in initial request</li><li>Final requests to read the actual data tiles</li></ol><p>For cloud-hosted public datasets, this pattern can lead to:</p><ul><li>Multiple HTTP requests per file access</li><li>Increased latency, especially across cloud regions</li><li>Potential throttling on public buckets</li><li>Higher costs from numerous small requests to paid buckets</li></ul><h3 id="a-new-approach-extending-stac-geoparquet-and-byte-range-calculations">A New Approach: Extending STAC GeoParquet and Byte-range calculations<br></h3><h4 id="extending-stac-geoparquet-with-new-columns"><u>Extending stac-geoparquet with new columns</u></h4><p>Building upon the excellent stac-geoparquet, we explored adding some of COG&#x2019;s internal metadata information directly into it.</p><ul><li>Tile offset info</li><li>Tile size info</li><li>Data type</li><li>Compression info</li></ul><p>We do a batch process and we pay the cost and time spent to gather COG files metadata upfront.<br>In our case we took 1 year&apos;s worth of Sentinel 2 items from STAC API for this.</p><h4 id="get-data-from-urls-by-calculating-byte-ranges-just-in-time"><u>Get data from URLs by calculating byte ranges just-in-time</u></h4><p>Now that we have 1 year&apos;s worth of STAC items in GeoParquet along with each COG&apos;s internal metadata. We can do what GDAL does behind the scenes without needing to query the headers of COG files again and again.</p><p>We use the info related to tile offsets and tile size info, to calculate the exact byte-range required to be read for each AOI for each COG URL in STAC items.<br>We then use Python <em>requests</em> module to get the data from the COG URLs, decompress the incoming bytes, since the data resides as deflate compressed COGs, and create the <em>Numpy</em> array.</p><p>Below is the current approach to reading a remote COG URL using Rasterio.</p><figure class="kg-card kg-code-card"><pre><code class="language-python"># Current approach with GDAL/rasterio requires
# multiple http requests per file behind the scenes

# find scenes from STAC APIs:

import rasterio
from pystac_client import Client

AOI_POLYGON = Polygon([(77.5, 13.0), (77.55, 13.0),
                (77.55, 13.05),(77.5, 13.05),(77.5, 13.0)])

client = Client.open(&quot;https://earth-search.aws.element84.com/v1&quot;)
      search = client.search(
          collections=[&quot;sentinel-2-l2a&quot;], 
          datetime=f&quot;{start_date.isoformat()}/{end_date.isoformat()}&quot;,
          intersects=mapping(AOI_POLYGON),
          query={&quot;eo:cloud_cover&quot;: {&quot;lt&quot;: 20}}
        )

items = list(search.get_items())

for item in items:
  # First request to get IFD and headers of a file
  src = rasterio.open(item.assets[&apos;red&apos;].href)
  # Second or third may happen behind the scene by GDAL
  # if headers are not completely read in 1 http request
  # once its read, rasterio open dataset is assigned

  epsg = item.properties[&quot;proj:epsg&quot;]
  utm_polygon = wgs84_polygon_to_utm(AOI_POLYGON, epsg)
  geojson = utm_polygon.__geo__interface

  # using src now, we can ask GDAL to do the byte-range based request
  # to get only those internal tiles of COG that cover our AOI
  # and it returns a numpy array and its transform

  data, transform = rasterio.mask(src, geojson, crop=True)
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Current Approach</span></p></figcaption></figure><p>And below is is our approach.</p><figure class="kg-card kg-code-card"><pre><code class="language-python">from pathlib import Path
from shapely.geometry import Polygon
import xarray as xr

from rasteret import Rasteret
from rasteret.constants import DataSources
from rasteret.core.utils import save_per_geometry


def main():
    # 1. Setup workspace folder and parameters
    workspace_dir = Path.home() / &quot;rasteret_workspace&quot;
    workspace_dir.mkdir(exist_ok=True)

    # custom name for local STAC collection to be created
    custom_name = &quot;bangalore&quot;
    date_range = (&quot;2024-01-01&quot;, &quot;2024-03-31&quot;)
    data_source = DataSources.SENTINEL2

    # get bounds of polygon above for stac search
    bbox = AOI_POLYGON.bounds

    # instantiate Rasteret class
    processor = Rasteret(
        custom_name=custom_name,
        data_source=data_source,
        output_dir=workspace_dir,
        date_range=date_range,
    )

    # create a local STAC geoparquet with COG metadata added to it
    processor.create_collection(
        bbox=bbox,
        date_range=date_range,
        cloud_cover_lt=20,
    )

    # use the processor to get the image data for polygons and bands
    # as a multidimensional xarray dataset
    
    # this uses the local geoparquet created above to optimize HTTP
    # requests to multiple COG files (1 request per required tile)

    # bands, geometries can be changed here
    # date_range too can be changed but must be within the date range
    # available in local geoparquet
    
    # pySTAC&apos;s &quot;search&quot; filters, like &quot;platform&quot;,&quot;cloud_cover_lt&quot; and 
    # others can be passed here too.
    
    xarray_ds = processor.get_xarray(
        geometries=[AOI_POLYGON],
        bands=[&quot;B4&quot;, &quot;B8&quot;],
        cloud_cover_lt=20,
        date_range=date_range,
    )
</code></pre><figcaption><p><span style="white-space: pre-wrap;">Our New Approach</span></p></figcaption></figure><h3 id="performance-insights">Performance Insights</h3><p>Initial benchmarks show promising results for this approach. The aim was to improve <strong>time-to-first-tile.</strong></p><p><em><u>There are a few things to note before we get to some raster query speed tests -</u></em></p><p>It is important to tune GDAL configurations correctly to fully benefit from GDAL&#x2019;s own multi-range requests, and use settings that avoid GDAL listing all S3 files, which wastes time and adds more S3 GET requests.</p><p>Rasterio with GDAL&#x2019;s virtual cache gets faster with subsequent reads to the <strong>same</strong> COG S3 file, but if the analysis is across time and space, then we pay the latency time to get headers per new file that is not cached.</p><p>This is also true across new VMs/Lambdas/Serverless compute, because a new Rasterio &quot;Session/Environment&quot; will always be created for each run of your Python code in a new Python environment, so scaling out to multiple VMs means these new ENVs keep sending HTTP requests to each COG file it interacts with, even if it was used in an earlier Session,  to complete the rasterio.mask task.</p><blockquote>Our approach to add new columns to STAC GeoParquet avoids paying this time and cost of multiple HTTP requests for each COG file in each new Python process.</blockquote><p>We have made a custom python code for byte-range calculation code based on GDAL&#x2019;s C++ approach. We also created custom tile merging code as well incase 1 AOI intersects with more than 1 internal tile of a COG.<br>This allows our library to be pretty lightweight and not require any GDAL dependencies, except for the geometry_mask/rasterize functions.</p><p><strong><em><u>With all this context set, below are some initial test results -</u></em></strong></p><p><strong>Machine config - 2 CPU (4 threads), 2GB RAM machine</strong></p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-text"><i><em class="italic" style="white-space: pre-wrap;">Test scenario: Processing 20 Sentinel-2 scenes for NDVI calculation over a year for a single farmland in India, </em></i>with async Python functions</div></div><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.terrafloww.com/content/images/2024/12/rasteret_total.png" class="kg-image" alt="Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL" loading="lazy" width="590" height="490"><figcaption><span style="white-space: pre-wrap;">Total time taken - STAC filter + 40 TIF file queries + NDVI calculation</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.terrafloww.com/content/images/2024/12/rasteret_stac-1.png" class="kg-image" alt="Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL" loading="lazy" width="590" height="490"><figcaption><span style="white-space: pre-wrap;">Time for filtering 1 year of STAC items with AOI and Cloud filter</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.terrafloww.com/content/images/2024/12/rasteret_mem.png" class="kg-image" alt="Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL" loading="lazy" width="590" height="490"><figcaption><span style="white-space: pre-wrap;">Memory Usage for whole process</span></figcaption></figure><h4 id="key-factors-to-speed-up-in-our-approach">Key Factors to speed up in our approach:</h4><ul><li>Reduced HTTP requests through pre-calculating byte ranges just-in-time for each AOI for each COG file</li><li>Pre-cached metadata in STAC-GeoParquet eliminating API calls to STAC API JSON endpoints</li><li>Optimized parallel processing for spatio-temporal analysis</li></ul><h3 id="current-scope-and-limitations">Current Scope and Limitations</h3><p>While these initial results are encouraging, it&apos;s important to note where this approach currently works best:</p><p><strong>Optimal Use Cases:</strong></p><ul><li>Time-series analysis of Sentinel 2 data across multiple small polygons</li><li>Optimize paid public bucket data access, to reduce GET requests costs</li></ul><p><strong>Areas of improvement:</strong></p><ul><li>Pure Python or Rust implementations of operations like rasterio.mask</li><li>Adding more data sources like USGS Landsat and others</li><li>LRU or other cache for repeated same tile queries</li><li>Reducing memory usage</li><li>Benchmark against Xarray and Dask workloads</li><li>Test on multiple polygons across the world for 1 year date range</li></ul><hr><h3 id="what-next">What next:</h3><p>As we continue to develop and refine this approach, we&apos;re excited to engage with the geospatial community to gather feedback, insights, and contributions. By collaborating and building upon each other&apos;s work, we can collectively push the boundaries of what&apos;s possible with cloud-based raster data access and analysis.</p><p>We&apos;re currently working on an <strong>open-source</strong> <strong>library</strong> which will be called &#x201C;<strong><em>Rasteret</em></strong>&#x201D; that implements these techniques, and we look forward to sharing the library and more technical details in an upcoming deep dive blog. Stay tuned!</p><hr><h3 id="acknowledgments">Acknowledgments</h3><p>This work stands on the shoulders of giants in the open-source geospatial community:</p><ul><li>GDAL and Rasterio, for pioneering geospatial data access</li><li>The Cloud Native Geospatial community for STAC and COG specifications</li><li>PyArrow and GeoArrow for efficient parquet filtering</li><li>The broader open-source geospatial community</li></ul><p>We&apos;re grateful for the tireless efforts and contributions of these projects and communities. Their dedication, expertise, and willingness to share knowledge have laid the foundation for approaches like the one outlined here.</p><hr><p>Terrafloww is proud to support the Cloud Native Geospatial forum by being a newbie startup member in their large established community.</p>
<!--kg-card-begin: html-->
<div style="display: flex; justify-content: center; align-items: center; width: 100%; height: 100%;">
    <a href="https://cloudnativegeo.org/?ref=blog.terrafloww.com" style="text-decoration: none;">

  <svg id="Layer_2" data-name="Layer 2" style="width: 200px;" xmlns="http://www.w3.org/2000/svg" viewbox="0 0 702.61 311.75">
    <defs>
      <style>
        .cls-1 {
          fill: #2126f7;
          stroke-width: 0px;
        }
      </style>
    </defs>
    <path class="cls-1" d="M415.02,208.9c-21.69,0-28.11-10.42-28.11-47.87s6.42-48.46,28.11-48.46c14.26,0,19.28,4.4,21.89,19.03h25.1c-1.2-31.45-11.85-40.86-46.99-40.86-41.58,0-53.82,16.02-53.82,70.29s12.26,69.7,53.82,69.7c35.34,0,46.39-9.01,48.2-39.66h-25.1c-2.21,13.82-7.43,17.83-23.09,17.83Z"/>
    <polygon class="cls-1" points="557.1 171.06 559.9 199.49 559.3 199.49 510.43 92.75 479.19 92.75 479.19 228.93 503.83 228.93 503.83 150.63 501.02 122.19 501.63 122.19 550.49 228.94 581.73 228.94 581.73 92.76 557.1 92.76 557.1 171.06"/>
    <path class="cls-1" d="M649.66,174.05h28.28v30.65c-6.82,3-11.84,4.01-22.26,4.01-23.67,0-30.69-9.8-30.69-47.26s6.41-48.67,28.08-48.67c17.85,0,23.47,4.01,24.67,17.63h24.87c-1.61-30.65-13.04-39.66-49.74-39.66-40.92,0-53.15,16.02-53.15,70.09s12.83,69.9,56.56,69.9c17.65,0,28.08-2.41,45.93-10.62v-67.89h-52.54v21.83Z"/>
    <path class="cls-1" d="M291.6,85.81L189.84,2.36c-1.86-1.53-4.2-2.36-6.61-2.36h-35.76c-1.82,0-3.62.4-5.27,1.16L24.09,55.73c-4.5,2.08-7.59,6.37-8.14,11.3L.1,209.11c-.59,5.25,1.52,10.45,5.61,13.8l105.25,86.47c1.87,1.53,4.2,2.37,6.62,2.37h32.16c1.82,0,3.63-.4,5.28-1.16l120.56-55.85c4.22-1.96,7.11-5.98,7.63-10.6l15.57-140.72c.74-6.7-1.96-13.32-7.17-17.6ZM149.8,226.17c-69.08-19.75-67.08-121.75-.64-141.69,33.56,9.11,51.71,39.48,51.42,71.33-.27,30.38-18.2,61.67-50.78,70.36Z"/>
  </svg>
      </a>
</div>

<!--kg-card-end: html-->
]]></content:encoded></item></channel></rss>