Efficient Cloud-Native Raster Data Access: An Alternative to Rasterio/GDAL
A journey of optimization of cloud-based geospatial data processing. Introducing a new approach to raster data access, harnessing the power of STAC GeoParquet and cloud-native workflows to push the boundaries of satellite imagery analysis.
The rapid growth of Earth observation data in cloud storage, which will continue to grow exponentially, powered by falling rocket launch prices by companies like SpaceX, has pushed us to think of how we access and analyze satellite imagery. With major space agencies like ESA and NASA adopting Cloud-Optimized GeoTIFFs (COGs) as their standard format, we're seeing unprecedented volumes of data becoming available through public cloud buckets. This accessibility brings new challenges around efficient data access patterns and resource utilization. In this article, we introduce an alternative approach to cloud-based raster data access, building upon the foundational work of GDAL and Rasterio while exploring optimizations specifically for cloud-native workflows. We hope this article can contribute to the collective efforts of the geospatial community in tackling some of the challenges of big data in the cloud era by trying out various approaches to problem-solving.
The Evolution of Raster Storage
Traditional GeoTIFF files weren't designed with cloud storage in mind. Reading these files often required downloading entire datasets, even when only a small portion was needed. The introduction of COGs marked a significant shift, enabling efficient partial reads through HTTP range requests.
COGs achieve this efficiency through their internal structure:
- An initial header containing the Image File Directory (IFD)
- Tiled organization of the actual image data
- Overview levels for multi-resolution access
- Strategic placement of metadata for minimal initial reads
This structure allows tools to read specific portions of the file without downloading the entire dataset. However, even with COGs, accessing cloud-based data still presents challenges, especially around managing multiple requests and minimizing latency across cloud regions.
The STAC Ecosystem and GeoParquet
The SpatioTemporal Asset Catalog (STAC) specification has emerged as a crucial tool for discovering and accessing Earth observation data. While STAC APIs provide standardized ways to query satellite imagery, the Cloud Native Geospatial (CNG) community took this further by developing STAC GeoParquet.
STAC GeoParquet leverages Parquet's columnar format to enable efficient querying of STAC metadata. The columnar structure allows for:
- Filter pushdown for spatial and temporal fields
- Efficient compression of repeated values
- Reduced I/O through column pruning
- Fast parallel processing capabilities
Current Access Patterns and Challenges
The current approach to accessing COGs, exemplified by GDAL and Rasterio, typically involves:
- Initial GET request to read the file header
- Additional requests if needed if header not found in initial request
- Final requests to read the actual data tiles
For cloud-hosted public datasets, this pattern can lead to:
- Multiple HTTP requests per file access
- Increased latency, especially across cloud regions
- Potential throttling on public buckets
- Higher costs from numerous small requests to paid buckets
A New Approach: Extending STAC GeoParquet and just-in-time Byte-range compute
Building upon the excellent STAC GeoParquet format from the CNG community, we've explored adding COG structural information directly into the metadata:
Key extension to the GeoParquet data is per-band metadata columns
- Tile offset locations
- Tile size details
- Data type
- Compression details
Below is the current standard approach to do this using Rasterio.
And here is a refined version that calculates and fetches only using byte range.
Performance Insights
Initial benchmarks across cloud environments show promising results for this approach:
Rasterio and GDAL configs are important to be tuned correctly to fully benefit from GDAL caching and avoid GDAL reading / listing all S3 files which wastes time.
The aim was to improve time-to-first-tile, Rasterio with GDAL’s cache gets faster with subsequent reads to the same COG S3 file, but if the analysis is across time and space, then we pay the price of getting the tif metadata multiple times for multiple new files in each Rasterio session.
The above graphs show the metrics for the following steps being performed.
- Filter STAC for 1 AOI for 1 year's worth of Sentinel 2 scenes with a cloud filter of less than 20%
- Get the tif file urls of B08 and B04 bands
- Query both the bands for each date and create a NDVI mean array for the day
- Do the above steps async using python parallel processing
- Combine NDVI values, create each month's average value and then print them
All approaches shown in graph gave the same exact results.
Key Performance Factors:
- Reduced HTTP requests through pre-calculating byte ranges just-in-time for each new .tif file and new AOI in the .tif file
- Pre-cached metadata in STAC-GeoParquet eliminating API calls to STAC JSONs
- Optimized parallel processing for spatio-temporal analysis
Current Scope and Limitations
While these initial results are encouraging, it's important to note where this approach currently works best:
Optimal Use Cases:
- Time-series analysis of Sentinel 2 data across multiple small polygons
- Cross-cloud data access
- Public bucket access optimization
Areas of improvement:
- Pure Python or Rust implementations of raster operations like rasterio.mask
- Adding more data sources like USGS Landsat and others
- LRU or other cache for repeated same tile queries
- Reducing memory usage
- Benchmark against Xarray and Dask workloads
- Test on multiple polygons across the world for 1 year date range
What next:
As we continue to develop and refine this approach, we're excited to engage with the geospatial community to gather feedback, insights, and contributions. By collaborating and building upon each other's work, we can collectively push the boundaries of what's possible with cloud-based raster data access and analysis.
We're currently working on an open-source library that implements these techniques, and we look forward to sharing the library and more technical details in an upcoming deep dive blog. Stay tuned!
Acknowledgments
This work stands on the shoulders of giants in the open-source geospatial community:
- GDAL and Rasterio, for pioneering geospatial data access
- The Cloud Native Geospatial community for STAC and COG specifications
- PyArrow and GeoArrow for efficient parquet filtering
- The broader open-source geospatial community
We're grateful for the tireless efforts and contributions of these projects and communities. Their dedication, expertise, and willingness to share knowledge have laid the foundation for approaches like the one outlined here.
We hope this article can contribute to the ongoing conversation around efficient cloud-based raster data access, and we look forward to learning from and collaborating with the community as we collectively explore new frontiers in geospatial data processing.
Terrafloww is proud to support the Cloud Native Geospatial forum by being a newbie startup member in their large established community.