Cloud native data formats

EO College

What are cloud native data formats

Cloud native formats or cloud-optimized formats, are file formats specifically designed to optimize the storage, access, and processing of geospatial data in cloud computing environments. These formats are tailored to leverage the scalability, flexibility, and parallel processing capabilities of cloud infrastructure, enabling efficient handling of large-scale datasets.

YouTube — Video content in cooperation with Aimee Barciauskas (DevelopmentSeed) and Ryan Avery (DevelopmentSeed).

“Cloud-optimised means organizing so subsets of data can be read. Ideally, the data is also compressed. Both of these factors minimize the amount of data that has to be transferred across a network.”

Characteristics of cloud native data formats

Cloud-optimized means mainly optimized “read” access with partial reads and also parallel reads. Main characteristics common for cloud-optimized formats:

Data Chunking: Cloud native formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.
Internal Indexing: These formats incorporate internal indexing structures that facilitate fast spatial and attribute queries. This enables efficient data access and retrieval operations without the need for extensive scanning or processing of the entire dataset.
Metadata Optimization: Cloud native formats optimize metadata storage and indexing, allowing for efficient access and retrieval of metadata associated with the data at once. This supports faster discovery and interpretation of data properties and characteristics.
Compression and Tiling: Cloud native formats often employ advanced compression techniques to reduce storage requirements while maintaining data quality. Additionally, they utilize tiling strategies to divide the data into smaller, manageable tiles that can be independently accessed and processed.

HTTP Range Request allows clients to request only a specific portion or range of data instead of a complete dataset.

Examples of cloud native data formats

COG – Cloud-Optimized GeoTIFF (COG) is an optimized version of the GeoTIFF format. It organizes raster data into chunks, utilizes internal tiling and compression, and uses HTTP range requests for efficient data access in the cloud.
ZARR Zarr is a format specifically designed for storing and accessing multidimensional arrays. It supports chunking, compression, and parallel processing, making it suitable for large-scale geospatial datasets, for example, weather data. Metadata is stored externally in data files itself.
FlatGeoBuf Cloud optimized vector data format. It is a binary encoding format for geodata and holds a collection of Simple Features.

Available Material

Ryan Avery, Aimee Barciauskas, Development Seed, United States (2023). Technologies used to Create, Store and Access Geospatial Data in the Cloud. https://2023.ieeeigarss.org/view_paper.php?PaperNum=5306
ESIP Talk on Cloud Native Formats: https://www.youtube.com/watch?v=ac_UKunUrNM
FOSS4G Talk On Cloud Native Formats (Matthew Hanson)
- https://talks.osgeo.org/foss4g-2023/talk/XBHYF9/
- https://space.cloud68.co/s/xExLwCmzzKEcoB9?dir=undefined&path=%2FLumbardhi%2F28.06.2023&openfile=1356611
OGC White Paper on Cloud Native Formats (Chris Holmes, Scott Simmons):
Cloud-Native Geospatial Foundation initiative of Radiant Earth:
- https://cloudnativegeo.org
Tweet from Chris Holmes (great example – postholer)