Cloud native data formats
What are cloud native data formats
Cloud native formats or cloud-optimized formats, are file formats specifically designed to optimize the storage, access, and processing of geospatial data in cloud computing environments. These formats are tailored to leverage the scalability, flexibility, and parallel processing capabilities of cloud infrastructure, enabling efficient handling of large-scale datasets.
Characteristics of cloud native data formats
Cloud-optimized means mainly optimized “read” access with partial reads and also parallel reads. Main characteristics common for cloud-optimized formats:
- Data Chunking: Cloud native formats employ a chunk-based organization, where the data is divided into smaller chunks or blocks. This enables parallel processing and efficient retrieval of specific portions of the data, reducing the need to access the entire dataset.
- Internal Indexing: These formats incorporate internal indexing structures that facilitate fast spatial and attribute queries. This enables efficient data access and retrieval operations without the need for extensive scanning or processing of the entire dataset.
- Metadata Optimization: Cloud native formats optimize metadata storage and indexing, allowing for efficient access and retrieval of metadata associated with the data at once. This supports faster discovery and interpretation of data properties and characteristics.
- Compression and Tiling: Cloud native formats often employ advanced compression techniques to reduce storage requirements while maintaining data quality. Additionally, they utilize tiling strategies to divide the data into smaller, manageable tiles that can be independently accessed and processed.
HTTP Range Request allows clients to request only a specific portion or range of data instead of a complete dataset.
Examples of cloud native data formats
- COG – Cloud-Optimized GeoTIFF (COG) is an optimized version of the GeoTIFF format. It organizes raster data into chunks, utilizes internal tiling and compression, and uses HTTP range requests for efficient data access in the cloud.
- ZARR Zarr is a format specifically designed for storing and accessing multidimensional arrays. It supports chunking, compression, and parallel processing, making it suitable for large-scale geospatial datasets, for example, weather data. Metadata is stored externally in data files itself.
- FlatGeoBuf Cloud optimized vector data format. It is a binary encoding format for geodata and holds a collection of Simple Features.
Available Material
- Ryan Avery, Aimee Barciauskas, Development Seed, United States (2023). Technologies used to Create, Store and Access Geospatial Data in the Cloud. https://2023.ieeeigarss.org/view_paper.php?PaperNum=5306
- ESIP Talk on Cloud Native Formats: https://www.youtube.com/watch?v=ac_UKunUrNM
- FOSS4G Talk On Cloud Native Formats (Matthew Hanson)
- OGC White Paper on Cloud Native Formats (Chris Holmes, Scott Simmons):
- Cloud-Native Geospatial Foundation initiative of Radiant Earth:
- Tweet from Chris Holmes (great example – postholer)