In the realm of machine learning, handling large datasets is a common yet challenging task. My passion for data science and machine learning drives me to seek out and develop innovative solutions for complex problems. At ASTERRA, we have the opportunity to exercise these skills, working with Earth Observation SAR data, gigapixel images (tens of thousands of pixels in each dimension). Moreover, I take great pride in helping ASTERRA because our work helps clients save the planet by providing critical insights into soil moisture estimation and other environmental parameters.
Soil moisture estimation from satellite data is a notoriously difficult task, and large datasets are required to achieve accuracy in indirect machine learning approaches, such as transfer learning. These datasets often exceed the memory capacity of a single machine, posing significant obstacles in data processing and model training. In this blog, we delve into a practical solution using TensorFlow Records (TFRecord), a binary format optimized for the popular Deep Learning framework from Google. We’ll explore the challenges, methodologies, and design choices that led us to this solution.
The primary challenge was managing a dataset that exceeds the memory capacity of our hardware. This situation can cause inefficiencies in data loading, slow down training times, and even lead to memory overflow errors. Addressing these issues is crucial for optimizing the performance and scalability of our machine learning models.
To ensure the effectiveness of our solution, we established clear, quantifiable success criteria:
1. Performance Improvement: A significant improvement in training speed and model performance compared to previous methods was essential.
2. Conversion Efficiency: We aimed to convert the entire imageries patches dataset into TensorFlow Records within less than an hour per 100GB of data on an inexpensive EC2 instance type. Thus, shifting the bottleneck from data loading to GPU processing or model training.
3. Data Loading Speed: The goal was to achieve data loading and preprocessing times of under 2 seconds per 5 batches of 1024 shuffled patches (GPU RAM is a few GB).
4. Memory Management: The solution needed to prevent memory overflow and efficiently manage resources.
• Image Folder Layout (Kaggle-style): This method organizes each image patch into a separate file, grouped in folders by label. While straightforward, it can be inefficient for large-scale data processing (TensorFlow guide: Load and preprocess images).
• CSV and Flat Files: Traditional and easy to use, but not scalable for large datasets due to high memory consumption.
• HDF5 Format: Offers better storage efficiency and supports chunked data but lacks seamless integration with TensorFlow’s data pipeline (Loading NetCDFs in TensorFlow).
• Memory: The solution must ensure that no single operation requires loading the entire dataset into memory.
• Processing Speed: The data conversion and loading process should not become a bottleneck.
• Scalability: The solution should be scalable to larger datasets and adaptable to different types of data.
We evaluated several approaches before settling on TensorFlow Records:
• Data Generators: Initially considered for their ability to perform real-time data transformations (including augmentations), but they were found to be limiting in terms of performance.
• HDF5 with tf.data: Although efficient in storage, compatibility issues with TensorFlow’s data pipeline posed challenges. HDF5 is designed for efficient storage and access, but when dealing with very large datasets, it can lead to slower data throughput, particularly when data needs to be fetched and processed on-the-fly during training.
• Cache to disk using tf.data API: while effective, it is not scalable. For unclear reasons, it is limited to 10 million samples, which is a significant limitation.
• TensorFlow Records: Ultimately chosen for its efficient data loading and preprocessing capabilities, leveraging TensorFlow’s tf.data API. TensorFlow I/O, an extension package to TensorFlow, also plays a crucial role by adding support for standard hierarchical and geospatial data structures.
While TensorFlow Records offered many advantages, it also had some drawbacks:
• Initial Setup Time: The process of converting data into TensorFlow Records can be time-consuming, requiring significant upfront effort.
• Flexibility: Once data is converted to TensorFlow Records, there is limited flexibility in experimenting with different data preparation techniques.
Step 1 – Data Preprocessing:
1. Clean and preprocess the raw data.
2. Crop the imagery into smaller patches.
3. Save the preprocessed patches dataset as HDF5 file per original image.
Step 2 – Convert Data to TensorFlow Records:
1. Creating a tf.data.Dataset of the patches from all the imageries using tensorflow_io.IODataset.from_hdf5.
2. Serializing the tensors using tf.io.serialize_tensor.
3. Writing the data to TFRecord files using tf.io.TFRecordWriter.
Step 3 – Data Loading and Preprocessing Pipeline:
1. Parsing the TFRecords using tf.io.parse_tensor.
2. Creating a tf.data.Dataset using tf.data.TFRecordDataset.
3. Apply shuffling, batching and prefetching using tf.data.Dataset API.
Efficiently handling large datasets is essential for modeling satellite-scale imagery, such as soil moisture data. By leveraging TensorFlow Records and TensorFlow I/O HDF5 support, we addressed key challenges related to data loading and memory management, ultimately improving our model’s training speed and performance. We are always continuing to refine our methods and train more complex models on larger datasets.
We look forward to exploring further optimizations and adapting our approach to various data types and use cases. My contributions, along with those of my team, are helping people in ways I never imagined, enabling us to provide valuable insights and solutions that aid in environmental conservation and resource management.
To learn more about ASTERRA’s AI technology and mission, watch this YouTube video featuring our Head of AI, Inon Sharony, or contact our talented team today.
#MachineLearning #DataScience #BigData #TensorFlow #TensorFlowRecords #SARData #EarthObservation #SoilMoisture #DataProcessing #AI #EnvironmentalData #HDF5 #DeepLearning #SatelliteData #ResourceManagement #DataEfficiency #AIForGood