Synthetic Aperture Radar (SAR) is an advanced remote sensing radar system that captures detailed images of the Earth’s surface using radio waves. SAR operates by sending out radio waves from an antenna on an aircraft or satellite towards the ground. These waves bounce back from the Earth’s surface and return to the antenna as echoes. As the antenna moves along a path, it collects echoes from various positions, constructing a large and detailed image over time.
(SAR image taken over USA)
One of the significant advantages of using radio waves is their ability to penetrate clouds, fog, and darkness, enabling SAR to operate under almost all-weather conditions and at any time of day. Additionally, different surfaces reflect radio waves differently, allowing SAR to detect fine details like surface roughness, moisture levels, and even slight changes in terrain (1).
(NASA SAR Handbook (9))
Radio frequencies have long wavelengths, so a trick was devised to keep spatial resolution high – “Synthetic Aperture”. IT refers to the method of combining data from the moving antenna to create a synthetic large aperture, much larger than the actual physical antenna. This synthetic aperture allows SAR to capture very high-resolution images, much like having a huge camera lens, effectively the size of the satellite’s orbit (larger than the Earth).
This combination of high-resolution imaging and all-weather capabilities makes SAR an invaluable tool for various applications, including Earth observation, and disaster monitoring.
At ASTERRA pursuitthe pursuit of excellence drives us to continuously seek improvements and innovations. As advancements are made, we strive to enhance existing methods and create extraordinary products for our customers. The inherent complexity of SAR imagery, with its unique challenges such as high noise levels, speckle, and the need for fine-grained analysis, demands the power of machine learning. In particular, deep learning is essential to extract meaningful insights and achieve precise detection.
For the past decade, convolutional neural networks (CNNs) have been the go-to approach for image classification. They are effective, produce excellent results, and are relatively straightforward to train and maintain. For example, numerous studies have demonstrated the success of CNNs in classifying ships and icebergs using SAR images (2)(3). The Transformer model, initially introduced in the paper “Attention Is All You Need” (6) revolutionized natural language processing. Later, this architecture was adapted for visual tasks in “An image is worth 16×16 words: Transformers for image recognition at scale” (10). The challenge with using ViTs for SAR images lies in the vast amount of data required for training. The original ViT model was trained on the JFT-300M dataset, 300 million sample is a scale of data not commonly available for SAR imagery.
(From An image is worth 16×16 words: Transformers for image recognition at scale)
Recent studies have shown promising results in applying Swin Transformers (Shifted Windows) to SAR image classification (4) (5). For instance, integrating Swin Transformers into environmental monitoring systems has led to significant improvements in detecting changes in terrain and identifying objects such as ships and icebergs (11)(12). Quantitative analysis demonstrates that Swin Transformers achieve higher accuracy and robustness compared to traditional CNNs and ViTs, making them a valuable tool in SAR image processing.
(SAR SHIP DETECTION BASED ON SWIN TRANSFORMER AND FEATURE ENHANCEMENT FEATURE PYRAMID NETWORK (4))
At ASTERRA, we’ve seen that detecting subterranean physical phenomena from space-borne SAR requires learning feature correlations at all ranges, and ViT offers a way to balance that with the scarcity of labeled ground truth examples, as working with SAR images presents more challenges than typical images for labels. The human eye cannot always interpret SAR images, so we lack visual assistance in understanding the image. This often leads to performing field work which can be expensive and physically demanding, and time consuming. This means we are limited in the amount of ground truth labels we can acquire and therefore require an efficient model.
The potential benefits of integrating advanced transformer-based architectures into SAR image analysis are significant. By leveraging these innovative methods, we can push the boundaries of what is possible in SAR image classification and detection and contribute to better environmental monitoring and protection.
When I first thought about a topic for this article, I came upon Tokens to Token (T2T) Transformers. They seemed interesting and promising; I even wrote a few pages about them. But while investigating, I came upon the actual (new) topic of this article – Swin Transformers. I was surprised that T2T has almost no mentions on the Internet compared to Swin Transformers.
I will present Tokens to Token and then go into Swin Transformers allowing you to perceive for yourself how Swin Transformers stands out.
Imagine a theatre stage and some performance taking place. The stage represents the image that needs to be analyzed, and the audience represents the neural network trying to understand the scene.
In ViT the entire audience is looking at the entire stage at once. Every detail from every part of the stage is equally visible to the audience. All parts of the stage are given equal attention. This global view allows all interactions and elements to be seen simultaneously but can be overwhelming and computationally expensive to process every detail at once.
In T2T a spotlight shining on a stage, initially casting a broad, unfocused beam. This broad spotlight represents the first stage in T2T, where the image is divided into small patches—basic tokens capturing the overall scene without much detail. As the spotlight narrows and sharpens its focus on specific parts of the stage, it begins to reveal finer details, like how T2T progressively refines and merges tokens to capture more complex patterns.
As the spotlight moves across the stage, continuing to refine its focus, it brings out clearer and more informative views of the performance. This mirrors how T2T processes and enhances tokens layer by layer, resulting in a final, detailed understanding of the image. The process ends with a thoroughly illuminated stage, just as T2T outputs refined tokens ready for classification or detection tasks.
In more technical terms, Visual Transformers (ViT) divide the image into fixed-size patches, each patch is projected to an embedding space using a linear transformation, positional encoding is concatenated before being fed into a standard transformer encoder. The transformer processes these patches in a flat, non-hierarchical manner, treating them equally.
ViTs require huge amounts of data because the model lacks the inductive biases inherent in CNNs. Additionally, the quadratic complexity of self-attention with respect to the number of patches poses a challenge, especially for high-resolution images like SAR.
To apply T2T-ViTs to images, the process starts by dividing the image into small patches, which are then progressively refined and aggregated into larger, more informative tokens. These tokens are linearly embedded into high-dimensional vectors, with positional encodings added to retain spatial information. The embedded tokens are processed through transformer layers that use self-attention to capture global context and long-range dependencies. Finally, the refined tokens are used for classification, detection, or segmentation, with task-specific heads added to produce the desired output.
In Swin Transformers a spotlight approach is also used. Imagine a spotlight moving around the stage, illuminating only one specific area at a time. This spotlight focuses the audience’s attention on a smaller, local region of the stage, allowing to see details and interactions within the illuminated area more clearly and efficiently. As the play progresses, the spotlight moves to different areas of the stage, shifting slightly to ensure that every part of the stage is eventually illuminated and observed as relevant to the events on stage. This shift allows the audience to piece together a coherent understanding of the entire play by combining insights from different local regions.
The spotlight not only moves around but also changes sizes. At the beginning it may focus on very small, detailed areas. Later it might cover larger areas, summarizing broader patterns and context.
Shifted Windows Vision Transformers (Swin Transformers) improve and elevate ViT. Swin Transformers introduce a hierarchical approach by dividing images into non-overlapping local windows, each processed independently. Within each window, patches are embedded and processed using self-attention. In subsequent layers, the windows shift by a predefined stride. This shifting mechanism ensures that connections and dependencies between adjacent windows are captured, allowing the model to build a more comprehensive understanding of the entire image.
By limiting self-attention computation to smaller windows, Swin Transformers significantly reduce the computational burden compared to the global self-attention in ViTs. The local attention mechanism within windows reduces the quadratic complexity to linear complexity with respect to the number of patches within a window. As the network goes deeper, Swin Transformers incorporate patch merging layers, which progressively merge adjacent patches, reducing the spatial dimensions and increasing the feature dimensions. This hierarchical reduction is similar to the pooling operations in CNNs but maintains the ability to capture long-range dependencies through shifted windows.
Swin Transformers use relative positional encoding that is better at understanding the spatial relationships within small windows. This is more effective than the absolute positional encoding used in ViTs, which can struggle to model local details accurately, you may have encountered this in generative models, where fine details in the image are blurry or lack realistic detail, for example fingers in synthetic images of people. The self-attention within each window helps the model focus on prominent features in small regions, while the shifting windows make sure that connections between these regions are captured. This approach allows Swin Transformers to effectively balance the details within small areas and the overall context of the image.
Now that you are familiar with Swin Transformers, we will talk a bit about T2T. (8)
T2T improves the standard Vision Transformer by refining tokenization, but it does not inherently encode the hierarchical structure. T2T-ViT focuses on progressively refining and merging tokens, but it processes the image as a flat sequence, which may limit its ability to capture multi-scale features as effectively as Swin Transformers.
Additionally, Swin Transformers are more computationally efficient because they process images in smaller, localized windows, reducing the complexity of handling high-resolution images like SAR. T2T-ViT, while improving tokenization, still processes the entire image at once, which can be less efficient.
In my opinion Swin Transformers are better for my uses.
The field of SAR image processing continues to evolve with ongoing research focusing on optimizing transformer architectures for enhanced performance on SAR data, particularly in applications like those of ASTERRA. Our use cases demand the ability to detect phenomena at the scale of a single pixel within a much larger context, often under constraints of limited labels. To make the most of the available SAR data, another key area of focus is the development of more efficient training algorithms and techniques, such as data augmentation, transfer learning, and semi-supervised learning. These innovations hold the potential to enhance our ability to monitor and respond to environmental changes, contributing to better management and protection of our planet.