Post-Storm Event Assessment: Damaged Building Detection and Classification

Group Name: Double-Y | Public Leaderboard: 11/222 (Top 5%)
EY Open Science Data Challenge Program 2024
Inference samples predicted by our trained model.

Geospatial analysis.

Summary

Coastal regions are extremely vulnerable to storms and tropical cyclones, which have caused significant economic costs and numerous fatalities. This underscores the urgent need for action to protect the sustainability and resilience of coastal communities. In this paper, we present an AI-driven geospatial analysis pipeline for automating coastal disaster assessment by detecting building damage from satellite imagery. Firstly, we propose an effective and scalable pipeline to train an artificial intelligence (AI) model for damaged building detection using a limited dataset. Specifically, we use Microsoft’s Building Footprint dataset as pretraining data, allowing our AI model to quickly adapt and learn the Puerto Rico landscape. Subsequently, we fine tune the model with a carefully engineered sequence using manually annotated data and self-annotated data. Upon training, we used our AI model to generate geospatial heatmaps of damaged building counts and damage ratio, which are useful to assess the storm damage and coastal vulnerability. Our approach placed us in top 5% in the public leaderboard, enabling us to be shortliested for the global semi-final rounds.

Competition Overview

1. Objective: The goal of the challenge is to develop a machine learning model to identify and detect “damaged” and “un-damaged” coastal infrastructure (residential and commercial buildings), which have been impacted by natural calamities such as hurricanes, cyclones, etc. Participants will be given pre- and post-cyclone satellite images of a site impacted by Hurricane Maria in 2017 and build a machine learning model, designed to detect four different objects in a satellite image of a cyclone impacted area:

  • Undamaged residential buildings
  • Damaged residential buildings
  • Undamaged commercial buildings
  • Damaged commercial buildings

2. Mandatory Dataset:

  • High-resolution panchromatic satellite images before and after a tropical cyclone: Maxar GeoEye-1 (optical)

3. Optional Dataset (that we used):



Key Challenges

1. Dataset Collection: Manually annotating all four classes in the provided high-resolution satellite dataset from Maxar's GEO-1 mission, covering an area of 327 sq.km of San Juan, Puerto Rico, is a time-consuming task. With only one month for the competition duration, this task poses significant challenges in terms of time and energy allocation.

2. Class Imbalanced: The dataset contains four unique classes. However, our analysis indicates that damaged buildings are significantly underrepresented compared to undamaged ones. Moreover, residential buildings are more prevalent than commercial ones. This imbalance may introduce bias into the model, causing it to favor the majority class.

3. Out-of-Distribution data: We noticed that the competition’s validation dataset comprises only buildings from rural settings. However, the training dataset consists of a mixture of images from rural settings, industrial zones, and urban areas. Our empirical study reveals that mixing images from non-rural settings can have a severe impact on model learning.



Key Elements and Assumptions

Before delving into the proposed methodology, we introduce the key elements and assumptions as shown here:

Key Element Description
1 Target Region Puerto Rico
2 Object Detection Model YOLOv8n
3 Microsoft BF dataset Only the Puerto Rico region
4 Puerto Rico dataset 5690 unique data
5 Non-experts Annotators with limited expertise in the given task
6 Experts Annotators with expertise in the given task
7 Crowdsourced dataset Dataset annotated by non-experts (200 unique data)
8 Expert dataset Datasets annotated by the experts (28 unique data)

Our assumptions:

  1. When labelling multiple versions of the provided post-disaster dataset, we observed that not all annotated data aligns with the expected outcomes in the EY validation images. Some of our annotated datasets yield high mAP, some yield low mAP.
  2. Consequently, we assume datasets that perform exceptionally well as the “expert dataset,” annotated by “expert annotators.” Conversely, the datasets that do not yield results as good as the expert dataset are referred to as the “crowd-sourced dataset”.
  3. We assume expert annotators could effectively differentiate damaged/undamaged commercial and residential buildings. Logically, the expert dataset is a high-quality dataset but lower in quantity.
  4. Meanwhile, crowd-sourced dataset is a high quantity dataset, with lower-quality annotations. We assume this is labelled by volunteer annotators in real life scenario, rather than the experts.
  5. We assume all buildings in Microsoft BF dataset are undamaged residential buildings (since majority buildings are residential). The exact class is not important, since the dataset is only for pretraining.

Methodology Overview

PrepareData

Figure 1. Overview of the proposed methodology.
Workflow illustrating the complete process from data acquisition to model training.


Model

The goal of Phase 1 is to identify and detect “damaged” and “undamaged” coastal infrastructure, which is an object detection task. To tackle this challenge, our team has opted for Ultralytics YOLOv8, one of the state-of-the-art (SOTA) object detection models renowned for its speed and accuracy. Despite the availability of competitors like YOLOv9, we prefer Ultralytics YOLOv8 for its user-friendliness and well-documented workflows that streamline training and deployment. We choose the smallest YOLOv8 - YOLOv8n, since it is unwise to use larger model when dealing with limited dataset, as it may lead to overfitting. Given more time, we would explore other YOLOv8 version and other SOTA models when we have a bigger dataset. Meanwhile, our empirical study revealed that the main influencing factor on the detection accuracy is the quantity and quality of the annotated dataset. Hence, we argue that the main focus of the challenge should be data annotation. We provide details on how we built our training dataset in the next section.



Submission Experiment

We conducted a comprehensive series of experiments, submitting a total of 30 entries. Here are select highlights:

Setup Pretraining Crowdsourced Dataset Expert Dataset MLOps mAP
A 0.10
B 0.44
C 0.39
D 0.50
E 0.24
F 0.51

Further experiments were conducted, including:

  • Setup A: 0.10 - We pretrained a YOLOv8n model using the Puerto Rico dataset. Surprisingly, we achieved a mAP of 0.10 on the EY validation dataset, without any manual annotation from our side!
  • Setup B: 0.44 - When fine-tuning the pretrained model on the crowd-sourced dataset, we achieved an mAP of 0.44, which exceeds the completion threshold for this challenge (mAP 0.40).
  • Setup C: 0.39 - When fine-tuning the pretrained model directly on the Expert dataset, we can achieve an mAP of 0.39, despite the dataset containing only 28 unique data (84 after augmentation). This shows that the quality of data is equally important, if not more important than the quantity of data.
  • Setup D: 0.50 - We initially fine-tune the pretrained model using a large-scale crowd-sourced dataset to quickly warm it up. Subsequently, we fine-tune the model on the expert dataset, which has more accurate labels. With this approach, we achieved a mAP of 0.50.
  • Setup E: 0.24 - We demonstrate that without pretraining, the performance is not satisfying even when both the crowd-sourced and expert datasets are utilised, only achieving mAP 0.24.
  • Setup F: 0.51 - Finally, we demonstrate that by employing the proposed MLOps cycle, we can enhance the model’s mAP to 0.51. Notably, the sole human intervention in this MLOps cycle involves verifying the self-labelled data using the baseline model from Setup E.


Key Takeaways

1. Dataset quality is what you need: There are 2 observations from our study. Firstly, data quality is as important as data quantity. Secondly, having annotators with expertise in building damage assessment is crucial for producing the high-quality 'expert dataset.' On the contrary, non-experts tend to generate a lower quality dataset, which we refer to as a 'crowdsourced dataset.' However, a high-quality dataset tends to be smaller in size because it takes time to carefully annotate the data. Conversely, a high-quantity dataset tends to have lower quality due to a lack of expertise and attention. This mirrors a real-world scenario of a quality-quantity tradeoff. Fortunately, we found that we can combine the strengths of both datasets, as demonstrated in Setup D from our ablation study in Table I. This involves fine-tuning the pretrained model on the crowdsourced dataset, followed by fine-tuning on the expert dataset.

2. Start with a small model: We recommend starting with a smaller model. It is unwise to use a larger model when dealing with a limited dataset, as it may lead to overfitting. Our empirical study agrees with this hypothesis, as we failed to achieve a high mAP score using the bigger YOLOv8 version. Given more time, we would explore the bigger YOLOv8 version and other state-of-the-art (SOTA) models when we have a larger dataset.



Technological Stack

ultralytics roboflow pytorch jupyter python