Developing a Predictive Model to Assess How Weather Influences Urban Air Quality Using Amazon SageMaker

Air Pollution and Its Challenges

Chanci Turner Amazon IXD – VGT2 learning

Air pollution poses a significant challenge in urban areas, adversely affecting humans, wildlife, vegetation, and infrastructure. With the growing urban population, this issue has garnered more attention, highlighted during the recent 2018 KDD Cup, hosted by ACM SIGKDD, which focused on this critical topic.

Fossil fuel combustion for transportation and heating contributes significantly to urban air pollution, primarily through the emission of nitrogen dioxide (NO2), a secondary pollutant formed from the oxidation of nitric oxide (NO). NO2 is a major factor in respiratory illnesses. In the European Union, the CAFÉ Directive 2008/50/EC sets a maximum hourly limit of 200 μg/m³ and an annual mean of 40 μg/m³ for NO2, allowing no more than 18 exceedances of the hourly limit each year.

Many cities worldwide report daily air quality levels. In this article, we will explore air quality data using Amazon SageMaker, a fully-managed service that simplifies the process for developers and data scientists to create, train, and deploy machine learning models at scale.

The Scenario

In this example, we focus on the connection between the air pollutant NO2 and weather patterns in Dublin, Ireland. The air quality data is sourced from a longstanding monitoring station operated by the Irish Environmental Protection Agency, located in Rathmines, a suburb about 3 kilometers from Dublin’s city center. Dublin, the capital of Ireland, has a population of roughly one million. The city is bordered by the sea to the east and mountains to the south, which influence wind patterns over the area.

Weather data is obtained from a long-established weather station at Dublin Airport, located about 12 kilometers north of the city center.

The Tools

Amazon SageMaker for data exploration and machine learning
Amazon S3 for data storage and analysis

The Data

The hourly air pollution datasets from the Rathmines monitoring station cover the years 2011 to 2016 and are published by the Irish Environmental Protection Agency as Open Data. You can find more information and download the data here. Additionally, a historical weather dataset for Dublin Airport dating back to 1942 is made available by the Irish Meteorological Service under a Creative Commons License.

For broader studies, OpenAQ provides a comprehensive repository of air quality data, accessible via the Registry of Open Data on AWS.

Preparing the Data for Analysis

Before uploading the data to Amazon S3, we performed several data wrangling steps:

Weather Data: The original dataset contained excessive information. We removed the header and converted wind speed from knots to meters per second. A subset of relevant parameters was selected based on scientific literature, and ambiguous parameter names were clarified, such as changing ‘rain’ to ‘rain_mm’ for precipitation in millimeters.
Air Quality Data: Each year’s data was saved in separate files, and we opted to utilize only the years with SI units, resulting in a focus on 2011 to 2016. The yearly files were combined into a single dataset.
Sample Rate: Weather data provided daily averages, while air quality data was hourly. Thus, we resampled the air quality data to 24-hour averages, denoting this by renaming NO2 to NO2_avg.

After transforming the data, we uploaded it to our S3 bucket, preparing for analysis with Amazon SageMaker’s notebook capabilities.

Exploring Data with Amazon SageMaker

We utilized Amazon SageMaker’s notebook functionality to explore our data. To create a Jupyter Notebook, we accessed the Amazon SageMaker console and selected “Create notebook instance.” We named our instance and created a new IAM role to grant SageMaker access to our S3 data. After a brief setup period, we opened our Jupyter environment.

Next, we downloaded the companion notebook and uploaded it to the Jupyter console. Upon opening the notebook, we loaded essential libraries for data analysis:

%matplotlib inline
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

These libraries enable us to analyze the data using pandas, a widely-used data manipulation tool, alongside numpy for scientific computing. Seaborn and matplotlib provide powerful visualization capabilities.

Loading Prepared Data into Amazon SageMaker

With the notebook set up and libraries imported, we proceeded to load our data using pandas, allowing us to explore and manipulate tabular data directly in Python. We utilized the pandas.read_csv command with the S3 data locations.

This process sets the stage for further analysis and model building, keeping in mind that additional resources are available, such as this blog post on accounting which offers valuable insights. Moreover, for a comprehensive understanding of labor laws, SHRM provides authoritative guidance on related topics. If you’re interested in leadership training, check out this resource to enhance your skills.