⚠ This website is for the 2022-23 academic year and is no longer maintained. ⚠

5.2. Extreme Value Analysis 1 (Assignment)#

Sampling extremes#

Objective#

Apply sampling techniques to identify extremes within an historical database.

Introduction#

The first step to model the extreme observations of our loading variable is to extract those from our database. In this notebook, we will use the wave height dataset seen in the videos to put sampling techniques into practise. We will follow the next steps:

  1. Load libraries and data

  2. Sampling extremes with Block Maxima

  3. Sampling extremes with Peak Over Threshold

  4. Let’s talk about threshold and declustering time selection

0. Load libraries and data#

Install dependences if needed

!pip install pandas numpy scipy statsmodels matplotlib datetime
%matplotlib inline

Import necessary libraries (include more if needed).

import pandas as pd
import numpy as np
from scipy import stats
from scipy.signal import find_peaks
import os
import matplotlib.pyplot as plt
import matplotlib
import datetime

Load the database obtained from a buoy from surfdrive.

file_key = ['2Z1OHYg5Gv9PIAK', 'o5CSK5igJiov4Lq', 'Xk3sdPJDb75OcNX']
dataset_url = ('https://surfdrive.surf.nl/files/index.php/s/'
               + file_key[0] + '/download')
data = pd.read_csv(dataset_url, sep = ';')
data.head()

The first column of the database corresponds to the date and time of each measurement, so we are going to give it the right format. The second column corresponds to the hourly measured significant wave height (\(H_s\)).

data['Date (GMT)'] = pd.to_datetime(data['Date (GMT)'])
data.head()

We can start the analysis of the data by plotting the time series.

fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label='${H_s(m)}$')
axs.set_title('Significant wave height time series')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()
plt.show()

In the figure, you can see some outliers. When the buoy is not able to measure, it records -9,999. Before going ahead with the analysis, we need to remove these values. Afterwards, we can plot the data again.

data = data[data['Hs(m)'] >= 0]
data.set_axis([np.linspace(0, len(data)-1, len(data), dtype = int)], axis = 'index', inplace = True)
fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label = '${H_s(m)}$')
axs.set_title('Significant wave height time series')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()

Now we have a clean database to work! As you can see, the buoy has been measuring from 1985 until 2005, so we have almost 20 years of measurements. However, these data are not enough to totally represent the right tail of the distribution (high return periods), so we need to model it.

1. Sampling extremes with Block Maxima#

This first sampling technique consists of defining a time block and selecting the maximum observation within that time block.

Task #1: identify extreme events using Block Maxima. Define a function which implements Block Maxima with a time block equal to a year. The function must take as input the buoy data. The output of the function must be a dataframe with the two columns of the original data.

Hint: using pd.DatetimeIndex(data[‘Date (GMT)’]).year, you can extract the year of each observation

'''
def yearly_maxima(data):
    your function
    return max_list
'''

Let’s apply your function to our data!

yearly_maxima_list=yearly_maxima(data)
yearly_maxima_list

We have obtained one maximum value per year, so 21 extreme values. We can also plot it on our timeseries to see how it looks!

fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label='${H_s(m)}$ time series')
axs.scatter(yearly_maxima_list['Date (GMT)'], yearly_maxima_list['Hs(m)'], 40, 'r', label = 'Yearly Maxima')
axs.set_title('Time series and block maxima')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()
axs.legend()

2. Sampling extremes with Peak Over Threshold#

Let’s go with the second technique: Peak Over Threshold (POT). This sampling technique involves defining two parameters:

  • A threshold: value above which an observation is considered extreme.

  • Declustering time: minimum time distance between two extreme observations.

Based on these parameters, we can select those observations which are above the threshold and present a minimum distance equal to the declustering time between events.

Task #2: identify extreme events using the POT method. Define a function which implements the POT method. The function must take as input (1) the buoy data, (2) the threshold, and (3) the declustering time. The output of the function must be a dataframe with the same columns as the data.

Hint: You can use the function find_peaks from SciPy library.

'''
def pot_method(data, threshold, dtime):
    your function
    return pot_list
'''

Let’s apply your function to your data! Do it with a threshold=2.5m and declustering time=48h.

pot_maxima = pot_method(data, threshold=2.5, dtime=48)
pot_maxima

Let’s see how much extreme values we got.

len(pot_maxima)

That is much more than the 21 extreme observations we got with Yearly Maxima, right?

Let’s compare both methods visually!

fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label = '${H_s(m)}$ time series')
axs.scatter(yearly_maxima_list['Date (GMT)'], yearly_maxima_list['Hs(m)'], 40, 'r', label = 'Yearly Maxima')
axs.scatter(pot_maxima['Date (GMT)'], pot_maxima['Hs(m)']+0.25, 40, 'b', label = 'POT Maxima')
axs.set_title('Yearly Maxima vs. POT')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()
axs.legend()

Note that for the following phases of Extreme Value Analysis, you need the excesses over the threshold when applying POT method, not the value of the peak itself. This is, if I have identified an extreme observation of 3.5m with a threshold of 2m, I will model an excess of 3.5-2=1.5m.

3. Let’s talk about threshold and declustering time selection#

Now, we are going to discuss about the appropriate selection of the parameters of the POT sampling technique.

Let’s sample again our time series using POT but with different parameters. Let’s use now threshold=1.5 and declustering time=12h.

pot_maxima_2 = pot_method(data, threshold = 1.5, dtime = 12)
len(pot_maxima_2)

Now, we have obtained much more extreme values. That’s better for our analysis. Let’s compare the selected extremes in a plot.

fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label = '${H_s(m)}$ time series')
axs.scatter(pot_maxima['Date (GMT)'], pot_maxima['Hs(m)'], 40, 'b', label = 'threshold=2.5 and dtime=48h')
axs.scatter(pot_maxima_2['Date (GMT)'], pot_maxima_2['Hs(m)']+0.25, 40, 'g', label = 'threshold=1.5 and dtime=12h')
axs.set_title('Comparing different parameters for POT')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()
axs.legend()

Let’s zoom in and analyze one wave storm.

fig, axs = plt.subplots(1, 1)
axs.plot(data['Date (GMT)'], data['Hs(m)'], 'k', label='${H_s(m)}$ time series')
axs.scatter(pot_maxima['Date (GMT)'], pot_maxima['Hs(m)'], 40, 'b', label = 'threshold=2.5 and dtime=48h')
axs.scatter(pot_maxima_2['Date (GMT)'], pot_maxima_2['Hs(m)']+0.25, 40, 'g', label = 'threshold=1.5 and dtime=12h')
axs.set_title('Significant wave height time series')
axs.set_xlabel('Date')
axs.set_ylabel('${H_s (m)}$')
fig.set_size_inches(15, 4)
axs.grid()
axs.set_xlim(datetime.datetime(2002, 3, 22, 0, 0), datetime.datetime(2002, 4, 9, 0, 0))
axs.legend()

In the previous plot we can see that there was a time storm between the 28th March 2002 and 1st March 2022. Using threshold=2.5 and declustering time=48h, one extreme value is sampled from the wave storm. Using threshold=1.5 and declustering time=12h, three values are selected.

Task #3. What do you think is best? Explain your answer.

'''
Write your answer here
'''