devtools::install_github("hydrosolutions/riversCentralAsia") # download package from github
library('riversCentralAsia') # load package
4 Discharge Station Data
Here, we demonstrate essential data preparation steps that should always precede modeling. These preparatory steps focus on data visualization, cleaning, and gap-filling. Please note that many key data visualization techniques and their corresponding code have already been shown in Chapter 2 in Part I of the book. Feel free to revisit these sections for more details on data visualization.
4.1 Available Data
The riversCentralAsia
Package provides available data on the gauging and meteorological stations in the Chirchik River Basin (where other data are used, and their source and access options are indicated). This is the time to install and load the package with the following commands.
Before starting any type of modeling, it is important to get a good understanding of the data that we are dealing with and whether there exist problems with the raw data that need to be addressed prior to the modeling step. This is actually also one of the more hidden agendas when doing a basin characterization, i.e. to detect such possible issues present in the available data.
Problems in real-world data usually include data gaps and outliers as data records that one obtains are usually neither complete nor cleaned (of errors).
The steps performed here are thus required steps for any type of successful modeling and should be performed with great care prior to starting hydrological modeling.
The importance of good quality data for modeling cannot be overstated. It can very easily be summarized in the following way
- Data \(\rightarrow\) Model \(\rightarrow\) Results
If the underlying data is erroneous, then this translated into
- Garbage in \(\rightarrow\) Model \(\rightarrow\) Garbage out
We concentrate our efforts here on discharge records and data from meteorological stations in the Chirchik River Basin for demonstration purposes. The techniques shown here for decadal (10-days) data naturally extend to monthly data and also, to data from other basins and other sources.
4.2 Gap Filling Discharge Data
In the following, we will work with decadal discharge data from the two main tributaries of the Chirchik River, i.e. the Chatkal River (Gauge 16279) and the Pskem River (Gauge 16290) as well as on the data of the inflow to the Charvak reservoir (Gauge 16924). The goal is to analyze the data and prepare for modeling. First, let us load the relevant discharge data. The data are stored in the ChirchikRiverBasin
object.
data <- ChirchikRiverBasin # load data
q_dec_tbl <- data |> filter(code == '16279' | code == '16290' | code == '16924') # Note for the new name of the data object, we use snake notation. We choose to add periodicity (_dec_) and data type (_tbl for tibble/dataframe) to the data name. This just helps to stay organized and is good practice in R programming.
q_dec_tbl
# A tibble: 9,072 × 14
date data norm units type code station river basin resolution
<date> <dbl> <dbl> <chr> <fct> <chr> <chr> <chr> <chr> <fct>
1 1932-01-10 48.8 38.8 m3s Q 16279 Khudaydod Chatkal Chirch… dec
2 1932-01-20 48.4 37.5 m3s Q 16279 Khudaydod Chatkal Chirch… dec
3 1932-01-31 42.4 36.6 m3s Q 16279 Khudaydod Chatkal Chirch… dec
4 1932-02-10 43.7 36.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
5 1932-02-20 44.2 36.3 m3s Q 16279 Khudaydod Chatkal Chirch… dec
6 1932-02-29 47.7 36.9 m3s Q 16279 Khudaydod Chatkal Chirch… dec
7 1932-03-10 54.1 39.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
8 1932-03-20 63.2 47.6 m3s Q 16279 Khudaydod Chatkal Chirch… dec
9 1932-03-31 103 60.5 m3s Q 16279 Khudaydod Chatkal Chirch… dec
10 1932-04-10 103 86.4 m3s Q 16279 Khudaydod Chatkal Chirch… dec
# ℹ 9,062 more rows
# ℹ 4 more variables: lon_UTM42 <dbl>, lat_UTM42 <dbl>, altitude_masl <dbl>,
# basinSize_sqkm <dbl>
You can get more information about the available data by typing ?ChirchikRiverBasin
. Note that the original time series data has been packaged in this format by the riversCentralAsia::loadTabularData()
function which takes a simple .csv file as input.
It is advisable to check at this stage for missing data in time series and to fill gaps where present. Are there missing data? How can these be filled so as to arrive at complete time series that are required for hydrological modeling? These are the key questions that we will address in the following.
As can be seen in Figure 4.1, close inspection of the time series indeed reveals some missing data in the 1940ies. The missing data are indicated by gaps in the time series. The gaps are clearly visible in the time series of the Pskem River (Gauge 16290) and the Charvak reservoir (Gauge 16924). The inflow to the Chatkal River (Gauge 16279) time series appears complete and does not show any missing data. However, we will see that this is not the case.
Note, Figure 4.1 is an interactive figure where you can zoom in. Try it and zoom into the 1940ies to visualize the missing data fore clearly. You can zoom out again by clicking the Autoscale hover over. For the visualization of time series, we normally use the excellent timetk R Package. Check it out and try yourself!
q_dec_tbl |> plot_time_series(date,data,
.facet_vars = code,
.smooth = FALSE,
.interactive = TRUE,
.x_lab = "year",
.y_lab = "m^3/s",
.title = ""
)
Missing data are also confirmed by the warning that the function timetk::plot_time_series()
throws (suppressed here). Statistics of the missing data can be easily obtained. As the Table below shows, we can do this analysis for each discharge station separately.
# A tibble: 3 × 3
code n.na na.perc
<chr> <int> <dbl>
1 16279 15 0.496
2 16290 39 1.29
3 16924 42 1.39
Summarizing the number of observation with missing data reveals that 15 data points for station 16279 (0.5 % of total record length) and 39 for station 16290 (1.3 % of total record length) are missing. As there are only very few gaps in the existing time series, we use a simple method to fill these. Wherever there is a gap, we fill in the corresponding decadal norm as stored in the norm column in the object q_dec_tbl
at the time stamp of the missing data. The visualization of the results confirms that our simple gap filling approach is indeed satisfactory (see Figure 4.2).
# Make a copy of the original data
q_dec_filled_tbl <- q_dec_tbl
# Actual gap filling step
q_dec_filled_tbl$data[is.na(q_dec_filled_tbl$data)] =
q_dec_filled_tbl$norm[is.na(q_dec_filled_tbl$data)]
# Inspect results
q_dec_filled_tbl |> plot_time_series(date, data,
.facet_vars = code,
.smooth = FALSE,
.interactive = TRUE,
.x_lab = "year",
.y_lab = "m^3/s",
.title = ""
)
All missing data are gone now as can easily be validated. The number of missing data points is zero for all stations. The results are summarized in the Table below.
# A tibble: 3 × 3
code n.na na.perc
<chr> <int> <dbl>
1 16279 0 0
2 16290 0 0
3 16924 0 0
A note of caution here. This simple gap filling technique reduces variance in the time series. It should only be used when the percentage of missing data is low. As will be discussed in the next Section Section 4.3 below, more sophisticated techniques should be utilized when there exist substantial gaps and in the case of less regular data.
Finally, we discard the data that we no longer need, including the norm data, which we used for gap filling of the missing discharge data and convert the data to wide format (see Table 4.1 below) to add to it meteorological data in the next Section.
As a result, we now have a complete record of decadal discharge data for the two main tributaries of the Chirchik river and the inflow time series to Charvak Reservoir from the beginning of 1932 until and including 2015, i.e. 84 years. The same type of preparatory analysis will now be carried out for the meteorological data but in a slightly more sophisticated way.
4.3 Gap Filling Meteorological Data
Here, we use precipitation and temperature data from Pskem (38462), Chatkal (38471) and Charvak Reservoir (38464) Meteorological Stations (see Section 2.2 for more information on these stations). We also have data from Oygaing station (Station Code 38339) but the record only starts in 1962 and the time resolution is monthly. Therefore, we do not take this station into account here for the time being.
We start with precipitation and plot the available data.
p_dec_tbl <- data |> filter(type == "P" & code != "38339")
p_dec_tbl |> plot_time_series(date,data,
.facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = "",
.y_lab = "mm/decade",
.x_lab = "year"
)
The precipitation data from these 3 stations shows some significant data gaps. The Chatkal Meteorological Station that is located in Kyrgyzstan apparently did not work in the post-transition years as continuous measurements were only resumed there in 1998.
Let us see what happens if we were to use the same simple gap filling technique that we introduced above for discharge.
p_dec_filled_tbl <- p_dec_tbl
p_dec_filled_tbl$data[is.na(p_dec_filled_tbl$data)] = p_dec_filled_tbl$norm[is.na(p_dec_filled_tbl$data)]
p_dec_filled_tbl |> plot_time_series(date,data,
.facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = "",
.y_lab = "mm/decade",
.x_lab = "year"
)
Closely inspect the significant data gap in the 1990ies at Station 38741. Play around and zoom into the time series in the 1990ies in Figure 4.3 and compare it with the resulting gap-filled time series in Figure 4.4. We see that our technique of gap filling with long-term norms is not suitable for this type of data and the significant gap size. The effect of variance reduction is clearly visible.
Hence, we resort to a more powerful gap filling technique that uses a (regression) model to impute the missing values from existing ones at the neighboring stations, i.e. Stations 38462 and 38464. To do so, we utilize the simputation R package. Please note that if you do not have the required package installed locally, you should install it prior to its use with the following command install.packages('simputation')
library(simputation)
# First, we bring the data into the suitable format.
p_dec_wide_tbl <- p_dec_tbl |>
mutate(code = paste0('P',code |> as.character())) |>
dplyr::select(date,data,code) |>
pivot_wider(names_from = "code",values_from = "data")
# Second, we impute missing values.
p_dec_filled_wide_tbl <- p_dec_wide_tbl |>
impute_rlm(P38471 ~ P38462 + P38464) |> # Imputing precipitation at station 38471 using a robust linear regression model
impute_rlm(P38462 ~ P38471 + P38464) |> # Imputing precipitation at station 38462 using a robust linear regression model
impute_rlm(P38464 ~ P38462 + P38471) # Imputing precipitation at station 38464 using a robust linear regression model
p_dec_filled_long_tbl <- p_dec_filled_wide_tbl |> pivot_longer(c('P38462','P38464','P38471'))
p_dec_filled_long_tbl |> plot_time_series(date,value,
.facet_vars = name,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "mm/decade",
.x_lab = "year"
)
As you can see, we use simple linear regression models to impute missing value in the target time series using observations from the neighboring stations. This is of course only possible where data is not missing across the time series, as we will discuss below.
Through simple visual inspection, it becomes clear that this type of regression model for gap filling is better suited than the previous approach chosen. Let us check whether we could successfully fill all gaps with this robust linear regression approach.
p_dec_filled_long_tbl |>
group_by(name) |>
summarize(n.na = sum(is.na(value)), n.na.perc = n.na / n() * 100)
# A tibble: 3 × 3
name n.na n.na.perc
<chr> <int> <dbl>
1 P38462 12 0.402
2 P38464 12 0.402
3 P38471 3 0.100
It turns out that we still have very few gaps to deal with. We can see them by simply visualizing the wide tibble. The problem persisted at times when two or more values were missing across the available stations at the same time and where thus the linear regression could not be carried out. Let us look at the start of the record…
p_dec_filled_wide_tbl |>
head(10)
# A tibble: 10 × 4
date P38462 P38464 P38471
<date> <dbl> <dbl> <dbl>
1 1933-01-10 NA NA 2
2 1933-01-20 NA NA 10
3 1933-01-31 NA NA 5
4 1933-02-10 NA NA 33
5 1933-02-20 NA NA 8
6 1933-02-28 NA NA 10
7 1933-03-10 NA NA 31
8 1933-03-20 NA NA 50
9 1933-03-31 NA NA 6
10 1933-04-10 23 21.3 13
… and the end of the record. The missing values are easily spotted.
p_dec_filled_wide_tbl |>
tail()
# A tibble: 6 × 4
date P38462 P38464 P38471
<date> <dbl> <dbl> <dbl>
1 2015-11-10 72 81 19
2 2015-11-20 122 76 43
3 2015-11-30 7 2 3
4 2015-12-10 NA NA NA
5 2015-12-20 NA NA NA
6 2015-12-31 NA NA NA
We can solve the issues related to the missing values at the start of the observation record by using the same technique as above and by only regressing P38462 and P38464 on P38471.
p_dec_filled_wide_tbl <-
p_dec_filled_wide_tbl |>
impute_rlm(P38462 ~ P38471) |> # Imputing precipitation at station 38462 using a robust linear regression model
impute_rlm(P38464 ~ P38471) # Imputing precipitation at station 38464 using a robust linear regression model
p_dec_filled_wide_tbl |> head(10)
# A tibble: 10 × 4
date P38462 P38464 P38471
<date> <dbl> <dbl> <dbl>
1 1933-01-10 5.60 5.08 2
2 1933-01-20 18.3 16.7 10
3 1933-01-31 10.4 9.46 5
4 1933-02-10 54.9 50.3 33
5 1933-02-20 15.2 13.8 8
6 1933-02-28 18.3 16.7 10
7 1933-03-10 51.8 47.3 31
8 1933-03-20 82.0 75.0 50
9 1933-03-31 12.0 10.9 6
10 1933-04-10 23 21.3 13
Converse to this, the complete set of observations is missing for December 2015. We will thus remove these non-observations from our tibble. This can be done once and for all with na.omit()
as shown in the code block below.
# A tibble: 6 × 4
date P38462 P38464 P38471
<date> <dbl> <dbl> <dbl>
1 2015-10-10 5 1 0
2 2015-10-20 89 108 58
3 2015-10-31 34 40 12
4 2015-11-10 72 81 19
5 2015-11-20 122 76 43
6 2015-11-30 7 2 3
p_dec_filled_long_tbl <- p_dec_filled_wide_tbl |> pivot_longer(-date)
Inspecting the temperature data, we see similar data issues as in the precipitation data set and can proceed accordingly for gap filling.
t_dec_tbl <- data |> filter(type == "T")
t_dec_tbl |> plot_time_series(date,data,
.facet_vars = code,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "deg. Celsius",
.x_lab = "year"
)
# First, we bring the data into the suitable format.
t_dec_wide_tbl <- t_dec_tbl |>
mutate(code = paste0('T',code |> as.character())) |>
dplyr::select(date,data,code) |>
pivot_wider(names_from = "code",values_from = "data")
# Second, we impute missing values.
t_dec_filled_wide_tbl <- t_dec_wide_tbl |>
impute_rlm(T38471 ~ T38462) |> # Imputing precipitation at station 38471 using a robust linear regression model
impute_rlm(T38462 ~ T38471) # Imputing precipitation at station 38462 using a robust linear regression model
t_dec_filled_long_tbl <- t_dec_filled_wide_tbl |>
pivot_longer(c('T38462','T38471'))
t_dec_filled_long_tbl |>
plot_time_series(date,value,
.facet_vars = name,
.interactive = TRUE,
.smooth = FALSE,
.title = '',
.y_lab = "deg. Celsius",
.x_lab = "year"
)
There are some irregularities in the temperature time series of Chatkal Meteorological Station in the first decade of the 20th century (tip: zoom in to see these more clearly). Note that these were not introduced by the gap filling technique that we used but are most likely wrong temperature readings or recordings. We will return to these in the outlier analysis below in Section 4.4.
Any missing values left in the temperature time series? Let’s check!
t_dec_filled_long_tbl |>
group_by(name) |>
summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
# A tibble: 2 × 3
name n.na n.na.perc
<chr> <int> <dbl>
1 T38462 3 0.100
2 T38471 3 0.100
To see where the missing value are, we find them easily again by looking at the head and tail of the tibble.
t_dec_filled_wide_tbl |> head()
# A tibble: 6 × 3
date T38462 T38471
<date> <dbl> <dbl>
1 1933-01-10 -6.9 -16.6
2 1933-01-20 -6.1 -15.5
3 1933-01-31 -6.3 -15.6
4 1933-02-10 -2 -8.6
5 1933-02-20 -3.3 -12.5
6 1933-02-28 -0.1 -8.5
t_dec_filled_wide_tbl |> tail()
# A tibble: 6 × 3
date T38462 T38471
<date> <dbl> <dbl>
1 2015-11-10 2.4 -2.5
2 2015-11-20 2 -2.2
3 2015-11-30 4.6 -3.7
4 2015-12-10 NA NA
5 2015-12-20 NA NA
6 2015-12-31 NA NA
Finally, we remove these non observations again as above with the function na.omit()
.
t_dec_filled_wide_tbl <- t_dec_filled_wide_tbl |> na.omit()
t_dec_filled_long_tbl <- t_dec_filled_wide_tbl |> pivot_longer(-date)
To deal with the missing values at the end of the observational record, we could also have used any other technique. Using the norm values however would have artificially reduced the variance in both cases as explained above. Furthermore and at least in the case of temperature, it is also questionable to what extent a norm calculated over the last 84 years is still representative given global warming. We will look in this important and interesting topic in the next section.
4.4 Anomalies and Outliers
We use the function timetk::plot_anomaly_diagnostics()
to investigate these anomalies in the time series. For discharge, we first log-transform the raw data with the following transformation to reduce the variance of the original data.
\[
\hat{q}(t) = log(q(t) + 1)
\] where \(\hat{q}(t)\) denotes the transformed discharge. Prior to the log transformation, 1 is added so as to avoid cases where discharge would be 0 and the logarithmic transform thus undefined. The transformation can easily be done with the log1p()
function in R. Back-transformation is then via the function expm1()
simply involves taking the exponent and subtracting 1 from the result. Figure 4.8 shows the result.
Results are shown in Figure 4.8, Figure 4.9 and Figure 4.11 below.
The exceptionally wet year 19169 shows up as anomalous in the Chatkal River Basin and at the downstream Charvak Reservoir inflow gauge.
q_dec_filled_long_tbl |>
plot_anomaly_diagnostics(date,
value |> log1p(),
.facet_vars = name,
.frequency = 36,
.interactive = TRUE,
.title = "")
The investigation of precipitation anomalies shows a succession of regular anomalous wet events over time. It is interesting to see that the winter 1968/69 regularly anomalous at all three stations (Figure 4.9, zoom in to investigate).
p_dec_filled_long_tbl |>
plot_anomaly_diagnostics(date,
value,
.facet_vars = name,
.interactive = TRUE,
.title = "")
While intuitively, we would have expected an exceptionally mild winter in 1968/69 due to the precipitation excess, the corresponding anomaly does not show up in the temperature record as shown in Figure 4.11.
We have set a quite conservative range for the precipitation anomalies in Figure 4.9. We might want to experiment with a larger .alpha
value in the plot_anomaly_diagnostics()
function to see if the winter 1968/69 is still considered as an anomaly.
p_dec_filled_long_tbl |>
plot_anomaly_diagnostics(date,
value,
.facet_vars = name,
.interactive = TRUE,
.title = "",
.alpha = .03)
Now we look at the temperature anomalies.
t_dec_filled_long_tbl |>
plot_anomaly_diagnostics(date,value,
.facet_vars = name,
.interactive = TRUE,
.title = "")
Apart from the identification of extreme periods since as the 1969 discharge year in the Chatkal river basin, the diagnostics of anomalies also helps to identify likely erroneous data records. In Figure 4.11 for example, when we zoom into the data of the series T38471 in the first decade of the 21st century, problems in relation to positive anomalies during the winter are visible in 4 instances. One explanation would be that in at least some instances, the data are erroneously recorded as positive values when in fact they were negative temperatures (see dates ‘2002-01-31’, ‘2005-01-10’ and ‘2007-02-28’, Chatkal Station 38471).
Obvious errors can be spotted like this and corrected. However, non-obvious data errors should be communicated with the data producing agency and replacement strategy jointly defined. If this is not possible, the values could be set to NA
and then imputed as shown above.
The discharge data is now ready to be used for modelling and we can move on to the next Chapter on Geospatial Data.