4  Discharge Station Data

Here, we demonstrate essential data preparation steps that should always precede modeling. These preparatory steps focus on data visualization, cleaning, and gap-filling. Please note that many key data visualization techniques and their corresponding code have already been shown in Chapter 2 in Part I of the book. Feel free to revisit these sections for more details on data visualization.

4.1 Available Data

The riversCentralAsia Package provides available data on the gauging and meteorological stations in the Chirchik River Basin (where other data are used, and their source and access options are indicated). This is the time to install and load the package with the following commands.

devtools::install_github("hydrosolutions/riversCentralAsia") # download package from github
library('riversCentralAsia') # load package

Before starting any type of modeling, it is important to get a good understanding of the data that we are dealing with and whether there exist problems with the raw data that need to be addressed prior to the modeling step. This is actually also one of the more hidden agendas when doing a basin characterization, i.e. to detect such possible issues present in the available data.

Problems in real-world data usually include data gaps and outliers as data records that one obtains are usually neither complete nor cleaned (of errors).

The steps performed here are thus required steps for any type of successful modeling and should be performed with great care prior to starting hydrological modeling.

Garbage in - Garbage out

The importance of good quality data for modeling cannot be overstated. It can very easily be summarized in the following way

  • Data \(\rightarrow\) Model \(\rightarrow\) Results

If the underlying data is erroneous, then this translated into

  • Garbage in \(\rightarrow\) Model \(\rightarrow\) Garbage out

We concentrate our efforts here on discharge records and data from meteorological stations in the Chirchik River Basin for demonstration purposes. The techniques shown here for decadal (10-days) data naturally extend to monthly data and also, to data from other basins and other sources.

4.2 Gap Filling Discharge Data

In the following, we will work with decadal discharge data from the two main tributaries of the Chirchik River, i.e. the Chatkal River (Gauge 16279) and the Pskem River (Gauge 16290) as well as on the data of the inflow to the Charvak reservoir (Gauge 16924). The goal is to analyze the data and prepare for modeling. First, let us load the relevant discharge data. The data are stored in the ChirchikRiverBasin object.

data <- ChirchikRiverBasin # load data
q_dec_tbl <- data |> filter(code == '16279' | code == '16290' | code == '16924') # Note for the new name of the data object, we use snake notation. We choose to add periodicity (_dec_) and data type (_tbl for tibble/dataframe) to the data name. This just helps to stay organized and is good practice in R programming.
q_dec_tbl
# A tibble: 9,072 × 14
   date        data  norm units type  code  station   river   basin   resolution
   <date>     <dbl> <dbl> <chr> <fct> <chr> <chr>     <chr>   <chr>   <fct>     
 1 1932-01-10  48.8  38.8 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 2 1932-01-20  48.4  37.5 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 3 1932-01-31  42.4  36.6 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 4 1932-02-10  43.7  36.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 5 1932-02-20  44.2  36.3 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 6 1932-02-29  47.7  36.9 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 7 1932-03-10  54.1  39.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 8 1932-03-20  63.2  47.6 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
 9 1932-03-31 103    60.5 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
10 1932-04-10 103    86.4 m3s   Q     16279 Khudaydod Chatkal Chirch… dec       
# ℹ 9,062 more rows
# ℹ 4 more variables: lon_UTM42 <dbl>, lat_UTM42 <dbl>, altitude_masl <dbl>,
#   basinSize_sqkm <dbl>

You can get more information about the available data by typing ?ChirchikRiverBasin. Note that the original time series data has been packaged in this format by the riversCentralAsia::loadTabularData() function which takes a simple .csv file as input.

It is advisable to check at this stage for missing data in time series and to fill gaps where present. Are there missing data? How can these be filled so as to arrive at complete time series that are required for hydrological modeling? These are the key questions that we will address in the following.

As can be seen in Figure 4.1, close inspection of the time series indeed reveals some missing data in the 1940ies. The missing data are indicated by gaps in the time series. The gaps are clearly visible in the time series of the Pskem River (Gauge 16290) and the Charvak reservoir (Gauge 16924). The inflow to the Chatkal River (Gauge 16279) time series appears complete and does not show any missing data. However, we will see that this is not the case.

Tip

Note, Figure 4.1 is an interactive figure where you can zoom in. Try it and zoom into the 1940ies to visualize the missing data fore clearly. You can zoom out again by clicking the Autoscale hover over. For the visualization of time series, we normally use the excellent timetk R Package. Check it out and try yourself!

q_dec_tbl |> plot_time_series(date,data,
                               .facet_vars  = code,
                               .smooth      = FALSE,
                               .interactive = TRUE,
                               .x_lab       = "year",
                               .y_lab       = "m^3/s",
                               .title       = ""
                               )
Figure 4.1: Discharge data of selected gauges in the upstream zone of runoff formation in the Chirchik River Basin. Data Source: Uzbek Hydrometeorological Service.

Missing data are also confirmed by the warning that the function timetk::plot_time_series() throws (suppressed here). Statistics of the missing data can be easily obtained. As the Table below shows, we can do this analysis for each discharge station separately.

q_dec_tbl |> group_by(code) |> 
  summarize(n.na = sum(is.na(data)), na.perc = n.na/n()*100)
# A tibble: 3 × 3
  code   n.na na.perc
  <chr> <int>   <dbl>
1 16279    15   0.496
2 16290    39   1.29 
3 16924    42   1.39 

Summarizing the number of observation with missing data reveals that 15 data points for station 16279 (0.5 % of total record length) and 39 for station 16290 (1.3 % of total record length) are missing. As there are only very few gaps in the existing time series, we use a simple method to fill these. Wherever there is a gap, we fill in the corresponding decadal norm as stored in the norm column in the object q_dec_tbl at the time stamp of the missing data. The visualization of the results confirms that our simple gap filling approach is indeed satisfactory (see Figure 4.2).

# Make a copy of the original data
q_dec_filled_tbl <- q_dec_tbl

# Actual gap filling step
q_dec_filled_tbl$data[is.na(q_dec_filled_tbl$data)] = 
  q_dec_filled_tbl$norm[is.na(q_dec_filled_tbl$data)] 

# Inspect results
q_dec_filled_tbl |> plot_time_series(date, data, 
                                      .facet_vars  = code, 
                                      .smooth      = FALSE,
                                      .interactive = TRUE,
                                      .x_lab       = "year",
                                      .y_lab       = "m^3/s",
                                      .title       = ""
                                      )
Figure 4.2: Gap filled Pskem and Chatkal river discharges.

All missing data are gone now as can easily be validated. The number of missing data points is zero for all stations. The results are summarized in the Table below.

q_dec_filled_tbl |> group_by(code) |> 
  summarize(n.na = sum(is.na(data)), na.perc = n.na/n()*100)
# A tibble: 3 × 3
  code   n.na na.perc
  <chr> <int>   <dbl>
1 16279     0       0
2 16290     0       0
3 16924     0       0

A note of caution here. This simple gap filling technique reduces variance in the time series. It should only be used when the percentage of missing data is low. As will be discussed in the next Section Section 4.3 below, more sophisticated techniques should be utilized when there exist substantial gaps and in the case of less regular data.

Finally, we discard the data that we no longer need, including the norm data, which we used for gap filling of the missing discharge data and convert the data to wide format (see Table 4.1 below) to add to it meteorological data in the next Section.

Table 4.1
q_dec_filled_wide_tbl <- q_dec_filled_tbl |> # again we use the name convention of objects as introduced above
  mutate(code = paste0('Q',code |> as.character())) |> # Since we convert everything to long form, we want to keep information as compact as possible. Hence, we paste the type identifier (Q for discharge here) in from of the 5 digit station code.
  dplyr::select(date,data,code) |> # ... and then ditch all the remainder information
  pivot_wider(names_from = "code",values_from = "data") # in order to pivot to the long format, we need to make a small detour via the wide format.

q_dec_filled_long_tbl <- q_dec_filled_wide_tbl |> pivot_longer(-date) # and then pivot back
q_dec_filled_wide_tbl
# A tibble: 3,024 × 4
   date       Q16279 Q16290 Q16924
   <date>      <dbl>  <dbl>  <dbl>
 1 1932-01-10   48.8   38.3   87.1
 2 1932-01-20   48.4   37.7   86.1
 3 1932-01-31   42.4   36.2   78.6
 4 1932-02-10   43.7   35.6   79.3
 5 1932-02-20   44.2   35     79.2
 6 1932-02-29   47.7   37.1   84.8
 7 1932-03-10   54.1   43.1   97.2
 8 1932-03-20   63.2   47    110  
 9 1932-03-31  103     72.1  175  
10 1932-04-10  103     73.2  176  
# ℹ 3,014 more rows

As a result, we now have a complete record of decadal discharge data for the two main tributaries of the Chirchik river and the inflow time series to Charvak Reservoir from the beginning of 1932 until and including 2015, i.e. 84 years. The same type of preparatory analysis will now be carried out for the meteorological data but in a slightly more sophisticated way.

4.3 Gap Filling Meteorological Data

Here, we use precipitation and temperature data from Pskem (38462), Chatkal (38471) and Charvak Reservoir (38464) Meteorological Stations (see Section 2.2 for more information on these stations). We also have data from Oygaing station (Station Code 38339) but the record only starts in 1962 and the time resolution is monthly. Therefore, we do not take this station into account here for the time being.

We start with precipitation and plot the available data.

p_dec_tbl <- data |> filter(type == "P" & code != "38339") 
p_dec_tbl |> plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = "",
                               .y_lab       = "mm/decade",
                               .x_lab       = "year"
                               )
Figure 4.3: Raw decadal precipitation data from Pskem (38462), Charvak Reservoir (38471) and Chatkal Meteo Station (38471).

The precipitation data from these 3 stations shows some significant data gaps. The Chatkal Meteorological Station that is located in Kyrgyzstan apparently did not work in the post-transition years as continuous measurements were only resumed there in 1998.

Let us see what happens if we were to use the same simple gap filling technique that we introduced above for discharge.

p_dec_filled_tbl <- p_dec_tbl
p_dec_filled_tbl$data[is.na(p_dec_filled_tbl$data)] = p_dec_filled_tbl$norm[is.na(p_dec_filled_tbl$data)]
p_dec_filled_tbl |> plot_time_series(date,data,
                                      .facet_vars  = code,
                                      .interactive = TRUE,
                                      .smooth      = FALSE,
                                      .title       = "",
                                      .y_lab       = "mm/decade",
                                      .x_lab       = "year"
                                      )
Figure 4.4: Precipitation Data gap-filled with norms. The filled values from 1990 - 2000 in the case of the Station 38471 indicate that the norm-filling technique is not adequate for this type of data.

Closely inspect the significant data gap in the 1990ies at Station 38741. Play around and zoom into the time series in the 1990ies in Figure 4.3 and compare it with the resulting gap-filled time series in Figure 4.4. We see that our technique of gap filling with long-term norms is not suitable for this type of data and the significant gap size. The effect of variance reduction is clearly visible.

Hence, we resort to a more powerful gap filling technique that uses a (regression) model to impute the missing values from existing ones at the neighboring stations, i.e. Stations 38462 and 38464. To do so, we utilize the simputation R package. Please note that if you do not have the required package installed locally, you should install it prior to its use with the following command install.packages('simputation')

library(simputation)
# First, we bring the data into the suitable format. 
p_dec_wide_tbl <- p_dec_tbl |> 
  mutate(code = paste0('P',code |> as.character())) |> 
  dplyr::select(date,data,code) |> 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
p_dec_filled_wide_tbl <- p_dec_wide_tbl  |> 
  impute_rlm(P38471 ~ P38462 + P38464) |> # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(P38462 ~ P38471 + P38464) |> # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38462 + P38471) # Imputing precipitation at station 38464 using a robust linear regression model

p_dec_filled_long_tbl <- p_dec_filled_wide_tbl |> pivot_longer(c('P38462','P38464','P38471')) 

p_dec_filled_long_tbl |> plot_time_series(date,value,
                                          .facet_vars  = name,
                                          .interactive = TRUE,
                                          .smooth      = FALSE,
                                          .title       = '',
                                          .y_lab       = "mm/decade",
                                          .x_lab       = "year"
                                          )
Figure 4.5: Precipitation Data gap filled with a robust linear regression modeling approach

As you can see, we use simple linear regression models to impute missing value in the target time series using observations from the neighboring stations. This is of course only possible where data is not missing across the time series, as we will discuss below.

Through simple visual inspection, it becomes clear that this type of regression model for gap filling is better suited than the previous approach chosen. Let us check whether we could successfully fill all gaps with this robust linear regression approach.

p_dec_filled_long_tbl |> 
  group_by(name) |> 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na / n() * 100)
# A tibble: 3 × 3
  name    n.na n.na.perc
  <chr>  <int>     <dbl>
1 P38462    12     0.402
2 P38464    12     0.402
3 P38471     3     0.100

It turns out that we still have very few gaps to deal with. We can see them by simply visualizing the wide tibble. The problem persisted at times when two or more values were missing across the available stations at the same time and where thus the linear regression could not be carried out. Let us look at the start of the record…

p_dec_filled_wide_tbl |> 
  head(10)
# A tibble: 10 × 4
   date       P38462 P38464 P38471
   <date>      <dbl>  <dbl>  <dbl>
 1 1933-01-10     NA   NA        2
 2 1933-01-20     NA   NA       10
 3 1933-01-31     NA   NA        5
 4 1933-02-10     NA   NA       33
 5 1933-02-20     NA   NA        8
 6 1933-02-28     NA   NA       10
 7 1933-03-10     NA   NA       31
 8 1933-03-20     NA   NA       50
 9 1933-03-31     NA   NA        6
10 1933-04-10     23   21.3     13

… and the end of the record. The missing values are easily spotted.

p_dec_filled_wide_tbl |> 
  tail()
# A tibble: 6 × 4
  date       P38462 P38464 P38471
  <date>      <dbl>  <dbl>  <dbl>
1 2015-11-10     72     81     19
2 2015-11-20    122     76     43
3 2015-11-30      7      2      3
4 2015-12-10     NA     NA     NA
5 2015-12-20     NA     NA     NA
6 2015-12-31     NA     NA     NA

We can solve the issues related to the missing values at the start of the observation record by using the same technique as above and by only regressing P38462 and P38464 on P38471.

p_dec_filled_wide_tbl <- 
  p_dec_filled_wide_tbl  |> 
  impute_rlm(P38462 ~ P38471) |> # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38471) # Imputing precipitation at station 38464 using a robust linear regression model
p_dec_filled_wide_tbl |> head(10)
# A tibble: 10 × 4
   date       P38462 P38464 P38471
   <date>      <dbl>  <dbl>  <dbl>
 1 1933-01-10   5.60   5.08      2
 2 1933-01-20  18.3   16.7      10
 3 1933-01-31  10.4    9.46      5
 4 1933-02-10  54.9   50.3      33
 5 1933-02-20  15.2   13.8       8
 6 1933-02-28  18.3   16.7      10
 7 1933-03-10  51.8   47.3      31
 8 1933-03-20  82.0   75.0      50
 9 1933-03-31  12.0   10.9       6
10 1933-04-10  23     21.3      13

Converse to this, the complete set of observations is missing for December 2015. We will thus remove these non-observations from our tibble. This can be done once and for all with na.omit() as shown in the code block below.

p_dec_filled_wide_tbl <- p_dec_filled_wide_tbl |> na.omit()
p_dec_filled_wide_tbl |> tail()
# A tibble: 6 × 4
  date       P38462 P38464 P38471
  <date>      <dbl>  <dbl>  <dbl>
1 2015-10-10      5      1      0
2 2015-10-20     89    108     58
3 2015-10-31     34     40     12
4 2015-11-10     72     81     19
5 2015-11-20    122     76     43
6 2015-11-30      7      2      3
p_dec_filled_long_tbl <-  p_dec_filled_wide_tbl |> pivot_longer(-date)

Inspecting the temperature data, we see similar data issues as in the precipitation data set and can proceed accordingly for gap filling.

t_dec_tbl <- data |> filter(type == "T") 
t_dec_tbl |> plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = '',
                               .y_lab       = "deg. Celsius",
                               .x_lab       = "year"
                               )
Figure 4.6: Raw temperature data from the meteorological stations Pskem (38462) and Chatkal (38471)
# First, we bring the data into the suitable format. 
t_dec_wide_tbl <- t_dec_tbl |> 
  mutate(code = paste0('T',code |> as.character())) |> 
  dplyr::select(date,data,code) |> 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
t_dec_filled_wide_tbl <- t_dec_wide_tbl  |> 
  impute_rlm(T38471 ~ T38462) |> # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(T38462 ~ T38471) # Imputing precipitation at station 38462 using a robust linear regression model

t_dec_filled_long_tbl <- t_dec_filled_wide_tbl |> 
  pivot_longer(c('T38462','T38471')) 

t_dec_filled_long_tbl |> 
  plot_time_series(date,value,
                   .facet_vars  = name,
                   .interactive = TRUE,
                   .smooth      = FALSE,
                   .title       = '',
                   .y_lab       = "deg. Celsius",
                   .x_lab       = "year"
                   )
Figure 4.7: Temperature data gap filled with robust linear regression modeling.

There are some irregularities in the temperature time series of Chatkal Meteorological Station in the first decade of the 20th century (tip: zoom in to see these more clearly). Note that these were not introduced by the gap filling technique that we used but are most likely wrong temperature readings or recordings. We will return to these in the outlier analysis below in Section 4.4.

Any missing values left in the temperature time series? Let’s check!

t_dec_filled_long_tbl |> 
  group_by(name) |> 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
# A tibble: 2 × 3
  name    n.na n.na.perc
  <chr>  <int>     <dbl>
1 T38462     3     0.100
2 T38471     3     0.100

To see where the missing value are, we find them easily again by looking at the head and tail of the tibble.

t_dec_filled_wide_tbl |> head()
# A tibble: 6 × 3
  date       T38462 T38471
  <date>      <dbl>  <dbl>
1 1933-01-10   -6.9  -16.6
2 1933-01-20   -6.1  -15.5
3 1933-01-31   -6.3  -15.6
4 1933-02-10   -2     -8.6
5 1933-02-20   -3.3  -12.5
6 1933-02-28   -0.1   -8.5
t_dec_filled_wide_tbl |> tail()
# A tibble: 6 × 3
  date       T38462 T38471
  <date>      <dbl>  <dbl>
1 2015-11-10    2.4   -2.5
2 2015-11-20    2     -2.2
3 2015-11-30    4.6   -3.7
4 2015-12-10   NA     NA  
5 2015-12-20   NA     NA  
6 2015-12-31   NA     NA  

Finally, we remove these non observations again as above with the function na.omit().

t_dec_filled_wide_tbl <- t_dec_filled_wide_tbl |> na.omit()
t_dec_filled_long_tbl <- t_dec_filled_wide_tbl |> pivot_longer(-date)

To deal with the missing values at the end of the observational record, we could also have used any other technique. Using the norm values however would have artificially reduced the variance in both cases as explained above. Furthermore and at least in the case of temperature, it is also questionable to what extent a norm calculated over the last 84 years is still representative given global warming. We will look in this important and interesting topic in the next section.

4.4 Anomalies and Outliers

We use the function timetk::plot_anomaly_diagnostics() to investigate these anomalies in the time series. For discharge, we first log-transform the raw data with the following transformation to reduce the variance of the original data.

\[ \hat{q}(t) = log(q(t) + 1) \] where \(\hat{q}(t)\) denotes the transformed discharge. Prior to the log transformation, 1 is added so as to avoid cases where discharge would be 0 and the logarithmic transform thus undefined. The transformation can easily be done with the log1p() function in R. Back-transformation is then via the function expm1() simply involves taking the exponent and subtracting 1 from the result. Figure 4.8 shows the result.

Results are shown in Figure 4.8, Figure 4.9 and Figure 4.11 below.

The exceptionally wet year 19169 shows up as anomalous in the Chatkal River Basin and at the downstream Charvak Reservoir inflow gauge.

q_dec_filled_long_tbl |> 
  plot_anomaly_diagnostics(date,
                           value |> log1p(),
                           .facet_vars  = name,
                           .frequency = 36,
                           .interactive = TRUE,
                           .title = "")
Figure 4.8: Anomaly diagnostics of discharge data. The transparent grey band shows the width of the normal range. The highly anomalous wet year of 1969 is clearly visible in the discharge record of the Chatkal river basin (Station 16279).

The investigation of precipitation anomalies shows a succession of regular anomalous wet events over time. It is interesting to see that the winter 1968/69 regularly anomalous at all three stations (Figure 4.9, zoom in to investigate).

p_dec_filled_long_tbl |> 
  plot_anomaly_diagnostics(date,
                           value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")
Figure 4.9: Anomaly diagnostics of precipitation data. The transparent grey band shows the width of the normal range. The winter 1968/69 is regularly anomalous at all three stations.

While intuitively, we would have expected an exceptionally mild winter in 1968/69 due to the precipitation excess, the corresponding anomaly does not show up in the temperature record as shown in Figure 4.11.

We have set a quite conservative range for the precipitation anomalies in Figure 4.9. We might want to experiment with a larger .alpha value in the plot_anomaly_diagnostics() function to see if the winter 1968/69 is still considered as an anomaly.

p_dec_filled_long_tbl |> 
  plot_anomaly_diagnostics(date,
                           value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "",
                           .alpha = .03)
Figure 4.10: Anomaly diagnostics of precipitation data. The width of the normal range has been much increased and only the very extrem decades are now considered as anomalous. The cluster of extremes in 1968/69 appears more clearly like this.

Now we look at the temperature anomalies.

t_dec_filled_long_tbl |>  
  plot_anomaly_diagnostics(date,value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")
Figure 4.11: Anomaly diagnostics of temperature data.

Apart from the identification of extreme periods since as the 1969 discharge year in the Chatkal river basin, the diagnostics of anomalies also helps to identify likely erroneous data records. In Figure 4.11 for example, when we zoom into the data of the series T38471 in the first decade of the 21st century, problems in relation to positive anomalies during the winter are visible in 4 instances. One explanation would be that in at least some instances, the data are erroneously recorded as positive values when in fact they were negative temperatures (see dates ‘2002-01-31’, ‘2005-01-10’ and ‘2007-02-28’, Chatkal Station 38471).

Obvious errors can be spotted like this and corrected. However, non-obvious data errors should be communicated with the data producing agency and replacement strategy jointly defined. If this is not possible, the values could be set to NA and then imputed as shown above.

The discharge data is now ready to be used for modelling and we can move on to the next Chapter on Geospatial Data.

4.5 References