Data preparation

This is not a comprehensive explanation of any methods used to prepare data for use in Source. Although it refers specifically to calibration, it is suggested that these methods are used for data in general.

Note: This content is an excerpt from eWater Cooperative Research Centre 2010, Source Catchments User Guide, eWater Cooperative Research Centre, Canberra. ISBN 978-1-921543-29-6.

Data interpolation methods

This section introduces some interpolation methods and indicates how those methods can result in unexpected distortions to data. Care and experience is required to undertake effective interpolation, and the resulting data should be tested to avoid surprises.The simplest spatial interpolation methods are usually based on deterministic concepts. These assume that one true, but unknown, spatial pattern exists which is estimated by either some ad-hoc assumptions or some optimality criterion.

Deterministic interpolation methods include:

The Thiessen (1911) method, or equivalently, nearest-neighbour method: each point to be interpolated is assigned the value of the data point that is closest in space. This results in patterns that have polygon-shaped patches of uniform values with sharp boundaries between the patches (Figure 1a). These are clearly very unrealistic for hydrological variables but the method is simple and robust. The method assumes that the data are error free (ie. the value of the interpolated surface at the measurement point and the measurement itself, are the same).

The inverse-distance-squared method: the interpolated value is estimated by a weighted mean of the data and the weights are inversely proportional to the squared distance between the interpolated value and each data point. This results in a pattern that is smooth between the data points but can have a discontinuity in slope at the data points. For typical hydrologic data that are unevenly spaced, this method produces artefacts as illustrated in Figure 1b. It is clear that hydrological variables are very unlikely to vary in the way this method assumes. Even though this method is very easy to use, it can not be recommended.

Moving polynomials: a trend surface is locally (ie for each interpolated value) fitted to the data within a small local (ie “moving”) window about each point to be interpolated. The fitting is usually done by least squares and the trend surface is represented as a polynomial. The higher the order of the polynomial, the closer the fit to the data but the more irregular the trend surface and the more likely the occurrence of “overfitting” will be. Overfitting occurs when the particular data points are very well represented but representation of the underlying (true) pattern is poor; ie more credibility is given to the data than is merited due to measurement error and the fact that the true surface may not vary like a polynomial. An example of overfitting is given in Figure 1c where a 6th-order polynomial has been fitted to the seven data points. Overfitting can be avoided by setting the order of the polynomial to be much smaller than the number of local data points, resulting in an interpolated (trend) surface that is smoother and does not exactly go through the data points. This is consistent with the implicit assumption that the data may be in error and need not be fitted exactly.
Thin plate (or Laplacian) splines: a continuously differentiable surface is fitted to all the data, ie this is a global rather than a local method. The name of “thin plate” derives from the minimisation function used - this has a physical analogy in the average curvature or bending energy of a thin elastic sheet. A loworder polynomial is usually used and a minimum curvature criterion implies finding the smoothest possible function. This method therefore does not suffer from the overfitting or oscillation problems of the moving polynomial method. There are two main variants of this method. The simpler variant assumes that the data are error-free and hence the interpolated surface goes through the data points. It has one “smoothing”or “tension” parameter which can be used to control the smoothness of the interpolated surface (Figure 1d). The other variant allows for a measurement error by introducing an additional parameter representing the error term (eg Hutchinson, 1991). This parameter can be used to control the balance between smoothness of the surface versus how close the interpolated surface is to the data. Thin plate splines work very well in most hydrologic applications, unless the data are too widely spaced as compared to the scale of the underlying phenomenon. They are robust and operationally straightforward to use. The thin-plate splines method is the recommended deterministic interpolation methods. The only drawback, as compared to geostatistical methods, is that it not as straightforward to explicitly consider measurement error and estimation error with spline interpolations.

Catchment Data

Source has different models for runoff generation, constituent generation, filtering, in-stream water quantity and in-stream water quality. Most of the runoff generation models used in Source Catchments are lumped and consequently only require “area”. However, other soil properties can be used to estimate bucket size and recession constants. The constituent generation and filter models generally require more detailed information on catchment characteristics. In-stream models may require characteristics such as channel dimensions or information used to compute estimates of dimensions. The catchment area is usually an easy parameter to obtain but should be used with caution. The area is dependent on the scale of maps or DEMs that it was derived from and in flatter areas there can be large uncertainty in catchment boundary delineation. A small error in catchment area can cause a large error in the estimated volume or load that runs off the catchment. This can be further exacerbated in estimating FU proportions where categorisations such as land use, soil type and topography are also used to define catchment boundaries.

Although slope, land use, soil profile, soil depth and hydraulic conductivity may not be used by a model this information is also worth considering. The type of land use will influence surface runoff characteristics, evapotranspiration rates and interception losses. The soil characteristics will influence the size of soil stores and seepage rates. This sort of information is invaluable for setting realistic bounds on model parameters as well as sanity checking the fluxes out of the model. This information may be used in the grouping of models and parameter sets in the calibration process. Generally the only catchment characteristic required by lumped rainfall runoff models is the catchment area. However, some models (eg. SWAT) need to know slope, land use, soil profile, soil depth, and hydraulic conductivity. The models operate in mm and to convert the model output from runoff depth to runoff volume, catchment area is required.

Rainfall Data

The calibration of most Source models will be most sensitive to the rainfall data that are provided. If the volume of rainfall is incorrect or the rain days are not representative of the peaks in flow then good calibration may be difficult.

In preparing rainfall data, consider:

Catchment/sub-catchment average rainfall – ie. getting areal averages from specific rain gauges.
Selection of appropriate rainfall sites – to derive the areal averages and temporal signal.

Catchment/sub-catchment average rainfall

The catchment\sub-catchment average rainfall can be estimated by many different methods, two of the more common methods are discussed here. The first method is to draw an isoheytal map across the catchment and the second method is to sum grid squares from a rainfall surface (spline).

An isoheytal map is a contour map, typically of average annual rainfall. Drawing an isoheytal map is a relatively easy, manual process when there are a number of gauges in and surrounding the catchment. Take care to ensure all rainfall sites are gap filled and that the period selected is common to all sites. Climate databases, such as SILO, have splined surfaces that cover Australia. These surfaces take into account location, distance from coast and elevation, to derive average annual rainfall across grid squares. This can be summed for the grid squares in a catchment and averaged.

Note: the second method is to sum grid squares from a rainfall surface (spline). An isoheytal map is a contour map, typically of average annual rainfall. Drawing an isoheytal map is a relatively easy, manual process when there are a number of gauges in and surrounding the catchment. Take care to ensure all rainfall sites are gap filled and that the period selected is common to all sites. Climate databases, such as SILO, have splined surfaces that cover Australia. These surfaces take into account location, distance from coast and elevation, to derive average annual rainfall across grid squares. This can be summed for the grid squares in a catchment and averaged.

Selecting rainfall sites

There are several considerations when selecting rainfall sites:

Difference in average annual rainfall as compared to the sub-catchment average annual rainfall
Proximity to the sub-catchment
Correlation with flow peaks
The number of sites used

If the difference in average annual rainfall is great (eg more than 20%) then the rainfall process for the catchment and selected site are probably quite different, and thefore this is probably not a good station to use.

Unfortunately studies have shown that the daily rainfall decorrelation distance is approximately 10 km and there are not many places in Australia where rainfall stations are this close together. In most cases stations in the catchment should have priority over ones outside the catchment. However, in many cases, there may not be any long-term stations in the sub-catchment, in which case short-term stations in the sub-catchment may be used to assess which long-term stations are most representative.

A good method for assessing how well a rainfall station represents the flow from the catchment is to plot the rainfall and flow on similar scales. The rainfall peaks can then be checked against the flow peaks to see if the size of peaks approximately correlates with the amount of rainfall and that the peaks occur at about the same time, taking into account expected changes in this relationship with antecedent conditions. The number of sites is a very important issue when multiple rainfall sites are available. When either Thiessen weighting or a nearest-neighbour spline disaggregation approach (as in SILO) is used to associate a portion of the catchment with each rainfall station, there are a few things to consider:

More is not always better. The more stations you have the more rain days that will occur on the catchment. Take care to make sure that the number of rain days per month is similar to the number of flow peaks per month.
When generating long flow records it is tempting to change the number of stations used as stationscut in and out. Be careful! Calibrating models to different groups of stations generally leads to different calibration parameters. Consequently, you will have no idea how robust the model is when the number of rainfall sites is considerably different to the period when the model was calibrated. This is an important point when automated interpolation (such as with SILO) is used, since the user is not aware of which stations are actually used in the final data set.
Be very careful when using rainfall surface data for the reasons mentioned above. It is not recommended that rainfall be used for every grid square that is available in the catchment. A better approach is to use the monthly surfaces at each grid square to estimate the average rainfall on the catchment each month and then disaggregate this data to daily data with selected rainfall sites. This ensures that you do not get more rain days; and more importantly, do not miss any daily peaks – which could occur on some, but not all, of the rainfall stations used.

Infilling rainfall records

In many cases rainfall data records will have gaps. There are two types of data gaps that need to be considered:

Missing records
Accumulated gaps

Missing records occur when no rainfall data were collected. Accumulated gaps occur when data were not sampled for a few days and then a total reading for the period was taken. Accumulated gaps typically happen over weekends and public holidays when gauges are not read. Sometimes accumulated gaps are flagged in the data record with special codes. Where gaps are not flagged, special methods need to be incorporated to distinguish accumulated values from missing records.

Missing records are typically filled by neighbouring stations. A linear correlation on a monthly, seasonal or annual basis is established between the two sites and this is used to adjust the rainfall at the neighbouring site prior to infilling. The selection of the best neighbouring site can be on the basis of proximity or the correlation during the wettest month.

Accumulated gaps are disaggregated by neighbouring sites. The rainfall on each day at the neighbouring site is multiplied by the ratio of the two accumulated values at each site and in filled into the accumulated gap. The selection on the best neighbouring site to use might be on the basis of proximity or similarity of accumulated rainfall. When choosing a neighbouring site it must have had some rainfall if values are to be disaggregated.

Evapotranspiration data

There are many different methods of estimating evapotranspiration. Common methods include:

Average monthly ET surfaces in Climatic Atlas of Australia: Evapotranspiration
Evaporation pan (Class A, sunken tank, sunken tank with bird guard)
Lysimeter
Priestley Taylor equation
Morton equation
Penman Monteith equation
Reference crop evapotranspiration

The important issue with evaporation data is that all of these different methods will provide a different estimate of evaporation. It is important that whatever source is used, it is consistent with what the model requires. There are typically two different types of evapotranspiration data required by models: potential evapotranspiration (PET) and actual evapotranspiration (AET). There are factors that can be applied to each of these data sources to convert to the appropriate type of evapotranspiration. The Rainfall Runoff Library (freely available from www.toolkit.net.au/rrl) has a data scaling dialog that allows annual or monthly factors to be applied to evaporation data. In the context of rainfall-runoff modelling, the areal potential evapotranspiration (APET), rather than point potential evapotranspiration (PPET) should be used.

In general, the use of mean monthly APET is sufficient for most rainfall-runoff modelling applications, because the inter-annual variability of PET is relatively low and compared to rainfall, the day-to-day variation in PET has little influence on the water balance at a daily time scale. For a description of the different forms of evapotranspiration refer to Fitzmaurice et al (2006).

Note: A popular approach for generating both rainfall and evapotranspiration time series, particularly for larger catchments, is to use the SILO daily ASCII gridded data as input to Source. For information regarding purchasing SILO data, see http://www.longpaddock.qld.gov.au/silo/.

Infilling evaporation records

Similar to rainfall records, evaporation records will have gaps that are due to missing or accumulated gaps. Due to the low variance in evaporation data both spatially and temporally filling evaporation data is much simpler. Some of the methods for filling gaps include:

Using the values from a neighbouring station adjusted by a monthly, seasonal or annual relationship selected based on proximity or correlation
Using the long term average obtained from the site for each day of the year
Based on a correlation between monthly evaporation rates for when it is raining and dry. Then selecting a neighbouring station determines whether to use the wet or dry monthly average value.
Accumulated gaps can be spread by dividing equally across the gap

Flow data

Rainfall runoff and routing models are calibrated against flow data. When extracting gauged flow time series data from databases such as HYDSTRA (commercial software used across Australia for archiving of gauge data), it is important to align the time bases for flow with other climate data sets. For example, SILO data is collected each day at 09:00 am, so each data point is the total rainfall for the previous day. Therefore, when extracting flow data from HYDSTRA, ensure that the “end of day” option is selected, so that the flow data will align with the SILO rainfall data.

It is important to understand the conventions used by your organisation and by organisations that send you data - they may not be the same!

Do not assume that the flow data have no errors; good quality flow data leads to better calibration. There are several considerations:

How were the height data collected? Automatic logging equipment is more accurate than manual reading since the latter does not integrate over the day. In many Australian states prior to 1970, manual readings were taken once daily, so height data collected prior to automatic logging should be recognised as having a lower reliability and used with caution in calibration.
How stable is the rating at the site? Check by reviewing successive rating curves for the site.
How sensitive were flow estimates to a change in height? At low flows how does the accuracy of the equipment relate to the flow? As an example, a 1mm equates to how many m/s of flow?
Looped ratings ie where the rating on the rising and falling limbs of hydrographs are different.
How believable is the gauge reading at low flows? Could the gauge be “stuck”? At low flows, the data should be free of step changes. If the servicing dates of the gauges are known, measured levels before and after can be compared to ensure there is no step change.
What is the highest rated flow compared to the highest flow in the data set? It is common for rating curves to flatten out at high flows, meaning that there can be considerable uncertainty about high-flow rates.

After assessing the flow data it may be appropriate to remove unrealistic data from the record prior to calibration. It is also important to know the errors across the flow range (typically worse in the higher flows).

Infilling flow records

Infilling flow records is much more difficult than rainfall and evaporation because of the autocorrelation between successive flow values, ie the flow today is similar to the flow yesterday. This means that methods such as factoring a neighbouring site’s flow records may cause discontinuities in recessions, which incorrectly indicates events that may not have actually occurred. This may not be an issue in some cases but should be taken into consideration. There are no accumulated values in flow records. In some situations, flow records can be in-filled by stations that may be upstream or downstream of the flow site. In such situations, there may be quite a strong relationship between the two sites. There are several ways of exploiting this relationship:

Regression relationships (monthly, annual or total record) between the flow at both sites
Regression relationships between the flow at both sites taking into account time lag
Regression relationships between the flow at both sites taking into account routing
Mapping between equivalent percentiles in flow duration curves that take into account lag or routing
Smaller gaps in recessions may be extrapolated based on a logarithmic relationship when there is no rainfall at nearby stations

Where there are no upstream or downstream data to fill gaps then a neighbouring catchment may be used. Take care when selecting this catchment: ensure that the flow characteristics are similar by comparing flow records and flow duration curves. It is most likely that in-filling will cause discontinuities that can be smoothed if required. The methods available for infilling are:

Regression relationships between the flow at both sites
Mapping percentiles between flow duration curves

Another way of filling gaps in flow records is by calibrating rainfall runoff models to the observed record and in-filling the gaps with the modelled data. This will also likely cause discontinuities in the record that may need to be smoothed.

Water quality data

Generally most time series flow and climate data are collected on a continuous or daily basis. This may be the case for some water quality data such as salinity and temperature and in more recent times turbidity, where this is typically sampled at the same location as flow data. However in most cases water quality data are collected discretely. This creates several issues when trying to calibrate models:

The data paucity may be a constraint on being used for calibration purposes.
Water quality data may not have been collected at the same location as flow data. This creates particular problems when converting between sampled concentrations and estimated loads.
Discrete samples are effectively “instantaneous” whereas if we are calibrating a daily model, we really need effective value over a day. These scale problems are more severe for small or flashy catchments where flows and water quality change rapidly within a day.
Samples may not be taken across a range of flows which limits the understanding of the different process, particularly at high flows where samples are typically not taken.
Changes in the collection methods and the associated difference in the results obtained by each method.

There are also issues with continuous data such as:

Instrument calibration and drift over time.
Location of sampling equipment, particularly where a water quality constituent is not fully mixed. Stratification for temperature and salinity may be an issue in deeper areas for weir pools and storages in periods of low flow.
The spatial distribution of a constituent, particularly where samples are taken in storages. The concentration is likely to very both spatially and with depth.

Spurious correlation

Water quality data are most commonly collected as a concentration rather than a load. The conversion between concentration and load is achieved by multiplying by flow. This raises several issues:

The flow and concentration data may not be collected at the same location or the same point in time. This introduces uncertainty that is directly related to the amount of flow. The larger the flow, the larger the error in estimating loads.
The concentration will most likely be a point sample that implies that this concentration exists across the entire cross-section of the river. This will only be the case if the constituent is fully mixed.

During the calibration process for example, correlations are made between observed and simulated data to get the best possible match. It is important to make comparisons between modelled and observed values that take into consideration what the model is to be used for, ie volume or concentration of a constituent. Take care when assessing statistical comparisons based on load: as the variation in daily flow is generally orders of magnitude greater than that of concentration (when flow is multiplied by concentration to get load), the dominant driver will be flow. A consequence of this is that the correlation between observed load and simulated load may be overpowered by the correlation of flows. This is known as spurious correlation.

When loads are being assessed over large periods of time (eg. years) then the variance in flows will be a similar order of magnitude to concentration and consequently comparisons of loads can be made. However when the variance in flow is substantially more than concentration it is best to ensure that comparisons are made based on concentration.

Infilling water quality records

The techniques for infilling continuous water quality records are similar to those mentioned for flow. There is also an additional method that can be applied when there is reasonable correlation between flow and concentration. For example in salinity records high flows are associated with lower salinities and low flows with higher salinities, the opposite occurs with sediment and often particulate nutrient concentrations. There may also be correlations with other constituents for example water temperature may correlate with solar radiation or air temperature, and total suspended solids may correlate with particulate nutrients or turbidity (which can be measured continuously).

Climate data

At present the only time series data required by Source Catchments’ rainfall runoff models is rainfall and potential evapotranspiration. Over time, other component models might require other climate data such as airtemperature, solar radiation or relative humidity. Comments on these will be added as required.

References

Fitzmaurice, L, Beswick, A, Rayner, D, Kociuba G & Moodie K 2006, Calculation, verification and distribution of potential evapotranspiration (PET) data for Australia, Department of Natural Resources and Water, Queensland.

Grayson, RG & Blöschl, G (eds) 2001, Spatial patterns in catchment hydrology : observations and modelling. Cambridge University Press, Cambridge.

Note: This is documentation for version 5.0 of Source. For a different version of Source, select the relevant space by using the Spaces menu in the toolbar above