Data mining for water industry applications
The School of Engineering at Exeter University adopts a multidisciplinary approach to engineering education and research. Within the School this approach is exemplified bythe Centre for Water Systems, which brings together researchers from various engineering disciplines. The Centre develops and applies innovative computing methods, techniques and advanced information technology to problems of water resource systems planning and management. Dr Dragan Savic explains its research programme into optimisation of pipe networks.
FIG 1: Basic structure of an artificial neural network
Water utilities possess large quantities of data derived from different sources (SCADA, asset and customer databases, GIS) collected in different formats, that are stored but not archived properly or fully understood. The industry currently recognises that in addition to making data available across a company, it is equally important to be able to efficiently extract information from data, such as to have procedures for identifying logical, nontrivial, useful, and ultimately understandable patterns in data. The new technologies also promise to provide intelligent information extraction capabilities to these water utilities that have often been accused of being ‘data rich but information poor.’ The problems associated with extracting information from diverse data sources could be a thing of the past. For example, it may be valuable to analyse data from different water companies and to produce a model for the natural rise in background leakage and the effect rehabilitation actions have on it. The process of discovering patterns in data and ultimately knowledge is commonly known as data mining. This article introduces some aspects of data mining, together with the data mining technologies with probably the widest applicability in the water industry; artificial neural networks and genetic algorithms. The article also provides suggestions for many water industry applications.
Data mining appears under a multitude of names, which includes knowledge discovery in databases, data or information harvesting, data archaeology, functional dependency analysis, knowledge extraction, and data pattern analysis. In addition there exist a large number of definitions for this group of methods. Of several related definitions of data mining one that is most appropriate for real-world applications is:
Data mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
In other words, data mining is the search for relationships and global patterns that exist among parameters in databases, but are hidden among vast amounts of data. Data mining can be divided into three major categories or levels:
undirected or pure data mining. Here the data mining system is left relatively unconstrained to discover patterns (models) in the data free of prejudices from the user;
directed data mining. Here the user will ask something much more specific and the problem usually changes from a general pattern-detection problem to a rather better defined model induction problem; and
hypothesis testing and refinement. The user presents a hypothesis to the system to evaluate and refine it.
The process of data mining usually consists of the following four steps:
Data Screening. One of the advantages of data mining is that it tolerates noisy data and missing values. However, every effort should be made to minimise errors in data and missing values, because these can degrade the performance of the models obtained from the data;
· Selection of Training and Validation Datasets. To achieve robustness and generalisation of models, data mining is commonly done using the split record test to develop a model and validate it. This method consists of splitting data into a training set and a set for validation. Only the training data are used to evaluate the fitness of the model being developed. The training error is calculated as the error between the modelled and target output for the training data. Similarly, the test error is calculated as the error between the modelled and target output for the validation data;
Selection of Relevant Parameters. It is important to include a number of parameters that may have some relevance to the problem being studied. The data mining system will then discover which are most useful, and any relationship that exists between these parameters. Omitting a highly relevant parameter from the analysis will cause deterioration in prediction/classification performance of the system; and
¹ Model Discovery and Encoding. This phase involves running the system (the learning phase), validating the new model discovered and finally encoding the model in software that can be used for prediction or classification purposes in future.
Although there exist other general-purpose data mining techniques, artificial neural networks (ANNs) and genetic algorithms (GAs) are probably the most promising techniques for the water industry data mining problems.From its earliest history, researchers in the field of ANNs have been interested in identifying hidden patterns in data. The potential of ANNs to provide significant advances in the application of modelling to water industry systems rests on their ability to model complex physical phenomena, even in the presence of noisy data, and on their ability to tackle non-linear problems.
A neural network is a set of simple computational units called ‘neurons’ of which each try to imitate the behaviour of a single human brain cell. Figure 1 (above) shows the basic structure of a neural network with two layers of neurons and connections among them. Individual neurons receive input parameter values (signals) which propagate though the network, where they are amplified using weights along the connections. These signals are then summed at every individual neuron and the individual output signals are calculated.
The crucial step of data mining is teaching the desired behaviour, i.e. so that ANN will output values similar to those in the training data sets when the input values match one or fall in between of the training samples. For example, an ANN can be trained to use the values of a variety of parameters, such as turbidity, colour, temperature, flow and the concentrations of total nitrogen, as well as soluble and total phosphorus, to forecast algal blooms and to identify the factors that affect the blooms of a particular type of algae in a stream. Results of various studies demonstrate that ANNs have a promising capacity for learning complex, non-linear processes that are difficult to model using rigorous mathematical (mechanistic/first principles) models. Therefore, the development of an ANN appears to be justifiable for modelling purposes, in cases when parameter estimation for model building is imprecise or difficult to obtain. Other examples of the use of ANNs include:
time series analysis and forecasting (e.g. water demand predictions);
assessment of the influence of changing conditions (e.g. influence of changing seasonal climates in modifying discharge and water quality parameters in river basins);
water quality modelling predictions, for example learning the dynamics of chlorine decay in water distribution networks;
estimation of physical and biological parameters (e.g. daily soil water evaporation, algal growth); and
process modelling (e.g. modelling water treatment plant operation).
However, industrial concern regarding the use of ANNs as ‘black box’ models and associated extrapolation problems have understandably limited their routine application. The complexity of signal processing and weight calculation within ANNs make them black-box models because their inner workings are concealed from the user even when the learning process is over. The additional drawback is the lack of any indicator for evaluating the accuracy and reliability of the ANN answer when extrapolation is required, such as when ‘never-seen’ patterns are presented to the ANN. Genetic algorithms and associated techniques promise to resolve these problems by giving more transparent solutions with similar non-linear capabilities to those of ANNs. Here we will present genetic programming (GP) as an appropriate machine intelligence technology for the water industry applications.
Genetic programming is an approach to the automated generation of computer programs based on genetic algorithms. Genetic algorithms are optimisation techniques that find optimal or near optimal solutions to problems by simulating the process of natural evolution. These algorithms begin the optimisation process with a collection of solutions to the particular problem produced by random generation.
The randomly generated starting solutions are generally very poor. Then procedures similar to reproduction, mutation and natural selection act on the population of solutions that evolves through successive generations towards progressively better solutions. Genetic programming differs from genetic algorithms in the nature of the solutions sought. Genetic programming seeks solutions that are themselves mathematical models or computer programs. The genetic material that evolves over successive generations represents executable computer code and is more complex than the data typically used with genetic algorithms. An example of a simple GP solution corresponding to the algebraic expression “a + (2.3 x c)” would be represented as the tree in Figure 2. The root node is the first element of the tree, the interior nodes are the functions and the leaf nodes are the constants (e.g 2.3) and/or the variables (e.g a and c). If the set of functions used is sufficiently rich, tree structures are capable of representing hierarchical programs of any complexity. For example, these functions may include arithmetic operators (+, x, -), mathematical functions (sin, cos, log), Boolean operators (AND, OR, NOT), logical expressions (IF-THEN-ELSE), iterative functions (DO- UNTIL), or any other user-defined function. The genetic programming technique must operate on evolving genetic material that is of variable length and contains recursive structures typical of computer programs. All of the genetic programming operations must ensure that a meaningful and executable syntax exists in all the solution programs the evolutionary process creates.
An EPSRC funded project is investigating the potential water industry applications of genetic programming. Dr James Davidson at the Centre for Water Systems is conducting the research, under the direction of Dr Savic and Dr Godfrey Walters, in collaboration with Optimal Solutions, a division of Ewan Associates Ltd. The research project investigates the suitability of genetic programming to water industry applications. The investigation combines both theoretical and practical work in the engineering and computer science fields. In addition to similar applications as identified for ANNs, the focus of the project is initially on the following applications:
development of optimal water supply control strategies;
forecasting demands on water supply systems for both short and long term; and
the use of genetic programs to efficiently simulate the behaviour of complex water distribution systems as a replacement for more computationally intensive simulation models.
The aim of the project is the improvement of modelling methods used by the water industry by developing and applying data mining techniques. The potential benefit of improved modelling and understanding knowledge buried in large data sets is increased efficiency of operation of water supply reservoirs, water distribution systems and treatment plants, which would result in reduced cost for water utilities and increased consumer satisfaction.