Revolutionizing Water Management by Harnessing Machine Learning for River Water Quality Prediction
By Pankaj Kumar Sahu
The need for water conservation is an imperative today, especially since freshwater sources are finite and shrinking by the day. According to a WHO report, about a billion people lack access to safe drinking water, and two million people die annually as a result of poor water quality, poor sanitation, and unhygienic conditions. Although 71% of the Earth’s surface is covered by water, 95% of it is salt water.
The rapid pace of industrialization and economic development over the decades have seen an increase in the use of chemicals, fertilizers, pesticides, and this harmful industrial waste. A melange of untreated sewage from households, various solid wastes, electronic waste, and other pollutants are continuously mixing with freshwater bodies such as rivers, lakes, and water reservoirs. This rapidly degrades the quality of water and makes it unfit for drinking and any other essential applications such as agricultural and industrial use.
It is imperative to test, analyze, and control the water quality. While several techniques for assessing water quality have been in use for decades, most use a predominantly non-statistical, laboratory-based, single-dimension approach to testing. Water quality testing is largely done on the basis of three parameters: physical, chemical, and biological.
- Physical water quality parameters include total dissolved solids (TDS), turbidity, temperature, color, electrical conductivity, salinity, taste, odor, etc.
- Chemical water quality parameters include dissolved oxygen, pH, hardness, chlorine, acidity, alkalinity, etc.
- Biological water quality parameters include bacteria load, algae, nutrients, viruses, etc.
Water quality assessment vs. prediction
Of the many methods available, the Water Quality Index (WQI) is the most widely used method for measuring water quality. It derives the value from the weighted average of various measuring parameters mentioned above.
The WQI prediction method uses artificial intelligence (AI) and machine learning (ML) algorithms to train various ML models using a variety of data. This includes historical raw river data, remote sensing data, sensor data, various water quality parameters, seasonal river water data, meteorological data, etc. The trained models are evaluated for their accuracy with test data in different scenarios before being deployed for actual prediction of water quality in the future based on current conditions.
Why prediction of river water quality is critical
The population explosion has resulted in a wide use of chemicals such as fertilizers and pesticides, which along with vast amounts of industrial and domestic waste, get dumped into the rivers. This has serious detrimental effects on river water quality. Once the river water is degraded and unfit for use, it directly affects the health of all living things. Further, cleaning degraded freshwater bodies is a massive and time-consuming task, causing huge inconvenience and deprivation to those dependent on this water.
Prediction of water quality gives enough time for and provides valuable insights into implementing required preventive measures. This can avert further contamination and make the cleaning process more effective, efficient, eco-friendly, and less time-consuming.
The traditional approach to measuring and predicting water quality is long drawn and often inaccurate. This is mostly because it depends on an individual water expert’s knowledge to analyze the huge volume of historical data collected to predict the water quality. Due to the limitation of human capacity, it is difficult to predict water quality accurately and quickly.
Today, technologies such as IoT, big data, cloud, etc., drive the collection and storage of large volume data with ultrafast processing speed, while artificial intelligence and machine learning are providing water experts and data scientists with efficient methods for analyzing and predicting water quality with lightning speed.
Methods of water quality prediction
Rivers, the largest source of freshwater supply, have become a significant repository for sewage discharges from domestic and industrial activities and thus are highly polluted/prone to pollution. Therefore, water treatment, water quality monitoring, and water quality control are necessary to ensure clean water at an affordable cost. Hence systematic analysis of data along with water quality prediction is the need of the hour.
Methods such as multivariate statistical techniques are used to determine the correlation between different water quality parameters, whereas machine learning models such as regression and classification algorithms, and deep learning models such as ANN (artificial neural networks), are used for predicting the water quality with higher accuracy.
Some of the popular ML and deep learning models used for prediction of water quality include:
Machine Learning Models
• Linear regression
• Logistic regression
• Decision tree
• Random forest algorithm
• SVM algorithm
• Naive Bayes algorithm
• KNN algorithm
• K-means
• Dimensionality reduction algorithms
• Gradient boosting algorithm
Deep Learning Models
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs)
• Long Short-Term Memory Networks (LSTMs)
• Generative Adversarial Networks (GANs)
• Radial Basis Function Networks (RBFNs)
Depending on the type, volume, and quality of data, water quality experts and data scientists decide which model or combinations of models will best suit the purpose.
Overview of an ML Model for Water Quality Prediction
The diagram below depicts high-level architecture for applying a machine learning model to water quality prediction.
Figure 1: Machine learning framework for water quality prediction
The architecture consists of various elements such as data collection, data exploration, data processing, training and testing the model, model deployment, and model monitoring.
Data collection
All water parameters which determine the water quality index are collected from sample collection points. Additionally, various sensors fitted at strategic locations help gather a large volume of data. A minimum of three years of sample data is ideal for training the model. Seasonal and weather factors also need to be considered while collecting the sample. Data can be collected and represented in time series form, which can be used for prediction and forecasting purposes.
Exploratory data analysis (EDA)
EDA is a method used to analyze the raw water data set to discover trends, patterns, and any anomalies in the data with the help of statistical and graphical tools. It helps in identifying the key features (water parameters) out of all water data parameters, which have a strong correlation with target output results.
Data preprocessing (DP)
In data preprocessing, the raw water dataset is cleansed, decoded (transformed), and normalized for use in machine learning algorithms for model training and testing purposes. DP helps in feeding quality data into the ML models and improves the efficiency of the model training process overall.
Model training
The model for predicting the water quality index is trained on the key features of the water data set, which are finalized, cleansed, and transformed in the EDA and DP steps. These selected key features are known as “input variables,” and the feature “water quality index” which actually determines the quality of sample water, is called “output variable or target variable.” The supervised learning algorithm uses the following two types of mode
- Regression models: These are used for predicting water quality index value for future dates, which is a “continuous value.” Based on predicted value, it determines the quality of water based on its value as per scientific guidelines.
- Classification models: These are used for predicting the water quality index as a “decision boundary or discrete,” such as “the output water quality index will be potable, palatable, contaminated, infected, etc.”
Model evaluation (testing)
The performance of each model in predicting the water quality is evaluated to find the best model to be deployed. Regression and classification models have different evaluation methods, as the output of the regression model is a continuous value, while the classification model yields a discrete value.
- Regression model metrics
Three typical error metrics are used for assessing the performance of the regression ML model:
• Mean Square Error (MSE)
• Root Mean Square Error (RMSE)
• Mean Absolute Error
b. Classification model metrics
The popular metric for assessing the classification model performance is Confusion Matrix. Below are four key calculations used in the Confusion Matrix to assess the model performance.
- Precision is defined as the ratio of True Positive (TP) to total number predicated results, i.e. Precision = TP/(TP+FP)
- Recall is defined as the ratio of True Positive to the total number of actual positive cases, i.e., Recall = TP/(TP+FN)
- Accuracy = (TP+TN)/(TP+TN+FP+FN)
- F1 = 2 * Precision*Recall / (Precision + Recall)
Model deployment
The best-performing water quality model goes through many rounds of testing and is optimized and tuned thoroughly before being ready for deployment. Models need a deployment environment with all the necessary resources and data to function optimally. Below are the most widely used deployment methods.
- Webservice deployment: Provides an option to integrate model output as Web service with Web, mobile, and desktop applications.
- Batch deployment: Used where real-time prediction is not a priority. This is used to perform complex calculations and predictions and can handle a high volume of data.
Many cloud hyperscalers, such as AWS, Azure, and Google Cloud, provide readily available services to deploy and integrate the model with various applications. This is the best way to deploy the model if you are already using a cloud service.
Model monitoring
Constant monitoring of the model is required to evaluate its performance and accuracy over time. This is because the addition of new data over time may degrade model performance due to various reasons.
Toward sustainable solutions
The challenges of depleting water as a resource are increasingly impacting every region around the world. We are facing the effects of rapid urbanization, climate change, aging infrastructure, and resources. However, there is a significant increase in innovative new technologies that help overcome various water-related challenges. Governments, water utilities, and IT and engineering service integrators should work together to build a suitable solution for problems related to water quality monitoring, distribution, leakage, water scarcity, etc. In the end, technology plays a pivotal role in helping deliver safe and reliable water for our everyday use.
About the Author:
Author: Pankaj Kumar Sahu
Designation: Director, Technology Partner, Cyient
Bio: Pankaj Sahu spearheads the Enterprise Asset Management (EAM) practice for Electric, Gas, and Water Utility technology solutions at Cyient. With a wealth of experience spanning over 20 years, he has an established track record in implementing and consulting EAM and APM solutions for clients in the Utility, Transportation, and Energy sectors worldwide, and the views expressed in this article are his own