Disclosure of Invention
In view of the defects of the prior art, the invention aims to provide a machine learning-based flocculation air flotation algae removal process optimization method, which adopts an H 2 O-based automatic machine learning platform (H 2 O AutoML) to optimize the flocculation air flotation algae removal process, thereby improving algae water treatment efficiency and water quality safety.
In the machine learning-based flocculation air flotation algae removal process optimization method, the automatic machine learning model selection and parameter tuning can adapt to the water quality conditions which change in real time, and the addition amount and the operation parameters of the flocculant are scientifically optimized.
According to a first aspect of the present invention, a machine learning-based flocculation air flotation algae removal process optimization method is provided, comprising the following steps:
Acquiring flocculation air flotation algae removal treatment historical data, wherein the flocculation air flotation algae removal treatment historical data comprises historical water quality data, flocculation air flotation process parameters and algae treatment data, and preprocessing the historical data to obtain a training data set;
carrying out data division on the training data set, and determining a training set and a testing set according to a preset proportion;
The H 2 O AutoML platform automatically executes the training of a plurality of machine learning algorithms by utilizing the training set and the testing set, and selects an optimal model according to a preset evaluation index;
performing super-parameter optimization on the optimal model to obtain an optimized control model of the flocculation air flotation algae removal process;
Deploying a flocculation air flotation algae removal process optimization control model into a control system of a water treatment facility, dynamically predicting flocculation air flotation operation parameters according to water quality data acquired in real time, and adjusting the addition amount of an actual flocculant, flocculation conditions and air flotation conditions; and
And (5) evaluating the algae removal effect according to the algae removal treatment of the actual flocculation air floatation operation parameters.
In a further embodiment, the flocculation air flotation algae removal treatment history data includes:
Historical water quality data: turbidity, algae species and quantity, pH, water temperature, conductivity;
Flocculation air floatation process parameters: air floatation time, coagulant addition amount and stirring intensity;
Algae treatment data: the algae species and quantity, turbidity and physical and chemical indicators of algae, including solubility, binding extracellular polymers and algae cell Zeta potential.
In a further embodiment, the machine learning algorithm executed on the H 2 O AutoML platform includes a GLM random forest, a DRF distributed random forest, an XRT extreme random forest, DEEPLEARNING deep learning, an XGboost, a GBM gradient hoist, and a Stack integrated Stack model, and selects an optimal model based on a predetermined evaluation index.
In a further embodiment, the H 2 O AutoML platform automatically performs training process configuration of a plurality of machine learning algorithms includes:
Selecting a variable of model prediction output, namely setting a response_column value;
Setting AutoML the maximum run time to 300s, i.e., the max_ runtime _secs parameter is set to 300s;
The maximum number of models explored before stopping is adjusted AutoML to 10, i.e., the max_ models parameter is set to 10;
providing a random seed for AutoML running processes to ensure the repeatability of the experiment, wherein the seed number is 1234, namely the seed parameter is set to 1234;
The distribution function supported by each machine learning algorithm is selected as AUTO, namely, the distribution parameter is set as AUTO;
The number of cross-validation folds is set to 5, which helps to evaluate the stability and generalization ability of the model, i.e., nfolds parameters are set to 5;
the training stop condition stop_ rounds parameter is set to 3, namely, the model performance is not improved in the specified number of rounds, and the training process is stopped;
The keep_cross_validation_ models parameter is set to Ture, leaving a cross-validated model.
In a further embodiment, the performing the super parameter optimization on the optimal model includes:
The model parameters are automatically adjusted using grid search or random search techniques of the H 2 O AutoML platform.
In a further embodiment, the method further comprises the steps of:
And analyzing key factors influencing the algae removal efficiency and weights thereof by using a model interpretation function of the H 2 O AutoML platform.
In a further embodiment, the model interpretation method used includes residual analysis, variable importance analysis, shapley interpretation method and partial dependency curve, which are used to interpret the prediction result of the optimal model obtained by training the H 2 O AutoML platform.
In a further embodiment, the method further comprises the steps of:
And acquiring water quality data, actual flocculation air flotation operation parameters and algae removal effect according to a preset period, dynamically training and updating a flocculation air flotation algae removal process optimization control model, and redeploying the updated flocculation air flotation algae removal process optimization control model.
According to the automatic machine learning-based flocculation air flotation algae removal process optimization method, indexes such as algae quantity and turbidity of water can be predicted according to the input water quality parameters and process parameters through the prediction model, so that the setting of the flocculant addition amount and air flotation operation conditions is not blind any more, the rapidly-changing water quality conditions can be met, the continuous meeting of the safety standard of the water quality is ensured, the social benefit is good, the economic benefit is high, and a water plant process manager can be assisted to rapidly make a proper flocculation air flotation process decision.
According to the invention, an automatic machine learning module is introduced in the flocculation air floatation process prediction, so that model selection and super-parameter optimization can be automatically realized, a plurality of basic machine learning algorithms are integrated, and the accuracy and generalization capability of the model are improved; errors caused by human factors, such as errors of manual data processing and parameter selection, are reduced through automatic model development, and the repeatability and reliability of the whole model development process are improved.
Compared with the prior art, the implementation of the flocculation air flotation algae removal process optimization method has the remarkable beneficial effects that:
1. Providing automated and intelligent decision support: the H 2 O AutoML platform is used for automatically executing the training of a plurality of machine learning algorithms, an optimal model can be selected from a large number of possible models, and the workload of professional technicians in the aspects of model selection and parameter adjustment is reduced through automatic decision optimization, so that the accuracy and efficiency of decision making are improved;
2. Providing a data driven optimization process: by utilizing historical data and real-time data, the method can dynamically adjust flocculation air floatation operation parameters such as the addition amount of a flocculating agent, flocculation conditions and air floatation conditions, the water treatment process is more accurate by a data driving method, and quick response can be made according to real-time change of water quality
3. Improving the treatment efficiency and economy: the scientifically and reasonably optimized flocculation air floatation operation parameters can improve the algae removal effect, reduce unnecessary chemical use and reduce the operation cost;
4. Enhancing the stability and reliability of the system: by continuously monitoring and adjusting the operation parameters, the method can cope with the fluctuation and change of the original water quality, keep the stability and reliability of the treatment effect, especially cope with seasonal change or water quality change in sudden events, timely and dynamically adjust the operation process parameters according to the water quality change in real time, and ensure the high efficiency and water quality safety of the water quality treatment;
5. continuous performance monitoring and evaluation: by continuously evaluating the actual flocculation air floatation operation parameters and the algae removal effect, the method can monitor the performance of the whole system in real time, and the continuous monitoring is helpful for timely finding problems and carrying out necessary adjustment, so that the water treatment effect is ensured;
6. Easy to expand and suitable for different scenarios: the optimization method can be easily adapted to water treatment facilities of different scales and different water quality conditions through online updating and deployment of the machine learning model. And as more data is accumulated, the predictive power and accuracy of the model will further increase.
It should be understood that all combinations of the foregoing concepts, as well as additional concepts described in more detail below, may be considered a part of the inventive subject matter of the present disclosure as long as such concepts are not mutually inconsistent. In addition, all combinations of claimed subject matter are considered part of the disclosed inventive subject matter.
The foregoing and other aspects, embodiments, and features of the present teachings will be more fully understood from the following description, taken together with the accompanying drawings. Other additional aspects of the invention, such as features and/or advantages of the exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of the embodiments according to the teachings of the invention.
Detailed Description
For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.
Aspects of the invention are described in this disclosure with reference to the drawings, in which are shown a number of illustrative embodiments. The embodiments of the present disclosure are not necessarily intended to include all aspects of the invention. It should be understood that the various concepts and embodiments described above, as well as those described in more detail below, may be implemented in any of a number of ways, as the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.
{ Flocculation air flotation algae removal process optimization method based on machine learning })
The implementation process of the flocculation air flotation algae removal process optimization method based on machine learning in combination with the embodiment shown in fig. 1 comprises the following steps:
Acquiring flocculation air flotation algae removal treatment historical data, wherein the flocculation air flotation algae removal treatment historical data comprises historical water quality data, flocculation air flotation process parameters and algae treatment data, and preprocessing the historical data to obtain a training data set;
carrying out data division on the training data set, and determining a training set and a testing set according to a preset proportion;
The H 2 O AutoML platform automatically executes the training of a plurality of machine learning algorithms by utilizing the training set and the testing set, and selects an optimal model according to a preset evaluation index;
performing super-parameter optimization on the optimal model to obtain an optimized control model of the flocculation air flotation algae removal process;
Deploying a flocculation air flotation algae removal process optimization control model into a control system of a water treatment facility, dynamically predicting flocculation air flotation operation parameters according to water quality data acquired in real time, and adjusting the addition amount of an actual flocculant, flocculation conditions and air flotation conditions; and
And (5) evaluating the algae removal effect according to the algae removal treatment of the actual flocculation air floatation operation parameters.
In a further embodiment, the flocculation air flotation algae removal treatment history data includes:
Historical water quality data: turbidity, algae species and quantity, pH, water temperature, conductivity;
Flocculation air floatation process parameters: air floatation time, coagulant addition amount and stirring intensity;
Algae treatment data: the algae species and quantity, turbidity and physical and chemical indicators of algae, including solubility, binding extracellular polymers and algae cell Zeta potential.
And obtaining a training data set through pretreatment aiming at the obtained flocculation air flotation algae removal treatment historical data.
In an embodiment of the present invention, the preprocessing includes: data normalization and outlier processing.
Through the data normalization to the historical data, be used for converting the data of different dimension or numerical range in the historical data to same scale, the contribution of balanced characteristic to the model to the convergence performance when doing model training prevents the convergence problem that the characteristic data scale difference caused, improves the precision of training the model simultaneously, more easily catches the slight change in the data, thereby improves the degree of accuracy and the robustness of model.
As an example, the normalization method of the present embodiment employs a Min-Max normalization (Min-Max normalization) or a Z-score normalization (standard or Z-score normalization) method.
The foregoing outlier processing, also referred to as outlier processing, means that for identifying and processing values that are significantly different from or significantly unreasonable to most data, the effects of outliers on model training are eliminated or mitigated by direct deletion/conditional deletion, substitution, clipping, transformation, etc. In the embodiment of the invention, a direct deleting method is selected, abnormal value data is proposed to eliminate noise and errors in the data, the overall quality and the signal to noise ratio of a data set are improved, the accuracy and the reliability of model training are improved, and the model errors and the prediction error influence caused by the abnormal value are eliminated.
In the embodiment of the invention, the data dividing ratio of the training set to the test set is 0.8:0.2.
In a further embodiment, the machine learning algorithm executed on the H 2 O AutoML platform includes a GLM random forest, a DRF distributed random forest, an XRT extreme random forest, DEEPLEARNING deep learning, an XGboost, a GBM gradient hoist, and a Stack integrated Stack model, and selects an optimal model based on a predetermined evaluation index.
In an embodiment of the invention, the segmented training set is selected as model training data, i.e. specified by the training_frame parameter.
In a further embodiment, the H 2 O AutoML platform automatically performs training process configuration of a plurality of machine learning algorithms includes:
Selecting a variable of model prediction output, namely setting a response_column value;
Setting AutoML the maximum run time to 300s, i.e., the max_ runtime _secs parameter is set to 300s;
The maximum number of models explored before stopping is adjusted AutoML to 10, i.e., the max_ models parameter is set to 10;
providing a random seed for AutoML running processes to ensure the repeatability of the experiment, wherein the seed number is 1234, namely the seed parameter is set to 1234;
The distribution function supported by each machine learning algorithm is selected as AUTO, namely, the distribution parameter is set as AUTO;
The number of cross-validation folds is set to 5, which helps to evaluate the stability and generalization ability of the model, i.e., nfolds parameters are set to 5;
the training stop condition stop_ rounds parameter is set to 3, namely, the model performance is not improved in the specified number of rounds, and the training process is stopped;
The keep_cross_validation_ models parameter is set to Ture, leaving a cross-validated model.
In an embodiment of the invention, further, in the model training process, the index for evaluating the model performance uses a mean square error (RMSE), a Mean Absolute Error (MAE), or a decision coefficient (r 2).
The mean square error RMSE is the average of the squares of the differences between the predicted and the actual values measured in the regression task and is used to evaluate the accuracy of the model predictions.
Mean absolute error MAE is the average of the absolute values of the deviations of all individual observations from the true value (or arithmetic mean).
The higher the decision coefficient r 2, i.e. the degree of interpretation of the dependent variable (independent variable), between 0 and 1, the better the fitting of the model.
In the method of the invention, the mean square error, the mean absolute error and the decision coefficients are selected to evaluate the performance of each model on a test set.
In a further embodiment, the performing the super parameter optimization on the optimal model includes:
The model parameters are automatically adjusted using grid search or random search techniques of the H 2 O AutoML platform.
In a further embodiment, the method further comprises the steps of:
And analyzing key factors influencing the algae removal efficiency and weights thereof by using a model interpretation function of the H 2 O AutoML platform.
In a further embodiment, the model interpretation method includes residual analysis, variable importance analysis, shape interpretation method and partial dependency curve, which are used to interpret the prediction result of the optimal model obtained by training the H 2 O AutoML platform.
The residual analysis refers to the difference between the actual observed value and the model predicted value, and the reliability, periodicity or other interference condition of the data can be analyzed through the information provided by the residual.
The variable importance analysis is a method for measuring the influence degree of each input characteristic in the model on the predicted result.
The Shapley interpretation measures the average contribution of a single feature to model predictions when considering interactions with other features.
The partial dependency curve analysis shows how the feature variables affect the model predictions by calculating the marginal effect meters of one (or both) of the input parameters to the prediction model.
In a further embodiment, the method further comprises the steps of:
And acquiring water quality data, actual flocculation air flotation operation parameters and algae removal effect according to a preset period, dynamically training and updating a flocculation air flotation algae removal process optimization control model, and redeploying the updated flocculation air flotation algae removal process optimization control model.
{ Example 1}
To further illustrate the practice of the foregoing method of the present invention, we will now describe in further detail by way of specific embodiments thereof, with reference to the accompanying figures 2-9.
Step 1: 400 pieces of historical data are obtained through laboratory experiment results, wherein the historical data comprise historical water quality data, flocculation air floatation process parameters and algae treatment data.
Water quality data: turbidity, algae species and quantity, pH, water temperature, conductivity, and dissolved oxygen.
Flocculation air floatation process parameters: air floatation time, coagulant addition amount and stirring intensity.
Algae treatment data: the algae species and quantity, turbidity and physical and chemical indicators of algae, including solubility, binding extracellular polymers and algae cell Zeta potential.
Step 2: data exploration, importing seaborn packages in the python script, calling corr and pairplot functions to see the correlation between the original dataset distribution and the data, as shown in fig. 2 and 3.
Specifically, relevant packets (pandas and matplotlib) required by seaborn are downloaded through a pip and configured in a conda environment, CSV type data are read through the pandas packets, a DATAFRAME tabular data structure matrix is constructed, data exploratory analysis is carried out through pairplot functions, and parameters are set to be like kine= "scanner", diag_kine= "kde"; then, the corr function is used to calculate the pearson correlation coefficients between all the numerical columns in DATAFRAME; finally, the pictures are displayed and saved through the showand save functions of matplotlib.
Step 3: initializing an H 2 O AutoML platform, inputting the data read in the step 2, splitting the data into a training set and a verification set at a ratio of 0.8:0.2, and then designating a target column (blue algae removal rate) for prediction and a characteristic column (comprising the water quality parameters, the flocculation air floatation process parameters and the algae treatment data mentioned in the step 1) for training.
The H 2 O AutoML platform automatically trains a plurality of models, and after training, a testing set is used for establishing a flocculation air flotation algae removal process prediction model and obtaining various model performance comparisons, as shown in figure 4.
Specifically, downloading and automatically installing a framework required by a training model through a pip in conda virtual environment; initializing H 2 O AutoML in the python script by init function; the upload_file function reads the data and segments the data set with split, with a training set to validation set ratio of 0.8:0.2, i.e., ratios= [0.8,0.2]; train, column extracts and specifies the feature columns for training: the H2OAutoML function was used to start building the model and finally five-fold cross-validation was used to ensure the reliability of the model.
Wherein, the parameter max_ runtime _secs of the H 2 O AutoML function is set to 300s, and the maximum running time of AutoML is 300s; max_ models is set to 10 for adjusting AutoML the maximum number of models that can be explored before stopping; seed is set to 1234, a random seed is provided for AutoML operation process, and repeatability of model establishment is guaranteed; the distribution function supported by various algorithms selects Auto, namely distribution is set as AUTO; the cross-validation fold number is set to 5, i.e., nfolds is set to 5; stop_ rounds is set to 3; keep_cross_validation_ models is set to Ture, leaving a cross-validated model.
In this embodiment, the mean square error (equation 1), the mean absolute error (equation 2) and the decision coefficient (equation 3) are selected to evaluate the performance of each model of the training over a test set.
The best model for training was Stack_1, the mean square error was 0.05164, the mean absolute error was 0.05308, and the coefficient was 0.972.
Wherein SS res (sum of squares of residuals) is the residual variation, i.e. the sum of squares of the difference between the observed and predicted values; SS tot (sum of squares) is the sum of squares of the total variation, i.e. the difference between the observed value and the average of the observed values.
Step 4: after the training in the step 3 is finished, the best model result of the performance obtained by training the H 2 O AutoML platform is explained through comprehensive interpretation methods such as residual analysis, learning curve, variable importance, shapley interpretation (SHAP) summarization, partial dependence graph (PDP) and the like.
Specifically, the aforementioned interpretable method proceeds directly in the Python script.
Wherein the residual analysis (fig. 5) shows that the residual analysis shows the residual distribution of the test dataset in the integrated stack model. The frequency of residuals around zero is high, especially for higher algae removal rates (60% -100%), indicating that the trained integrated stack model has obtained sufficient predictive information.
The learning curve (fig. 6) of the integrated stack model shows that three curves, including a training curve, a test curve and a cross validation curve, drop rapidly in the first 20 iterations, showing a fast initial learning rate and a fast improvement in model performance. The smaller spacing between the training curve and the cross-validation curve indicates that the model has no significant overfitting. The model has a better generalization capability as the model has a smaller difference in performance on the test curve than the training curve.
Wherein, the variable importance of the optimal model (figure 7) evaluates the importance of each input variable on the algae removal rate, and the order of the input variable is found that the air flotation time, the adding amount, the pH value, the turbidity value and the Zeta potential are respectively more than bEPS, dEPS, and the result shows that the influence of the air flotation time on the algae removal rate is most important.
Wherein the SHAP summary of the optimal model (fig. 8) further evaluates the marginal effect of each input variable on algae removal rate. The SHAP values of the test data set including air floatation time, dosing amount, bEPS, DO, algae density, pH, turbidity, zeta potential and dEPS numerical variables are obtained, and the SHAP values are changed within a certain range. The different distributions of similar variable values in horizontal positions indicate that the effect of an input variable is not only determined by its variable value, but also affected by other variables. The importance of this is demonstrated by the large distribution of air bearing time in the SHAP summary plot. Meanwhile, the low variable points of the air floatation time, the medicine adding amount and bEPS are mainly distributed on the left side, and the high variable points are mainly distributed on the right side. These results further indicate that longer air flotation times and higher dosing and bEPS facilitate algae removal.
Wherein, the dependence relationship between the most important input variables (air flotation time, dosing amount, bEPS and DO shown in four graphs respectively) and algae removal rate is visualized by a partial dependence graph (fig. 9). The PDP is drawn by changing the values of the variables of interest and keeping the other variables fixed. The curves of different colors in the graph represent the results obtained for the different models. As for the air floatation time, the dependence of the first 40-60s is obviously improved, and then the slope is gradually reduced, so that the air floatation time for removing algae can be obviously optimized and adjusted. For the dosage, it can be seen that before 20mg/L, the dependence is significantly improved with the increase of the dosage, and the improvement is slower with the subsequent re-dosing. The result shows that after a certain dosage of the medicine is reached, the dosage has little influence on flocculation-air floatation algae removal efficiency. For bEPS, a significant increase in dependence after 7.8mg/L can be seen, indicating that a high concentration of bEPS would be beneficial for the flocculation-air flotation process. For DO, the dependence is strongest at low and high concentrations.
While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.