Inner Mongolia grassland plant species classification based on ground object spectrum library
Technical Field
The invention relates to a method for classifying inner Mongolia grassland plant species based on a ground object spectrum library, and belongs to the technical field of spectrum classification.
Background
Grasslands are an important component of the terrestrial ecosystem, have extremely important ecological service functions such as climate regulation, conservation of water sources, wind and sand prevention, biodiversity conservation, primary productivity, carbon fixation and the like, and also provide habitats for biodiversity including vascular plants, birds, large mammals and soil microorganisms. China is a world wide grassland resource country, has the total grassland area of nearly 4 hundred million hectares and occupies 41.7 percent of the total national soil area. According to incomplete statistics, the grassland plants in China totally belong to 254 families, more than 4000 genera and more than 9700 varieties. Among them, the poisonous plants are 52 families, 138 genera and 316 species. However, under the influence of factors such as long-term excessive grazing, reclamation, climate change, environmental pollution and biological invasion, the biological diversity of the grasslands in China is obviously reduced, and further the functions and functions of the grassland ecological system are reduced. Therefore, the monitoring and evaluation of the grassland biodiversity are enhanced, and the method is crucial to the establishment of grassland biodiversity protection policies and the adaptability management of grassland ecosystems. Identification and classification of grass species is the basis for biodiversity monitoring of grass ecosystems, and therefore, accurate classification of grass species is an important step in the development of grass ecosystem protection and restoration.
The traditional grassland biodiversity monitoring mainly comprises ground investigation and fixed-point station monitoring, is mainly based on local scale mechanism research at species level, has poor space representativeness and time continuity of areas, is time-consuming and labor-consuming, and is difficult to obtain field investigation data for areas which are hard to reach by human power, such as heat, high and cold areas. The satellite remote sensing has the advantages of wide detection range, short periodicity of acquired data, strong dynamic property and the like, is beneficial to rapidly revealing the loss condition of large-area biodiversity, is displayed in a continuous, borderless and repeatable mode, and is very suitable for biodiversity monitoring of different time and space scales. In recent years, remote sensing technology has been increasingly used in the field of plant species identification. The hyperspectral data comprises dozens or hundreds of continuous wave bands, and the wave bands have a plurality of spectral characteristics related to plant functional characteristics and are suitable for identifying and classifying plant species.
In the classification of hyperspectral data, machine learning is a more common method, but the high dimensionality and multiple collinearity of the hyperspectral data can reduce the precision of a classification model of the method, so how to reduce data redundancy is a key problem to be solved. Huang Yu et al indicate that feature extraction is an effective method for reducing data redundancy, and dimension reduction and vegetation index extraction of data are two strategies commonly used in feature extraction. In addition to the redundancy of high dimensional features, the imbalance in the number of spectra between species is another cause of inaccurate classification results. In actual field spectrum acquisition, the number of spectrums obtained by rare species is small, even the difference between the number of spectrums obtained by rare species and the number of spectrums obtained by common species is more than two orders of magnitude, and due to the bias of a machine learning algorithm, the rare species can be identified as the common species with a high probability, so that the precision of species classification is reduced.
Disclosure of Invention
In order to solve the technical problems, the invention provides a classification of inner Mongolia grassland plant species based on a ground object spectrum library, which is achieved by the following specific technical means: a method for classifying inner Mongolia grassland plant species based on a ground object spectrum library comprises the following steps:
s1: constructing a grassland plant species spectrum library, including the acquisition of field leaf samples and the indoor leaf spectrum measurement;
s2: preprocessing a surface feature spectrum library, comprising four steps of data denoising, continuous wavelet transform spectrum characteristic enhancement, SMOTE algorithm spectrum balance and PCA spectrum dimensionality reduction;
s3: and classifying the preprocessed surface feature spectrum library by using an MLP algorithm.
Further, the construction of the grassland plant species spectrum library comprises leaf collection and spectrum measurement, a large-scale spectrum library comprising 29212 total spectrum data of 120 species is constructed, and leaf sample collection: and selecting a sampling place according to the environmental gradient in the inner Mongolia Silo union, and then collecting different grassland species to be placed in a refrigerator for refrigeration so as to be used for indoor spectral measurement. Indoor blade spectrum measurement: based on the ASD surface feature spectrometer, spectral measurement is carried out on the blades of the refrigerating box according to the operation specification, and the measurement precision and the diversity of spectral data are ensured.
Further, the pre-processing of the surface feature spectrum library comprises four steps of data de-noising, continuous wavelet transform spectrum characteristic enhancement, SMOTE algorithm spectrum balance and PCA spectrum dimension reduction, and the pre-processing of the surface feature spectrum library comprises four steps of data de-noising, continuous wavelet transform spectrum characteristic enhancement, SMOTE algorithm spectrum balance and PCA spectrum dimension reduction. Data denoising involves pruning the edge spectrum to reduce the systematic error and filtering the noise using Savitzky-Golay filtering. The continuous wavelet transform is used for denoising and feature enhancement.
Further, the SMOTE algorithm takes the species with the largest number of spectra as a reference, and expands the number of spectra of other species to the number of the species with the largest number, so that the invention uses the SMOTE algorithm to perform data balance aiming at the problem of unbalanced number of spectral data between common species and rare species. And expanding the spectral number of other species by the number of the most species by taking the species with the largest spectral number as a reference. The PCA spectrum dimensionality reduction is a common data dimensionality reduction method, and can effectively reduce data redundancy and improve operation speed. Two parameters need to be set: n _ components and svd _ solvent.
Further, the MLP algorithm comprises preprocessed surface feature spectral library classification, firstly, spectral data processed by PCA is divided into a training set and a verification set according to a certain proportion, then, the training set is classified by using a common machine learning algorithm, and finally, the MLP method is obtained to be highest in precision through evaluation, and if precision and speed are considered, secondary Discriminant Analysis (QDA) is most suitable.
Compared with the prior art, the invention has the following beneficial effects:
1. the method selects grassland species with different environmental gradients, and constructs an inner Mongolia grassland plant species spectrum library which covers 120 species and has 29212 spectrums;
2. the invention provides an effective grassland multi-species spectrum classification method based on a ground feature spectrum library;
3. in the invention, the SMOTE algorithm is used for solving the problem of the unbalanced quantity of the interspecies spectrums, the PCA algorithm is used for solving the problem of the redundancy of the spectrum data, and in the embodiment, the classification precision of the MLP reaches 97.17% after data preprocessing, which shows that the invention has high reliability;
4. according to the invention, the ground feature spectrum library is preprocessed by using various methods, and then the spectrums are classified by using the QDA method, so that the speed is greatly improved, and the precision is still kept at a high level, which shows that the method has strong practicability.
Drawings
FIG. 1 is a flow chart of the method for classifying the species of the inner Mongolian grassland plants based on the spectral library of the surface features.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
the invention discloses a method for classifying inner Mongolia grassland plant species based on a ground object spectrum library, which comprises the following steps:
s1: constructing a grassland plant species spectrum library;
s2: preprocessing a surface feature spectrum library;
s3: and classifying the preprocessed surface feature spectrum library by using an MLP algorithm.
And step S1, constructing a grassland plant species spectrum library, including the acquisition of field leaf samples and the indoor leaf spectrum measurement. Leaf sample collection: in the inner Mongolia autonomous region Sinopodophylla union, 18 sampling sites with obvious environmental gradient differences are selected from 6 months to 8 months of 2020, the grassland types cover meadow steppe and typical steppe (including shrubby steppe), fresh leaves of 120 plants (particularly shown in table 1) including herbaceous plants and shrubs are collected, meanwhile, in order to ensure accurate subsequent spectrum collection, the leaves are placed in a refrigerating box, and then all leaf spectrums are measured on the same day in a laboratory. Indoor blade spectrum measurement: the blades were measured spectroscopically using an ASD geophysical spectrometer. The specific operation steps are as follows: starting up and preheating for 15 minutes; RS3 software operation: the shot is aligned with the white board and the OPT is clicked for optimization; operation of RS3 software: acquiring a white board reference spectrum: for the white board, a straight line with reflection equal to 1 appears in WR; RS3 software operation: aligning a lens to a plant leaf to be detected: the space saves data; integrating and counting spectral data: the measured plant spectrums are classified and sorted, a spectrum library of 29212 spectrums in total is obtained by statistics of 120 species, and a lens is aligned to a plant leaf to be measured. To ensure the diversity of spectral data, the leaves at the upper, middle and lower positions of the same plant and different positions of a single leaf are selected for spectral measurement, and repeated measurement is not allowed. To ensure the measurement accuracy, the standard white board was calibrated every 4 minutes.
And step S2, preprocessing the surface feature spectrum library, including four steps of data denoising, continuous wavelet transform spectrum characteristic enhancement, SMOTE algorithm spectrum balance and PCA spectrum dimensionality reduction. Data denoising involves pruning the edge spectrum to reduce the systematic error and filtering the noise using Savitzky-Golay filtering. Continuous wavelet transform is an effective data noise reduction method, and can also enhance the spectral feature details of hyperspectral data, in this embodiment, the mother function is a gaussian second derivative, and the scale parameter is set to 16.
And (3) carrying out SMOTE algorithm balance data after denoising:
aiming at the problem of unbalanced quantity of the spectral data between the common species and the rare species, the invention uses the SMOTE algorithm to carry out data balance. The basic idea of the SMOTE algorithm is to analyze a few samples, then use the K-nearest neighbor classification algorithm to perform simulation to generate a new synthesized sample, and finally combine the original sample and the synthesized sample as the new few samples. In the present embodiment, the spectrum numbers of the other 119 species are all expanded to 444 by taking the species of the most numerous spectra (the number of the species is 444) as a reference.
Carrying out PCA spectrum dimensionality reduction after data balance:
the PCA, namely a principal component analysis method, is a common data dimension reduction method, can effectively reduce data redundancy and improve the operation speed, and can extract important features according to a certain rule for subsequent processing. Therefore, in order to improve the operating efficiency of the subsequent algorithm, in this embodiment, the PCA is used to perform the dimension reduction of the spectral data, and two specific parameters are set: n _ components and svd _ solvent. N _ components is used to set the feature dimension number after the PCA dimension reduction, and may be adjusted as needed, where N _ components is set to 100 in this embodiment. Svd _ resolver is a method for setting Singular Value Decomposition (SVD), and there are 4 values to choose from: { ' auto ', ' full ', ' arpack ', ' randomised ', the present embodiment selects svd _ resolver ═ auto '.
And step S3, classifying the preprocessed surface feature spectrum library by using an MLP algorithm. The method comprises the following steps: s3.1, dividing the spectral data after PCA processing into a training set and a verification set according to the proportion of 7:3 by using a StratifiedShufflesplit function, and simultaneously ensuring that the proportion of the function in the data of each category is also 7: 3; s3.2, selecting an optimal classification method for evaluating the performance of the algorithm, and selecting a common 6-machine learning algorithm for comprehensive comparison and analysis (the specific method is shown in a table 2); and S3.3, evaluating the classification result to obtain an algorithm with the highest precision, namely an MLP (Multi level processing) algorithm and a QDA (maximum likelihood estimation) algorithm, wherein the algorithm with the highest precision is more than 95%. And S3.4, only considering the precision, using MLP to classify the species of the spectral data, and using QDA if the precision and the speed are both considered.
In step S3.3, 5 evaluation methods are selected, specifically, the evaluation method comprises the following steps of (i) acquisition, (ii) Recall, (iii) Precison, (ii) F1-score and (iii) Kappa coefficient. By comparison (refer to table 1), the MLP classification precision represented by the mlpclasifier function is the highest, accuracycacy is up to 97.17%, the time used is 192.8s, and the QDA precision represented by quadraticdiscriminatinal analysis is slightly lower by 96.35%, but the time used is greatly reduced and is 2.4 s.
In step S3.4, the MLP is used for species classification of the spectral data, taking into account only the accuracy. The MLP is also called an artificial neural network, and is composed of an input layer, a hidden layer and an output layer, wherein the hidden layer may comprise multiple layers. The MLP has stronger robustness and fault-tolerant capability to noise, and can fully approximate to a complex nonlinear relation, so that a better result can be obtained in data classification. The mlpclasifier (with the parameters of "sgd", activation "relu", max _ iter "2000, alpha" 1e-5, and hidden _ layer _ sizes "(256, 128), respectively) used in this embodiment, considers precision and speed, and classifies plant species from the spectral data using QDA.
TABLE 1
| Method/evaluation index%)
|
accuracy
|
precision
|
recall
|
F1-score
|
kappa
|
T/s
|
| LinearDiscriminantAnalysis()
|
80.63%
|
86.51%
|
80.63%
|
82.23%
|
80.47%
|
0.283
|
| QuadraticDiscriminantAnalysis()
|
96.35%
|
96.89%
|
96.35%
|
96.47%
|
96.32%
|
2.456
|
| KNeighborsClassifier()
|
90.82%
|
91.43%
|
90.82%
|
90.40%
|
90.74%
|
133.0
|
| RandomForestClassifier()
|
84.02%
|
85.01%
|
84.02%
|
83.86%
|
83.89%
|
9.585
|
| MLPClassifier()
|
97.17%
|
97.19%
|
97.17%
|
97.16%
|
97.14%
|
192.8
|
| LinearSVC()
|
95.88%
|
95.92%
|
95.88%
|
95.82%
|
95.85%
|
42.30 |
TABLE 2
The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention to achieve the above technical effects.