WO2001067295A2 - Analyse de donnees - Google Patents
Analyse de donnees Download PDFInfo
- Publication number
- WO2001067295A2 WO2001067295A2 PCT/GB2001/001031 GB0101031W WO0167295A2 WO 2001067295 A2 WO2001067295 A2 WO 2001067295A2 GB 0101031 W GB0101031 W GB 0101031W WO 0167295 A2 WO0167295 A2 WO 0167295A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- cluster
- point
- kernel
- searchable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- This invention relates to data analysis and has particular reference to comparison of items each of which is characterised by a large number of datapoints .
- the problems of handling such comparisons is well illustrated by the comparison of spectral data in which each spectrum is characterised by a large number of datapoints .
- Spectral data presents difficulty in analysis since in the original analog spectral data, the intensities are not reproducible. In some spectra, the weak spectral peaks merge into the background "noise".
- MALDI-TOF-MS matrix assisted laser desorption ionisation time of flight mass spectrometry
- the precision of the MALDI-TOF-MS machine is such that the mass position on each spectral peak is not exactly reproducible and a small element of "shift" for any given peak is likely to occur. This is particularly noticeable towards the high mass end of the spectrum.
- Existing attempts to analyze the spectral data from MALDI-TOF-MS analysis have relied on the Jacquard method. According to this method, the spectral data is analyzed at a number of datapoints, typically at a number of datapoints greater than 16k. Each data point reports the presence or the absence of a peak at that particular point on the spectrum. The data point reports only the presence or the absence of a spectral peak and does not include any information whatsoever concerning the intensity or relative intensity of any peak located at that position.
- the reported information from the datapoint is stored as an absolute number within the database. Using this technique, there is no measure oF relative intensity between the peaks and troughs of relative peaks within the spectrum being analyzed. Furthermore, because of the non-reproducibility of the spectral intensity, in some instances, significant but low intensity peaks will not be reported or considered. If the background noise level within the system is relatively high, significant data may be lost due to it being simply discounted. Since the data set in any of one particular spectrum is very large and may be of the order of 16k, 32k or greater datapoints, significant and critical amounts of characterizing information would simply be discounted with a result that critical comparisons and analysis within the database cannot take place.
- This data comparison search method of the kind described has been found to give exemplary results with a large database of items in which each item comprises many thousands of datapoints.
- the search engine incorporating this method is now referred to MMUSE (Manchester Metropolitan University Search Engine) .
- a method of improving the quality of a database for use with a search method of the kind described comprises determining the single searchable reference point for a plurality of replicate samples of each item, establishing the coordinates of the replicate reference points in high dimensional space and thereafter determining a single point reference coordinate of the cluster of replicate reference points for initial searching/comparison.
- the invention also includes a method of comparing data which method comprises defining a plurality of data points in respect of each item to be compared across the complete range of data, converting groups of data points to a kernel function, said function being characteristic of the position / shape and/or relative intensity of the data at forming that group, assembling the said kernel functions as a cluster and projecting said cluster in high dimensional space using Cover' s Theorem to provide a single searchable reference point for the said plurality of data points, characterised in that the cluster kernel function is determined for a plurality of replicate samples of the each item, establishing the coordinates of the replicate kernel functions in high dimensional space and thereafter determining a single point reference coordinate of the cluster of replicate kernel functions for initial searching/comparison.
- a further aspect of the invention includes a computer programmed to conduct a search within a data base, which program comprises the steps of:-
- the single searchable reference point for replicates of any sample is compared in a matrix or an array.
- any replicate point with a distance characteristic significantly greater that the others in the cluster is regarded as an outlier and may be excluded from the determination of said single point reference coordinates for the replicate cluster. This overcomes the problem of averaging.
- the sum of the distances of any one replicate searchable point can be calculated to each of the other replicate points in the cluster. This is repeated for each of the replicate points and the results are compared; the replicate point (s) with the greatest distance is the one which is least likely to be typical of the cluster and can be designated as an "outlier" . In practice where a number of replicate point (s) have summed distances significantly greater than the rest then more than one such replicate point may be designated as an "outlier".
- the outliers Once the outliers have been identified, they can be ignored and the remaining replicate points may be processed to produce a single data point reference coordinate for the replicate cluster thus providing a single data point reference for the purpose of searching within the database. This enables a relatively simple comparison with the corresponding single data point coordinates for the replicate cluster of an unknown sample.
- the data point clusters may overlap, in some cases quite significantly. There may, therefore, be some difficulty in deciding within which reference coordinate cluster the unknown item belongs .
- the topographical distribution and/or the spread of kernel functions about each searchable point for a given replicate is characteristic of the data item to which the searchable point and hence the data point coordinates relate.
- the invention further includes the modelling of the topographical distribution of cluster kernel functions and/or the modelling of the spread of such searchable cluster kernel functions for each replicate searchable point and hence for the data point reference coordinates .
- the cluster of searchable points may themselves be further projected in high dimensional space.
- the data may be spectral data and the datapoints may be assembled into groups of data points across a range of spectral data. This range may extend across the whole of the spectral data or only a part or sub-set of the range.
- data is normalized to provide an intensity function which is a measure of the relative intensity of, say, each spectral peak, that is to say, by comparing all the peak intensities of the data with the highest which is rated at 1. All the other peaks in the data will then have a value of under 1.
- the kernel function of the cluster in high dimensional space can be normalized to 1.
- the data points may be applied across a neural network.
- the neural network may also be employed to analyse pattern distributions of kernel functions of the local kernel clusters using the Cover Theorem (Ref: Thomas M Cover (1965) Geometrical and Statistical properties of system of linear inequalities with application in Pattern Recognition) .
- Cover Theorem Ref: Thomas M Cover (1965) Geometrical and Statistical properties of system of linear inequalities with application in Pattern Recognition
- the datapoints defining any particular group in a dataset may be defined by vector spatial functions which may be displayed as a cluster or projected as a single point in high dimensional space.
- the local kernel of each cluster of replicate searchable points in high dimensional space can thus be determined by a single set of searchable parameters.
- all that is necessary is the comparison of the single set of searchable parameters of the local kernel clusters for each of the replicate spectra within the database and compared it with the single set of searchable parameters of the local replicate kernel cluster for the unknown sample.
- an artificial neural network to assist in optimization of the search data has the advantage that prior knowledge of models and associated careful network design is unnecessary.
- a search engine has enabled the sorting and comparison of MALDI-TOFMS spectral data to make available a high-performance mass spectral analysis tool, which may be operated by the nonspecialist .
- the equipment required to perform the analysis is relatively inexpensive, and the search engine forming part of the invention enables rapid and easy searching of an extensive database of microorganisms.
- radial basis functions are used to fit or include each cluster kernel.
- the present invention also includes a computer programmed with a database comprising the radial basis functions of the kernel of each data set or group of data points of the spectral data in high dimensional space characterised in that the said kernel function is determined for a plurality of replicate samples of each item, establishing the coordinates of the replicate kernel functions in high dimensional space and thereafter determining a single point reference coordinate of the cluster of replicate kernels for initial searching/comparison.
- the information relating to the item is enhanced since outlier replicates are identified and discarded and the resultant single point kernel function is at a statistical optimum.
- spectral data may be recorded in digital form for ease of searching.
- the presence and availability of all the data points within the cluster for each item permits the reconstitution of each item from this information so that the original datapoints may be represented in graphic as well as digital or numeric form.
- the invention also includes a database comprising the radial basis functions of the known items for comparison with the unknown items.
- a method of improving the quality of a database for use with a search engine of the kind described which method comprises determining the cluster kernel function for a plurality of replicate samples of each data item, establishing the coordinates of the replicate kernel functions in high dimensional space and thereafter determining a single point reference coordinate of the cluster of replicate kernels for initial searching/comparison.
- a method of comparing data comprises defining as a group, a plurality of data points within a range of data points for a data item, converting said group of data points to at least one kernel function, assembling the resultant plurality of kernel functions covering all the data points for the data item into a cluster, and projecting said cluster of kernel functions in high dimensional space using Cover' s Theorem to define a single searchable reference point for all the data points for said data item, and comparing the said single searchable point for a sample item with the single searchable point for similarly processed comparison items characterised in that the cluster kernel function or single searchable point is determined for a plurality of replicate samples of the each item, establishing the coordinates of the replicate kernel functions or searchable points in high dimensional space and thereafter determining a single data point reference coordinate of the cluster of replicate kernels or searchable points for initial searching/comparison.
- a computer programmed to conduct a search within a data base which program comprises the steps of defining a plurality of groups of data points in respect of each item to be searched across the complete range of data for said item, converting each data point within a group to a vector spatial function, said function being characteristic of the data at that point, assembling the vector spatial functions of each point in said group as a cluster, determining the kernel function in respect of that cluster, determining a radial basis function for each kernel which is characteristic of all the information in the plurality of data points constituting said group projecting said kernel functions into high dimension space and assembling to form a single searchable point and comparing the resultant searchable point with the corresponding point (s) of the other data items within the database, characterised in that the said searchable point is determined for a plurality of replicate samples of the each data item, thereafter determining a single point reference coordinate of the cluster of replicate searchable points for initial searching/comparison .
- the individual replicates of any sample cluster of kernel searchable point functions for a data item may be compared in a matrix or an array whereby any replicate with a distance characteristic significantly greater that the others in the cluster is regarded as an outlier and is excluded from the determination of said single point reference coordinates for the cluster .
- the spread of data points for each group cluster kernel functions for any particular replicate may be determined as a function for comparison with the spread of data points for the group cluster kernel functions of an unknown sample.
- the topographical distribution of data points for each group cluster kernel function may be determined as a function for comparison with the topographical distribution of data points for the cluster kernel function of an unknown sample.
- the data may be spectral data and wherein the groups of data points are selected across a range of spectral data.
- the data may be normalized to provide an intensity function which is a measure of the relative intensity of each spectral peak.
- the normalization procedure may compare all the peak intensities as a proportion of the highest peak which is rated at 1.
- the radial basis function of the datapoints may be applied across a neural network.
- a neural network may be employed to analyse pattern distributions of radial basis functions of local kernel clusters in accordance with the Cover Theorem.
- a database of data comprising the radial basis functions of the kernel of each group of data points in high dimensional space whereby the radial basis function of the kernel cluster serves to determine the relative spatial position of the kernel in high dimensional space characterised in that the kernel cluster function is determined for a plurality of replicate samples of the each item, establishing the coordinates of the replicate kernel functions in high dimensional space and thereafter determining a single point reference coordinate of the cluster of replicate kernels for initial searching/comparison.
- a database may comprise spectral data obtained by MALDI-TOF-MS of micro organisms.
- a method of characterising micro organisms which method comprises providing a database of spectral data for a range of known micro organisms, preparing a sample of unidentified microorganism and obtaining corresponding spectral data relating thereto and comparing, using suitable comparison means the spectral data so obtained with the spectral data contained in the database thereby to identify the unidentified microorganism by comparison with a known microorganism having the same or similar spectral data.
- Apparatus for the screening of microorganisms characterised in that the apparatus comprising spectroscopic means for producing spectral data of the sample organism, database means containing spectral data for a range of microorganisms and comparison, means for comparing the spectral data of the sample with that of the database to permit classification/ identification of the sample, characterized in that the spectroscopic means comprises means for producing spectral data of the sample organism by MALDI-TOF techniques, in that the database contains MALDITOF-MS spectral data, and in that the comparison means is by the comparison method of the invention.
- Spectral data in the database may be arranged in groups of data according to the genus of each microorganism with sub-divisions corresponding to each strain of microorganism.
- a sample of unidentified microorganism may be prepared either by taking cells from a culture and applying them to a sample plate comprising a matrix or by admixing the cells with the matrix prior to subjecting to MALDI-TOF-MS analysis in order to retain the cellular integrity of the sample.
- a sample matrix mixture may be prepared and is bombarded with laser energy to create a gas phase ionic species which is then pulsed into a flight conduit or tube for identification of both positive and/or negative ions.
- Each species present may be identified by their mass/charge ratio.
- the mass/charge ratio of each spectral peak may be determined from the centroid of the peak corresponding to the average molecular mass of the particular ion.
- the spectral data may be derived from a plurality of laser shots of the sample in which the positive and/or energy of the radiation impinging on the sample is varied between shots of the same sample.
- MMUSE can be used. It is possible to produce a matrix of distance for whole member of cluster using MMUSE. This matrix contains the distances among fingerprints points. If the distances of a fingerprint point to other fingerprint points (in a column or a row of matrix) which are much higher than the rest of fingerprint points, then that point can be assumed an outlier.
- the outlier point is the point which has the largest value of the stored smallest value if the largest value be much larger than the rest of the stored smallest value.
- An iterative method can be used, so in iteration, a new representative fingerprint is calculated after a new fingerprint point is added to the population at this iteration.
- a matrix of distance can be obtained by MMUSE. Then in each column distance should be stored.
- the representative point is that point which has the smallest value of the stored largest value. The smallest of the largest value shows the maximum spread of populatin.
- the representative point of each population can be determined by second application of Cover's theorum to the population of replicate fingerprints. There are two points which are important in this patent.
- the representative point can be assigned to:-
- ⁇ x representative -- y ⁇ £ - J , - x"-il
- the representative points of each population say G can be found such that a cost function or an obj ective function J of distance measures DO is minimised .
- J ⁇ Ji, so J depends on the geometrical properties of each population and location of its representative centre.
- W n (t+1) W n (t) + n(x(t) ⁇ W n (t) ) ⁇ X r e p res e n t a t i ve
- Kohonen Feature Map or Topology-Preserving Map. It locates the output of a CLN in a geometrical manner (two dimensional array of higher dimensional points), then the weights of the winner (representatives) as well as neighbouring loser are updated such that a certain topological property in the input data is reflected in output unit's weights.
- first step it uses CLN or other methods to obtain representatives of the populations, which is called as template, reference vector or codebook vector.
- supervised learning neural networks uses randomly selected a training input vector X such that: X - Wn is a minimum.
- Improved LVQ2 and LVQ3 are also available techniques .
- the spread and topological distribution of a cluster is related to the physical quantities of the spectral data (fingerprints) .
- the physical quantities of the fingerprints are also related to the identity of the samples. Therefore the identification and typing of fingerprints can be effected by dynamic variations of their physical quantities.
- Hopfield Networks An alternative is to use the Hopfield Networks. This relies upon the concept of attractors in non-linear dynamical system which follow: "Any physical system whose dynamics in phase space is dominated by a substantial number of locally stable states to which it is attracted, can therefore be regarded as a general content- addressable memory".
- fuzzy set By using fuzzy set, overlap between clusters are automatically taken into account.
- a member (say ⁇ a' ) of a fuzzy set (say A) has a degree of belongness to its set and also to other sets (say A and B) , which can be expressed by fuzzy membership functions mA(a), mB(a) and mC(a).
- ⁇ a' only belongs to A, not B and C.
- fuzzy membership functions for (or degree of belongness to) each point in a high dimensional space is defined according its position and distribution of different clusters. This stage is called fuzzification.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Debugging And Monitoring (AREA)
Abstract
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0220671A GB2375862A (en) | 2000-03-10 | 2001-03-09 | Data analysis |
| AU2001237612A AU2001237612A1 (en) | 2000-03-10 | 2001-03-09 | Data analysis |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0005875.0 | 2000-03-10 | ||
| GBGB0005875.0A GB0005875D0 (en) | 2000-03-10 | 2000-03-10 | Data analysis |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2001067295A2 true WO2001067295A2 (fr) | 2001-09-13 |
| WO2001067295A3 WO2001067295A3 (fr) | 2003-02-06 |
Family
ID=9887424
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2001/001031 Ceased WO2001067295A2 (fr) | 2000-03-10 | 2001-03-09 | Analyse de donnees |
Country Status (3)
| Country | Link |
|---|---|
| AU (1) | AU2001237612A1 (fr) |
| GB (2) | GB0005875D0 (fr) |
| WO (1) | WO2001067295A2 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119131436A (zh) * | 2024-09-04 | 2024-12-13 | 冠县嘉源纺织有限公司 | 一种自动化纺织设备运行数据智能处理方法 |
-
2000
- 2000-03-10 GB GBGB0005875.0A patent/GB0005875D0/en not_active Ceased
-
2001
- 2001-03-09 WO PCT/GB2001/001031 patent/WO2001067295A2/fr not_active Ceased
- 2001-03-09 GB GB0220671A patent/GB2375862A/en not_active Withdrawn
- 2001-03-09 AU AU2001237612A patent/AU2001237612A1/en not_active Abandoned
Non-Patent Citations (6)
| Title |
|---|
| COVER T. M.: "Geometrical and Statistical Properties of Systems of Linear Inequalities with Application in pattern recognition" IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS , vol. EC14, 1965, pages 235-334, XP001074416 * |
| PAPADIMITRIOU S ET AL: "Nonlinear analysis of the performance and reliability of wavelet singularity detection based denoising for doppler ultrasound fetal heart rate signals" INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, ELSEVIER SCIENTIFIC PUBLISHERS, SHANNON, IR, vol. 53, no. 1, January 1999 (1999-01), pages 43-60, XP004158050 ISSN: 1386-5056 * |
| RAGUNATHAN N ET AL: "Multispectral detection for gas chromatography" JOURNAL OF CHROMATOGRAPHY A, ELSEVIER SCIENCE, NL, vol. 703, no. 1, 26 May 1995 (1995-05-26), pages 335-382, XP004023371 ISSN: 0021-9673 * |
| SINGHAL A ET AL: "Document length normalization" INFORMATION PROCESSING & MANAGEMENT, ELSEVIER, BARKING, GB, vol. 32, no. 5, 1 September 1996 (1996-09-01), pages 619-633, XP004012180 ISSN: 0306-4573 * |
| TIMOSZCZUK A P ET AL: "RBF neural networks and MTI for text independent speaker identification" NEURAL NETWORKS, 1998. PROCEEDINGS. VTH BRAZILIAN SYMPOSIUM ON BELO HORIZONTE, BRAZIL 9-11 DEC. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 9 December 1998 (1998-12-09), pages 124-129, XP010313705 ISBN: 0-8186-8629-4 * |
| TONG C S ET AL: "Mass spectral search method using the neural network approach" NEURAL NETWORKS, 1999. IJCNN '99. INTERNATIONAL JOINT CONFERENCE ON WASHINGTON, DC, USA 10-16 JULY 1999, PISCATAWAY, NJ, USA,IEEE, US, 10 July 1999 (1999-07-10), pages 3962-3967, XP010372551 ISBN: 0-7803-5529-6 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119131436A (zh) * | 2024-09-04 | 2024-12-13 | 冠县嘉源纺织有限公司 | 一种自动化纺织设备运行数据智能处理方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2001237612A1 (en) | 2001-09-17 |
| GB2375862A (en) | 2002-11-27 |
| GB0005875D0 (en) | 2000-05-03 |
| WO2001067295A3 (fr) | 2003-02-06 |
| GB0220671D0 (en) | 2002-10-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Firpi et al. | Swarmed feature selection | |
| CN112904299B (zh) | 基于深层类内分裂的雷达高分辨距离像开集目标识别方法 | |
| CN108694390B (zh) | 一种布谷鸟搜索改进灰狼优化支持向量机的调制信号分类方法 | |
| Bao et al. | Identification of wheat leaf diseases and their severity based on elliptical-maximum margin criterion metric learning | |
| US20080101705A1 (en) | System for pattern recognition with q-metrics | |
| CN119380825B (zh) | 一种多尺度特征增强与特征选择的代谢产物预测方法 | |
| CN113011346A (zh) | 一种基于度量学习的辐射源未知信号识别方法 | |
| Li et al. | A score-level fusion fingerprint indexing approach based on minutiae vicinity and minutia cylinder-code | |
| US20020059151A1 (en) | Data analysis | |
| CN119091225B (zh) | 一种基于特征解耦的船只目标个体识别方法 | |
| CN118962600B (zh) | 基于fcm算法改进的雷达信号分选算法、设备及存储设备 | |
| CN113780455A (zh) | 一种基于模糊隶属度函数的c-svm的移动目标识别方法 | |
| CN119580881A (zh) | 基于ai与高通量筛查结合的土壤有机污染物识别方法及系统 | |
| CN112257792A (zh) | 一种基于svm的实时视频目标动态分类方法 | |
| WO2001067295A2 (fr) | Analyse de donnees | |
| Adinugroho et al. | Leaves classification using neural network based on ensemble features | |
| CN119089280A (zh) | 一种均衡增强多模态心理疾病不平衡数据的方法 | |
| Jena et al. | Elitist TLBO for identification and verification of plant diseases | |
| Chang et al. | An Efficient Hybrid Classifier for Cancer Detection. | |
| Ardon et al. | Aerial Radar Target Classification using Artificial Neural Networks. | |
| CN118070099A (zh) | 一种针对已分选雷达脉冲序列的零样本增量识别方法 | |
| CN110780270A (zh) | 一种目标库属性判别局部正则学习子空间特征提取方法 | |
| CN118277902A (zh) | 一种基于机器学习的雷达信号实时分选识别方法 | |
| US20040042665A1 (en) | Method and computer program product for automatically establishing a classifiction system architecture | |
| Rolfe et al. | Using geometric shape variations to create an inverse model for a sheet metal process |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| ENP | Entry into the national phase in: |
Ref country code: GB Ref document number: 200220671 Kind code of ref document: A Format of ref document f/p: F |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase in: |
Ref country code: JP |