[go: up one dir, main page]

WO2018133340A1 - 数据分析方法和设备 - Google Patents

数据分析方法和设备 Download PDF

Info

Publication number
WO2018133340A1
WO2018133340A1 PCT/CN2017/092186 CN2017092186W WO2018133340A1 WO 2018133340 A1 WO2018133340 A1 WO 2018133340A1 CN 2017092186 W CN2017092186 W CN 2017092186W WO 2018133340 A1 WO2018133340 A1 WO 2018133340A1
Authority
WO
WIPO (PCT)
Prior art keywords
subspace
node
probability
score
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/092186
Other languages
English (en)
French (fr)
Inventor
张振中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to EP17859359.6A priority Critical patent/EP3572959A4/en
Priority to US15/768,825 priority patent/US11195114B2/en
Publication of WO2018133340A1 publication Critical patent/WO2018133340A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/20ICT specially adapted for the handling or processing of medical references relating to practices or guidelines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to a data analysis method and apparatus, and more particularly to a data analysis method and apparatus based on medical data and/or medical knowledge maps.
  • the common practice of predicting diseases based on medical data is to extract characteristics from the data, such as blood pressure values, blood sugar values and other detection indicators, and then train a predictive model (such as logistic regression model) through machine learning methods, and finally use the model to perform prediction.
  • a predictive model such as logistic regression model
  • such practices often use test indicators as a feature and do not take advantage of some other useful information such as patient self-reported symptoms.
  • a method of medical data analysis comprising:
  • the probability P 1 of the object in its subspace is analyzed by judging the semantic consistency of the subspace in which the vector h is located.
  • locating the sub-space in which the object is located in the medical data set by the feature parameter further comprises:
  • the sparse solution comprises:
  • analyzing the semantic consistency of the subspace in which the vital parameter vector h is located further includes:
  • the above method further comprises:
  • the maximum value of C 1 , C 2 , ..., C M , C ⁇ is output as the probability P 1 that the object is in its subspace.
  • a medical data analysis method based on a medical knowledge map including:
  • the score p 0 [p 0,1 , p 0,2 ,...,p 0,N ], where Each node v i represents a vital parameter or a related category, and p 0,i represents the initial evidence score of the node v i ;
  • the probability P 2 of the object at the node to which it belongs is obtained based on the score of each node in the final evidence score p t .
  • determining the final evidence score p t for each node further comprises:
  • d is the damping coefficient, 0 ⁇ d ⁇ 1;
  • W i,j denotes the weight of the edge e i,j connecting the nodes v i and v j in V.
  • the iteration termination condition is no change in p t is the iteration reaches a predetermined or maximum number of iterations.
  • determining the final evidence score p t for each node further comprises:
  • the final evidence score for each node is determined by the following formula:
  • d is the damping coefficient, 0 ⁇ d ⁇ 1;
  • I is the unit matrix of N ⁇ N, W i,j denotes the weight of the edge e i,j connecting the nodes v i and v j in V.
  • the probability P 2 of analyzing the object at its own node based on the score of each node in the final evidence score p t further comprises:
  • Determining the probability at each node by calculating the percentage of the final score of each disease node in V in the sum of the final scores of all disease nodes in V;
  • the maximum probability in the probability is output as the probability P 2 that the object is at the node to which it belongs.
  • a medical data analysis method comprising:
  • the probability P 1 of the object in its subspace is analyzed by analyzing the semantic consistency of the subspace in which the subject's vital parameter is located;
  • a probability P 2 of the object at its associated node is analyzed based on an evidence transfer score of the subject's vital parameter on the medical knowledge map;
  • the probability P of the object in its subspace or node is determined based on the probability P 1 and the probability P 2 :
  • a medical data analysis device comprising:
  • a receiving unit configured to receive a physical parameter of an object and establish a vector h according to the physical parameter to represent the physical parameter
  • the first analysis unit is configured to analyze the probability P 1 of the object in its subspace by determining the semantic consistency of the subspace in which the vector h is located.
  • the subspace positioning unit is further configured to:
  • the sparse solution comprises:
  • the first analysis unit is further configured to:
  • the above device further comprises:
  • the first output unit is configured to output a maximum value of C 1 , C 2 , . . . , C M , C ⁇ as a probability P 1 that the object is in its subspace.
  • a medical data analysis device based on a medical knowledge map comprising:
  • a receiving unit configured to receive a physical parameter of an object
  • a second analysis unit configured to determine a final evidence score p t for each node based on evidence transfer scores on the partial medical knowledge map of the subject's vital parameters
  • a determining unit configured to obtain a probability P 2 that the object is at a node to which it belongs based on a score of each node in the final evidence score p t .
  • the second analyzing unit further comprises:
  • a calculation unit configured to perform an iterative operation by the following formula to determine a final evidence score for each node:
  • d is the damping coefficient, 0 ⁇ d ⁇ 1;
  • W i,j denotes the weight of the edge e i,j connecting the nodes v i and v j in V.
  • the iteration termination condition is no change in p t is the iteration reaches a predetermined or maximum number of iterations.
  • the second analyzing unit further comprises:
  • a calculation unit configured to determine a final evidence score for each node by the following formula:
  • d is the damping coefficient, 0 ⁇ d ⁇ 1;
  • I is the unit matrix of N ⁇ N, w i,j denotes the weight of the edge e i,j connecting the nodes v i and v j in V.
  • the determining unit is further configured to:
  • Determining the probability of being in each category node by calculating the percentage of the final score of each category node in V in the sum of the final scores of all category nodes in V;
  • the maximum probability of the probabilities is output as the probability P 2 at the class node to which the object belongs.
  • a medical data analysis device comprising:
  • a receiving unit configured to receive a physical parameter of an object
  • a subspace positioning unit configured to locate the subspace in which the object is located in the medical data set by using the physical parameter
  • a first analyzing unit configured to analyze a probability P 1 of the object in its subspace by analyzing semantic consistency of a subspace in which the subject's vital parameter is located;
  • a second analysis unit configured to analyze a probability P 2 of the object at a node to which it belongs based on an evidence transfer score on the medical knowledge map of the subject's vital parameter
  • a harmonic unit configured to determine a probability P that the object is in its subspace or node based on probability P 1 and probability P 2 :
  • a medical data analysis device comprising:
  • a memory configured to store computer executable instructions
  • a processor coupled to the memory is configured to execute the computer executable instructions such that the processor performs any of the methods described above.
  • a computer readable storage medium having stored thereon computer readable instructions that, when executed by a computing device, cause the computing device to perform any of the methods described above.
  • FIG. 1 is a flow chart of a data analysis method based on medical big data and medical knowledge maps, according to one embodiment
  • Figure 2 shows a schematic diagram of several subspace examples within a large data space
  • Figure 3 shows an example of graph-based evidence propagation in a medical knowledge map
  • Figure 4 shows a schematic diagram of a simplified example of a medical knowledge map
  • FIG. 5 is a schematic structural diagram of a data analysis device according to an embodiment
  • FIG. 6 illustrates an example computing device that can be utilized to implement one or more embodiments.
  • the object belongs to which subspace or type within the big data or which node or category in the knowledge map. It should also be noted that, as indicated above, the invention is not limited to the medical field, and therefore the solutions of all methods provided by the present disclosure are not directly used in the diagnosis and treatment of diseases.
  • FIG. 1 is a flow chart of a method of data analysis based on medical big data and medical knowledge maps, in accordance with one embodiment.
  • the medical big data herein is only an example, and the present invention is not limited to medical big data or case big data, and the present invention can also be applied to other data sets including multiple subspaces, including but not limited to For other types of big data.
  • the embodiment comprehensively utilizes implicit knowledge in medical big data and explicit knowledge in medical knowledge maps for disease analysis.
  • the method includes three major parts: analysis based on medical big data. (the part included in the dotted line frame in the middle of Fig. 1), the analysis based on the medical knowledge map (the part included in the dotted line frame in the right part of Fig.
  • the basic idea of the module is that patients with similar symptoms may have the same disease; patients with the same disease are most likely to have similar characteristics. This is in line with real-world situations. For example, Lloyd Minor, dean of the Stanford University School of Medicine in 1998, and colleagues reported for the first time in the world a rare disease, the "upper semicircular canal syndrome.” Patients with this disease may experience symptoms such as dizziness and abnormal sensitivity to sound. This is a very common academic finding, but many patients in the world who have not found a cause for many years, or who have tried hard to test treatment in other departments, are finally diagnosed and treated by searching for relevant symptom information.
  • the "Medical Big Data-Based Analysis” portion includes four steps: S1, receiving a symptom of a patient or a subject; S2, searching for a subspace in which the patient or the subject is located in the medical big data; S3 And analyzing the semantic consistency of the subspace; S4, outputting the analysis result P 1 based on the medical big data, that is, the probability P 1 that the patient or the object is in a specific subspace (ie, the subspace to which it belongs).
  • the patient's symptoms are received at step S1.
  • the main function of this step is to collect self-reported symptoms during the patient's visit, such as dizziness, headache, and the like.
  • the symptoms of the self-reported signs herein are merely an example, and the present invention is not limited to the symptoms of self-reported signs, and the present invention can also be applied to the symptoms of symptoms obtained by means of observing and the like.
  • the present invention is also applicable to other vital parameters including, but not limited to, vital signs that can be obtained without physical examination or vital signs obtained by very simple physical examination.
  • step S2 search for the subspace in which the patient is located in the medical big data.
  • the main function of this step is to find the subspace in which the patient is in the medical big data based on the collected symptoms. Since this embodiment performs disease analysis based on medical big data, a large number of cases, such as confirmed cases of various hospitals over the years, are required, which corresponds to the "case big data" module in FIG.
  • D i (1 ⁇ i ⁇ M) represents the i-th disease
  • D i,j (1 ⁇ i ⁇ M, 1 ⁇ j ⁇ K) represents the jth case of the i-th disease.
  • Each case consists of a series of corresponding feature vectors (such as symptoms), matrix D constitutes the semantic space of a confirmed case, and D i (1 ⁇ i ⁇ M) constitutes a subspace within the semantic space.
  • D constitutes the semantic space of a confirmed case
  • D i (1 ⁇ i ⁇ M) constitutes a subspace within the semantic space.
  • big data can include multiple subspaces, each representing multiple categories, each of which can have a corresponding instance, and each instance can have multiple features.
  • the input of an object's features can be used to locate or search for the subspace in which the object is located, the principle of which is exactly the same as positioning the patient's subspace.
  • the invention is not limited to big data and can be applied to data sets comprising multiple subspaces.
  • each case consists of a series of corresponding feature vectors, so D j,j (1 ⁇ i ⁇ M,1 ⁇ j ⁇ K) can be used to represent the jth case of the i-th disease.
  • D j,j (1 ⁇ i ⁇ M,1 ⁇ j ⁇ K)
  • D i the vector h
  • D i the vector h
  • ⁇ i,j (1 ⁇ j ⁇ K) is a correlation coefficient.
  • the symptoms in case 1 are “dizziness, nausea, palpitations and shortness of breath”.
  • the symptoms in case 2 are “heart qi, shortness of breath, tinnitus, limb numbness”.
  • the symptoms in case 3 are "dizziness, nausea”.
  • all the symptoms in the collection case set D are set to set S, and the number of symptoms in S is
  • S ⁇ vertigo, nausea, palpitations, shortness of breath, tinnitus, limb numbness ⁇ T
  • S contains a total of 5 symptoms
  • vertigo as the first dimension in the vector
  • limbs Numbness is the fifth, so the different symptoms in the case correspond to different dimensions of the vector.
  • the symptom value corresponding to the symptom of the patient is set to 1, and the dimension value corresponding to the symptom that does not appear is set to 0.
  • case 1 [1,1,1,0,0] T
  • case 2 [0,0,1,1,1] T
  • case 3 [1,1,1,1,0] T
  • each disease can be represented as a subspace composed of known cases contained therein, and a certain case belonging to the disease can be composed of a linear combination of the bases of the corresponding subspaces.
  • Figure 2 shows a schematic diagram of several subspace examples within a large data space;
  • the disease of the patient h in D can be determined by finding the subspace of patient h in D.
  • D [D 1 , D 2 , ..., D M ], then the subspace in which the patient h is located can be obtained by the formula (1).
  • X is a column vector
  • X [x 1 T , x 2 T , ..., x i T , ..., x M T ] T , whose dimension is not M, but the number of cases in D, ie the dimension and [ D 1,1 , D 1,2 ,...., D i,1 , D i,2 , . . . , D M,K ] are the same.
  • D contains 2 diseases
  • disease 1 contains a case
  • disease 2 contains b cases
  • the dimension of X is a+b.
  • the present invention employs a sparse solution method (using the fewest cases to reconstruct h).
  • the advantage of using a sparse solution is to reduce the impact of "noise" data, making the model robust.
  • the specific solution is as follows:
  • ⁇ i, j may not be the optimal solution.
  • step S3 The method then proceeds to step S3 to perform subspace semantic consistency analysis.
  • the main function of this step is to analyze the probability of being in a specific subspace (ie, the subspace to which it belongs) by analyzing the semantic consistency of the subspace in which h is located.
  • ⁇ i (X) denote a coefficient vector of coefficient vector X belonging to subspace D i and a remaining dimension of 0, and the dimension is the number of all cases in D, ie dimension and [D 1,1 ,D 1,2 ,....,D i,1 ,D i,2 ,..., D M,K ].
  • step S3 the result based on the medical big data analysis is output in step S4. That is, the maximum value among the outputs C 1 , C 2 , . . . , C M , C ⁇ is taken as the probability P 1 that the patient is in the subspace corresponding to the maximum value.
  • the subspace or category corresponding to the maximum value is also finally determined as the subspace or category in which the patient is located.
  • step S4 The main function of step S4 is to output an analysis result based on medical big data.
  • Some embodiments of the present disclosure propose to utilize medical big data to analyze a patient's condition.
  • the protocol utilizes the patient's physical symptoms as a feature, and can find the subspace in which the patient is in the medical big data according to the patient's symptoms, and analyze the patient's condition through the semantic consistency of the subspace.
  • the solution does not need to establish a model for each disease, and the prediction efficiency is high.
  • the patient's condition can be analyzed according to the symptoms at the first time, and the patient can be inspected in a targeted manner, thereby reducing the cost and improving the efficiency.
  • Figure 3 shows an example of graph-based evidence propagation in a medical knowledge map.
  • nodes represent diseases or symptoms (in more general cases, categories and features), and the edges between nodes reflect the semantic relevance between nodes.
  • the basic idea of the analysis based on medical knowledge maps is that the evidence (or disease) transmitted to a disease with high semantic relevance (or symptoms) yields a higher score than the semantic relevance.
  • the present invention is equally applicable to other knowledge maps having structures similar to medical knowledge maps for analyzing and determining the nodes or categories to which an object belongs.
  • the principle is exactly the same as the principle based on the medical knowledge map.
  • step S5 the method performs graph-based evidence propagation. That is, based on the evidence transfer score on the medical knowledge map, the probability P 2 of which disease node the patient is in is analyzed.
  • the role of this step is to analyze the patient's condition by combining the patient's symptoms with explicit medical knowledge in the medical knowledge map. Specifically, given the initial evidence score of the symptoms, and then based on the basic idea "symptoms (or diseases) spread to the high degree of semantic relevance of the disease (or symptoms) evidence scores are lower than the semantic relevance", in the medical knowledge map Spread the evidence score until the evidence scores of all nodes no longer change or change very little.
  • the value of the initial evidence score does not matter, because according to the Markov chain convergence theorem, the final score is independent of the initial value.
  • setting a "good” initial value helps to converge.
  • the initial value of "bad” may take 10,000 times to converge to the final result, while the "good” initial value may only need 1000 times to converge to the final result.
  • the end result is the same.
  • the "good” initial value is set by the a priori knowledge of the person. For example, based on prior knowledge, you can set a greater weight for the more significant initial evidence scores. If we don't know who is more significant, that is, we don't have any prior knowledge. At this time, according to the Ocom razor principle, the same weight is generally given to each evidence score. This is the case where the entropy is the largest.
  • V ⁇ v 1 , v 2 , ..., v N ⁇ represents a set of vertices or nodes in the medical knowledge map
  • E ⁇ ..., e i, j , ... ⁇ represents the edge between the nodes
  • a set of e i,j representing the edge between nodes v i and v j , where 1 ⁇ i ⁇ N, 1 ⁇ j ⁇ N, i ⁇ j.
  • W is a set of weights of edges, where w i,j represents the weight on the edge e i,j .
  • p t [p t,1 ,p t,2 ,...,p t,N ] be the evidence scores of each node after t times of evidence propagation iteration.
  • symptoms or diseases
  • the evidence scores of each node after the t+1th iteration are:
  • d is the damping coefficient (0 ⁇ d ⁇ 1)
  • M (i) represents the set of nodes connected to the node v i .
  • the evidence score propagation process will eventually converge, but theoretically it cannot guarantee how many cycles (or how many times) will converge.
  • a maximum number of iterations for example, 1 million cycles.
  • the distance convergence result is very close, and it is not necessary to continue to waste a lot of time to obtain convergence. result. Therefore, the maximum number of iterations reflects the harmonic balance of time efficiency and result precision.
  • equation (4) can be changed to a matrix representation as follows:
  • I is an identity matrix of N x N
  • p i is the final evidence score of node v i .
  • the final evidence score of each node can be directly calculated.
  • the probability that the patient is at the node "without aura migraine" is The probability of being at the node "rabies" is
  • a medical knowledge map that analyzes a patient's condition by transmitting relevant evidence in a medical knowledge map through a diseased condition.
  • the knowledge map is established through medical knowledge.
  • the embodiment can utilize the explicit knowledge or information in the knowledge map for disease analysis, and can analyze the patient's condition according to the symptom at the first time, and let the patient check in a targeted manner, thereby reducing the cost and improving the efficiency.
  • Figure 4 shows a schematic diagram of a simplified example of a medical knowledge map.
  • the middlemost big circle represents one of the categories in the knowledge map.
  • a disease is indicated, and the node directly connected to it represents the relationship between the category and other features, for example, in this case, the cause, the symptom, treatment.
  • the outermost circle indicates the corresponding feature, which in this case may be the symptoms, the cause, and the treatment.
  • the weights on the side are omitted in FIG.
  • step S7 which outputs a probability P 2 that the patient obtained based on the medical knowledge map analysis is at a certain disease node or belongs to a certain category.
  • step S8 the method proceeds to step S8 to output the final result. That is, the probability that the patient belongs to a certain category is determined based on P 1 obtained from step S4 and P 2 obtained from step S7.
  • the function of this step is to provide a final analysis result based on the analysis of medical big data and the analysis of medical knowledge map. Specifically, the scores of the two are combined using a linear weighting method. Specifically, the probability that a patient belongs to a certain category is calculated or determined by the following formula:
  • is the harmonic parameter, 0 ⁇ ⁇ ⁇ 1, used to adjust the specific gravity of the two analysis methods.
  • this embodiment takes advantage of the patient's symptoms as a feature, on the one hand, by looking for the subspace in which the patient is in the medical big data based on the patient's symptoms, and analyzing the patient's condition by the semantic consistency of the subspace. On the other hand, it is also possible to analyze the patient's condition by transmitting relevant evidence in the medical knowledge map through the disease symptoms. Finally, it is possible to combine the two aspects of information for analysis and output the final conclusion or result.
  • the embodiment can comprehensively utilize the laws, knowledge or information implied in the large-scale medical big data and the explicit knowledge or information in the knowledge map for disease analysis, thereby improving the accuracy of the analysis;
  • the physiological phenomenon (symptoms) of the patient By analyzing the physiological phenomenon (symptoms) of the patient, the patient's condition can be analyzed according to the symptoms at the first time, and the patient can be inspected in a targeted manner, thereby reducing the cost and improving the efficiency.
  • FIG. 5 is a schematic structural diagram of a data analysis device according to an embodiment.
  • the data analysis device is analyzed based on medical big data and medical knowledge maps. Similar to FIG. 1, the data analysis device 500 shown in FIG. 5 also includes three parts, a big data analysis device (as shown by the dotted line on the left), and a medical data analysis device (the portion included in the upper right dotted line plus the receiving unit 510). And a blending portion that reconciles the output of the above two parts (as shown by the lower right dashed box).
  • the big data analysis device and the medical data analysis device can be implemented independently as separate devices.
  • the big data analysis device can also be used to analyze other big data to determine the subspace or category in which an object is located in the big data space
  • the medical data analysis device can also be configured on the structure.
  • Other types of knowledge maps similar to medical knowledge maps are analyzed to determine the class nodes or categories in which an object is located in the knowledge map.
  • the big data analysis device may include a receiving unit 510, a subspace positioning unit 520, a first analyzing unit 530, and an optional first output unit 540.
  • the receiving unit 510 can be configured to receive a symptom symptom of the patient and use the vector h to represent the symptom of the sign.
  • the subspace positioning unit 520 can be configured to locate the subspace in which the patient is in the medical big data, characterized by the symptom of the sign.
  • the first analysis unit 530 can be configured to analyze the probability P 1 of the patient in a particular subspace or category (ie, the subspace or category to which it belongs) by analyzing the semantic consistency of the subspace in which the symptom symptom vector h is located.
  • the subspace positioning unit 520 may be further configured to determine, based on the values of the elements in the coefficient vector X, which portion of the subspace is located at most, thereby determining which subspace has the highest probability of being located.
  • 2 is the L2 paradigm.
  • the case corresponding to the dimension in which the solution x* median is not zero constitutes the subspace in which h is located.
  • the first analysis unit 520 may be further configured to calculate the consistency of the semantic subspace and the subspace D i of h by the following formula:
  • the optional first output unit 540 can be configured to output a maximum of C 1 , C 2 , . . . , C M , C ⁇ as the patient is in the corresponding subspace or belongs to the corresponding category.
  • the probability of P 1 is a maximum of C 1 , C 2 , . . . , C M , C ⁇ as the patient is in the corresponding subspace or belongs to the corresponding category. The probability of P 1 .
  • the medical data analysis device may include a receiving unit 510, an access unit 550, a second analyzing unit 560, and a determining unit 570.
  • the second analysis unit 560 can be configured to determine a final evidence score p t for each node based on evidence transfer scores on the partial medical knowledge map of the patient's vital symptoms.
  • the determining unit 570 can be configured to analyze the probability P 2 that the patient is at a particular node or category (ie, the node or category to which it belongs) based on the scores of the various nodes in the final evidence score p t .
  • the second analysis unit 550 can further include a computing unit.
  • the calculation unit can be configured to perform an iterative operation by the following formula to determine the final evidence score for each node:
  • the iteration termination condition is no longer changes p t in the iterative calculation or the maximum number of iterations. .
  • the computing unit can be configured to determine the final evidence score for each node by the following formula:
  • d is the damping coefficient, 0 ⁇ d ⁇ 1;
  • I is the unit matrix of N ⁇ N, w i,j denotes the weight of the edge e i,j connecting the nodes v i and v j in V.
  • the determining unit 570 can be further configured to: be at each node by calculating a percentage of the final score of each disease node in V in the sum of the final scores of all disease nodes in V Probability; and outputting the maximum probability in the probability as the probability P 2 at the corresponding node or class. Accordingly, the node and category are also determined to be the node or category to which the patient belongs.
  • the blending portion may include a blending unit 580 and a final result output unit 590.
  • the blending unit 580 may be configured to a probability based on the probability P P 1 and P 2 to determine the probability of a patient at a particular subspace or nodes or categories (i.e. it belongs subspace or nodes or categories) of:
  • the final result output unit 590 can be configured to output the probability P as a probability that the patient is in a corresponding subspace or node or category.
  • the data analysis device 500 shown in FIG. 5 can perform any of the method steps in the method shown in FIG. 1. Since the principles of the present disclosure are the same as the analysis method, those skilled in the art can obtain other details about the data analysis device 500 from the description of the method, and the data analysis device 500 and its component execution will not be repeatedly described herein. Details of the above methods and their steps.
  • FIG. 6 illustrates an example computing device 600 that can be used to implement one or more embodiments.
  • a device in accordance with some embodiments may be implemented at the example computing device 600.
  • the example computing device 600 includes one or more processors 610 or processing units, which may include one or more computer readable media 620 of one or more memories 622, one or more for displaying content to a user Display 640, one or more input/output (I/O) interfaces 650 for input/output (I/O) devices, one or more communication interfaces 660 for communicating with other computing devices or communication devices, and A bus 630 that allows different components and devices to communicate with one another.
  • processors 610 or processing units which may include one or more computer readable media 620 of one or more memories 622, one or more for displaying content to a user Display 640, one or more input/output (I/O) interfaces 650 for input/output (I/O) devices, one or more communication interfaces 660 for communicating with other computing devices or communication devices, and A bus
  • Computer readable medium 620, display 640, and/or one or more I/O devices may be included as part of computing device 600, or alternatively may be coupled to computing device 600.
  • Bus 630 represents one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus of any structure using a variety of bus architectures.
  • Bus 630 can include wired and/or wireless buses.
  • the one or more processors 610 are not subject to any limitations in the materials from which they are formed or the processing mechanisms employed therein.
  • the processor can be comprised of one or more semiconductors and/or transistors, such as an electronic integrated circuit (IC).
  • the processor-executable instructions can be electrically executable instructions.
  • Memory 622 represents memory/storage capacity associated with one or more computer readable media.
  • the memory 622 can include volatile media (such as random access memory (RAM)) and/or nonvolatile media (such as read only memory (ROM), flash memory, optical disks, magnetic disks, and the like).
  • the memory 622 can include fixed media (eg, RAM, ROM, fixed hard drive, etc.) as well as removable media (eg, flash memory drives, removable hard drives, optical disks, and the like).
  • One or more input/output interfaces 650 allow a user to enter commands and information into computing device 600, and also allow presentation of information to the user and/or presentation to other components or devices using different input/output devices.
  • Examples of input devices include a keyboard, a touch screen display, a cursor control device (such as a mouse), a microphone, a scanner, and the like.
  • Output device Examples include display devices (such as monitors or projectors), speakers, printers, network cards, and more.
  • Communication interface 660 allows for communication with other computing devices or communication devices.
  • Communication interface 660 has no limitations in the communication technology it employs.
  • Communication interface 660 may include a wired communication interface such as a local area network communication interface and a wide area network communication interface, and may also include a wireless communication interface such as an infrared, Wi-Fi or Bluetooth communication interface.
  • Computing device 600 can be configured to execute specific instructions and/or functions corresponding to software and/or hardware modules implemented on a computer readable medium.
  • the instructions and/or functions may be performed/operated by a manufactured product (eg, one or more computing devices 600 and/or processor 610) to implement the techniques described herein.
  • a manufactured product eg, one or more computing devices 600 and/or processor 610
  • Such techniques include, but are not limited to, the example processes described herein.
  • a computer readable medium can be configured to store or provide instructions for implementing the various techniques described above when accessed by one or more devices described herein.
  • the foregoing embodiment is only exemplified by the division of the foregoing functional modules.
  • the foregoing functions may be allocated to different functional modules as needed.
  • the internal structure of the device can be divided into different functional modules to complete the above description. All or part of the functions described.
  • the function of one module described above may be completed by multiple modules, and the functions of the above multiple modules may also be integrated into one module.
  • first analysis unit and “second analysis unit” do not necessarily mean that the first analysis unit performs operations and performs processing before the second analysis unit in time. In fact, these phrases are only used to identify different units of analysis.
  • any reference signs placed in parentheses shall not be construed as limiting the claim.
  • the word “comprising” or “comprises” or “comprises” or “comprising” does not exclude the presence of the elements or steps in the
  • the word “a” or “an” or “an” In the device or system claims enumerating several means, one or more of these means can be embodied in the same hardware item. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Theoretical Computer Science (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种医学数据分析方法和设备,其中,该方法包括:接收一对象的体征参数;以所述体征参数为特征定位所述对象在医学数据集中所处的子空间;通过分析所述对象的体征参数所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P 1。此外还可以基于所述对象的体征参数在医学知识图谱上的证据传递得分来分析所述对象处于其所属节点的概率P 2。还可以基于概率P 1和概率P 2来确定所述对象处于其所属子空间或节点的概率P:P=α×R 1+(1-α)×R 2,其中α是调和参数,0<α<1。通过这些方案,不仅能够提高分析的准确率;而且还可以在第一时间依据症状分析患者的状况,让患者有针对性地进行检查,从而降低成本提高效率。

Description

数据分析方法和设备
相关申请
本申请要求2017年1月19日提交、申请号为201710038539.X的中国专利申请的优先权,该申请的全部内容通过引用并入本文。
技术领域
本公开涉及一种数据分析方法和设备,更具体而言,涉及一种基于医学数据和/或医学知识图谱的数据分析方法和设备。
背景技术
咨询委员会公司(Advisory Board Company)的最新研究表明:在未来十年内,普通大众在医疗机构的成本支出每年将提高5%。因此,为了生存以及发展,医疗机构应想方设法将成本降低20%。而实现这一目标的一个有效方法是利用大数据分析技术尽可能早地发现重大疾病。
众所周知,如果能在早期发现重大疾病的警示信号,其治疗过程会比更晚发现简单得多,而且费用也比较便宜,患者的恢复效果也好。根据EMC报告,22%的医疗机构利用数据分析技术来提高重大疾病的早期发现几率。美国已经有医疗机构通过数据分析,节约了可观的成本,并且提高了患者护理质量:美国威斯康星州麦迪逊的Meriter健康服务公司部署了一套商业智能解决方案,把来自分析系统和电子健康档案(EHR)的数据进行整合,为行政人员和临床医师提供了大量的、对实际工作有指导意义的信息。丰富的信息为Meriter公司的整形外科医师提供了准确的基准数据,还为医生选择更适合患者的植入假体提供了依据。在这些信息的帮助下,医院能够更高效地利用医疗开支——Meriter公司在利用数据分析后的短短8个月中节约了近100万美元。
目前基于医疗数据预测疾病的常用做法是从数据中抽取特征,例如血压值、血糖值等检测指标,然后通过机器学习方法训练一个预测模型(如逻辑斯蒂回归模型等),最后使用该模型进行预测。但此类做法存在两个不足之处:1)需要对每一个病种建立一个模型,当患者就诊时,需要使用每一个模型进行预测,效率十分低下;2)由于不同 的病种的预测通常需要不同的检测指标,因此需要患者进行一些不必要的检测,提高了医疗成本。同时,此类做法通常使用检测指标作为特征,不能充分利用患者自述症状等一些其他有用的信息。
发明内容
为了解决或缓解上述现有技术中的至少一个缺陷,期望提供一种新的数据预测和分析方法和设备。
根据一个方面,提供了一种医学数据分析方法,包括:
根据一对象的体征参数建立向量h来表示所述体征参数;
以所述体征参数为特征定位所述对象在医学数据集中所处的子空间,其中用矩阵D来表示数据集中的病例集合,D=[D1,D2,......,DM],Di表示第i个子空间,1≤i≤M,所述定位包括通过公式h=DX获得X,X为一个系数向量,用来表示向量h在各个子空间中的分布;和
通过判断向量h所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
在一个实施例中,以所述体征参数为特征定位所述对象在医学数据集中所处的子空间进一步包括:
通过采用稀疏解法对公式h=DX求解从而确定所述对象所处的子空间;以及
基于系数向量X中各元素的值确定h位于哪个子空间的部分最多,从而确定位于哪个子空间的概率最大。
在一个实施例中,所述稀疏解法包括:
求解x*=arg min||X||1,其中X满足||DX-h||2≤ε,其中||·||1是L1范式,||·||2是L2范式,
其中所获得的解x*中值不为零的维度所对应的病例构成了h所在的子空间。
在一个实施例中,分析体征参数向量h所在子空间的语义一致性进一步包括:
通过如下公式来计算h所在语义子空间和子空间Di的一致性:
Figure PCTCN2017092186-appb-000001
其中h=h1+h2+......+hM+η,hi=Dδi(X),η为误差,δi(X)表示系数向量X中属于子空间Di的维度为1,其余维度为0。
在一个实施例中,上述方法进一步包括:
输出C1、C2、……、CM、Cη中的最大值作为所述对象处于其所属子空间的概率P1
根据另一个方面,提供了一种基于医学知识图谱的医学数据分析方法,包括:
接收一对象的体征参数;
访问医学知识图谱以获得与所述对象相关的部分医学知识图谱,所述部分医学知识图谱包括多个节点V={v1,v2,...,vN}以及每一个节点的初始证据得分p0=[p0,1,p0,2,...,p0,N],其中
Figure PCTCN2017092186-appb-000002
每一个节点vi表示一个体征参数或一个相关类别,p0,i表示节点vi的初始证据得分;
基于所述对象的体征参数在所述部分医学知识图谱上的证据传递从而确定每一个节点的最终证据得分pt;以及
基于最终证据得分pt中各个节点的得分来获得所述对象处于其所属节点的概率P2
在一个实施例中,确定每一个节点的最终证据得分pt进一步包括:
通过如下公式进行迭代运算从而确定各节点的最终证据得分:pt=(1-d)×p0+d×pt-1×W,
其中d为阻尼系数,0<d<1;
Figure PCTCN2017092186-appb-000003
Wi,j表示连接V中各节点vi和vj的边ei,j的权重。
在一个实施例中,所述迭代运算的终止条件是在所述迭代运算中pt不再发生变化或者达到预定的最大迭代次数。
在一个实施例中,确定每一个节点的最终证据得分pt进一步包括:
通过如下公式确定各节点的最终证据得分:
p=(1-d)×p0×(I-d×W)-1
其中d为阻尼系数,0<d<1;I为N×N的单位矩阵,
Figure PCTCN2017092186-appb-000004
Wi,j表示连接V中各节点vi和vj的边ei,j的权重。
在一个实施例中,基于最终证据得分pt中各个节点的得分来分析所述对象处于其所属节点的概率P2进一步包括:
通过计算V中每一个疾病节点的最终得分在V中所有疾病节点的最终得分之和中所占的百分比来确定处于各个节点的概率;以及
输出所述概率中的最大概率作为所述对象处于其所属节点的概率P2
根据另一个方面,提供了一种医学数据分析方法,包括:
接收一对象的体征参数;
以所述体征参数为特征定位所述对象在医学数据集中所处的子空间;
通过分析所述对象的体征参数所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
基于所述对象的体征参数在医学知识图谱上的证据传递得分来分析所述对象处于其所属节点的概率P2;以及
基于概率P1和概率P2来确定所述对象处于其所属子空间或节点的概率P:
P=α×P1+(1-α)×P2,其中α是调和参数,0<α<1。
根据另一个方面,提供了一种医学数据分析设备,包括:
接收单元,被配置用来接收一对象的体征参数并根据所述体征参数建立向量h来表示所述体征参数;
子空间定位单元,被配置用来以所述体征参数为特征定位所述对象在医学数据集中所处的子空间,其中用矩阵D来表示数据集中的病例集合,D=[D1,D2,......,DM],Di表示第i个子空间,1≤i≤M,所述定位包括通过公式h=DX获得X,X为一个系数向量,用来表示向量h在各个子空间中的分布;和
第一分析单元,被配置用来通过判断向量h所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
在一个实施例中,所述子空间定位单元进一步被配置用来:
通过采用稀疏解法对公式h=DX求解从而确定所述对象所处的子空间;以及
基于系数向量X中各元素的值确定h位于哪个子空间的部分最多,从而确定位于哪个子空间的概率最大。
在一个实施例中,所述稀疏解法包括:
求解x*=arg min||X||1,其中X满足||DX-h||2≤ε,其中||·||1是L1范式,||·||2是L2范式,
其中所获得的解x*中值不为零的维度所对应的病例构成了h所在的子空间。
在一个实施例中,所述第一分析单元进一步进一步被配置用来:
通过如下公式来计算h所在语义子空间和子空间Di的一致性:
Figure PCTCN2017092186-appb-000005
其中h=h1+h2+......+hM+η,hi=Dδi(X),η为误差,δi(X)表示系数向量X中属于子空间Di的维度为1,其余维度为0。
在一个实施例中,上述设备进一步包括:
第一输出单元,被配置用来输出C1、C2、……、CM、Cη中的最大值作为所述对象处于其所属子空间的概率P1
根据另一个方面,提供了一种基于医学知识图谱的医学数据分析设备,包括:
接收单元,被配置用来接收一对象的体征参数;
访问单元,被配置用来访问医学知识图谱以获得与所述对象相关的部分医学知识图谱,所述部分医学知识图谱包括多个节点V={v1,v2,...,vN}以及每一个节点的初始证据得分p0=[p0,1,p0,2,...,p0,N],其中
Figure PCTCN2017092186-appb-000006
每一个节点vi表示一个体征参数或一个相关类别,p0,i表示节点vi的初始证据得分;
第二分析单元,被配置用来基于所述对象的体征参数在所述部分医学知识图谱上的证据传递得分从而确定每一个节点的最终证据得分pt;以及
确定单元,被配置用来基于最终证据得分pt中各个节点的得分来获得所述对象处于其所属节点的概率P2
在一个实施例中,所述第二分析单元进一步包括:
计算单元,被配置用来通过如下公式进行迭代运算从而确定各节点的最终证据得分:
pt=(1-d)×p0+d×pt-1×W,
其中d为阻尼系数,0<d<1;
Figure PCTCN2017092186-appb-000007
Wi,j表示连接V中各节点vi和vj的边ei,j的权重。
在一个实施例中,所述迭代运算的终止条件是在所述迭代运算中pt不再发生变化或者达到预定的最大迭代次数。
在一个实施例中,所述第二分析单元进一步包括:
计算单元,被配置用来通过如下公式确定各节点的最终证据得分:
p=(1-d)×p0×(I-d×W)-1
其中d为阻尼系数,0<d<1;I为N×N的单位矩阵,
Figure PCTCN2017092186-appb-000008
wi,j表示连接V中各节点vi和vj的边ei,j的权重。
在一个实施例中,所述确定单元进一步被配置用来:
通过计算V中每一个类别节点的最终得分在V中所有类别节点的最终得分之和中所占的百分比来确定处于各个类别节点的概率;以及
输出所述概率中的最大概率作为处于所述对象所属类别节点的概率P2
根据另一个方面,提供了一种医学数据分析设备,包括:
接收单元,被配置用接收一对象的体征参数;
子空间定位单元,被配置用来以所述体征参数为特征定位所述对象在医学数据集中所处的子空间;
第一分析单元,被配置用来通过分析所述对象的体征参数所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
第二分析单元,被配置用来基于所述对象的体征参数在医学知识图谱上的证据传递得分来分析所述对象处于其所属节点的概率P2;以及
调和单元,被配置用来基于概率P1和概率P2来确定所述对象处于其所属子空间或节点的概率P:
P=α×P1+(1-α)×P2,其中α是调和参数,0<α<1
根据另一个方面,提供了一种医学数据分析设备,包括:
存储器,被配置用来存储计算机可执行指令;以及
耦合到所述存储器的处理器,被配置用来执行所述计算机可执行指令从而使得所述处理器执行如上所述的任何一种方法。
根据另一个方面,提供了一种计算机可读存储介质,其上存储了计算机可读指令,所述指令在被计算设备执行时导致计算设备执行如上所述的任何一种方法。
附图说明
为了更清楚地说明本公开的实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。应当意识到,下面描述中的附图仅仅涉及一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图,所述其它的附图也在本发明的范围内。
图1为根据一个实施例的基于医学大数据和医学知识图谱的数据分析方法的流程图;
图2示出了在大数据空间内的若干个子空间示例的示意图;
图3示出了医学知识图谱中基于图的证据传播的一个示例;
图4示出了医学知识图谱一个简化示例的示意图;
图5为根据一个实施例的数据分析设备的结构示意图;
图6图示了可以被用于实现一个或多个实施例的示例计算设备。
具体实施方式
下面将结合附图,对本公开中一些实施例的技术方案进行清楚、完整地描述,以便能够更清楚地理解一些实施例的目的、技术方案和优点。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是 全部的实施例。基于本公开中的实施例,本领域普通技术人员能够获得其它的实施例,所获得的所有这些实施例都属于本发明保护的范围。需要指出的是,尽管本文以医学大数据和医学知识图谱为例进行解释说明,但是本领域技术人员知道,本发明同样可以适用于对其它类型的大数据和知识图谱进行分析,从而确定某个客体属于大数据内的哪个子空间或类型或者知识图谱中的哪个节点或类别。另外还需要指出的是,正如前面指出的,本发明并不局限于医学领域,因此本公开所提供的所有方法的方案都不是直接用于疾病的诊断和治疗的方法。
图1为根据一个实施例的基于医学大数据和医学知识图谱的数据分析方法的流程图。需要指出的是,此处的医学大数据仅仅只是一个示例,本发明并不局限于医学大数据或病例大数据,本发明还可以应用于其它的包括多个子空间的数据集,包括但不局限于其它类型的大数据。如图1所示,所述实施例综合利用医学大数据中隐式的知识和医学知识图谱中显式的知识进行疾病分析,具体来说所述方法包含三大部分:基于医学大数据的分析(图1中间的虚线框内所包含的部分)、基于医学知识图谱的分析(图1右部虚线框内所包含的部分)以及综合前两部分信息的输出(图1中的步骤S8:最终结果输出)。需要指出的是,虽然在图1所示实施例中包含了所有的所述三个部分,但是本领域技术人员能够理解,基于医学大数据的分析和基于医学知识图谱的分析这两个部分可以作为独立的技术方案来单独实施。
下面将参照图1来详细介绍这三部分的原理以及具体实现方式。
基于医学大数据的分析
本小节将详细介绍基于医学大数据分析模块的基本思想和具体实现方式。该模块的基本思想是:出现相似症状的患者可能患有同一疾病;患有同一疾病的患者极有可能出现相似的特征。这符合现实世界的情况,例如:1998年斯坦福大学医学院院长Lloyd Minor和同事在全世界首次报道了一种罕见病-“上半规管裂损症候群”。这种疾病的患者会出现眩晕、对声音异常敏感等症状。这本是一次很普通的学术发现,但世界上许多多年来找不到病因、或在其他科室苦苦试验治疗方案的患者通过搜索相关症状信息最终得以确诊和治疗。
根据本发明的一些实施例,“基于医学大数据的分析”这一部分包括4个步骤:S1、接收患者或对象的体征症状;S2、在医学大数据中 搜索患者或对象所在的子空间;S3、分析子空间的语义一致性;S4、输出基于医学大数据的分析结果P1,也就是患者或对象处于特定子空间(即其所属子空间)的概率P1
首先,在步骤S1接收患者的体征症状。本步骤的主要功能是收集患者就诊过程中的自述症状,例如眩晕、头疼等。需要指出的是,此处的自述体征症状仅仅只是一个示例,本发明并不局限于自述体征症状,本发明还可以应用于通过望闻问切等方式获得的体征症状。此外,本发明还可以应用于其它的体征参数,包括但不局限于不需要身体检查就能获得的体征参数或者通过非常简单的身体检查获得的体征参数。
然后,所述方法进行到步骤S2,在医学大数据中搜索患者所在的子空间。本步骤的主要功能是:依据采集得到的症状寻找患者在医学大数据中所处的子空间。由于该实施例基于医学大数据进行疾病分析,因此需要大量的病例,例如各医院历年的确诊病例,这对应着图1中的“病例大数据”模块。
本公开使用符号D=[D1,D2,……,DM]来表示该病例集合,假设其中一共包含M种疾病,则Di(1≤i≤M)表示第i种疾病,其中Di可以被表示为Di=[Di,1,Di,2,……,Di,K],这表示第i种疾病中包含K个病例,Di,j(1≤i≤M,1≤j≤K)表示第i种疾病中的第j个病例。每一个病例由一系列对应的特征向量(如症状)构成,则矩阵D构成了一个确诊病例的语义空间,而Di(1≤i≤M)则构成了所述语义空间内的子空间。需要指出的是,虽然此处以医学大数据为例进行解释说明,但是本发明同样可以适用于对其它类型的大数据进行分析,从而可以确定某个客体属于大数据内的哪个子空间或类别。在一个通用的例子中,大数据可以包括多个子空间,分别表示多个类别,每个类别中可以有相应的实例,每个实例可以具有个多个特征。这样就可以通过本公开所提供的实施例,利用输入的一个对象的特征来定位或搜索所述对象所在的子空间,其原理与定位患者所在子空间是完全一样的。在一个更通用的例子中,本发明并不局限于大数据,可以应用于包括多个子空间的数据集。
正如上面所提到的,每一个病例由一系列对应的特征向量构成,因此可以用Di,j(1≤i≤M,1≤j≤K)表示第i种疾病中的第j个病例。新来的患者也是如此,它也是由一系列对应的特征向量构成,因此可以用向量h来表示。对于新来的患者h,假设其患有疾病Di,依据本模块的基 本思想:患有同一疾病的患者极有可能出现相似的症状,即新来患者h可以表示为Di中病例的线性组合,
h=αi,1×Di,1i,2×Di,2+......+αi,K×Di,K
其中,αi,j(1≤j≤K)是相关系数。例如,对于疾病“高血压”,病例1中的症状有“眩晕、恶心、心悸气短”,病例2中的症状有“心悸气短、耳鸣、肢体麻木”,病例3中的症状有“眩晕、恶心、耳鸣、心悸气短”,新来患者的症状有“心悸气短、肢体麻木”,则有“新来患者=病例1+病例2-病例3”。
作为更一般的解释,收集病例集合D中的所有症状设为集合S,设S中的症状个数为|S|,我们使用|S|维的一个列向量来表示一个病例。例如针对病例1,2,3有S={眩晕、恶心、心悸气短、耳鸣、肢体麻木}T,S一共包含5个症状,我们设定眩晕为向量中的第一维度,依此类推,肢体麻木为第5个,因此病例中的不同症状对应着向量的不同维度。患者出现的症状对应的维度值设为1,未出现的症状对应的维度值设为0。这样病例1=[1,1,1,0,0]T,病例2=[0,0,1,1,1]T,病例3=[1,1,1,1,0]T,其中上标T表示矩阵的转置。新来患者h=[0,0,1,0,1]T=[1,1,1,0,0]T+[0,0,1,1,1]T-[1,1,1,1,0]T。也就是说,“新来患者=病例1+病例2-病例3”。假设疾病“高血压”对应于第i个疾病,那么Di,1=[1,1,1,0,0]T;Di,2=[0,0,1,1,1]T,Di,3=[1,1,1,1,0]T。h=Di,1+Di,2-Di,3。在这个例子中,规定病例向量为列向量,但是本领域技术人员能够理解,所述病例向量也可以为行向量,但是相应地,病例集合以及其它向量的维度也应该相应地转置。
为了表示简洁和方便,上面的表达形式可以用矩阵表示。假设Di=[Di,1,Di,2,……,Di,K],xi=[αi,1,αi,2,……,αi,K]T,其中上标T表示矩阵的转置,则有
h=Dixi
通过上面的讨论,可以看到每一个病种可以表示成由其包含的已知病例所构成的子空间,属于该病种的某一病例可以由相应子空间的基的线性组合构成。图2示出了在大数据空间内的若干个子空间示例的示意图;
基于上面的讨论,给定病例集合D,可以通过寻找患者h在D中的子空间来确定其所患疾病。令D=[D1,D2,……,DM],则患者h所在的 子空间可通过式子(1)求得,
h=DX         (1)
其中,D=[D1,D2,……,DM]=[D1,1,D1,2,….,Di,1,Di,2,……,DM,K]是一个矩阵,包含了D中的所有病种Di(1≤i≤M),其中每个元素Di,j(1≤i≤M,1≤j≤K)对应D中的一个病例,即第i种疾病中的第j个病例,而每个病例又是一个列向量。
其中,X是一个列向量,X=[x1 T,x2 T,…,xi T,…,xM T]T,其维度不是M,而是D中的病例数目,即维度与[D1,1,D1,2,....,Di,1,Di,2,……,DM,K]一样。例如,假设D包含2个病种,病种1包含a个病例,病种2包含b个病例,则X的维度为a+b。
然而由于D的空间非常大,通常存在很多X能够满足式子(1),即存在多个子空间能够重建h。为了解决这一问题,本发明采用稀疏解法(使用最少的病例去重构h)。采用稀疏解法的好处是能够降低“噪音”数据的影响,使得模型具有良好的鲁棒性。具体解法如下:
Figure PCTCN2017092186-appb-000009
其中||·||1是L1范式,用于求所有元素的绝对值之和,||·||2是L2范式,用于求所有元素的平方和。ε是事先给定的参数。X=[α1,1,α1,2,……,αi,K,……]T是需要求解的系数。通过式子(2)解得的x*中值不为零的维度对应的病例构成了h所在的子空间。这样,也就是要求一个列向量X,使得所确定X中各元素绝对值之和最小,而且矩阵乘积DX所得到的向量中每个元素都与患者的症状非常接近。其中,αi,j的具体值是通过式子(2)使用随机梯度下降方法解出来的。
对于以上的例子,即新来患者h=[0,0,1,0,1]T=[1,1,1,0,0]T+[0,0,1,1,1]T-[1,1,1,1,0]T。这是理想的例子,在现实中由于计算精度的限制,可能出现αi,j不是最优解,例如本例中假设α1,1=0.8,这样就有0.8*[1,1,1,0,0]T+[0,0,1,1,1]T-[1,1,1,1,0]T=[0.2,0.2,0.8,0,1]。此时Dx-h=[0.2,0.2,0.8,0,1]T-[0,0,1,0,1]T=[0.2,0.2,-0.2,0,0]T,则||DX-h||2=0.12。如果我们设定ε=0.2,则||DX-h||2≤ε依然成立。引入ε就是为了降低“噪音”的影响。所以αi,j不相等的情况下,依然 可以运算。而且通过公式(2)进行求解时,理论上就不要求αi,j相等,它们只是向量的一个权重。
然后,所述方法进行到步骤S3,执行子空间语义一致性分析。本步骤的主要作用是:通过分析h所在子空间的语义一致性来分析其处于特定子空间(即其所属子空间)的概率。令δi(X)表示系数向量X中属于子空间Di的维度为1,其余维度为0的一个列向量,其维度是D中所有病例的数目,即维度和[D1,1,D1,2,….,Di,1,Di,2,……,DM,K]一样。则向量h中对应子空间Di的语义成分为hi=Dδi(X),那么h可以表示为h=h1+h2+......+hM+η,其中η为误差。基于上述表示,本发明将h所在语义子空间和子空间Di的一致性定义为:
Figure PCTCN2017092186-appb-000010
其中,
Figure PCTCN2017092186-appb-000011
是L2范式的平方。
在步骤S3中执行子空间语义一致性分析之后,在步骤S4中输出基于医学大数据分析的结果。也就是说,输出C1、C2、……、CM、Cη中的最大值作为患者处于与所述最大值对应的子空间的概率P1。同时,与所述最大值对应的子空间或类别也最终被确定为患者所在的子空间或类别。
步骤S4的主要作用是:输出基于医学大数据的分析结果。令C=[C1,C2,......,CM,Cη],由式子(3)的定义可以看到C1+C2+......+CM+Cη=1,Ci反映了h处于子空间Di可能性的大小,即对应着图1中的概率P1,其中Cη反映了h不处于前面任一子空间D1-DM的可能性或概率。这是因为Ci越大,表明构成h的向量中包含属于子空间Di的病例越多,即h位于Di子空间的部分越多,则属于子空间Di的可能性越大。例如在图2中,假设已知病例空间中一共有三个子空间,分别对应不同的形状:四角星 节点、三角形节点、六角星节点,分别对应C的前三个维度,最后一个维度为误差Cη。圆形节点表示新来的患者。图2所示的两个圆圈分别表示用来表示新来患者的两种线性组合。第一种仅仅用四角星节点所代表的子空间就可以表示新来患者,第二种需要用所有的三个子空间来表示新来的患者。可以清楚地看到,对于左边的图有C=[1,0,0,0],即患者可能处于四角星所代表的子空间。对于右边的图有C=[0.25,0.375,0.375,0],则很难分析患者处于哪个子空间或属于哪个类别。
本公开的一些实施例提出了利用医学大数据来分析患者的患病情况。所述方案利用患者的体征症状作为特征,能够通过依据患者的症状寻找患者在医学大数据中所处的子空间,并通过子空间的语义一致性来分析患者的患病情况。相对于现有技术,所述方案无需针对每一种疾病建立一个模型,预测效率高。同时利用患者的生理现象(症状)进行分析,可以在第一时间依据症状分析患者的状况,让患者有针对性地进行检查,从而降低成本提高效率。
基于医学知识图谱的分析
接下来介绍图1所示流程图的第二部分,基于医学知识图谱的分析。本小节将介绍基于医学知识图谱分析模块的基本思想和具体实现方式。图3示出了医学知识图谱中基于图的证据传播的一个示例。在医学知识图谱中,节点表示疾病或者症状(在更一般的例子中,表示类别和特征),节点间的边反映了节点间的语义相关度大小。基于医学知识图谱的分析的基本思想是:症状(或疾病)传播给语义相关度高的疾病(或症状)所得证据得分要比语义相关度低的大。例如如图3所示,假设“无先兆偏头痛”的权重得分为1,则它可以传播0.7的证据得分给症状“恶心”,传播0.3的证据得分给症状“畏光”。直观地解释为无先兆偏头痛患者出现恶心症状的概率为70%,出现畏光症状的概率为30%,即在无先兆偏头痛患者中出现恶心症状的情况比出现畏光症状的情况多。而“狂犬病”患者出现畏光症状的概率为80%。
需要指出的是,本发明同样可以适用于结构与医学知识图谱类似的其它知识图谱,用于来分析和确定某个对象所属的节点或者类别。其原理与基于医学知识图谱的方案的原理完全相同。
下面将参照图1的流程图详细介绍基于图的证据传播方法。
在步骤S5中,所述方法执行基于图的证据传播。即基于医学知识图谱上的证据传递得分来分析患者处于哪个疾病节点的概率P2。该步骤的作用是:通过患者的症状,结合医学知识图谱中显式的医学知识来分析患者的情况。具体来说,给定症状初始证据得分,然后依据基本思想“症状(或疾病)传播给语义相关度高的疾病(或症状)证据得分要比语义相关度低的大”,在医学知识图谱中传播证据得分直到所有节点的证据得分不再变化或者变化很小为止。
需要指出的是,初始证据得分的取值无关紧要,因为依据马尔科夫链收敛定理,最终的得分和初始值无关。不过在工程上,设定“好的”初始值有助于收敛。打个比方,“不好”的初始值可能需要10000次才能收敛到最终结果,而“好的”初始值可能只需要1000次就能收敛到最终结果。当然了,最终结果是一样的。通常来说,“好的”初始值是通过人的先验知识设定的。比如,可以依据先验知识,给症状更显著的初始证据得分设置更大的权重。如果我们不知道谁更显著,也就是我们没有任何先验知识,此时依据奥科姆剃刀原理,一般给各证据得分设定相同的权重,此时就是熵最大的情况。
在一个实施例中,假设V={v1,v2,…,vN}表示医学知识图谱中的顶点或节点集合,E={…,ei,j,…}表示节点之间的边的集合,其中ei,j表示节点vi和vj之间的边,其中1≤i≤N,1≤j≤N,i≠j。W是边的权重的集合,其中wi,j表示边ei,j上的权重。设节点的初始证据得分为p0=[p0,1,p0,2,...,p0,N],,其中
Figure PCTCN2017092186-appb-000012
p0,i表示节点vi的初始证据得分。例如对于图3来说,假设患者出现“恶心”和“畏光”的症状,则初始证据得分可以设为p0[“恶心”,“畏光”,“无先兆偏头痛”,“狂犬病”]=[0.5,0.5,0,0],即在没有任何先验知识的情况下,使用符合条件且熵最大的分布作为初始证据得分。
设pt=[pt,1,pt,2,...,pt,N]是经过t次证据传播迭代之后,各节点的证据得分。这里的证据传播就是一个顶点传递给另一个顶点的得分。例如一个顶点的得分为1,它与另一个顶点之间的边的权重为0.6,则所述顶点A传给所述另一个顶点的证据得分为1*0.6=0.6。依据基本思想“症状(或 疾病)传播给语义相关度高的疾病(或症状)证据得分要比语义相关度低的大”,第t+1次迭代后各节点的证据得分为:
Figure PCTCN2017092186-appb-000013
其中,d为阻尼系数(0<d<1);M(i)表示与节点vi相连的节点集合。例如对于图3中的例子,假设各节点的初始证据得分为p0[“恶心”,“畏光”,“无先兆偏头痛”,“狂犬病”]=[0.5,0.5,0,0],边的权重如图3所示,则经过一轮迭代后“恶心”的证据得分为(1-d)×0.5+0×0.7=0.1(设d=0.8),“畏光”的证据得分为0.2×0.5+0×0.8=0.1,“无先兆偏头痛”的证据得分为
Figure PCTCN2017092186-appb-000014
同理可以算出“狂犬病”的得分为
Figure PCTCN2017092186-appb-000015
然后不停地迭代公式(4)直到达到终止条件,即pt不再发生变化或者达到预定的最大迭代次数。所述预定的最大迭代次数是工程上为了避免运算过长时间而设定的。虽然依据马尔科夫链收敛性定理,证据得分传播过程最终一定会收敛,但是理论上并不能保证一定在多少次循环(或者多少时间内)收敛。为了避免这种需要长时间才能收敛的情况,一般设定一个最大迭代次数例如100万次循环,此时虽然不收敛,但是已经距离收敛结果很近了,没必要继续浪费大量的时间去获得收敛结果。因此最大迭代次数体现了时间效率和结果精度的调和平衡。
为了简洁以及提高效率,可以将公式(4)变化为矩阵表示形式,如下:
pt+1=(1-d)×p0+d×pt×W
其中,
Figure PCTCN2017092186-appb-000016
Figure PCTCN2017092186-appb-000017
依据马尔科夫理论(Markov theory),
Figure PCTCN2017092186-appb-000018
一定存在。因此对公式(5) 两边取极限有
Figure PCTCN2017092186-appb-000019
其中I为N×N的单位矩阵,pi的值就是节点vi的最终证据得分。可以看到依据公式(6),能够直接算出每个节点的最终证据得分。例如可以直接得到图3中各节点的最终证据得分为p[“恶心”,“畏光”,“无先兆偏头痛”,“狂犬病”]=[0.25,0.30,0.27,0.18]。一个疾病节点最终的证据得分越高表明患者处于该节点的可能性越大,因此可以将疾病节点的得分归一化得到患者处于该节点的概率,对应着图1中的P2。例如对于图3中的例子,患者处于节点“无先兆偏头痛”的概率为
Figure PCTCN2017092186-appb-000020
处于节点“狂犬病”的概率为
Figure PCTCN2017092186-appb-000021
在本公开的一些实施例中提出了利用医学知识图谱来分析疾病,所述实施例通过患病症状在医学知识图谱中传递相关证据进而分析患者的患病情况。其中所述知识图谱是通过医学知识建立起来的。所述实施例能够利用知识图谱中显式的知识或信息进行疾病分析,可以在第一时间依据症状分析患者的状况,让患者有针对性地进行检查,从而降低成本提高效率。
图4示出了医学知识图谱一个简化示例的示意图。其中最中间的大圆表示知识图谱中个一个类别,在本例中表示一种疾病,与它直接相连的节点表示该类别与其它特征之间的关系,例如在本例中可以是病因、症状、治疗。最外围的圆圈表示相应的特征,在本例中可以是症状、病因和治疗方式,为了简洁起见,图4中略去了边上的权重。
然后该方法进行步骤S7,输出基于医学知识图谱分析所获得的患者处于某个疾病节点或属于某个类别的概率P2
最后,该方法进行到步骤S8,输出最终的结果。也就是基于从步骤S4获得的P1和从步骤S7获得的P2来确定患者属于某个类别的概率。该步骤的作用是:综合基于医学大数据的分析和医学知识图谱的分析给出最终分析结果。具体来说,就是使用线性加权方式将二者的得分综合起来。具体而言,通过如下公式来计算或确定患者属于某个类别的概率:
P=α×P1+(1-α)×P2
其中α是调和参数,0<α<1,用来调节两种分析方式的比重。当医学大数据的数据和质量都很高,基于医学大数据的分析的准确率高的时候,可以把α调高(例如让α=0.7等);反之,当医学大数据的数量不充分或者质量不高,基于医学大数据的分析准确率不高的时候,就应该充分利用医学知识图谱中的医学知识来分析,这个时候就可以把α调低(如设定α=0.2等)。
在本公开的一些实施例中提出了利用医学大数据和医学知识图谱综合分析疾病。该实施例利用患者的体征症状作为特征,一方面能够通过依据患者的症状寻找患者在医学大数据中所处的子空间,并通过子空间的语义一致性来分析患者的患病情况。另一方面还能够通过患病症状在医学知识图谱中传递相关证据进而分析患者的患病情况。最后还能够综合两方面的信息进行分析并输出最终的结论或结果。相比以前的做法,所述实施例能够综合利用大规模医学大数据中隐含的规律、知识或信息以及知识图谱中显式的知识或信息进行疾病分析,从而能够提高分析的准确率;同时利用患者的生理现象(症状)进行分析,可以在第一时间依据症状分析患者的状况,让患者有针对性地进行检查,从而降低成本提高效率。
图5为根据一个实施例的数据分析设备的结构示意图。在本例中,该数据分析设备基于医学大数据和医学知识图谱的来进行分析。与图1类似,图5所示数据分析设备500也包含了三个部分,大数据分析设备(如左边虚线框所示)、医学数据分析设备(右上虚线框所包含部分加上接收单元510)以及对上述两部分输出结果进行调和的调和部分(如右下虚线框所示)。本领域技术人员能够理解,所述大数据分析设备和医学数据分析设备可以作为独立的设备独立实施。另外需要指出的是,所述大数据分析设备同样可以用来对其它大数据进行分析,从而确定一个对象在大数据空间内所处的子空间或类别,所述医学数据分析设备也可以对结构与医学知识图谱类似的其它类型的知识图谱进行分析,从而确定一个对象在该知识图谱中所处的类别节点或类别。
如图5所示,大数据分析设备可以包括:接收单元510、子空间定位单元520、第一分析单元530以及可选的第一输出单元540。该接收单元510可以被配置用来接收患者的体征症状并用向量h来表示所述 体征症状。该子空间定位单元520可以被配置用来以所述体征症状为特征定位患者在医学大数据中所处的子空间。其中用矩阵D来表示大数据中的病例集合,D=[D1,D2,......,DM],Di表示第i个子空间,1≤i≤M。所述定位可以包括对公式h=DX求解,X为一个系数向量,用来表示向量h在各个子空间中的分布。该第一分析单元530可以被配置用来通过分析体征症状向量h所在子空间的语义一致性来分析患者处于特定子空间或类别(即其所属子空间或类别)的概率P1
在一个实施例中,所述子空间定位单元520可以进一步被配置用来通过采用稀疏解法对如下公式求解从而确定患者所处的子空间:h=DX。所述子空间定位单元520可以进一步被配置用来基于系数向量X中各元素的值确定h位于哪个子空间的部分最多,从而确定位于哪个子空间的概率最大。
在一个实施例中,所述稀疏解法包括求解x*=arg min||X||1,其中X满足||DX-h||2≤ε,其中||·||1是L1范式,||·||2是L2范式。所获得的解x*中值不为零的维度所对应的病例构成了h所在的子空间。
在一个实施例中,所述第一分析单元520可以进一步被配置用来通过如下公式来计算h所在语义子空间和子空间Di的一致性:
Figure PCTCN2017092186-appb-000022
其中h=h1+h2+......+hM+η,hi=Dδi(X),η为误差,δi(X)表示系数向量X中属于子空间Di的维度为1,其余维度为0的一个列向量。
在一个实施例中,所述可选的第一输出单元540可以被配置用来输出C1、C2、……、CM、Cη中的最大值作为患者处于相应子空间或属于相应类别的概率P1
所述医学数据分析设备可以包括接收单元510、访问单元550、第二分析单元560和确定单元570。该访问单元550可以被配置用来访问医学知识图谱以获得与患者相关的部分医学知识图谱,所述部分医学知识图谱包括多个节点V={v1,v2,...,vN}以及每一个节点的初始证据得分p0=[p0,1,p0,2,...,p0,N],其中
Figure PCTCN2017092186-appb-000023
每一个节点vi表示患者的一个症状或一个相关疾病(在更一般的例子中为一个特征或一个类别),p0,i表示 节点vi的初始证据得分。第二分析单元560可以被配置用来基于所述患者的体征症状在所述部分医学知识图谱上的证据传递得分从而确定每一个节点的最终证据得分pt。该确定单元570可以被配置用来基于最终证据得分pt中各个节点的得分来分析患者处于特定节点或类别(即其所属节点或类别)的概率P2。在一个实施例中,所述第二分析单元550可以进一步包括计算单元。该计算单元可以被配置用来通过如下公式进行迭代运算从而确定各节点的最终证据得分:
pt=(1-d)×p0+d×pt-1×W,
其中d为阻尼系数,0<d<1;
Figure PCTCN2017092186-appb-000024
Wi,j表示连接V中各节点vi和vj的边ei,j的权重。在一个实施例中,所述迭代运算的终止条件是在所述迭代运算中pt不再发生变化或者达到最大迭代次数。。
在另一个实施例中,该计算单元可以被配置用来通过如下公式确定各节点的最终证据得分:
p=(1-d)×p0×(I-d×W)-1
其中d为阻尼系数,0<d<1;I为N×N的单位矩阵,
Figure PCTCN2017092186-appb-000025
wi,j表示连接V中各节点vi和vj的边ei,j的权重。
在一个实施例中,所述确定单元570可以进一步被配置用来:通过计算V中每一个疾病节点的最终得分在V中所有疾病节点的最终得分之和中所占的百分比来处于各个节点的概率;以及输出所述概率中的最大概率作为处于相应节点或类别的概率P2。相应地,该节点和类别也就被确定为患者所属的节点或类别。
所述调和部分可以包括调和单元580和最终结果输出单元590。该调和单元580可以被配置用来基于概率P1和概率P2来确定患者处于特定子空间或节点或类别(即其所属子空间或节点或类别)的概率P:
P=α×P1+(1-α)×P2
其中α是调和参数,0<α<1。最终结果输出单元590可以被配置用来输出所述概率P作为患者处于相应子空间或节点或类别的概率。
需要指出的是,图5所示数据分析设备500可以执行图1所示方法中的任何方法步骤。由于本公开的原理与所述分析方法相同,本领域技术人员可以从对所述方法的描述中获得关于数据分析设备500的其它细节,在此就不再重复描述数据分析设备500以及其组件执行上述方法及其步骤的细节了。
图6图示了可以被用于实现一个或多个实施例的示例计算设备600。特别地,根据一些实施例的设备可以在所述示例计算设备600实现。如图所示,示例计算设备600包含一个或多个处理器610或处理单元,可以包含一个或多个存储器622的一个或多个计算机可读介质620,一个或多个用于向用户显示内容的显示器640,一个或多个用于输入输出(I/O)设备的输入/输出(I/O)接口650,一个或多个用于与其它计算设备或通信设备通信的通信接口660,以及允许不同的组件和设备彼此通信的总线630。
计算机可读介质620、显示器640和/或一个或多个I/O设备可以被包含来作为计算设备600的一部分,或者可替换地可以被耦合到计算设备600。总线630表示一个或多个若干类型的总线结构,其包含存储总线或存储控制器、外围总线、加速图形端口、以及使用各种各样总线架构的任何结构的处理器或局部总线。总线630可以包含有线的和/或无线的总线。
一个或多个处理器610在形成它们的材料或其中采用的处理机制方面没有任何限制。例如,处理器可以由一个或多个半导体和/或晶体管(例如电子集成电路(IC))组成。在这样的背景下,处理器可执行指令可以是电学可执行的指令。存储器622表示与一种或多种计算机可读介质相关联的记忆/存储容量。该存储器622可以包含易失性介质(诸如随机存取存储器(RAM)之类)和/或非易失性介质(诸如只读存储器(ROM)、闪速存储器、光盘、磁盘等等之类)。该存储器622可以包含固定介质(例如,RAM、ROM、固定硬盘驱动等等)以及可移动介质(例如,闪速存储器驱动、可移动硬盘驱动、光盘等等)。
一个或多个输入/输出接口650允许用户输入命令和信息到计算设备600,并且同样允许将信息呈现给该用户和/或使用不同的输入/输出设备呈现给其它组件或设备。输入设备的示例包含键盘、触摸屏显示器、光标控制设备(例如鼠标)、麦克风、扫描仪等等。输出设备的 示例包含显示设备(例如监视器或投影仪)、扬声器、打印机、网卡等等。
通信接口660允许与其它计算设备或通信设备进行通信。通信接口660在其采用的通信技术方面没有任何限制。通信接口660可以包括诸如局域网通信接口和广域网通信接口之类的有线通信接口,也可以包括无线通信接口,例如红外线、Wi-Fi或者蓝牙通信接口。
本文中各种技术是在软件、硬件(固定逻辑电路)、或程序模块的一般环境下描述的。一般地,所述程序模块包含执行特定任务或实现特定抽象数据类型的例程、程序、对象、元素、组件、数据结构等等。这些模块和技术的实现可以被存储在某种形式的计算机可读介质上或经由该计算机可读介质而被传输。计算机可读介质可以包含多种可以由计算设备访问的可用媒介或介质。
本文描述的特定的模块、功能、组件和技术可以被实现在软件、硬件、固件和/或其组合中。计算设备600可以被配置成执行对应于实现在计算机可读介质上的软件和/或硬件模块的特定指令和/或功能。该指令和/或功能可以由制造产品(例如,一个或者多个计算设备600和/或处理器610)执行/操作以便实现本文所述的技术。这样的技术包含但不限于本文所描述的示例过程。因此,计算机可读介质可以被配置成当由本文所描述的一个或者多个设备访问时存储或提供用于实现上述不同技术的指令。
尽管上面参考附图对本发明的一些实施例进行了具体的描述,但是本领域普通技术人员可以理解,以上的具体描述仅仅是为了解释本发明,本发明绝不仅仅局限于上述具体的实施方式。基于本文对这些实施例的具体描述和教导,本领域普通技术人员可以对这些具体实施方式进行各种修改、增加、置换以及变型而不脱离本发明的保护范围,也就是说,这些修改、增加、置换以及变型都应涵盖在本发明的保护范围内。本发明的保护范围应所述以权利要求的保护范围为准。上文中描述的具体特征和行为是作为实现权利要求的示例形式而被公开的。
需要说明的是,上述实施例仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要将上述功能分配给不同的功能模块完成。可以将装置的内部结构划分成不同的功能模块,以完成以上描 述的全部或者部分功能。另外,上述一个模块的功能可以由多个模块来完成,上述多个模块的功能也可以集成到一个模块中完成。
本申请使用了诸如“第一”、“第二”、“第三”等之类的措词。在无附加上下文时,使用这样的措词并不旨在暗示排序,实际上它们仅仅用于标识目的。例如短语“第一分析单元”和“第二分析单元”未必意味着在时间上第一分析单元在第二分析单元之前实施操作、执行处理。实际上,这些短语仅仅用来标识不同的分析单元。
在权利要求书中,任何置于括号中的附图标记都不应当解释为限制权利要求。术语“包括”或“包含”并不排除除了权利要求中所列出的元件或步骤之外的元件或步骤的存在。元件前的词语“一”或“一个”并不排除存在多个这样的元件。在列举了若干装置的设备或系统权利要求中,这些装置中的一个或多个能够在同一个硬件项目中体现。仅仅某个措施记载在相互不同的从属权利要求中这个事实并不表明这些措施的组合不能被有利地使用。

Claims (16)

  1. 一种医学数据分析方法,包括:
    根据一对象的体征参数建立向量h来表示所述体征参数;
    以所述体征参数为特征定位所述对象在医学数据集中所处的子空间,其中用矩阵D来表示数据集中的病例集合,D=[D1,D2,......,DM],Di表示第i个子空间,1≤i≤M,所述定位包括通过公式h=DX获得X,X为一个系数向量,用来表示向量h在各个子空间中的分布;和
    通过判断向量h所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
  2. 根据权利要求1所述的方法,以所述体征参数为特征定位所述对象在医学数据集中所处的子空间进一步包括:
    通过采用稀疏解法对公式h=DX求解从而确定所述对象所处的子空间;以及
    基于系数向量X中各元素的值确定h位于哪个子空间的部分最多,从而确定位于哪个子空间的概率最大。
  3. 根据权利要求2所述的方法,所述稀疏解法包括:
    求解x*=arg min||X||1,其中X满足||DX-h||2≤ε,其中||·||1是L1范式,||·||2是L2范式,
    其中所获得的解x*中值不为零的维度所对应的病例构成了h所在的子空间。
  4. 根据权利要求1、2或3所述的方法,分析体征参数向量h所在子空间的语义一致性进一步包括:
    通过如下公式来计算h所在语义子空间和子空间Di的一致性:
    Figure PCTCN2017092186-appb-100001
    其中h=h1+h2+......+hM+η,hi=Dδi(X),η为误差,δi(X)表示系数向量X中属于子空间Di的维度为1,其余维度为0。
  5. 根据权利要求4所述的方法,进一步包括:
    输出C1、C2、……、CM、cη中的最大值作为所述对象处于其所属子空间的概率P1
  6. 一种基于医学知识图谱的医学数据分析方法,包括:
    接收一对象的体征参数;
    访问医学知识图谱以获得与所述对象相关的部分医学知识图谱,所述部分医学知识图谱包括多个节点V={v1,v2,...,vN}以及每一个节点的初始证据得分p0=[p0,1,p0,2,...,p0,N],其中
    Figure PCTCN2017092186-appb-100002
    每一个节点vi表示一个体征参数或一个相关类别,p0,i表示节点vi的初始证据得分;
    基于所述对象的体征参数在所述部分医学知识图谱上的证据传递从而确定每一个节点的最终证据得分pt;以及
    基于最终证据得分pt中各个节点的得分来获得所述对象处于其所属节点的概率P2
  7. 根据权利要求6所述的方法,确定每一个节点的最终证据得分pt进一步包括:
    通过如下公式进行迭代运算从而确定各节点的最终证据得分:
    pt=(1-d)×p0+d×pt-1×W,
    其中d为阻尼系数,0<d<1;
    Figure PCTCN2017092186-appb-100003
    Wi,j表示连接V中各节点vi和vj的边ei,j的权重。
  8. 根据权利要求7所述的方法,所述迭代运算的终止条件是在所述迭代运算中pt不再发生变化或者达到预定的最大迭代次数。
  9. 根据权利要求6所述的方法,确定每一个节点的最终证据得分pt进一步包括:
    通过如下公式确定各节点的最终证据得分:
    p=(1-d)×p0×(1-d×W)-1
    其中d为阻尼系数,0<d<1;I为N×N的单位矩阵,
    Figure PCTCN2017092186-appb-100004
    wi,j表示连接V中各节点vi和vj的边ei,j的权重。
  10. 根据权利要求6所述的方法,基于最终证据得分pt中各个节点的得分来分析所述对象处于其所属节点的概率P2进一步包括:
    通过计算V中每一个疾病节点的最终得分在V中所有疾病节点的最终得分之和中所占的百分比来确定处于各个节点的概率;以及
    输出所述概率中的最大概率作为所述对象处于其所属节点的概率P2
  11. 一种医学数据分析方法,包括:
    接收一对象的体征参数;
    以所述体征参数为特征定位所述对象在医学数据集中所处的子空间;
    通过分析所述对象的体征参数所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
    基于所述对象的体征参数在医学知识图谱上的证据传递得分来分析所述对象处于其所属节点的概率P2;以及
    基于概率P1和概率P2来确定所述对象处于其所属子空间或节点的概率P:
    P=α×P1+(1-α)×P2,其中α是调和参数,0<α<1。
  12. 一种医学数据分析设备,包括:
    接收单元,被配置用来接收一对象的体征参数并根据所述体征参数建立向量h来表示所述体征参数;
    子空间定位单元,被配置用来以所述体征参数为特征定位所述对象在医学数据集中所处的子空间,其中用矩阵D来表示数据集中的病例集合,D=[D1,D2,......,DM],Di表示第i个子空间,1≤i≤M,所述定位包括通过公式h=DX获得X,X为一个系数向量,用来表示向量h在各个子空间中的分布;和
    第一分析单元,被配置用来通过判断向量h所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
  13. 一种基于医学知识图谱的医学数据分析设备,包括:
    接收单元,被配置用来接收一对象的体征参数;
    访问单元,被配置用来访问医学知识图谱以获得与所述对象相关的部分医学知识图谱,所述部分医学知识图谱包括多个节点V={v1,v2,...,vN}以及每一个节点的初始证据得分p0=[p0,1,p0,2,...,P0,N],其中
    Figure PCTCN2017092186-appb-100005
    每一个节点vi表示一个体征参数或一个相关类别,p0,i表示节点vi的初始证据得分;
    第二分析单元,被配置用来基于所述对象的体征参数在所述部分医学知识图谱上的证据传递得分从而确定每一个节点的最终证据得分pt;以及
    确定单元,被配置用来基于最终证据得分pt中各个节点的得分来获得所述对象处于其所属节点的概率P2
  14. 一种医学数据分析设备,包括:
    接收单元,被配置用接收一对象的体征参数;
    子空间定位单元,被配置用来以所述体征参数为特征定位所述对象在医学数据集中所处的子空间;
    第一分析单元,被配置用来通过分析所述对象的体征参数所在子空间的语义一致性来分析所述对象处于其所属子空间的概率P1
    第二分析单元,被配置用来基于所述对象的体征参数在医学知识图谱上的证据传递得分来分析所述对象处于其所属节点的概率P2;以及
    调和单元,被配置用来基于概率P1和概率P2来确定所述对象处于其所属子空间或节点的概率P:
    P=α×P1+(1-α)×P2,其中α是调和参数,0<α<1。
  15. 一种医学数据分析设备,包括:
    存储器,被配置用来存储计算机可执行指令;以及
    耦合到所述存储器的处理器,被配置用来执行所述计算机可执行指令从而使得所述处理器执行如权利要求1-11中任何一项所述的方法。
  16. 一种计算机可读存储介质,其上存储了计算机可读指令,所述指令在被计算设备执行时导致计算设备执行如权利要求1-11中任何一项所述的方法。
PCT/CN2017/092186 2017-01-19 2017-07-07 数据分析方法和设备 Ceased WO2018133340A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17859359.6A EP3572959A4 (en) 2017-01-19 2017-07-07 DATA ANALYSIS PROCESS AND DEVICE
US15/768,825 US11195114B2 (en) 2017-01-19 2017-07-07 Medical data analysis method and device as well as computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710038539.XA CN108335755B (zh) 2017-01-19 2017-01-19 数据分析方法和设备
CN201710038539.X 2017-01-19

Publications (1)

Publication Number Publication Date
WO2018133340A1 true WO2018133340A1 (zh) 2018-07-26

Family

ID=62908252

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/092186 Ceased WO2018133340A1 (zh) 2017-01-19 2017-07-07 数据分析方法和设备

Country Status (4)

Country Link
US (1) US11195114B2 (zh)
EP (1) EP3572959A4 (zh)
CN (2) CN108335755B (zh)
WO (1) WO2018133340A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459276A (zh) * 2019-08-15 2019-11-15 北京嘉和海森健康科技有限公司 一种数据处理方法及相关设备
CN112951441A (zh) * 2021-02-25 2021-06-11 平安科技(深圳)有限公司 基于多维度的监测预警方法、装置、设备及存储介质
CN113704496A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 医疗知识图谱的修复方法、装置、计算机设备及存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335755B (zh) 2017-01-19 2022-03-04 京东方科技集团股份有限公司 数据分析方法和设备
CN109326352B (zh) * 2018-10-26 2022-04-15 腾讯科技(深圳)有限公司 疾病预测方法、装置、终端及存储介质
CN110335683B (zh) * 2019-06-14 2023-03-21 北京纵横无双科技有限公司 一种健康大数据分析方法及装置
CN110335675B (zh) * 2019-06-20 2021-10-01 北京科技大学 一种基于中医知识图库的辨证方法
CN110391026B (zh) * 2019-07-25 2022-04-26 北京百度网讯科技有限公司 基于医疗概率图的信息分类方法、装置及设备
CN113012803B (zh) * 2019-12-19 2024-08-09 京东方科技集团股份有限公司 计算机设备、系统、可读存储介质及医学数据分析方法
CN112509693A (zh) * 2020-12-11 2021-03-16 北京目人生殖医学科技有限公司 一种临床数据统计分析方法、系统、设备及存储介质
US11393475B1 (en) * 2021-01-13 2022-07-19 Artificial Solutions Iberia S.L Conversational system for recognizing, understanding, and acting on multiple intents and hypotheses
CN114550946B (zh) * 2022-02-28 2025-03-07 京东方科技集团股份有限公司 医疗数据处理方法、装置及存储介质
CN114913873B (zh) * 2022-05-30 2023-09-01 四川大学 一种耳鸣康复音乐合成方法及系统
US20240047070A1 (en) * 2022-08-04 2024-02-08 Optum, Inc. Machine learning techniques for generating cohorts and predictive modeling based thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020454A (zh) * 2012-12-15 2013-04-03 中国科学院深圳先进技术研究院 发病关键因素提取与疾病预警方法及系统
US20140168246A1 (en) * 2012-12-19 2014-06-19 Industrial Technology Research Institute Health check path evaluation indicator building system, method thereof, device therewith, and computer program product therein
CN104915561A (zh) * 2015-06-11 2015-09-16 万达信息股份有限公司 疾病特征智能匹配方法

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0954854A4 (en) * 1996-11-22 2000-07-19 T Netix Inc PARTIAL VALUE-BASED SPEAKER VERIFICATION BY UNIFYING DIFFERENT CLASSIFIERS USING CHANNEL, ASSOCIATION, MODEL AND THRESHOLD ADAPTATION
WO2002087431A1 (en) * 2001-05-01 2002-11-07 Structural Bioinformatics, Inc. Diagnosing inapparent diseases from common clinical tests using bayesian analysis
US7624030B2 (en) * 2005-05-20 2009-11-24 Carlos Feder Computer-implemented medical analytics method and system employing a modified mini-max procedure
US8015136B1 (en) * 2008-04-03 2011-09-06 Dynamic Healthcare Systems, Inc. Algorithmic method for generating a medical utilization profile for a patient and to be used for medical risk analysis decisioning
US20110112380A1 (en) * 2009-11-12 2011-05-12 eTenum, LLC Method and System for Optimal Estimation in Medical Diagnosis
CN102184314A (zh) * 2011-04-02 2011-09-14 中国医学科学院医学信息研究所 面向偏差性症状描述的自动辅助诊断方法
CN104156905A (zh) * 2014-08-15 2014-11-19 西安交通大学 一种基于纳税人利益关联网络的重点监控企业评估方法
CN104484844B (zh) * 2014-12-30 2018-07-13 天津迈沃医药技术股份有限公司 一种基于疾病圈数据信息的自我诊疗网站平台
CN104834668B (zh) * 2015-03-13 2018-10-02 陈文� 基于知识库的职位推荐系统
CN105653859A (zh) * 2015-12-31 2016-06-08 遵义医学院 一种基于医疗大数据的疾病自动辅助诊断系统及方法
CN105738109B (zh) * 2016-02-22 2017-11-21 重庆大学 基于稀疏表示与集成学习的轴承故障分类诊断方法
US20170344711A1 (en) * 2016-05-31 2017-11-30 Baidu Usa Llc System and method for processing medical queries using automatic question and answering diagnosis system
CN106295186B (zh) * 2016-08-11 2019-03-15 中国科学院计算技术研究所 一种基于智能推理的辅助疾病诊断的系统
CN108335755B (zh) 2017-01-19 2022-03-04 京东方科技集团股份有限公司 数据分析方法和设备
CN107153775B (zh) * 2017-06-13 2020-03-10 京东方科技集团股份有限公司 一种智能分诊方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020454A (zh) * 2012-12-15 2013-04-03 中国科学院深圳先进技术研究院 发病关键因素提取与疾病预警方法及系统
US20140168246A1 (en) * 2012-12-19 2014-06-19 Industrial Technology Research Institute Health check path evaluation indicator building system, method thereof, device therewith, and computer program product therein
CN104915561A (zh) * 2015-06-11 2015-09-16 万达信息股份有限公司 疾病特征智能匹配方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI YING , ZHANG SHU-QUANG , LIU YU-XIU: "Application of Knowledge Mapping in Analysis of the Discipline Development", JOURNAL OF MEDICAL POSTGRADUATES, vol. 26, no. 8, 15 August 2013 (2013-08-15), pages 875 - 877, XP055749112, DOI: 10.16571/j.cnki.1008-8199.2013.08.002 *
See also references of EP3572959A4 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459276A (zh) * 2019-08-15 2019-11-15 北京嘉和海森健康科技有限公司 一种数据处理方法及相关设备
CN110459276B (zh) * 2019-08-15 2022-05-24 北京嘉和海森健康科技有限公司 一种数据处理方法及相关设备
CN112951441A (zh) * 2021-02-25 2021-06-11 平安科技(深圳)有限公司 基于多维度的监测预警方法、装置、设备及存储介质
CN112951441B (zh) * 2021-02-25 2023-05-30 平安科技(深圳)有限公司 基于多维度的监测预警方法、装置、设备及存储介质
CN113704496A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 医疗知识图谱的修复方法、装置、计算机设备及存储介质
CN113704496B (zh) * 2021-08-31 2024-01-26 平安科技(深圳)有限公司 医疗知识图谱的修复方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
US11195114B2 (en) 2021-12-07
US20190057316A1 (en) 2019-02-21
EP3572959A1 (en) 2019-11-27
EP3572959A4 (en) 2021-01-06
CN108335755A (zh) 2018-07-27
CN114141378B (zh) 2025-03-25
CN108335755B (zh) 2022-03-04
CN114141378A (zh) 2022-03-04

Similar Documents

Publication Publication Date Title
CN108335755B (zh) 数据分析方法和设备
CN109935336B (zh) 一种儿童呼吸科疾病的智能辅助诊断系统
Khanna et al. Radiologist-level two novel and robust automated computer-aided prediction models for early detection of COVID-19 infection from chest X-ray images
Shi et al. Medadapter: Efficient test-time adaptation of large language models towards medical reasoning
JP7357614B2 (ja) 機械支援対話システム、ならびに病状問診装置およびその方法
JP2020518050A (ja) エンティティ間のコンテキスト的類似度の学習及び適用
Ghaderzadeh et al. Efficient framework for detection of COVID‐19 omicron and delta variants based on two intelligent phases of CNN models
CN110598786A (zh) 神经网络的训练方法、语义分类方法、语义分类装置
Li et al. BCRAM: A social-network-inspired breast cancer risk assessment model
Wajgi et al. Optimized tuberculosis classification system for chest X‐ray images: Fusing hyperparameter tuning with transfer learning approaches
WO2024131025A1 (zh) 数据处理方法、装置、电子设备及存储介质
Hong et al. Predicting risk of mortality in pediatric ICU based on ensemble step-wise feature selection
Yenurkar et al. Effective detection of COVID-19 outbreak in chest X-Rays using fusionnet model
CN114881124B (zh) 因果关系图的构建方法、装置、电子设备和介质
Naveen et al. COVID salvation: A theoretical model for Predicting coronavirus from chest radiology imagery
CN116978549A (zh) 一种器官疾病预测方法、装置、设备及存储介质
Park et al. Evaluating Advanced Large Language Models for Pulmonary Disease Diagnosis Using Portable Spirometer Data: A Comparative Analysis of Gemini 1.5 Pro, GPT 4o, and Claude 3.5 Sonnet
Alam et al. A Robust CNN Framework with Dual Feedback Feature Accumulation for Detecting Pneumonia Opacity from Chest X-ray Images
Rafi A holistic approach to identification of covid-19 patients from chest x-ray images utilizing transfer based learning
Panç et al. Predicting COVID-19 outcomes: Machine learning predictions across diverse datasets
Singh et al. A Healthcare Chatbot System Using Python And NLP
TW202203248A (zh) 慢性病的風險評估裝置及方法
Ramaphosa et al. Comparison of Convolutional Neural Networks in SARS-CoV-2 Identification
US12106196B2 (en) Method and system for generating synthetic time domain signals to build a classifier
Saheb et al. Review Of Machine Learning-Based Disease Diagnosis and Severity Estimation of Covid-19

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17859359

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE