KR20070020463A

KR20070020463A - System and method for the automatic generation of hierarchical tree networks using two supplementary learning algorithms optimized for each leaf of the hierarchical tree network

Info

Publication number: KR20070020463A
Application number: KR1020067024811A
Authority: KR
Inventors: 데이비드 에이치. 킬; 데이비드 비. 포트슈미트
Original assignee: 휴마나 인코포레이티드
Priority date: 2004-04-27
Filing date: 2005-04-27
Publication date: 2007-02-21

Abstract

계층적 트리 네트워크를 생성하고 선형-플러스-비선형 알고리즘을 사용하여 구성원의 향후 건강 상태에 관한 일치도를 형성하는 시스템 및 방법. 계층적 트리 네트워크 내의 각각의 리프는 임상 특성, 경험 주기, 및 유효 데이터 애셋에 있어 균일하다. 피처 및 학습 알고리즘이 각각의 리프에 대해 특정화된 국부적 특성에 적합화될 수 있도록 각각의 리프 상에서 최적화가 수행된다.A system and method for creating a hierarchical tree network and using a linear-plus-nonlinear algorithm to form a consensus on the future health status of a member. Each leaf in the hierarchical tree network is uniform in clinical characteristics, experience cycle, and valid data assets. Optimization is performed on each leaf so that features and learning algorithms can be adapted to the local characteristics specified for each leaf.

Description

SYSTEM AND METHOD FOR AUTOMATIC GENERATION OF A HIERARCHICAL TREE NETWORK AND THE USE OF TWO COMPLEMENTARY Using two complementary learning algorithms optimized for each leaf of the hierarchical tree network LEARNING ALGORITHMS, OPTIMIZED FOR EACH LEAF OF THE HIERARCHICAL TREE NETWORK}

본 출원서는 본 명세서에서 인용참조되고 있는 계층적 트리의 각각의 리프에 대해 최적화된 2 개의 보충 학습 알고리즘의 사용과 계층적 트리 네트워크의 자동 생성을 위한 시스템 및 방법에 대해 2004년 4월 27일에 출원된 미국 가 특허 출원 일련 번호 제 60/565,579호의 이익을 주장한다. This application describes the use of two supplementary learning algorithms optimized for each leaf of a hierarchical tree, cited herein, and a system and method for automatic generation of a hierarchical tree network. The filed US claims the benefit of patent application serial number 60 / 565,579.

본 발명은 계층적 트리 네트워크를 생성하는 시스템 및 방법에 관한 것으로, 구성원의 향후 건강 상태에 관한 일치도(consensus view)를 형성하기 위해 선형-플러스-비선형(linear-plus-nonlinear) 학습 알고리즘을 이용한다. 계층적 트리 네트워크 내의 각각의 리프는 임상 특성, 경험 주기 및 유효 데이터 애세트(data asset)에 있어 균일하다. 피처(feature) 및 학습 알고리즘이 각각의 리프에 대해 특정화된 국부적 특성에 적합화될 수 있도록 각각의 리프 상에서 최적화가 수행된다.The present invention relates to a system and method for creating hierarchical tree networks, using a linear-plus-nonlinear learning algorithm to form a consensus view of a member's future health status. Each leaf in the hierarchical tree network is uniform in clinical characteristics, experience cycle, and valid data assets. Optimization is performed on each leaf so that features and learning algorithms can be adapted to the local characteristics specified for each leaf.

본 발명은 개인의 향후 건강 상태를 예측하는 방법을 위한 것이다. 상기 방법은:The present invention is for a method of predicting a future health condition of an individual. The method is:

a. 복수의 구성원에 대한 복수의 층화 변수(stratification variable)의 함수로서 복수의 노드(node) 중 최대 하나에만 각각의 구성원을 할당하는 계층적 트리 네트워크를 수립하는 단계;a. Establishing a hierarchical tree network assigning each member only to at least one of the plurality of nodes as a function of a plurality of stratification variables for the plurality of members;

b. 컴퓨터 기반 시스템에, 복수의 구성원 및 각각의 구성원에 대해 구성원 데모그래피 데이터(demographic data), 유효 구성원 의료 청구 데이터 및 유효 구성원 의약 청구 데이터를 제공하는 단계;b. Providing, to a computer-based system, member demographic data, active member medical billing data, and active member medical billing data for a plurality of members and each member;

c. 각각의 노드에 대해, 상기 노드에 할당된 모든 구성원에 대한 상기 구성원 데모그래피 데이터, 유효 구성원 의료 청구 데이터 및 유효 구성원 의약 청구 데이터 중 1 이상을 포함하는 세트로부터 피처들의 최적의 서브세트를 식별하기 위해, 상기 복수의 노드의 각각에 대해 피처 선택을 수행하는 단계;c. For each node, to identify an optimal subset of features from a set comprising at least one of the member demographic data, active member medical claim data, and active member medical claim data for all members assigned to the node. Performing feature selection on each of the plurality of nodes;

d. 상기 구성원 데모그래피 데이터, 유효 구성원 의료 청구 데이터 및 유효 구성원 의약 청구 데이터 중 1 이상을 이용하고, 학습된 파라미터 데이터베이스를 생성하기 위해 데이터베이스에 학습된 파라미터를 저장하는 MVLR 알고리즘 및 BRN 알고리즘을 트레이닝(training)하는 단계;d. Training an MVLR algorithm and a BRN algorithm utilizing one or more of the member demographic data, active member medical billing data, and active member medical billing data, and storing the learned parameters in a database to create a learned parameter database. Doing;

e. 상기 학습된 파라미터 데이터베이스를 이용하고, 1 이상의 구성원에 대한 상기 1 이상의 구성원의 구성원 데모그래피 데이터, 유효 구성원 의료 청구 데이터 및 유효 구성원 의약 청구 데이터를 이용하며, MVLR 알고리즘을 이용하여 MVLR 향후 건강 상태 스코어를 계산하고 BRN 알고리즘을 이용하여 BRN 향후 건강 상태 스코어를 계산하며, 상기 MVLR 향후 건강 상태 스코어 및 상기 BRN 향후 건강 상태 스코어의 산술 평균을 계산하여 최종 스코어를 결정하는 단계를 포함한다.e. Using the learned parameter database, using member demographic data, active member medical billing data, and active member medical billing data of one or more members for one or more members, and using the MVLR algorithm to obtain MVLR future health status scores. Calculating and calculating a BRN future health status score using a BRN algorithm, and determining a final score by calculating an arithmetic mean of the MVLR future health status score and the BRN future health status score.

층화 변수는 구성원의 등록 기간, 예를 들어 6 개월 이하; 파국적 상태(catastrophic condition)의 존재 여부; 당뇨병의 존재 여부; 의약 트리거(pharmacy trigger) 또는 인페이션트 어드미션 트리거(inpatient admission trigger)의 존재; 및 의료 및 의약 청구 데이터의 존재, 의약 청구 데이터만 존재 또는 의료 청구 데이터만 존재를 포함할 수 있다. Stratification variables may be based on the member's registration period, for example 6 months or less; Presence of catastrophic conditions; Presence of diabetes; Presence of a drug trigger or an inpatient admission trigger; And the presence of medical and medical billing data, the presence of medical billing data only or the presence of medical billing data only.

또한, 상기 방법은 계산된 최종 스코어와 함께 구성원에 관한 정보가 제시되는 보고 단계(reporting step)를 포함할 수 있다. 예를 들어, 구성원 정보는 등록/적격성(eligibility) 정보, 상태, 트리거 타입 및 트리거 날짜를 포함하는 임상 상태 정보, 및 의약 및/또는 의료 청구에 대한 시간에 따른 구성원 비용의 표현(representation)을 포함할 수 있다.The method may also include a reporting step in which information about the member is presented with the calculated final score. For example, member information includes enrollment / eligibility information, clinical status information including status, trigger type, and trigger date, and representation of member costs over time for medication and / or medical claims. can do.

첨부한 도면과 연계된 다음의 도면설명을 참조하면 본 발명이 더욱 쉽게 이해될 것이다.The invention will be more readily understood by reference to the following description taken in conjunction with the accompanying drawings.

도 1은 다양한 학습 알고리즘의 상대적인 장점 및 취약점과 함께 피처 최적화 및 학습의 설명을 지원하도록 처리되고 맵핑된(mapped) N-차원 벡터 공간을 도시한다.1 illustrates an N-dimensional vector space that has been processed and mapped to support the description of feature optimization and learning, along with the relative advantages and vulnerabilities of various learning algorithms.

도 2는 계층적 트리 네트워크 생성, 피처 최적화, 학습 및 스코어 링(scoring)의 전체 흐름도를 도시한다.2 shows an overall flow diagram of hierarchical tree network generation, feature optimization, learning and scoring.

도 3a 내지 도 3h는 작동 시(in action) 2 개의 선택된 학습 알고리즘(다변 선형 회귀(multivariate linear regression) 및 베이스 정규화 네트워크(Bayesian regularization network)의 조합)을 도시한다.3A-3H illustrate two selected learning algorithms (in combination of multivariate linear regression and Bayesian regularization network) in action.

도 4a 및 도 4b는 멀티레이어 퍼셉트론(multilayer perceptron)에 대한 기억(memorization)의 글림프스(glimpse)를 도시한다.4A and 4B show glymps of memory for multilayer perceptrons.

도 5는 2 개의 학습 알고리즘, 베이스 정규화 네트워크(BRN) 및 다변 선형 회귀(MVLR)가 일반적으로 몇몇 경우에서 동의하는 동안에 완벽한 동의를 나타내는 라인으로부터 떨어진 무수한 산란 지점(scatter point)에 의해 나타내어진 상이한 대답(answer)을 제공한다는 것을 도시한다.5 is a different answer represented by a myriad scatter points away from the line indicating perfect agreement while the two learning algorithms, base normalization network (BRN) and multivariate linear regression (MVLR) generally agree in some cases. to provide an answer.

도 6a 내지 도 6f는 1 년 단위의 향후 비용 대 다양한 공변(covariate)의 제곱근의 플롯들을 도시한다. 이러한 단순한 플롯들은 많은 유용한 인사이트(insight)를 제공한다. 단순한 선형 관계가 존재하는 경우, 직선의 단조 증가 라인이 나타날 것이다. 공교롭게도, 이들 피처 중 어느 것도 이러한 분명한 트렌드(trend)를 나타내지 않으며, 이는 단순한 선형 인사이트가 존재하지 않는다는 것을 나타낸다. 6A-6F show plots of the future cost in one year versus the square root of various covariates. These simple plots provide many useful insights. If a simple linear relationship exists, a monotonic increasing line of straight lines will appear. Unfortunately, none of these features show this clear trend, indicating that there is no simple linear insight.

배경으로서, 구성원의 향후 건강 상태를 예측하고자 건강-보험 산업에서는 예측 모델이 사용되었다. 상기 모델은 프로액티브 임상 개입(proactive clinical intervention)에 대한 고-위험 구성원들의 AU(actuarial underwriting) 및 식 별(identification)에 채택될 수 있다. As a background, predictive models were used in the health-insurance industry to predict the future health status of members. The model can be adapted for the actuarial underwriting and identification of high-risk members for proactive clinical intervention.

예측 모델을 구성하는 것은 2 개의 스테이지를 수반하는데, 첫번째는 청구 데이터를 임상 피처의 세트로 변환하는 것이며, 두번째는 이력 청구 데이터를 이용하여 임상 피처(x)와 향후 건강 상태(y) 간의 관계를 학습하는 것이다. 통상적인 임상 피처는 바이너리 임상 마커(binary clinical marker)를 포함한다. IHCIS는 예측 모델에서 공변으로서 수백 개 이상의 바이너리 임상 마커들을 사용한다는 것을 유의한다. 바이너리 마커의 예시로는 최근 3 개월 동안의 입원환자 발작(inpatient episode), 전년도 동안의 관상동맥우회로이식술(coronary artery bypass graft: CABG) 절차, 전년도 동안의 심근경색(heart attack), 연령, 성별, 전년도 Rx 비용, 전년도 의료 비용, 병원 입원 기간, 응급실(ER) 방문자 수, Rx-대-med 비용률 등이 있다.Constructing a predictive model involves two stages, first converting billing data into a set of clinical features, and second, using historical billing data to determine the relationship between clinical features (x) and future health conditions (y). To learn. Typical clinical features include binary clinical markers. Note that IHCIS uses hundreds of binary clinical markers as covariates in predictive models. Examples of binary markers include inpatient episodes in the last three months, coronary artery bypass graft (CABG) procedures for the previous year, heart attack, age, gender, Rx costs for previous years, medical costs for the previous year, length of hospital stay, number of ER visitors, and Rx-to-med rates.

현재 채택된 학습 알고리즘은 룰-기반 전문가 시스템(액티브 헬스(Active Health)), 선형 회귀(IHCIS, 인제닉스(Ingenix), DxCG, 액티브 헬스(Active Health) 및 바이오시그니아(Biosignia)), 멀티-레이어 퍼셉트론(맥케슨(Mckesson), CART(classification-and-regression trees)(메디컬 사이언티스트(Medical Scientists), 및 k-니어리스트(nearest) 또는 디스크리미넌트 어댑티브 니어리스트 네이버(discriminant adaptive nearest neighbor)(Hastie and Tibshirani, 1996)(MedAI)와 같은 인스턴스-기반 학습자의 변수들을 포괄한다. 학습 알고리즘은 L-1(절대 오차), L-2(평균제곱오차(mean-squared error), L-∞(최대 오차) 놈(norm)과 같은 다양한 형태의 오차 항들에 의해 지배된 목적 함수(objective function)를 최적화함으로써 입력과 출력 간의 관계를 찾아내고자 한다. Currently adopted learning algorithms include rule-based expert systems (Active Health), linear regression (IHCIS, Ingenix, DxCG, Active Health and Biosignia), multi- Layer perceptrons (Mckesson, classification-and-regression trees (CART) (Medical Scientists), and k -nearest or discriminant adaptive near neighbor (Hastie) and Tibshirani, 1996) (MedAI), which cover variables of instance-based learners: the learning algorithms are L-1 (absolute error), L-2 (mean-squared error), and L-∞ (maximum). Error) We try to find the relationship between input and output by optimizing an objective function governed by various types of error terms, such as norm.

개념적으로, 피처 추출 및 최적화는 원(raw) 데이터에 존재하는 유용한 정보를 모두 적절히 캡처할 수 있는 가장 작은 피처 치수를 찾아냄에 따라 학습을 용이하게 한다. 학습은 최적의 피처와 원하는 출력 간의 맵핑 함수를 찾아내는 것과 유사하다. 학습 알고리즘은 파라메트릭(parametric), 비-파라메트릭(nonparametric) 및 경계-결정 타입(boundary-decision type)(Kil and Shin, 1996)으로 폭넓게 카테고리화될 수 있다. Conceptually, feature extraction and optimization facilitates learning by finding the smallest feature dimension that can adequately capture all of the useful information present in the raw data. Learning is similar to finding the mapping function between the optimal feature and the desired output. Learning algorithms can be broadly categorized into parametric, nonparametric and boundary-decision types (Kil and Shin, 1996).

파라메트릭 알고리즘은 선형 가우시안과 같은 데이터 분포에 관해 강한 파라메트릭 가정(parametric assumption)을 하게 한다. 조정을 위해 대응적으로 작은 수의 모델 파라미터를 갖는 단순한 파라메트릭 가정은 더블-에지 스워드(double-edged sword)일 수 있다. 실제 데이터 분포가 원래의 가정보다 훨씬 더 복잡한 경우, 파라메트릭 학습자는 큰 모델 부정합(model mismatch)를 겪게 되며, 이는 x와 y간의 높은 비선형 관계를 캡처하는데 있어서 학습 알고리즘이 부적절한 때에 발생한다. 역설적으로, 실사회(real-world) 데이터가 모델이 조정된 트레이닝 데이터와 다소 상이한 때에 자주 발생하는 큰 데이터 부정합의 존재 시, 단순한 파라메트릭 알고리즘은 그 비-파라메트릭 카운터파트(nonparametric counterpart)를 보다 우수하게 하는 경향이 있다.Parametric algorithms make strong parametric assumptions about data distribution such as linear Gaussian. A simple parametric hypothesis with a correspondingly small number of model parameters for adjustment may be a double-edged sword. If the actual data distribution is much more complex than the original assumption, the parametric learner will experience a large model mismatch, which occurs when the learning algorithm is inadequate for capturing high nonlinear relationships between x and y. Paradoxically, in the presence of large data mismatches that often occur when real-world data is somewhat different from the modeled training data, a simple parametric algorithm looks at its nonparametric counterparts. Tends to be excellent.

비-파라메트릭 및 경계-결정 학습 알고리즘은 데이터로부터 데이터 분포를 학습하려고 한다. 일반적으로, 신경망 및 비-파라메트릭 학습 알고리즘은 모델-부정합 오차를 최소화하는데에는 양호하나, 난해하고 모호한 경향이 있다. 입력과 출 력 간의 관계를 설명하는 단순한 방식이 존재하지 않는다. 이러한 단순한 선형 관계가 존재하지 않기 때문에 더 높은 전년도 의료 비용이 더 높은 향후 비용을 초래한다고는 말할 수 없다. 이들 학습 알고리즘은 복잡하고 비선형의 관계를 캡처하는데 도움을 줄 수 있는 많은 수의 조정가능한 파라미터를 갖는다. 공교롭게도, 이러한 미세-조정은 데이터 부정합의 존재 시에 유해할 수 있는 오버피팅(overfitting) 또는 기억(memorization)을 유발할 수 있다. 또한, 비선형 알고리즘은 트레이닝된 공간 밖에서 동작하는 경우 거친 결과(wild result)를 생성하는 경향이 있다고 알려져 있다. 즉, 이는 모델이 조정된 트레이닝 데이터와 실사회 데이터 간의 큰 불일치(discrepancy)가 예상되는 때에는 부적합(ill-suit)하다.Non-parametric and boundary-decision learning algorithms attempt to learn data distribution from the data. In general, neural networks and non-parametric learning algorithms are good at minimizing model mismatch errors, but tend to be difficult and ambiguous. There is no simple way to describe the relationship between input and output. Since this simple linear relationship does not exist, it cannot be said that higher previous year's medical costs result in higher future costs. These learning algorithms have a large number of adjustable parameters that can help capture complex and nonlinear relationships. Unfortunately, such fine-tuning can cause overfitting or memory that can be detrimental in the presence of data mismatches. It is also known that nonlinear algorithms tend to produce wild results when operating outside the trained space. That is, this is ill-suited when a large discrepancy between the modeled training data and the real world data is expected.

히든 마코브 모델(Hidden Markov model)은 임시 트랜지션(temporal transition)을 모델링하는데 유용하다. 실제 어플리케이션에서는 연산적으로 고가의 바움-웰치(Baum-Welch) 최적화가 그에 따른 더 높은 통지된 정확성에도 불구하고 SKM(segmental k-mean)보다 열등하다는 것이 자주 관찰된다. 그 이유는 모델과 데이터 오정합과의 트레이드 오프(trade off)이다. 이제까지 보아왔던 대부분의 실제 문제들에서 데이터 오정합은 모델 오정합보다 훨씬 더 중요하다.The Hidden Markov model is useful for modeling temporal transitions. In practical applications it is often observed that computationally expensive Baum-Welch optimization is inferior to segmental k-mean (SKM) despite the higher reported accuracy. The reason is the trade off between model and data mismatch. In most real world problems we have seen so far, data mismatch is much more important than model mismatch.

좌측으로부터 우측으로, 피처 최적화 및 학습의 코어 개념(core concept)을 입증하기 위해, 도 1은 다수의 오버랩이 신호 처리되고 피처 최적화 이후에 결정된 양호한 피처에 의해 스팬(span)된 M-차원 벡터 공간을 생성하기 위해 피처가 랭킹(rank)된 원 데이터에 의해 스팬된 원래의 N-차원 벡터 공간을 도시하며, M<<N이고, 우측 상의 맵핑은 분류자(classifier)에 의해 생성된 최종 1-차원 결정 공간을 도시한다. 다음은 다양한 학습 알고리즘의 상대 장점 및 취약점을 집약한다. 파라메트릭 학습 알고리즘의 특성은 그것이 아래 놓인 분류-상태 확률 분포에 관한 강한 파라메트릭 가정을 행하고, 트레이닝하기에 매우 단순하며, 모델 오정합되기 쉽다는 것이다. 파라메트릭 학습 알고리즘의 예로는 다변 가우시안 분류자, 가우시안 혼합 모델, 및 선형 회귀가 있다. 비-파라메트릭 학습 알고리즘의 특성은 파라메트릭 가정을 행하지 않고, 데이터로부터 분포를 학습하며, 대부분의 인스턴스(instance)에서 트레이닝하기에는 고가이고, 트레이닝과 테스트 데이터 세트 간에서 데이터 오정합되기 쉽다는 것이다. 비-파라메트릭 학습 알고리즘의 예로는 커널 에스티메이터(Kernel estimator), 히스토그램(Histogram), 기능적 형식, 및 K-니어리스트가 있다. 경계 결정 학습 알고리즘의 특성은 상기 특성이 다수의 부류를 분리하는 (비)선형 경계 함수를 구성하고, 트레이닝하기에는 매우 고가이며, 내부 파라미터들이 대부분의 인스턴스에서 발견법적으로(heuristically) 결정된다는 것이다. 경계 결정 학습 알고리즘의 예시로는 멀티-레이어 퍼셉트론, 판별식 신경망, 써포트 벡터 머신(support vector machine)이 있다. From left to right, to demonstrate the core concept of feature optimization and learning, FIG. 1 shows an M-dimensional vector space in which multiple overlaps are signaled and spanned by good features determined after feature optimization. Shows the original N-dimensional vector space spanned by the raw data whose features are ranked to produce the equation, where M << N, and the mapping on the right is the final 1- generated by the classifier. Show the dimensional decision space. The following summarizes the relative strengths and weaknesses of the various learning algorithms. The characteristic of a parametric learning algorithm is that it makes a strong parametric assumption about the underlying classification-state probability distribution, very simple to train, and prone to model mismatch. Examples of parametric learning algorithms are multivariate Gaussian classifiers, Gaussian mixed models, and linear regression. The nature of non-parametric learning algorithms is that they do not make parametric assumptions, learn distribution from data, are expensive to train in most instances, and are prone to data mismatch between training and test data sets. Examples of non-parametric learning algorithms are Kernel estimator, histogram, functional form, and K-near list. The property of a boundary decision learning algorithm is that it constitutes a (non) linear boundary function that separates multiple classes, and is very expensive to train, and that internal parameters are determined heuristically in most instances. Examples of boundary decision learning algorithms include multi-layer perceptrons, discriminant neural networks, and support vector machines.

선형 회귀의 인기는 그 직관력(intuitive power)으로부터 생긴다. y를 향후 비용으로, x를 임상 피처 또는 공변(∈ R ^N )으로 나타내기로 한다. 선형 회귀는 다음과 같은 방정식:

을 이용하여 y를 추정하며, 여기서, a_n은 x_n와 y 간의 상관관계의 강도 및 방향을 나타낸다. 선형 회귀의 직관적 성질을 나타내게 된다. 전년도 의료 비용에 대한 회귀 계수가 +1.3인 경우, 다른 모든 입력 변수들이 동일 하다고 한다면 향후 비용은 주어진 전년도 의료 비용의 1.3 배가 될 것이라고 추측할 수 있다. The popularity of linear regression arises from its intuitive power. Let y be the future cost, and x be the clinical feature or covariate ( ∈ R ^N ). Linear regression has the following equation:

And y is estimated, where a _n represents the strength and direction of the correlation between x _n and y. Indicate the intuitive nature of linear regression. If the regression coefficient for medical costs for the previous year is +1.3, then all other input variables are equal, it can be assumed that future costs will be 1.3 times the given previous year's medical costs.

그 해석의 용이에도 불구하고, 선형 회귀는 큰 모델-오정합 오차를 유발할 수 있는 그 단순한 수학적 공식화로 인해 복잡한 비선형 관계를 모델링할 수 없다. 그럼에도 불구하고, 데이터- 및 모델-오정합 오차를 모두 최소화할 수 있도록 선형 및 비선형 모델의 강도를 어떻게 조합할 것인가는 대답없는 질문이다.Despite its ease of interpretation, linear regression cannot model complex nonlinear relationships due to its simple mathematical formulation that can cause large model-mismatch errors. Nevertheless, it is an unanswered question how to combine the strengths of linear and nonlinear models to minimize both data- and model-mismatch errors.

요약하면, 모델링의 성공은 모델 오정합 및 데이터 오정합 간의 트레이드 오프에 달려 있다. 데이터 변형에 있어 통합되고 상보적인 세트의 알고리즘, 도출된 피처들의 추출, 최적화 및 로버스트 학습(robust learning)을 찾아내는 것이 급선무이다.In summary, the success of modeling depends on the trade-off between model mismatch and data mismatch. It is imperative to find an integrated and complementary set of algorithms, extraction of derived features, optimization and robust learning in data transformation.

본 발명에서는 하나의 광범위한 스트로크(stroke)에서 전체 문제를 해결하려는 대신에, 적합한(judicious) 분할 정복 접근에 의존한다. 먼저, 계층적 트리 네트워크를 이용하여 문제 공간을 다수의 논리 하위공간(logical subspace)으로 분할한다. 계층적 트리 네트워크의 각각의 리프는 임상 특성, 경험 주기 및 유효 데이터 애세트에 있어 균일하다. 더욱이, 계층적 트리 구성 방법은 구성원의 전체 질병 경중(total disease burden)을 나타내는 임상 상태 스코어, 이전-경험 특성(매월 이전 비용), 만성 대 전체 비용, 및 질병 진행경로(disease tragectory)(증가, 감소 또는 시간에 따라 일정)와 같은 추가 디멘션(dimension)을 수용하기에 충분히 유연하다. 다음 피처 랭킹은 예측에 유용한 국부적으로 특유한 특성을 이용하기 위해 각각의 리프에서 행해진다. 최종적으로, 다수 학습 알고리즘은 자체 생각 모 자(own thinking hat)를 이용하여 출력과 최적의 피처 서브세트 간의 관계를 조사한다. 도 2는 분할 정복(divide-and-conquer) 접근법의 일 실시예를 도시한다.Instead of trying to solve the whole problem in one broad stroke, the present invention relies on a judicious divisional conquest approach. First, the problem space is divided into logical subspaces using a hierarchical tree network. Each leaf of the hierarchical tree network is uniform in clinical characteristics, experience cycle, and valid data assets. Moreover, hierarchical tree construction methods can include clinical status scores that represent the total disease burden of members, prior-experience characteristics (monthly costs), chronic versus overall costs, and disease tragectory (increasing, Flexible enough to accommodate additional dimensions such as reduction or constant over time). The next feature ranking is done at each leaf to take advantage of locally specific features useful for prediction. Finally, the majority learning algorithm uses its own thinking hat to examine the relationship between the output and the optimal subset of features. 2 illustrates one embodiment of a divide-and-conquer approach.

이하, 각각의 단계를 상세히 설명한다.Hereinafter, each step will be described in detail.

계층적 트리 네트워크 생성: 계층적 트리 네트워크를 설계하고, 여기서 각각의 리프는 유사한 휴마나 연령 Humana age) 및 데이터 플래그(data flag)를 공유하는 임상적으로 균일한 클러스터를 나타낸다. 클러스터링(clustering)은 x 축 상의 CCS(Clinical Condition Score)와 y 축 상의 매월 이전 비용에 대해 2-D 공간에서 수행될 것이다. 다수의 동반질환(comorbid condition)을 갖는 m 번째 구성원에 대한 임상 상태 스코어는 다음과 같이 정의된다:

, 여기서, b _k (m)은 m 번째 구성원에 대한 k 번째 존재/부재 플래그이고, pppm(k)은 k 번째 동반질환 질병 세트를 갖는 모든 구성원에 대한 평균 매월 비용을 나타낸다. 이 2-차원 벡터 공간에서, 1 사분면은 PMPM(per-member-per-month) 이전 비용을 갖는 치명적으로 아픈 개체군(population)을 나타낸다. 3 사분면은 낮은 PMPM을 갖는 비교적 건강한 개체군을 나타낸다. 2 사분면은 비교적 적은 수의 치명적 임상 상태를 가지나 큰 PMPM(아마도 양태적 이슈(behavioral issue))을 갖는 구성원들을 포함한다. 최종적으로, 4 사분면은 무수한 치명적 임계 상태를 가지나, 아마도 구성원들이 그 상태를 더욱 양호하게 자기-관리하기 때문에 작은 PMPM을 갖는 구성원들을 나타낸다. Hierarchical Tree Network Creation: Design a hierarchical tree network, where each leaf represents a clinically uniform cluster that shares similar huma or humana age and data flags. Clustering will be performed in 2-D space for the Clinical Condition Score (CCS) on the x-axis and the monthly transfer cost on the y-axis. The clinical status score for the mth member with multiple comorbid conditions is defined as follows:

Where b _k (m) is the k th presence / absence flag for the m th member, and pppm (k) represents the average monthly cost for all members with the k th comorbidity disease set. In this two-dimensional vector space, quadrant 1 represents a fatally ill population with the cost of per-member-per-month transfer. The third quadrant represents a relatively healthy population with low PMPM. The second quadrant includes members with a relatively small number of fatal clinical conditions but with a large PMPM (possibly a behavioral issue). Finally, the fourth quadrant has a myriad of deadly critical states, but represents members with small PMPMs, perhaps because they better self-manage that state.

a. 트리 하위분할에 대한 안내: 이전-및-이후 엔트로피 감소 기준(entropy reduction measure)(즉, 파티셔닝(partitioning)에 의해 어떤 것을 얻는가?), 데이터 오정합의 레벨(풀-데이터-세트(full-data-set) 트레이닝과 10-폴드 크로스 밸리데이션(ten-fold cross validation) 간의 MVLR 성능 차이), 300의 최소 인구 크기 및 필요성(등록 기간 및 데이터 유효성).a. Guide to tree subdivision: pre- and post-entropy reduction measures (ie, what do you get by partitioning?), Level of data mismatch (full-data-set) set) MVLR performance difference between training and ten-fold cross validation), minimum population size and need of 300 (registration period and data validity).

b. 브랜치 재조합(branch recombination): 쿨백-라이블러 발산(Kullback-Leibler divergence)과 같은 유사성 메트릭(similarity metric)에 기초한 상이한 브랜치에서의 노드와의 재조합은 다음과 같이 정의되며:b. Branch recombination: Recombination with nodes in different branches based on similarity metrics such as Kullback-Leibler divergence is defined as follows:

여기서, p(y｜leaf = j)는 계층적 트리 리프(j)와 연계된 선택된 출력의 확률이다.Where p (y | leaf = j) is the probability of the selected output associated with the hierarchical tree leaf j.

c. 근사법: 우리의 데이터의 실현성(reality)이 주어지면, 계층적 트리 리프를 생성하는 다음의 5 개의 피처를 사용한다.c. Approximation: Given the reality of our data, we use the following five features to create hierarchical tree leaves.

1. 휴마나 멤버쉽 기간(Humana membership duration): 이는 구성원이 얼마나 오래 휴마나와 함께 했었는지를 나타내며, 유효 청구 이력의 양을 결정한다. 이 분야는 건강 보험 산업에서 높은 회전율(turnover rate)의 사업 실현성을 고려하는데 필요하다.1. Humana membership duration: This indicates how long a member has been with Humana and determines the amount of valid claim history. This field is necessary to take into account the high turnover rate business feasibility in the health insurance industry.

2. 파국적 상태: 이 분야는 조화된 보건 관리(coordinated care management) 및 엄격한 임상 개입을 필요로 하는 고가의 만성 상태의 존재를 나타낸다. 2. Catastrophic status: This field represents the presence of expensive chronic conditions that require coordinated care management and rigorous clinical intervention.

3. 당뇨병 플래그: PN(Personal Nurse) 프로그램은 다른 만성 상태를 갖는 당뇨병 환자, 당뇨병을 앓고 있으나 파국적 상태가 별도로 플래그되지 않는 구성원에 대한 양태 수정(behavior modification)에 초점이 맞추어져 있다. 3. Diabetes Flags: The Personal Nurse (PN) program focuses on behavioral modifications for diabetics with other chronic conditions, members who have diabetes but whose catastrophic status is not flagged separately.

4. 트리거 타입: 이 분야는 구성원들이 예측성 모델링 큐(predictive modeling queue)로 유도되는 이유를 나타낸다. 입원환자 발작을 갖는 구성원들은 특별환 상담과, 그들이 추가적인 입원(hospitalization)으로부터 어떻게 제외될 수 있는지에 관한 상기 메세지(reminder message)를 필요로 한다. 또한, 그들은 양태 개입 메세지에 대해 더 많이 수용한다. 4. Trigger Type: This field shows why members are led to a predictive modeling queue. Members with inpatient seizures need special exchange counseling and a reminder message about how they can be excluded from further hospitalization. In addition, they accept more for modal intervention messages.

5. 데이터 유효성: 모델들이 유효 데이터 애셋을 고려하여야 한다는 것은 두 말할 나위가 없다. 5. Data validity: It goes without saying that models should consider valid data assets.

피처 서브세트 선택: 각각의 리프에 대해, 우리는 수확체감(diminishing return) 지점에서 최적의 피처 서브세트를 선택하도록 피처 최적화를 실행한다. Feature subset selection: For each leaf, we perform feature optimization to select the optimal subset of features at the diminishing return point.

a. 피처 상관관계: 리던던시(redundancy) 및 피처 디멘션을 최소화하기 위해, 우리는 주요 성분 분석을 이용하여 고도로 상관관계된 피처(ρ≥0.9)를 조합한며, 여기서

이다.a. Feature Correlation: To minimize redundancy and feature dimensions, we combine the highly correlated features (ρ≥0.9) using principal component analysis, where

to be.

b. 피처 랭킹: 피처 랭킹의 목적은 전체 예측 정확성에 대한 각각의 피처의 기여 정도를 조사하기 위함이다. 피처들이 완전히 직교하는 경우(즉, ρ _ij = 0, i≠ j), 피처 랭킹은 피셔의 판별율(Fisher's discriminant ratio), 멀티-모달 오버랩(multi-modal overlap: MOM) 기준, 발산, 바타챠리아 거리(Bhattacharyya distance)(Kil 및 Shin, 1996)와 같은 다수의 적절한 메트릭을 이용하여 한 계(marginal) 또는 1-차원 피처 랭킹으로 저하된다. 피처들이 직교하지 않는 경우, 우리는 추계학적(stochastic) 또는 조합적(combinatorial) 최적화 알고리즘을 사용할 수 있다.b. Feature ranking: The purpose of feature ranking is to investigate the contribution of each feature to the overall prediction accuracy. If features are completely orthogonal (ie ρ _ij = 0, i ≠ j), feature ranking is based on Fisher's discriminant ratio, multi-modal overlap (MOM), divergence, batacha Many suitable metrics, such as Bhattacharyya distance (Kil and Shin, 1996), are used to degrade to marginal or one-dimensional feature rankings. If the features are not orthogonal, we can use a stochastic or combinatorial optimization algorithm.

학습: 학습은 회귀(향후 비용과 같은 연속 종속 변수) 또는 분류(향후 비용에서 최상위 20%를 식별하는 것과 같은 이산 종속 변수)의 형태를 취할 수 있다. 종속 변수는 임상적으로 또는 보험통계적으로(actuarially) 지향될 수 있다. 우리는 다음과 같은 학습 알고리즘을 사용한다: Learning: Learning can take the form of regression (continuous dependent variables such as future costs) or classification (discrete dependent variables such as identifying the top 20% of future costs). The dependent variable can be oriented clinically or actuarially. We use the following learning algorithm:

a. 다변 선형 회귀(MVLR):

.a. Multivariate Linear Regression (MVLR):

.

b. 베이스 정규화 네트워크(BRN): BRN은 회귀 및 분류에 모두 사용될 수 있다(Foresee and Hagan, 1997).b. Base Normalization Network (BRN): BRN can be used for both regression and classification (Foresee and Hagan, 1997).

스코어링 : 학습된 파라미터 데이터베이스를 이용하여, 다중 학습 알고리즘을 이용하는 미지의 개체군 무리(batch)를 스코어링한다. Scoring : Using a trained parameter database, score unknown population batches using multiple learning algorithms.

이제, 도 2에 도시된 바와 같은 이러한 실시예의 학습 및 스코어링을 살펴보기로 한다. 도 2의 경우, 다음과 같은 참조 번호들이 사용된다: 1 - 퓨즈된(fused) 피처들의 입력으로서, 그들은 모든 유효 의료 및 의약 데이터 및 데모그래피 데이터임; 2 - 피처 서브세트 선택; 3 - 학습 또는 스코어 결정?; 4 - 학습 경로; 5 - MVLR 학습; 6 - BRN 학습; 7 - 학습된 데이터베이스; 8 - 스코어 경로; 9 - MVLR 스코어링; 10 - BRN 스코어링; 11 - 평균 연산자(mean operator); 12 - 임상 상태 요약; 및 13 - 최종 스코어 및 상태 보고서.Now look at the learning and scoring of this embodiment as shown in FIG. 2. In the case of Figure 2, the following reference numbers are used: 1-As input of fused features, they are all valid medical and medical data and demographic data; 2-feature subset selection; 3-learning or scoring ?; 4-learning path; 5-MVLR learning; 6-BRN learning; 7-trained database; 8-score path; 9-MVLR scoring; 10-BRN scoring; 11-mean operator; 12-summary of clinical status; And 13-final score and status report.

상세한 설명에 앞서, 다음의 표가 계층적 트리 네트워크에서 30 개의 노드에 관한 세부사항(detail)을 제공한다. Prior to the detailed description, the following table provides details about 30 nodes in a hierarchical tree network.

상기의 표에서, 1 내지 15에 의해 식별된 리프는 등록이 6 개월 미만인 구성원들의 노드들의 제 1 세트를 형성하고, 리프(16 내지 30)는 등록 ≥ 6 개월인 노드들의 제 2 세트를 형성한다. 노드(1 내지 15)의 제 1 세트의 경우, 리프(1 내지 3)는 제 1 서브세트를 형성하고, 리프(4 내지 15)는 제 2 서브세트를 형성하며; 리프(4 내지 6)는 제 1 서브세트를 형성하고, 리프(7 내지 9)는 제 2 서브세트를 형성하며, 리프(10 내지 12)는 제 3 서브세트를 형성하고, 리프(13 내지 15)는 제 4 서브세트를 형성한다. 노드(16 내지 30)의 제 2 세트의 경우, 리프(16 내지 18)는 제 1 서브세트를 형성하고, 리프(19 내지 30)는 제 2 서브세트를 형성하며; 리프(19 내지 21)는 제 1 서브세트를 형성하고, 리프(22 내지 24)는 제 2 서브세트를 형성하며, 리프(25 내지 27)는 제 3 서브세트를 형성하고, 리프(28 내지 30)는 제 4 서브세트를 형성한다.In the table above, the leaf identified by 1 to 15 forms a first set of nodes of members whose registration is less than 6 months, and the leaves 16 to 30 form a second set of nodes whose registration ≧ 6 months. . For the first set of nodes 1 to 15, the leaves 1 to 3 form a first subset and the leaves 4 to 15 form a second subset; Leaves 4 to 6 form a first subset, leaves 7 to 9 form a second subset, leafs 10 to 12 form a third subset, and leaves 13 to 15 ) Forms a fourth subset. For a second set of nodes 16-30, leaves 16-18 form a first subset, and leaves 19-30 form a second subset; Leaves 19 to 21 form a first subset, leaves 22 to 24 form a second subset, leaves 25 to 27 form a third subset, and leaves 28 to 30 ) Forms a fourth subset.

1. 상기 도시된 바와 같이, 예측 모델이 고려될 수 있는 모든 상황을 처리하기 위해서, 우리는 다음의 5 개의 층화 변수의 함수로서 계층적 트리 네트워크를 생성함에 따라, 30 개의 노드 또는 리프가 유도된다:1. As shown above, in order to handle all situations where a predictive model can be considered, 30 nodes or leaves are derived as we create a hierarchical tree network as a function of the following five stratification variables. :

등록 기간: 구성원이 휴마나와 함께 한 기간. 상기 도시된 바와 같이, 구성원들은 0 - 내지 0.5 년 동안 휴마나와 함께 한 구성원들과 0.5 년 이상 휴마나와 함께 한 구성원들로 나뉜다. Enrollment Period: The length of time a member has with Humana. As shown above, the members are divided into members with Humana for 0- to 0.5 years and members with Humana for more than 0.5 years.

파국적 상태: 구성원이 엄격 임상 상태를 갖는가? 구성원이 다음 중 어느 것을 갖는 경우, 카테고리 플래그는 yes이고, 그렇지 않으면 no이다.Catastrophic status: Does the member have a strict clinical condition? If the member has any of the following, the category flag is yes, otherwise no.

암.cancer.

말기 질환(신부전증 또는 간부전증).Terminal illness (renal failure or liver failure).

이식.transplantation.

희귀병.Rare disease.

HIV.HIV.

CAD + CHF + 고혈압.CAD + CHF + hypertension.

당뇨병 플래그, yes 또는 no. 구성원이 상기 b에서 파국적 상태를 갖는 경우, 그들은 리프(1 내지 3 또는 16 내지 18)에 있으며, 당뇨병 플래그는 영향을 주지 않는다.Diabetes flag, yes or no. If the members have a catastrophic state in b above, they are on the leaves 1 to 3 or 16 to 18 and the diabetic flag has no effect.

트리거 타입: 새로운 Rx 청구 또는 입원환자 허용.Trigger Type: Allow new Rx billing or inpatients.

유효 데이터: Rx only, medical only, 또는 그 둘 모두.Valid data: Rx only, medical only, or both.

일 예시로서, 30 노드에 관한 상기의 표와 상기의 설명으로부터, 우리는 존 스미스(John Smith)가 9 개월 동안 휴마나와 함께 하였고 파국적 상태를 갖는 경우, 그가 의약과 의료 혜택을 갖는다고 하면 그는 리프 # 16에 있게 될 것이다. 한편, 낸시 도(Nancy Doe)가 12 개월 동안 휴마나와 함께 하였고 어떤 파국적 상태를 갖지 않지만 당뇨병이고 임페이션트 트리거를 갖는 경우, 그녀가 휴마나에 의해 보장된 의료 및 의약 혜택 계획을 갖는다고 가정한다면, 그려는 리프 # 22에 있게 될 것이다. 또한, 파국적 상태를 갖거나 인페이션트 어드미션 트리거 또는 프리스크립션 트리거를 갖는 구성원들만이 30 개의 노드 중 하나에 할당될 것이다.As an example, from the above table and the above description of 30 nodes, we say that if John Smith had been with Humana for 9 months and had a catastrophic state, he would have medicinal and medical benefits, he would reef You will be at # 16. On the other hand, if Nancy Doe has been with Humana for 12 months and does not have any catastrophic status but is diabetic and has an impact trigger, assuming she has a health and medical benefit plan guaranteed by Humana, The draw will be on leaf # 22. In addition, only members having a catastrophic state or having an admission admission trigger or a subscription trigger will be assigned to one of the 30 nodes.

각각의 리프에 대해, 우리는 어드-온(add-on) 조합적 최적화를 이용하여 최적의 서브세트를 찾아내기 위해 피처 최적화를 수행한다.For each leaf, we perform feature optimization to find the optimal subset using add-on combinatorial optimization.

학습 시, 우리는 다변 선형 회귀(MVLR) 및 베이스 정규화 네트워크(BRN) 알고리즘을 모두 트레이닝한다. 학습된 파라미터들은 로컬 데이터베이스에 저장된다. 2 개의 학습 알고리즘은 실사회 상태에 통상적으로 직면한 데이터-오정합 및 모델-오정합 오차들에 대처하기 위해 선택된다. 모델-오정합 오차들은 학습 알고리즘이 너무 단순해서 입력과 출력 간의 복잡한 관계를 적절히 모델링할 수 없을 때에 발생한다. 한편, 데이터-오정합 오차들은 슈퍼튜닝된(supertuned) 학습 알고리즘이 트레이닝 데이터와 테스트 데이터 간의 데이터 특성의 차이에 대처할 수 없을 때에 발생한다.In training, we train both multivariate linear regression (MVLR) and base normalization network (BRN) algorithms. The learned parameters are stored in the local database. Two learning algorithms are selected to cope with data-mismatch and model-mismatch errors typically encountered in real-world conditions. Model-mismatch errors occur when the learning algorithm is too simple to properly model the complex relationship between input and output. Data-mismatch errors, on the other hand, occur when a supertuned learning algorithm cannot cope with differences in data characteristics between training data and test data.

스코어링 시, 두 개의 학습 알고리즘은 구성원의 향후 건강 상태에 관한 그들의 평가를 제공한다.In scoring, two learning algorithms provide their assessment of a member's future health status.

단순한 산술 평균 연산자는 최종 스코어를 출력하며, 그 후 이는 보고서 포맷 내의 구성원의 임상 상태 요약과 조합된다. 임상 상태 보고서는 최대 상위 5 개의 MCC 상태, 트리거 타입, 트리거 날짜, 스코어링 날짜, 개인 키(person key), 인슐린, 경구(oral) 또는 당뇨병인 경우 두 개의 코드, 및 머터니티 플래그(maternity flag)로 구성된다.A simple arithmetic mean operator outputs the final score, which is then combined with the summary of clinical status of the members in the report format. The clinical status report includes up to five MCC status, trigger type, trigger date, scoring date, personal key, two codes for insulin, oral or diabetes, and a maternity flag. It is composed.

상술된 각각의 리프에 대해, 별도의 학습은 그 리프에 속한 구조적으로 균일한 인구 세그먼트에 대해 행해진다. 예를 들어, 우리는 리프(16 내지 18)에 대해 다음과 같은 통계를 얻는다:For each leaf described above, separate learning is done for the structurally uniform population segments belonging to that leaf. For example, we get the following statistics for the leaves 16-18:

1.

에서 측정된 성능, 여기서 y 및

는 각각 실제 및 예측 된 출력을 나타내며, 바(var)(ㆍ)는 분산 연산자를 나타낸다:One.

Performance measured at, where y and

Represent the actual and predicted output, respectively, and var (·) represent the variance operator:

a. Rx + med: 0.41/0.39(BRN/MVLR) a. Rx + med: 0.41 / 0.39 (BRN / MVLR)

b. Rx only: 0.24/0.18 b. Rx only: 0.24 / 0.18

c. Med only: 0.36/0.35 c. Med only: 0.36 / 0.35

2. 상위 5 개의 피처는 다음과 같다: 2. The top five features are:

a. Rx + med a. Rx + med

i. 전체 이전 지불액(total prior paid amount)의 제곱근 i. Square root of the total prior paid amount

ii. 다른 입원 설비(inpatient facility)에 대한 ii. For other inpatient facilities

평균 이용(mean utilization)Mean utilization

iii. 시간에 걸친 이전 지불액의 트렌드iii. Trend of previous payments over time

iv. 지불액에 있어서의 변동 - 일치할수록 예측이 더 용이.iv. Variation in Payments-The more consistent, the easier the forecast.

v. 다중 경화(multiple sclerosis), 혈우병(hemophiliac)v. Multiple sclerosis, hemophiliac

등과 같은 희귀병과 연계된 비용Costs associated with rare diseases such as

b. Rx only b. Rx only

i. 영(zero)이 아닌 Rx 비용 기간i. Nonzero Rx Cost Term

ii. 채택된 특유한 약품의 개수ii. Number of unique drugs adopted

iii. 에이스 약품 비용(ace drug cost)iii. Ace drug cost

iv. Rx 만성 비용 대 Rx 총 비용의 비율iv. Ratio of Rx chronic cost to Rx total cost

- 비율이 높을수록, 예측이 용이-The higher the ratio, the easier the prediction

v. GPI 약품 부류 idv. GPI drug class id

c. Med only c. Med only

i. 가장 최근의 6 개월 이전 지불액의 제곱근i. Square root of payment last 6 months ago

ii. 다른 외래 시설에 대한 평균 이용ii. Average use of other outpatient facilities

iii. 1차 ICD-9에서의 특유한 진단의 개수iii. Number of Distinctive Diagnosis in Primary ICD-9

iv. 월당 주요 임상 상태(Major Clinical Condition: MCC) 비용 평균iv. Major Clinical Condition (MCC) Cost Averages per Month

v. 입원환자 내과의사(inpatient physician)에 대한 평균 지불액v. Average payments to inpatient physicians

3. 스코어링 시, 우리는 각각의 구성원에 대한 많은 수의 피처들을 연산하고, 그녀가 속하는 리프 #를 찾아내며, 선택된 리프에 대해 우리가 필요한 피처 서브세트를 필터링하고, 그 리프에 대한 2 개의 학습 알고리즘과 연계된 학습 파라미터를 로딩하며, MVLR 및 BRN으로부터 2 개의 출력을 생성한다. 2 개의 학습 알고리즘 간의 일치(consensus)를 위해 우리는 산술 평균을 취한다. MVLR의 경우, 학습 파라미터들은 정규화 및 회귀 계수로 구성된다. BRN의 경우, 학습 파라미터들은 네트워크 아키텍처 구조, 최적화 함수, 및 각각의 네트워크 노드 및 연결에 대한 가중(weight)/바이어스 값을 포함한다. 3. In scoring, we compute a large number of features for each member, find out the leaf # to which she belongs, filter the subset of features we need for the selected leaf, and learn two about that leaf. Load the learning parameters associated with the algorithm and generate two outputs from MVLR and BRN. For consensus between the two learning algorithms we take an arithmetic mean. For MVLR, the learning parameters consist of normalization and regression coefficients. For BRN, the learning parameters include network architecture structure, optimization function, and weight / bias values for each network node and connection.

4. 출력으로서, 우리는 평균 스코어에 각각의 개인 임상 상태, 웨더 맵(weather map), 트리거 정보, 및 다른 적격성(eligibility) 정보를 덧붙인다. 구성원 웨더 맵은 그 구성원의 의약 및/또는 의료 청구에 대한 특정화된 시간 주기에 걸친 구성원 비용의 표현을 제공한다. 주요 임상 상태에 대한 ICD9 코드를 맵핑하면, (예를 들어, 관상동맥질환(coronary artery disease), 울혈성심부전(congestive heart failure), 다른 심부전증, 순환기계, 암과 같은) 상이한 주요 임상 상태 카테고리, (예를 들어, 주요 임상 상태 관상동맥질환이 하위부류들 관상동맥우회로(coronary artery bypass graft), 경피적경혈관관상동맥확장술(percutaneous transluminal coronary angioplasty), 심근경색증(myocardial infarction), 협심증(angina), 다른 허혈성심질환(ischemic heart disease), 관상동맥경화증(coronary atherosclerosis), 및 고지혈증(hyperlipidemia)을 갖는 것과 같이) 그 서브세트 및 서브서브세트에 대해 의료 비용 정보가 나타내어질 수 있다. 또한, 약품 코드는 다양한 주요 임상 상태 및 그 서브세트 및 서브서브세트에 대해 맵핑된다. 또한, 구성원 웨더 맵은 동일한 특정화된 시간 주기에 걸친 (예를 들어, 병원 입원실(hospital inpatient), 병원 응급실, 내과의사 병실 방문(physician office visit)과 같은) 처치 장소에 의한 구성원 이용 및 의료 청구를 나타낼 수 있다.4. As an output, we append each individual clinical state, weather map, trigger information, and other eligibility information to the mean score. The member weather map provides a representation of member cost over a specified time period for the member's medication and / or medical claim. Mapping ICD9 codes for major clinical conditions can lead to different major clinical status categories (eg, coronary artery disease, congestive heart failure, other heart failure, circulatory system, cancer), (E.g., major clinical status coronary artery disease is subclasses of coronary artery bypass graft, percutaneous transluminal coronary angioplasty, myocardial infarction, angina, Medical cost information can be shown for that subset and subset of other ischemic heart diseases, coronary atherosclerosis, and hyperlipidemia. In addition, drug codes are mapped to various major clinical conditions and their subsets and subsets. In addition, the member weather map can be used for billing member usage and medical billing by treatment sites over the same specified time period (eg, hospital inpatient, hospital emergency room, physician office visit). Can be represented.

다음, 우리는 추가 알고리즘적 세부사항을 제공한다.Next, we provide additional algorithmic details.

베이스 정규화 네트워크:Bass Normalization Network:

전통적인 피드포워드 신경망은 가중된 산술 합산 및 비선형 활성화를 수행하는 다중 피드포워드 연결 뉴런을 이용하여 반복적으로 입력과 출력 간의 고도의 비선형 관계를 학습한다. 공교롭게도, 흔히 "새로운" 입력으로서 알려진 데이터 오정합의 존재 시, 성능은 로버스트니스(robustness)를 부족하게 하는 것으로 밝혀졌다. 이는 정규화가 중요한 역할을 하는 경우에 그러하다. Traditional feedforward neural networks repeatedly learn highly nonlinear relationships between inputs and outputs using multiple feedforward-connected neurons that perform weighted arithmetic summation and nonlinear activation. Unfortunately, performance has been found to lack robustness in the presence of data mismatches, often known as "new" inputs. This is the case when normalization plays an important role.

정규화는 의미 있는 근사 솔루션(approximate solution)을 찾아내는 것을 의미하는 것이지, 의미 없는 정확한 것을 찾아내는 것을 의미하는 것은 아니 다(Neumaier, 1998). 선형 대수학에서 유명한 정규화 방법은 티호노프(Tikhonov) 정규화로도 알려져 있는 소위 대각 로딩(diagonal loading)이다. Normalization means finding a meaningful approximate solution, not finding the exact meaningless (Neumaier, 1998). A popular method of normalization in linear algebra is the so-called diagonal loading, also known as Tikhonov normalization.

우리가 y = Ax에서 솔루션 x를 찾고자 한다면, 잘 알려진 표준 L-2 솔루션은

이다. 의사 역행렬(pseudo inverse)(A ⁺ = ( A'A ) ^-1 A')이 존재하지 않는 풀 수 없는(ill-posed) 문제의 경우(이는 SVD(singular value decomposition)를 이용하여 이를 입증할 수 있음), 의미 있는 양호한 근사 솔루션은

이며, 여기서, 우리는 정규화를 위해 각각의 대각 항(diagonal term)에 소량을 추가한다. 이 피처는 우리의 다변 가우시안 분류 알고리즘이 된다. If we want to find solution x at y = Ax, the well-known standard L-2 solution

to be. In the case of ill-posed problems where a pseudo inverse ( A ⁺ = ( A'A ) ^-1 A ' ) does not exist (which can be demonstrated using SVD (singular value decomposition)). A good approximation solution

Where we add a small amount to each diagonal term for normalization. This feature becomes our multivariate Gaussian classification algorithm.

학습에서와 유사하게, 평균 제곱 오차(L-2 놈(norm))를 최소화하는 전통적인 목적 함수를 이용하는 대신에, BIC(Bayesian Information Criterion) 또는 최소 디스크립션 길이(minimum description length)(Rissanen, 1989)와 개념적으로 유사한 최적화 함수를 이용할 수 있고, 이는

의 형태를 취하며, 여기서, D 및 S는 각각 데이터베이스 및 모델 구조를 나타낸다. N은 데이터의 샘플 크기인 한편, d는 주어진 모델 구조와 연계된 파라미터들의 세트를 나타내는

를 갖는 모델 S의 파라미터의 개수이다. 직관적으로, 이 방정식은 모델 복잡성을 최소화하는 동시에 데이터로부터 학습된 모델 구조의 설명력(explanation power)을 최대화함으로써 목적 함수가 최대화될 수 있다는 것을 나타낸다. Similar to learning, instead of using a traditional objective function that minimizes the mean squared error (L-2 norm), the BEI (Bayesian Information Criterion) or minimum description length (Rissanen, 1989) You can use conceptually similar optimization functions, which

Where D and S represent the database and model structure, respectively. N is the sample size of the data, while d represents the set of parameters associated with the given model structure

The number of parameters of model S with Intuitively, this equation indicates that the objective function can be maximized by maximizing the explanation power of the model structure learned from the data while minimizing model complexity.

BRN에서, 목적 함수는

이며, α+β=1이다. J의 첫번째 항은 잘 알려진 평균 제곱 예측 오차인 한편, 두번째 항은 네트워크 가중(network weight)의 제곱의 합을 나타낸다. β>>α이면, 학습은 예측 오차를 최소화하여 결과적인 네트워크 응답이 다른 것보다 훨씬 더 평활(smooth)하게 함으로써 가중 감소(weight reduction)를 훨씬 더 강조할 것이다. 가우시안 가정 하에서, α 및 β에 대해 더 근사한 형태의 솔루션을 도출하기 위해 베이스 룰(Bayes' rule)을 적용할 수 있다(Foresee and Hagan, 1997).In BRN, the objective function is

And α + β = 1. The first term of J is the well-known mean square prediction error, while the second term represents the sum of the squares of the network weights. If β >> α, learning will further emphasize weight reduction by minimizing the prediction error so that the resulting network response is much smoother than others. Under the Gaussian assumption, Bayes' rule can be applied to derive a more approximate form of solution for α and β (Foresee and Hagan, 1997).

간략하게, BRN은 새로운 패턴에 대해 학습하고 응답하도록 유연성을 유지하면서 학습하는 동안에 오버피팅을 회피하기 위해 베이스 정규화의 이용에 의존한다. 이 경우, 우리는 더 큰 로버스트니스를 위한 목적 함수에 있어 - 실제 피처 디멘션의 항에서 3 개의 팩터-오차, 네트워크 가중 및 모델 복잡성을 이용한다. Briefly, BRN relies on the use of base normalization to avoid overfitting during learning while maintaining the flexibility to learn and respond to new patterns. In this case, we use three factor-errors, network weighting, and model complexity in terms of the actual function dimension for greater robustness.

선형 회귀:Linear regression:

함수

는 다음과 같이 구현될 수 있다:function

Can be implemented as follows:

다시, 역행렬 문제에 대한 로버스트 솔루션(robust solution)을 찾기 위해 티호노프 정규화를 사용한다. Again, we use Tihonov normalization to find a robust solution to the inverse matrix problem.

다음, 로버스트니스 이슈(robustness issue)에 어떻게 착수할 것인지를 집약한다.Next, we summarize how to embark on a robustness issue.

1. 예측 오차, 모델 복잡성, 및 모델 파라미터를 어드레스하는 목적 함수. 2. 동적 범위를 감소시키고 아웃레이져스 아웃라이어(outrgeous outlier), 즉 선형 비용 대신에 비용의 제곱근을 스케일(scale)하도록 종속 변수의 변형. 3. 5- 또는 10-폴드 크로스 밸리데이션 및 풀-데이터-세트 트레이닝을 포함하는 성능 분석. 4. 다중-모델 조합.1. Objective function to address prediction error, model complexity, and model parameters. 2. Deformation of the dependent variable to reduce dynamic range and to scale out the square root of the cost instead of the linear outlier, i.e., the linear cost. 3. Performance analysis, including 5- or 10-fold cross validation and full-data-set training. 4. Multi-model combinations.

구현 예시:Example implementation:

인체 질병 진행의 다양성이 주어진 경우, 우리는 학습 알고리즘이 트레이닝 중에 직면하지 않은 상태들에 대해 로버스트해야 한다는 것을 암시하는 데이터 오정합의 상당한 양을 예상하였다. 또한, 우리는 그 문제가 대부분의 실사회 문제들에서와 같이 비선형이라고 예상하였다.Given the diversity of human disease progression, we anticipated a significant amount of data mismatch, suggesting that the learning algorithm should be robust against conditions not encountered during training. We also anticipate that the problem is nonlinear, as in most real-world problems.

다수의 학습 알고리즘을 주의 깊게 고려한 후, 우리는 다변 선형 회귀 및 베이스 정규화 네트워크(BRN)의 조합을 결정하였고, 그 둘 간의 일치(consensus)를 찾았다. 선형 회귀는 데이터 오정합의 존재 시에 그 로버스트니스로 인해 선택되었다. L-2 목적 함수에 기초한 비-파라메트릭 및 경계-결정 학습 알고리즘은 그 안심 구역(comfort zone) 외부에서 연산하도록 강제되는 경우 거친 추측(wild guess)(즉, 공학에서의 내삽(interpolation) 대신에 외삽(extrapolation))을 생성하는 것으로 알려져 있다.After carefully considering a number of learning algorithms, we determined a combination of multivariate linear regression and base normalization network (BRN), and found consensus between the two. Linear regression was chosen due to its robustness in the presence of data mismatch. Non-parametric and boundary-determining learning algorithms based on L-2 objective functions are used instead of wild guesses (i.e., interpolation in engineering) when forced to operate outside their comfort zone. It is known to produce extrapolation.

데이터 오정합에 대한 그 로버스트니스에도 불구하고, 선형 회귀는 복잡하고 비선형 관계를 모델링하는 경우 헐떡거리며(out of breath) 실행된다. 이러한 단점을 없애기 위해, 우리는 베이스 정규화 네트워크(BRN)와 선형 회귀를 조합한다. 또한, 정규화 및 BIC의 개념을 레버리징(leveraging)하여, 모델 복잡성의 함수로서 추가 페널티 텀(penalty term)을 도입함으로써, 일반화를 통한 새롭고 보이지 않는 데이터에 대한 로버스트니스의 필요성에 대한 학습 시 예측 오차를 최소화하기 위해 요구를 트레이드 오프한다.Despite its robustness to data mismatch, linear regression is performed out of breath when modeling complex and nonlinear relationships. To eliminate this drawback, we combine base normalization network (BRN) and linear regression. In addition, leveraging the concepts of normalization and BIC, introducing additional penalty terms as a function of model complexity, thereby predicting the need for robustness for new and invisible data through generalization. Trade off needs to minimize errors.

학습의 로버스트니스는 랜덤 크로스 밸리데이션과 풀-데이터-세트 트레이닝 간의 성능 차이를 유의함으로써 확인될 수 있으며, 이는 일반적으로 가능한 한 많은 실제 데이터를 모델링하기 원하기 때문에 실-시간 구현에서 바람직하다. 도 3a 내지 도 3h는 작동 시 2 개의 상이한 데이터 세트를 갖는 2 개의 선택된 학습 알고리즘을 나타낸다. 도 3a 및 도 3b는 제 1 데이터 세트에 대한 선형 회귀를 나타내고, 도 3c 및 도 3d는 그 데이터 세트에 대한 BRN을 나타낸다. 도 3b 및 도 3d는 풀 데이터 세트에 대한 것이며, 도 3a 및 도 3c는 크로스 밸리데이션이다. 도 3a는 R_sq = 0.86358인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3b는 R_sq = 0.83052인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3c는 R_sq = 0.91497인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3d는 R_sq = 0.92077인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3e 내지 도 3f는 그 데이터 세트에 대한 BRN을 나타낸다. 도 3f 및 도 3h는 풀 데이터 세트에 대한 것이며, 도 3e는 크로스 밸리데이션이다. 도 3e는 R_sq = 0.58165인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3f는 R_sq = 0.56736인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3g는 R_sq = 0.59826인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 3h는 R_sq = 0.59819인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 예상된 바와 같이, 2 개의 상이한 데이터 세트에 관한 랜덤 크로스-밸리데이션 트레이닝과 풀-데이터-세트 트레이닝 간의 성능 불일치가 적으며, 심지어는 학습 로버스트니스에 대한 테스타먼트(testament)인 몇몇 인스턴스에서 네거티브하다. 도 4a는 R_sq = 0.90001인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 도 4b는 R_sq = 0.96846인 실제 (x 축) 대 예측된 (y 축) 출력의 그래프이다. 이 경우, 성능 불일치는 다소 두드러지며, 이 상황이 로버스트 학습 알고리즘으로 관찰되는 경우, 데이터 오정합의 레벨은 매우 높다. The robustness of learning can be identified by noting the performance difference between random cross validation and full-data-set training, which is generally desirable in real-time implementations because they want to model as much real data as possible. 3A-3H show two selected learning algorithms with two different data sets in operation. 3A and 3B show linear regression for the first data set, and FIGS. 3C and 3D show the BRN for that data set. 3B and 3D are for the full data set, and FIGS. 3A and 3C are cross validation. 3A is a graph of actual (x axis) versus predicted (y axis) output with R _sq = 0.86358. 3B is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.83052. 3C is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.91497. 3D is a graph of the actual (x axis) vs. predicted (y axis) output with R _sq = 0.92077. 3E-3F show the BRN for that data set. 3F and 3H are for the full data set, and FIG. 3E is cross validation. 3E is a graph of actual (x axis) versus predicted (y axis) output with R _sq = 0.58165. 3F is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.56736. 3G is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.59826. 3H is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.59819. As expected, there is little performance discrepancy between random cross-validation training and full-data-set training on two different data sets, and even negative in some instances that are a testament to learning robustness. Do. 4A is a graph of actual (x-axis) vs predicted (y-axis) output with R _sq = 0.90001. 4B is a graph of actual (x axis) vs predicted (y axis) output with R _sq = 0.96846. In this case, the performance mismatch is rather pronounced, and if this situation is observed with robust learning algorithms, the level of data mismatch is very high.

학습 알고리즘이 일반화하는 대신에 단순히 솔루션을 기억하는 경우, 우리는 도 4a 및 도 4b에 도시된 것보다 더 큰 성능 불일치를 나타낸다고 예상한다. 도 4a 및 도 4b는 멀티레이어 퍼셉트론에 대한 기억 효과를 나타낸다. 도 4a는 크로스 밸리데이션에 기초하는 한편, 도 4b는 L-2 목적 함수를 최소화하는 신경망을 이용하는 풀 데이터 세트 트레이닝에 기초한다. If the learning algorithm simply remembers the solution instead of generalizing, we expect to show a greater performance mismatch than that shown in Figures 4A and 4B. 4A and 4B show the memory effect on multilayer perceptron. 4A is based on cross validation while FIG. 4B is based on full data set training using neural networks that minimize L-2 objective functions.

도 5는 다변 선형 회귀(MVLR) 예측 (x 축) 대 베이스 정규화 네트워크(BRN) 예측 (y 축)의 그래프이며, 2 개의 학습 알고리즘이 몇몇 경우에서 일반적으로 동의하는 동안에, 완벽한 동의를 나타내는 라인으로부터 멀리 있는 무수한 산란 지점에 의해 알 수 있는 바와 같이 그들은 상이한 대답을 제공한다. 여기서, 예측된 변수는 데모그래피 데이터와 청구들의 이전 9 개월을 이용한 향후 1 년 비용(의료 및 의약)의 제곱근이다. 5 is a graph of multivariate linear regression (MVLR) prediction (x-axis) versus base normalization network (BRN) prediction (y-axis), from a line showing complete agreement, while the two learning algorithms generally agree in some cases. As can be seen by the numerous scattering points in the distance, they give different answers. Here, the predicted variable is the square root of the next one year cost (medical and medicament) using the previous 9 months of demographic data and claims.

요약하면, 우리는 우리의 솔루션이 정확하고 로버스트하다는 것을 보장하기 위해 다음과 같은 기준: 오차 최소화; 모델-복잡성 최소화; 평활하고 일반화된 응답을 위한 모델-가중 최소화; 수확체감점(point of diminishing return)을 찾기 위한 피처 랭킹; 풀-데이터-세트 트레이닝과 5-/10-폴드 크로스 밸리데이션 간의 불일치의 최소화; 및 다중-모델 조합에 의존한다.In summary, we use the following criteria to ensure that our solutions are accurate and robust: minimizing errors; Minimize model-complexity; Model-weighted minimization for smooth, generalized responses; Feature ranking to find a point of diminishing return; Minimizing discrepancies between full-data-set training and 5- / 10-fold cross validation; And multi-model combinations.

리프-특정 피처 최적화를 갖는 계층적 트리 네트워크의 구현 예시로서, 우리는 몇몇 리프들이 어떻게 상이하게 행동하는지를 살펴보기로 한다. As an example implementation of a hierarchical tree network with leaf-specific feature optimization, we will look at how some leaves behave differently.

파국적 상태를 갖는 구성원들의 경우, 매우 유용한 피처들은 말기 신장질환(renal disease)/만성 신장질환(chronic kidney disease), 암 및 희귀병과 같은 만성 상태들과 연계된 비용 트렌드(cost trend)로부터 생긴다. 또 다른 흥미있는 피처는 최근 병원 외래 트렌드를 다루며, 여기서 투석이 통상적으로 수행된다. 또한, 이 인구 세그먼트에서 대응적으로 낮은 의료 비용을 갖는 더 높은 Rx 비용이 통상적으로 더 양호한 건강 관리라고 믿기 때문에 의료 대 Rx 비용의 비율은 중요하다. For members with catastrophic conditions, very useful features arise from cost trends associated with chronic conditions such as renal disease / chronic kidney disease, cancer and rare diseases. Another interesting feature addresses recent hospital outpatient trends, where dialysis is routinely performed. In addition, the ratio of medical to Rx costs is important because we believe that higher Rx costs with correspondingly lower medical costs in this population segment are typically better health care.

흥미롭게도, 의료와 Rx 혜택을 둘 다 갖는 것들, 휴마나를 이용한 Rx 청구만을 갖는 것들은 중요하지 않으나, 만성 Rx 의료에 대한 비용을 전체 Rx 비용으로 나눈 비율 및 의약 청구의 양은 둘 다 중요하다. 즉, 가지고 있는 혜택 플랜의 타입에 따라, 피처 최적화 알고리즘은 최적으로 서브세트를 자동으로 선택한다.Interestingly, those having both medical and Rx benefits, only having Rx claims using Humana, are not important, but the ratio of the cost of chronic Rx care divided by the total Rx cost and the amount of drug claim are both important. That is, depending on the type of benefit plan you have, the feature optimization algorithm automatically selects the subset optimally.

한편, 당뇨병 구성원들은 정신장애(mental disorder)(최근 만성 상태라고 진단된 사람들은 우울증과 다른 정신 불안(mental anxiety)을 겪는 경향이 있음), 공 구(vent), 울혈성심부전증, 내분비(endocrine), 정장제(digestive) 및 다른 합병증과 같은 막대한 비용 구동원을 갖는다. 이러한 표시자는 지난 9 개월 동안의 상이한 ICD(International Classifiction of Disease) 코드의 번호이다. ICD 코드는 흔히 ICD9 코드라고도 칭해지며, 코드에 대한 9^th 개정판(Revision)이라고도 칭해지는 것을 볼 수 있다. On the other hand, people with diabetes have mental disorders (those who have recently been diagnosed with a chronic condition tend to suffer from depression and other mental anxiety), vents, congestive heart failure, and endocrine. , Costly drivers such as digestive and other complications. This indicator is the number of different International Classifiction of Disease (ICD) codes over the last nine months. The ICD code is often referred to as the ICD9 code, and can also be seen as the 9 ^th revision of the code.

그럼에도 불구하고, 피처들이 특정 리프에 대해 더 효과적일 것이라는 추측에 대해 선험적으로(a priori) 사용될 수 있는 분명한 경험 법칙(rule of thumb)이 존재하지 않는다. 이 문제를 조사하기 위해, 우리는 도 6a 내지 도 6f에 도시된 바와 같은 다음의 공변: 취해진 특유 약품의 번호(도 6a); 전체 Rx 비용(도 6b); 연령(도 6c); 이식-관련 비용(도 6d); 만성 비용(도 6e); 및 만성동맥질환(CAD) 비용(도 6f) 대 1-년 향후 비용의 제곱근의 다수의 2-차원 산란 플롯을 살펴보기로 한다. 단순한 선형 관계가 존재하는 경우, 직선의 단조 증가 라인이 나타날 것이다. 공교롭게도, 이들 피처 중 어느 것도 이러한 분명한 트렌드를 나타내지 않으며, 이는 단순한 선형 인사이트가 존재하지 않는다는 것을 나타낸다. Nevertheless, there is no clear rule of thumb that can be used a priori to speculate that features will be more effective for a particular leaf. To investigate this problem, we have the following covariates as shown in FIGS. 6A-6F: number of unique drugs taken (FIG. 6A); Total Rx cost (FIG. 6B); Age (FIG. 6C); Transplant-related costs (FIG. 6D); Chronic cost (FIG. 6E); And multiple two-dimensional scatter plots of the square root of the cost of chronic artery disease (CAD) (FIG. 6F) versus the one-year future cost. If a simple linear relationship exists, a monotonic increasing line of straight lines will appear. Unfortunately, none of these features show this clear trend, indicating that there is no simple linear insight.

틀린 것으로 입증된 제 1 가정(myth)은 약품을 더 많이 취할 수록, 그 또는 그녀가 향후에 더 많은 비용을 소비할 것이라는 것이다. 도 6a는 이러한 관계에 대해 나타나 있지 않다. 도 6b에서, 미결 도면(less-than-conclusive picture)을 도시한 초기 절반에서 큰 부분(인구의 90 % 이상)을 나타낼지라도 전체 Rx 비용이 더 가능성 있어 보인다. 도 6c는 가장 위험한 연령 그룹이 약 50 세 내지 65 세의 연 령임을 나타낸다. 또한, 도 6c에서는 향후 비용을 예측하는데 있어서 연령이 유용하지 않다는 것도 나타낼 수 있다. 동일한 관찰은 도 6e에서의 만성 비용과 도 6f에서의 CAD 비용 둘 모두에 대해서도 유효하다. 즉, 다수의 만성 동반질환(comorbid condition)을 갖는다고 해서 유쾌하지 않은 향후 결과를 자동으로 유도하는 것은 아니다. 이 관찰은 향후 건강 상태를 예측하기 위한 예측 모델을 생성하는데 수반되는 상이성을 집약한다. The first myth proved to be wrong is that the more medication you take, the more he or she will spend in the future. 6A is not shown for this relationship. In FIG. 6B, the overall Rx cost seems more likely, even if it represents a large portion (more than 90% of the population) in the initial half showing a non-than-conclusive picture. 6C shows that the most dangerous age group is about 50 to 65 years old. 6C may also indicate that age is not useful for predicting future costs. The same observation is valid for both the chronic cost in FIG. 6E and the CAD cost in FIG. 6F. In other words, having multiple chronic comorbid conditions does not automatically lead to unpleasant future outcomes. This observation aggregates the differences involved in generating predictive models for predicting future health conditions.

그러므로, 예측 모델에 관해 빈번히 틀리는 우리의 직관을 강제하거나 (쓸모없는 다수의 잡다한 것을 포함하는) 모든 것을 모델에 넣는 대신에, 우리는 계층적 트리 네트워크를 구성하고, 최적화 알고리즘으로 하여금 다수의 피처들을 통해 분류하고 각각의 리프에 대한 최적의 서브세트를 선택하게 한다. 부연하면, 데이터로 하여금 우리에게 각각의 리프에 대해 무엇을 구현해야 하는지를 말하게 한다.Therefore, instead of forcing our intuitions that are frequently wrong about predictive models or putting everything into the model (including a number of useless miscellaneous things), we construct a hierarchical tree network and allow the optimization algorithm to Classify and select the optimal subset for each leaf. In other words, let the data tell us what to implement for each leaf.

본 발명은 개인의 향후 건강 상태를 예측하는 컴퓨터-기반 시스템을 포함한다: a. 5 개의 층화 변수의 함수로서 30 개의 노드 중 하나에 각각의 구성원을 할당하는 계층적 트리 네트워크; b. 다수의 추계학적(stochastic) 및 조합적(combinatorial) 최적화 알고리즘을 사용한 각각의 노드에 대한 피처-서브세트 선택; 연산 모드에 따른 별도의 학습 및 스코어링 모듈; d. 2 개의 학습 알고리즘으로부터의 일치를 찾는 산술 평균 연산자; 및 e. 임상 개입을 돕는 관련 파라미터를 출력하는 보고서 생성기. 5 개의 의미 있는 층화 변수가 계층적 트리 네크워크를 설계하는데 사용되므로, 턴키 솔루션(turnkey solution)은 데이터 및 경험 주기 요건의 항으로 전해진다. 부연하면, 우리는 의료 및 Rx 데이터와 6 개월 이상의 청 구 경험을 갖는 것만을 처리할 필요가 없다. 또한, 2 개의 학습 알고리즘은 서로 상보적이므로, 모델-오정합-대-데이터-오정합 및 모델링-유연성-대-모델-복잡성 스펙트럼을 포함하는 벡터 공간의 로버스트 서브스페이스에서 작업한다.The present invention includes a computer-based system for predicting a future health condition of an individual: a. A hierarchical tree network assigning each member to one of the thirty nodes as a function of five stratification variables; b. Feature-subset selection for each node using a number of stochastic and combinatorial optimization algorithms; A separate learning and scoring module according to the operation mode; d. An arithmetic mean operator to find a match from two learning algorithms; And e. Report generator that outputs relevant parameters to aid in clinical intervention. Since five meaningful stratification variables are used to design the hierarchical tree network, the turnkey solution is passed in terms of data and life cycle requirements. In other words, we do not have to deal only with medical and Rx data and with more than six months of billing experience. In addition, the two learning algorithms are complementary to each other and therefore work in robust subspaces of vector space that include model-mismatch-to-data-mismatch and modeling-flexibility-to-model-complexity spectra.

이전의 상세한 설명은 우선적으로 이해의 명확성을 제공하기 위한 것으로, 불필요한 제한으로 이해되어서는 아니되며, 본 명세서를 숙지한 당업자는 본 발명의 기술적 사상을 벗어나지 않고 다양한 변형들이 행해질 수 있음을 이해할 것이다.The foregoing detailed description is primarily for providing clarity of understanding, and should not be understood as unnecessary limitations, and those skilled in the art will appreciate that various modifications may be made without departing from the spirit of the invention.

Claims

In the method of predicting the future health condition of the individual,

a. Establishing a hierarchical tree network that assigns each member only to at least one of the plurality of nodes as a function of a plurality of stratification variables for the plurality of members;

b. Providing, to a computer-based system, member demographic data, active member medical billing data, and active member medical billing data for the plurality of members and each of the members;

c. For each said node an optimal serve of features from a set comprising at least one of said member demographic data, said effective member medical claim data and said effective member medical claim data for all members assigned to said node; Performing feature selection on each of the plurality of nodes to identify a set;

d. Parameters trained in the database to train an MVLR algorithm and a BRN algorithm using at least one of the member demographic data, the active member medical claim data and the effective member medical claim data, and generate a learned parameter database. Storing the;

e. Utilize the learned parameter database, use member demographic data, the active member medical claim data and the effective member medical claim data of the one or more members for one or more of the members, and use the MVLR algorithm to MVLR future Calculating a health status score, calculating a BRN future health status score using the BRN algorithm, and determining a final score by calculating an arithmetic mean of the MVLR future health status score and the BRN future health status score. Characterized by the method for predicting a medical condition.

The method of claim 1,

After determining the final score, further comprising generating a member clinical condition report, wherein the member clinical condition report includes member identification information and the final score. A medical condition prediction method, characterized in that.

The method of claim 2,

Wherein said member clinical status report further comprises all or part of any of said member's medical billing information and medication billing information, and said member's clinical status data.

The method of claim 1,

In establishing the hierarchical tree network, the plurality of stratification variables may include a member's enrollment duration, presence of a member of a catastrophic condition; A health condition characterized by the presence or absence of a member of the diabetes flag, a member having a subscription trigger or an inpatient admission trigger, and at least one of the member claim data Forecast method.

The method of claim 4, wherein

If the member's registration period is less than six months, the member will be assigned to one of the first set of nodes; And if the registration period of the member is more than six months, the member is assigned to one of the second set of nodes.

The method of claim 5,

If the registration period of the member is less than six months and the member has a catastrophic state, the member will be assigned to the first subset of the first set of nodes; If the member's registration period is less than six months and the member has a absence of catastrophic status, the member is one of the first set of nodes that is not present in the first subset of the first set of nodes. One of the first set of nodes that will be assigned to and not present in the first subset of the first set of nodes comprises a second subset of the first set of nodes; If the member's registration period is at least six months and the member has a presence of a global state, the member will be assigned to a first subset of the second set of nodes; If the member's registration period is six months or more and the member has a absence of catastrophic status, the member is assigned to one of the second sets of nodes that are not present in the first subset of the second set of nodes. And wherein one of the second sets of nodes not present in the first subset of the second set of nodes comprises a second subset of the second set of nodes. .

The method of claim 6,

If the member's enrollment period is less than 6 months and the member has a absence of catastrophic status, the member has a diabetes flag and the member has a subscription trigger, the member is responsible for the first set of nodes. Assigned to a first subset of the second subset; If the member's enrollment period is less than six months and the member has a absence of catastrophic status, the member has a presence of a diabetic flag and the member has an inflation admission trigger, the member is responsible for the first set of nodes. Assigned to a second subset of the second subset; If the member's enrollment period is less than six months and the member has a catastrophic state, the member has a diabetic flag and the member has a subscription trigger, the member is responsible for the first set of nodes. Assigned to a third subset of the second subset; If the member's enrollment period is less than six months and the member has a catastrophic state, the member has a diabetic flag and the member has an inflation admission trigger, the member is responsible for the first set of nodes. Assigned to a fourth subset of the second subset; If the member's registration period is at least six months and the member has a catastrophic state, the member has a diabetic flag and the member has a subscription trigger, the member may be assigned to the second set of nodes. Assigned to a first subset of the second subset; If the member's enrollment period is at least six months and the member has a catastrophic state, the member has a diabetic flag and the member has an inflation admission trigger, the member may be assigned to the second set of nodes. Assigned to a second subset of the second subset; If the member's enrollment period is at least six months and the member has a catastrophic state, the member has a diabetic flag and the member has a subscription trigger, then the member is subject to the second set of nodes. Assigned to a third subset of the second subset; If the member's enrollment period is at least 6 months and the member has a catastrophic state, the member has a diabetes flag and the member has an inflation admission trigger, the member is responsible for the second set of nodes. And a fourth subset of the second subset.

The method of claim 7, wherein

The first subset of the first set of nodes, the first subset of the second subset of the first set of nodes, the second subset of the first set of nodes A second subset, the third subset of the second subset of the first set of nodes, the fourth subset of the second subset of the first set of nodes, the node The first subset of the second set of nodes, the first subset of the second subset of the second set of nodes, the second subset of the second subset of the second set of nodes A subset, the third subset of the second subset of the second set of nodes, and the fourth subset of the second subset of the second set of nodes, respectively, And a third node in which the member has medical and billing data, a second node in which the member has medical billing data only, and a third node in which the member has only medical billing data.

The method of claim 4, wherein

Said member will have a catastrophic condition if said member suffers from at least one of any cancer, terminal disease, transplantation, rare disease, HIV, coronary artery disease and a combination of chronic heart failure and hypertension State prediction method.

The method of claim 1,

The plurality of nodes of the hierarchical tree network comprises thirty nodes,

a. A first node for any member having a registration period of less than six months, presence of catastrophic status, and both medical and medical billing data;

b. A second node for any member having a registration period of less than six months, presence of catastrophic status, medical billing data only;

c. A third node for any member having a registration period of less than six months, presence of catastrophic status, medical billing data only;

d. A fourth node for any member having a enrollment period of less than six months, absence of catastrophic status, presence of diabetes triggers, presence of prescription triggers, and both medical and medical billing data;

e. A fifth node for any member with a registration period of less than 6 months, absence of catastrophic status, presence of diabetes triggers, presence of prescription triggers, and medical billing data only;

f. A sixth node for any member with a registration period of less than six months, absence of catastrophic status, presence of diabetes triggers, presence of subscription triggers, and medical billing data only;

g. A seventh node for any member having a enrollment period of less than six months, the absence of catastrophic status, the presence of a diabetic trigger, the presence of an intake admission trigger, and both medical and medical billing data;

h. An eighth node for any member having a enrollment period of less than six months, absence of catastrophic status, presence of diabetes triggers, presence of intake admission triggers, and medical billing data only;

i. A ninth node for any member having a enrollment period of less than six months, the absence of catastrophic status, the presence of a diabetic trigger, the presence of an intake admission trigger, only medical billing data;

j. A tenth node for any member having a enrollment period of less than six months, absence of catastrophic status, absence of diabetes triggers, presence of prescription triggers, both medication and medical billing data;

k. An eleventh node for any member with a registration period of less than six months, absence of catastrophic status, absence of diabetes triggers, presence of prescription triggers, and medical billing data only;

l. A twelfth node for any member having a registration period of less than 6 months, absence of catastrophic status, absence of diabetes triggers, presence of subscription triggers, medical billing data only;

m. A thirteenth node for any member having a enrollment period of less than six months, the absence of catastrophic status, the absence of a diabetic trigger, the presence of an intake admission trigger, both medical and medical billing data;

n. A fourteenth node for any member having only a enrollment period of less than six months, the absence of catastrophic status, the absence of a diabetic trigger, the presence of an intake admission trigger, and only medical billing data;

o. A fifteenth node for any member having enrollment periods less than six months, absence of catastrophic status, absence of diabetes triggers, presence of inflation admission triggers, and medical billing data only;

p. A sixteenth node for any member having a registration period of six months or more, presence of catastrophic status, both medical and medical billing data;

q. A seventeenth node for any member having a registration period of six months or more, the presence of catastrophic status, and medical billing data only;

r. An eighteenth node for any member having a registration period of six months or more, presence of catastrophic status, medical billing data only;

s. A nineteenth node for any member having a enrollment period of six months or more, absence of catastrophic status, presence of diabetes triggers, presence of prescription triggers, and both medical and medical billing data;

t. A twentieth node for any member having a enrollment period of six months or more, the absence of catastrophic status, the presence of a diabetic trigger, the presence of a subscription trigger, and medical billing data only;

u. A twenty-first node for any member having a registration period of at least six months, absence of catastrophic status, presence of diabetes triggers, presence of subscription triggers, and medical billing data only;

v. A twenty-second node for any member having a enrollment period of six months or more, the absence of catastrophic status, the presence of a diabetic trigger, the presence of an intake admission trigger, and both medical and medical billing data;

w. A twenty-third node for any member having a enrollment period of six months or more, the absence of catastrophic status, the presence of a diabetic trigger, the presence of an intake admission trigger, and only medical billing data;

x. A twenty-fourth node for any member having a enrollment period of six months or more, absence of catastrophic status, presence of diabetes triggers, presence of intake admission triggers, and medical billing data only;

y. A twenty-fifth node for any member having a enrollment period of six months or more, absence of catastrophic conditions, absence of diabetes triggers, presence of prescription triggers, and both medical and medical billing data;

z. A twenty-sixth node for any member having only a enrollment period of six months or more, the absence of catastrophic conditions, the absence of a diabetic trigger, the presence of a subscription trigger, and medical billing data only;

aa. A twenty-seventh node for any member having only a enrollment period of six months or more, absence of catastrophic status, absence of diabetes triggers, presence of subscription triggers, and medical billing data only;

bb. A twenty-eighth node for any member having a enrollment period of six months or more, the absence of catastrophic status, the absence of a diabetic trigger, the presence of an intake admission trigger, and both medical and medical billing data;

cc. A twenty-ninth node for any member having only a enrollment period of six months or more, the absence of catastrophic status, the absence of a diabetic trigger, the presence of an intake admission trigger, and only medical billing data; And

dd. And a thirtieth node for any member having only a enrollment period of six months or more, absence of catastrophic status, absence of diabetes triggers, presence of inflation admission triggers, and medical billing data only.

The method of claim 10,

After determining the final score, further comprising generating a member clinical status report, wherein the member clinical status report includes member identification information and the final score.

The method of claim 11,

The method of claim 10,