JP2009003537A

JP2009003537A - calculator

Info

Publication number: JP2009003537A
Application number: JP2007161456A
Authority: JP
Inventors: Teruo Seki; 輝夫関
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-06-19
Filing date: 2007-06-19
Publication date: 2009-01-08

Abstract

【課題】プロセッサの有効活用を図りながら、障害発生時の信頼性及びリアルタイムなフォールトトレラント機能を有する計算機を提供する。
【解決手段】障害発生時の信頼性及びリアルタイムフォールトトレラント機能が要求される処理を逐次実行部１２１、１２２が実行し、その他の処理を並列処理部１２３が実行する。逐次実行部では各プロセッサが常用系１２５、待機系１２６に配置され、常用系及び待機系で対応するプロセッサ（１０１と１０３及び１０２と１０４）において同一の処理を実行させる。逐次実行部プロセッサの常用系故障発生時には待機系プロセッサが常用系に切り換ることでホットスタンバイを実現する。さらに故障したプロセッサが復旧した場合、待機系として動作する。また、並列処理部プロセッサによってタスクを並列処理させ、並列処理部プロセッサの故障発生時には他の並列処理部プロセッサが故障したプロセッサのタスクを実行する。
【選択図】図１Provided is a computer having a reliability and a real-time fault tolerant function when a failure occurs while effectively utilizing a processor.
The sequential execution units 121 and 122 execute processes requiring reliability and a real-time fault tolerant function when a failure occurs, and the parallel processing unit 123 executes other processes. In the sequential execution unit, the processors are arranged in the active system 125 and the standby system 126, and the same processing is executed in the corresponding processors (101 and 103 and 102 and 104) in the active system and the standby system. When a normal system failure occurs in the sequential execution unit processor, the standby system is switched to the normal system to realize hot standby. When the failed processor is restored, it operates as a standby system. In addition, tasks are processed in parallel by the parallel processing unit processor, and when a failure occurs in the parallel processing unit processor, the task of the processor in which another parallel processing unit processor has failed is executed.
[Selection] Figure 1

Description

この発明は、例えば、イベント入力に対する処理速度が要求され、処理の遅延が容認されないシステムにおけるリアルタイム性及び抗たん性の確保を主目的とした分散並列処理技術に関するものである。 The present invention relates to a distributed parallel processing technique whose main purpose is to ensure real-time performance and resilience in a system that requires a processing speed for event input and does not allow processing delay, for example.

一般的に、分散処理と冗長処理の両方を要求されるシステムでは、計算リソース有効活用の観点から常用系／待機系の方式はとらず、全ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を用いて分散処理を実施する。
処理性能向上と信頼性向上を同時に実現するための発明として、例えば特開２００２−３４２３００号公報（名称：分散処理システム及び分散処理方法並びに分散処理制御プログラム）（以下、特許文献１という）が開示されている。 In general, systems that require both distributed processing and redundant processing do not use the regular / standby system from the viewpoint of effective use of computing resources, and perform distributed processing using all CPUs (Central Processing Units). To do.
As an invention for simultaneously realizing improvement in processing performance and reliability, for example, Japanese Patent Laid-Open No. 2002-342300 (name: distributed processing system, distributed processing method, and distributed processing control program) (hereinafter referred to as Patent Document 1) is disclosed. Has been.

特許文献１記載の技術は、冗長化されたコンピュータモジュール資源を利用して、分散処理と冗長処理を組み合わせて実施することで処理性能向上と信頼性向上とを同時に実現する。
具体的には、故障発生時に、重要度の低いタスクから先に停止させて機能／性能を縮退させることで、重要なタスクに関する信頼性や機能／性能を維持可能とし、限られた資源を最も重要な目的に振り分けることができるとしている。
従って、特許文献１に記載の技術は、タスクに重要度の高いものと低いものとがある場合に大きな効果を発揮する。 The technique described in Patent Document 1 simultaneously improves processing performance and reliability by implementing a combination of distributed processing and redundant processing using redundant computer module resources.
Specifically, when a failure occurs, it is possible to maintain the reliability and function / performance related to the important task by stopping the less important task first and degrading the function / performance. It can be allocated to important purposes.
Therefore, the technique described in Patent Document 1 is very effective when there are high and low importance tasks.

ところが、処理時間に制限のあるタスクを実行中のＣＰＵが故障した場合、他ＣＰＵに処理を割り振り、再度計算を実施しなければならないため実行時間の制限を確保できない可能性がある。
また、イベント入力による処理であるため計算結果を保証できない可能性も含み、これらを許容できないシステムの計算機としては致命的な問題となる。 However, if a CPU that is executing a task with a limited processing time fails, processing must be allocated to another CPU and recalculated, so that there is a possibility that the execution time limit cannot be secured.
In addition, since the processing is based on event input, there is a possibility that the calculation result cannot be guaranteed, and this is a fatal problem for a computer in a system that cannot accept these results.

上記問題を許容できないシステムの例として戦闘システムの計算機が挙げられるが、当該計算機では射撃システムからの脅威目標入力時等のイベント入力に対するリアルタイム性及び冗長性の確保のため、複数搭載しているＣＰＵを常用系／待機系に２分割しホットスタンバイ方式（常時、常用系及び待機系が動作し、イベント入力に対して常用系及び待機系の両方で計算を実施）の構成としていた。 An example of a system that cannot tolerate the above problem is a combat system computer. In this computer, a plurality of CPUs are installed to ensure real-time performance and redundancy for event inputs such as when a threat target is input from a shooting system. The system is divided into a normal system / standby system and is configured in a hot standby system (always operating the normal system and the standby system, and the event input is calculated in both the normal system and the standby system).

上記の計算機では、計算機のＣＰＵを常用系／待機系に分割する構成とし、常用系及び待機系では対応するＣＰＵが同一の処理を行う。
任意の常用系ＣＰＵに異常が発生した場合、全てのＣＰＵが常用系から待機系に切り換わる。
このようなシステムではフォールトトレラントの機能を重視しているため、計算機が高負荷となる状況においても半数のＣＰＵは有効を活用することができない。
また、常用系の１ＣＰＵが故障した場合、全ての常用系ＣＰＵが待機系に移行するため、常用系の正常なＣＰＵは計算に使用することができなくなるという課題がある。
特開２００２−３４２３００号公報 In the above computer, the CPU of the computer is divided into a normal system / standby system, and the corresponding CPU performs the same processing in the normal system and the standby system.
When an abnormality occurs in any regular CPU, all the CPUs are switched from the regular system to the standby system.
In such a system, since the fault tolerant function is emphasized, half of the CPUs cannot utilize the effectiveness even in a situation where the computer is heavily loaded.
In addition, when one normal CPU fails, all the normal CPUs shift to the standby system, which causes a problem that normal normal CPUs cannot be used for calculation.
JP 2002-342300 A

特許文献１のシステムではＣＰＵを有効に活用できるが故障発生時のリカバリ時間及び信頼性における課題があり、上記の常用系／待機系構成とするシステムではフォールトトレラント機能を重視しているため、半数のＣＰＵを有効に活用できないという課題が存在する。 In the system of Patent Document 1, the CPU can be used effectively, but there are problems in recovery time and reliability in the event of a failure. There is a problem that the CPU cannot be effectively used.

そこで、本発明は、故障発生時の切換え時間／信頼性を重視しながら、計算機に搭載されるＣＰＵを効果的に利用する計算機構成を提供することを主な目的の一つとする。 Therefore, one of the main objects of the present invention is to provide a computer configuration that effectively uses a CPU mounted on a computer while placing importance on the switching time / reliability when a failure occurs.

本発明に係る計算機は、
常用系プロセッサと、常用系プロセッサの障害発生時に常用系プロセッサを代替する待機系プロセッサとを備える第一の処理部と、
二以上の並列処理プロセッサを備え、二以上の並列処理プロセッサが協働して分散並列処理を行う第二の処理部と、
前記第二の処理部に含まれる二以上の並列処理プロセッサのうち前記第一の処理部に含まれるプロセッサとの通信を伴う処理を実行していたいずれかの並列処理プロセッサに障害が発生した際に、障害が発生した障害並列処理プロセッサが実行中であった障害プロセッサ実行処理を引き継いで実行する引継ぎ並列処理プロセッサを指定し、前記障害プロセッサ実行処理において前記障害並列処理プロセッサの通信先であった常用系プロセッサ及び待機系プロセッサの少なくともいずれかに前記引継ぎ並列処理プロセッサを通知するプロセッサ管理部とを有することを特徴とする。 The computer according to the present invention is:
A first processing unit including an active processor and a standby processor that replaces the active processor when a failure occurs in the active processor;
A second processing unit that includes two or more parallel processors, and the two or more parallel processors cooperate to perform distributed parallel processing;
When a failure occurs in any one of the two or more parallel processing processors included in the second processing unit that is executing processing involving communication with the processor included in the first processing unit Is designated as a takeover parallel processing processor that takes over the faulty processor execution process that was being executed by the faulty parallel processing processor in which the fault occurred, and was the communication destination of the faulty parallel processing processor in the faulty processor execution process And a processor management unit that notifies the takeover parallel processing processor to at least one of a normal processor and a standby processor.

第一の処理部に含まれるプロセッサとの通信を伴う処理を実行していた並列処理プロセッサに障害が発生した際にも、プロセッサ管理部が、引継ぎ並列処理プロセッサを通信相手であった第一の処理部のプロセッサに通知するので、フォールトトレラント機能を有する第一の処理部とプロセッサ資源の有効活用のために分散並列処理を行う第二の処理部とを並存させる構成が可能となり、プロセッサ資源の有効活用を図りながら、障害発生時の信頼性及びリアルタイムなフォールトトレラント機能を確保することができる。 Even when a failure occurs in the parallel processing processor that is executing processing involving communication with the processor included in the first processing unit, the processor management unit sets the takeover parallel processing processor as the first communication partner. Since the processor of the processing unit is notified, a configuration in which a first processing unit having a fault-tolerant function and a second processing unit that performs distributed parallel processing for effective utilization of processor resources can be performed in parallel. While making effective use, it is possible to ensure reliability at the time of failure and a real-time fault-tolerant function.

実施の形態１．
図１は、実施の形態１に係る計算機１のプロセッサ構成の概要を示すシステム構成図である。
従来システムとの差を明確にするため、従来システムの構成を図８に示す。
図１において１０１〜１１０は計算機１に搭載されるネットワーク１３０で接続されているＣＰＵである。
ＣＰＵは、逐次実行部１２１、１２２、並列処理部１２３及び並列管理部１２４に分類される。
逐次実行部として動作するＣＰＵは、逐次実行部ＣＰＵ又は逐次実行部プロセッサと表記する。
並列処理部として動作するＣＰＵは、並列処理部ＣＰＵ又は並列処理部プロセッサと表記する。
並列管理部として動作するＣＰＵは、並列管理部ＣＰＵ又は並列管理部プロセッサと表記する。
また、ＣＰＵは、逐次実行部１２１、１２２及び並列管理部１２４については、常用系１２５と待機系１２６にも分類される。
常用系に分類されているＣＰＵは、常用系ＣＰＵ又は常用系プロセッサと表記する。
待機系に分類されているＣＰＵは、待機系ＣＰＵ又は待機系プロセッサと表記する。 Embodiment 1 FIG.
FIG. 1 is a system configuration diagram illustrating an overview of a processor configuration of a computer 1 according to the first embodiment.
In order to clarify the difference from the conventional system, the configuration of the conventional system is shown in FIG.
In FIG. 1, 101 to 110 are CPUs connected by a network 130 mounted on the computer 1.
The CPU is classified into sequential execution units 121 and 122, a parallel processing unit 123, and a parallel management unit 124.
A CPU that operates as a sequential execution unit is referred to as a sequential execution unit CPU or a sequential execution unit processor.
A CPU that operates as a parallel processing unit is referred to as a parallel processing unit CPU or a parallel processing unit processor.
A CPU that operates as a parallel management unit is referred to as a parallel management unit CPU or a parallel management unit processor.
The CPU is also classified into a normal system 125 and a standby system 126 for the sequential execution units 121 and 122 and the parallel management unit 124.
A CPU classified as a regular system is referred to as a regular system CPU or a regular processor.
A CPU classified as a standby system is referred to as a standby system CPU or a standby system processor.

逐次実行部１２１、１２２は、常用系プロセッサ１０１、１０２と、常用系プロセッサの障害発生時に常用系プロセッサを代替する待機系プロセッサ１０３、１０４とを備える。逐次実行部１２１、１２２は、第一の処理部の例である。
並列処理部１２３は、二以上の並列処理部プロセッサ（並列処理プロセッサ）１０５〜１０８を備え、二以上の並列処理部プロセッサ１０５〜１０８が協働して分散並列処理を行う。並列処理部１２３は、第二の処理部の例である。
並列管理部１２４は、常用系プロセッサ１０９と、常用系プロセッサの障害発生時に常用系プロセッサを代替する待機系プロセッサ１１０とを備える。並列管理部１２４は、プロセッサ管理部の例である。
また、並列管理部１２４では、並列処理部１２３に含まれる二以上の並列処理部プロセッサ１０５〜１０８のうち逐次実行部１２１、１２２に含まれるプロセッサとの通信を伴う処理を実行していたいずれかの並列処理部プロセッサに障害が発生した際に、常用系プロセッサ１０９又は待機系プロセッサ１１０が、障害が発生した障害並列処理プロセッサが実行中であった障害プロセッサ実行処理を引き継いで実行する引継ぎ並列処理プロセッサを指定し、障害プロセッサ実行処理において障害並列処理プロセッサの通信先であった逐次実行部１２１、１２２の常用系プロセッサ及び待機系プロセッサの少なくともいずれかに引継ぎ並列処理プロセッサを通知する。 The sequential execution units 121 and 122 include normal processors 101 and 102 and standby processors 103 and 104 that replace the normal processor when a failure occurs in the normal processor. The sequential execution units 121 and 122 are examples of a first processing unit.
The parallel processing unit 123 includes two or more parallel processing unit processors (parallel processing processors) 105 to 108, and the two or more parallel processing units processors 105 to 108 cooperate to perform distributed parallel processing. The parallel processing unit 123 is an example of a second processing unit.
The parallel management unit 124 includes a normal processor 109 and a standby processor 110 that replaces the normal processor when a failure occurs in the normal processor. The parallel management unit 124 is an example of a processor management unit.
In addition, the parallel management unit 124 executes any process involving communication with the processors included in the sequential execution units 121 and 122 among the two or more parallel processing unit processors 105 to 108 included in the parallel processing unit 123. Takeover parallel processing in which the active processor 109 or the standby processor 110 takes over and executes the failed processor execution process that was being executed by the failed parallel processor when the failure occurred in the processor of the parallel processor The processor is specified, and the takeover parallel processing processor is notified to at least one of the active processor and the standby processor of the sequential execution units 121 and 122 that are communication destinations of the failed parallel processing processor in the failed processor execution processing.

本実施の形態では、障害発生時の信頼性及びリアルタイムフォールトトレラント機能を要求される処理を逐次実行部１２１、１２２によって実行し、その他の処理を並列処理部１２３によって実行する。逐次実行部１２１、１２２では各プロセッサが常用系１２５又は待機系１２６に配置され、常用系及び待機系で対応するプロセッサ（１０１と１０３及び１０２と１０４）において同一の処理を実行させる。逐次実行部プロセッサの常用系故障発生時には待機系プロセッサが常用系に切り換ることでホットスタンバイを実現する。さらに故障したプロセッサが復旧した場合、待機系として動作する。また、並列処理部プロセッサによってタスクを並列処理させ、並列処理部プロセッサの故障発生時には他の並列処理部プロセッサが故障したプロセッサのタスクを実行する。
このような構成により、システム全体の処理性能が求められるシステムにおいて、特定の処理に対する障害発生時の信頼性及びリアルタイムなフォールトトレラント機能を有する計算機１が実現される。 In this embodiment, the processing that requires the reliability at the time of failure and the real-time fault tolerant function is executed by the sequential execution units 121 and 122, and the other processing is executed by the parallel processing unit 123. In the sequential execution units 121 and 122, the respective processors are arranged in the normal system 125 or the standby system 126, and the processors (101 and 103 and 102 and 104) corresponding to the normal system and the standby system execute the same processing. When a normal system failure occurs in the sequential execution unit processor, the standby system is switched to the normal system to realize hot standby. When the failed processor is restored, it operates as a standby system. In addition, tasks are processed in parallel by the parallel processing unit processor, and when a failure occurs in the parallel processing unit processor, the task of the processor in which another parallel processing unit processor has failed is executed.
With such a configuration, in a system in which the processing performance of the entire system is required, the computer 1 having a reliability when a failure occurs for a specific process and a real-time fault tolerant function is realized.

また、図１において、１１１〜１１４は、逐次実行部ＣＰＵ１０１〜１０４に静的に配置され実行されるタスクを示す。
また、１１５〜１１８は、並列処理部ＣＰＵ１０５〜１０８にて動的に割り振られ実行されるタスクを示す。
１１９〜１２０は、並列管理部ＣＰＵ１０９、１１０にて処理される並列処理部１２３を管理するためのタスク（並列処理部の稼動状況監視、負荷状況監視、並列処理部ＣＰＵ数の計算及び並列処理部タスク管理）及び計算機内の通信を管理するためのタスクを示す。 In FIG. 1, reference numerals 111 to 114 denote tasks that are statically arranged and executed in the sequential execution units CPU 101 to 104.
Reference numerals 115 to 118 denote tasks dynamically allocated and executed by the parallel processing units CPU105 to 108.
Reference numerals 119 to 120 denote tasks for managing the parallel processing unit 123 processed by the parallel management units CPU 109 and 110 (operation status monitoring of the parallel processing unit, load status monitoring, calculation of the number of parallel processing units CPU and parallel processing unit Task management) and tasks for managing communication within the computer are shown.

計算機１に搭載されるＣＰＵ数及び逐次実行部／並列処理部／並列管理部の構成数は図１の通りでなくて良いが、以下に示す構成とする必要がある。
逐次実行部１２１、１２２の常用系ＣＰＵ数と待機系ＣＰＵ数は同じとし、１以上とする。
並列処理部１２３のＣＰＵ数は、２以上とする。
並列管理部１２４のＣＰＵ数は、１以上でかつ常用系／待機系で同数とし１ＣＰＵを推奨とする。また、並列管理部１２４は、常用系／待機系という冗長構成になっていることが望ましいが、これは必須ではなく、並列管理部１２４が単一のＣＰＵで構成されていてもよい。
図１における常用系及び待機系は電源系統を別に持っている。常用系、待機系で独立した電源としなくても良いが、図４に示す電源系統異常時の冗長性が失われる。図４については、後に詳述する。 The number of CPUs mounted in the computer 1 and the number of components of the sequential execution unit / parallel processing unit / parallel management unit do not have to be as shown in FIG.
The number of regular CPUs and the number of standby CPUs in the sequential execution units 121 and 122 are the same, and are 1 or more.
The number of CPUs in the parallel processing unit 123 is two or more.
The number of CPUs of the parallel management unit 124 is 1 or more and the same number in the normal / standby system, and 1 CPU is recommended. The parallel management unit 124 preferably has a redundant configuration of a normal system / standby system, but this is not essential, and the parallel management unit 124 may be configured by a single CPU.
The normal system and standby system in FIG. 1 have separate power supply systems. Although it is not necessary to use independent power sources for the normal system and the standby system, the redundancy at the time of abnormality of the power system shown in FIG. 4 is lost. FIG. 4 will be described in detail later.

図１に示す構成において、ＣＰＵ故障時の切換え時間及び計算結果の保証を要求されるタスクを逐次実行部１２１、１２２で実行する。
逐次実行部１２１、１２２では静的なタスク配置を行い常用系／待機系の対応するＣＰＵで同じ処理を常時実施させることで、障害時の高速な切換えを実現する。
逐次実行部１２１、１２２に入力されるイベントは全て常用系及び待機系両方に入力し処理する。
逐次実行部ＣＰＵにおける静的なタスク割り当ては、障害時のＣＰＵ切換えによる通信相手の変更を考慮し、可能な限り他の逐次実行部ＣＰＵと通信しないようにすることが望ましい。 In the configuration shown in FIG. 1, the sequential execution units 121 and 122 execute tasks required to guarantee the switching time and the calculation result when the CPU fails.
The sequential execution units 121 and 122 perform static task placement and always perform the same processing in the corresponding CPUs of the active / standby system, thereby realizing high-speed switching at the time of failure.
All events input to the sequential execution units 121 and 122 are input to both the normal system and the standby system for processing.
It is desirable that static task assignment in the sequential execution unit CPU should not communicate with other sequential execution unit CPUs as much as possible in consideration of changes in the communication partner due to CPU switching at the time of failure.

図２、図３は、図１に示す構成のシステムにおけるＣＰＵ障害時のＣＰＵ及び通信の切換え動作例である。
図２は、逐次実行部と並列処理部で通信を行っていない場合のＣＰＵ切り換え動作例、図３は逐次実行部と並列処理部で通信中におけるＣＰＵ切り換え動作例を示す。
図１と同様に、１０１〜１０４は逐次実行部ＣＰＵ、１０５〜１０８は並列処理部ＣＰＵ、１０９及び１１０は並列管理部ＣＰＵを示す。
また、２１１、２１２及び３１１、３１２は障害発生前におけるＣＰＵ間通信例、２１３、２１４及び３１３、３１４は障害発生後のＣＰＵ間通信例である。 2 and 3 show an example of the CPU and communication switching operation at the time of CPU failure in the system having the configuration shown in FIG.
FIG. 2 shows an example of CPU switching operation when communication is not performed between the sequential execution unit and the parallel processing unit, and FIG. 3 shows an example of CPU switching operation during communication between the sequential execution unit and the parallel processing unit.
As in FIG. 1, 101 to 104 are sequential execution unit CPUs, 105 to 108 are parallel processing unit CPUs, and 109 and 110 are parallel management unit CPUs.
Reference numerals 211, 212, 311, and 312 denote communication examples between CPUs before the occurrence of a failure, and reference numerals 213, 214, 313, and 314 denote communication examples between CPUs after the occurrence of a failure.

図２において、逐次実行部ＣＰＵ１０１が故障した場合、対応する逐次実行部ＣＰＵ１０３の処理結果をシステムとして使用する。このときＣＰＵ１０２は常用系として処理を続行し、ＣＰＵ１０２の通信相手はＣＰＵ１０１からＣＰＵ１０３に変更される。
ＣＰＵ１０４はＣＰＵ１０２の待機リソースである。ＣＰＵ１０１が故障から復旧した場合、ＣＰＵ１０３の処理結果を使用するが、ＣＰＵ１０３が故障した場合には再度ＣＰＵ１０１の計算結果を利用する。 In FIG. 2, when the sequential execution unit CPU 101 fails, the processing result of the corresponding sequential execution unit CPU 103 is used as a system. At this time, the CPU 102 continues processing as a regular system, and the communication partner of the CPU 102 is changed from the CPU 101 to the CPU 103.
The CPU 104 is a standby resource for the CPU 102. When the CPU 101 recovers from the failure, the processing result of the CPU 103 is used. When the CPU 103 fails, the calculation result of the CPU 101 is used again.

図２において、並列処理部１０８が故障した場合、並列処理部１０９は正常動作しているＣＰＵ１０５〜１０７で運用を継続する。
つまり、ＣＰＵ１０８が故障した時、並列管理部ＣＰＵ１０９が故障を検知し、ＣＰＵ１０８に割り振っているタスクをＣＰＵ１０５〜１０７のいずれかのＣＰＵに割り振る。
図２の例では、並列管理部ＣＰＵ１０９は、ＣＰＵ１０５をＣＰＵ１０８の処理の割り振り先として指定している。故障が発生したＣＰＵ１０８が障害並列処理プロセッサの例であり、ＣＰＵ１０８の処理を割り振られたＣＰＵ１０５が引継ぎ並列処理プロセッサの例である。
そして、ＣＰＵ１０８と故障前に通信していたＣＰＵ１０７は、並列管理部ＣＰＵ１０９の指示によりＣＰＵ１０８のタスクを割り当てられたＣＰＵ１０５と通信を再開する。 In FIG. 2, when the parallel processing unit 108 fails, the parallel processing unit 109 continues to operate with the CPUs 105 to 107 operating normally.
That is, when the CPU 108 fails, the parallel management unit CPU 109 detects the failure and allocates the task allocated to the CPU 108 to any one of the CPUs 105 to 107.
In the example of FIG. 2, the parallel management unit CPU 109 designates the CPU 105 as an allocation destination of processing of the CPU 108. The CPU 108 in which a failure has occurred is an example of a faulty parallel processing processor, and the CPU 105 to which the processing of the CPU 108 is assigned is an example of a takeover parallel processing processor.
Then, the CPU 107 communicating with the CPU 108 before the failure resumes communication with the CPU 105 assigned the task of the CPU 108 in accordance with an instruction from the parallel management unit CPU 109.

図３は、逐次実行部と並列処理部にまたがった通信状態でＣＰＵ故障が発生した場合における動作例である。
逐次実行部ＣＰＵ１０１と並列処理部ＣＰＵ１０７が通信３１１を行っている状態で逐次実行部ＣＰＵ１０１が故障した場合、ＣＰＵ１０７の通信相手は並列管理部ＣＰＵ１０９の指示によりＣＰＵ１０３に切り換わり３１３の通信を行う。
同様に、逐次実行部ＣＰＵ１０２と並列処理部ＣＰＵ１０８が通信３１２を行っている状態で並列処理部ＣＰＵ１０８が故障した場合、ＣＰＵ１０８のタスクは並列管理部ＣＰＵ１０９により並列処理部ＣＰＵ１０５〜１０７のいずれかに割り振られ、ＣＰＵ１０２の通信相手は並列管理部ＣＰＵ１０９の指示により、ＣＰＵ１０８のタスクを割り当てられたＣＰＵ１０５に切り換り、３１４の通信を行う。また、この場合、ＣＰＵ１０２の待機系として動作しているＣＰＵ１０４にも、ＣＰＵ１０８のタスクが割り当てられたＣＰＵ１０５が通知される。
図３においても、故障が発生したＣＰＵ１０８が障害並列処理プロセッサの例であり、ＣＰＵ１０８の処理を割り振られたＣＰＵ１０５が引継ぎ並列処理プロセッサの例である。 FIG. 3 shows an operation example in the case where a CPU failure occurs in a communication state extending over the sequential execution unit and the parallel processing unit.
When the sequential execution unit CPU 101 fails while the sequential execution unit CPU 101 and the parallel processing unit CPU 107 are communicating 311, the communication partner of the CPU 107 is switched to the CPU 103 by the instruction of the parallel management unit CPU 109 and performs communication of 313.
Similarly, if the parallel processing unit CPU 108 fails while the sequential execution unit CPU 102 and the parallel processing unit CPU 108 are communicating 312, the task of the CPU 108 is assigned to one of the parallel processing units CPU 105 to 107 by the parallel management unit CPU 109. Then, the communication partner of the CPU 102 switches to the CPU 105 to which the task of the CPU 108 is assigned according to the instruction of the parallel management unit CPU 109 and performs communication of 314. In this case, the CPU 104 operating as a standby system of the CPU 102 is also notified of the CPU 105 to which the task of the CPU 108 is assigned.
Also in FIG. 3, the CPU 108 in which a failure has occurred is an example of a failure parallel processing processor, and the CPU 105 to which the processing of the CPU 108 is allocated is an example of a takeover parallel processing processor.

このように、本実施の形態では、並列管理部ＣＰＵ１０９、１１０は、並列処理部ＣＰＵ１０５〜１０８のうち逐次実行部ＣＰＵ１０１〜１０４との通信を伴う処理を実行していたいずれかの並列処理部ＣＰＵに障害が発生した際に、障害が発生した並列処理部ＣＰＵが実行中であった処理を引き継いで実行する並列処理部ＣＰＵを指定し、障害が発生した並列処理部ＣＰＵの通信先であった逐次実行部ＣＰＵの常用系ＣＰＵ及び待機系ＣＰＵの双方（常用系ＣＰＵ又は待機系ＣＰＵの一方に障害が発生している場合は、他方のみ）に障害が発生した並列処理部ＣＰＵの処理を引継ぐ並列処理部ＣＰＵを通知する。 As described above, in the present embodiment, the parallel management units CPU 109 and 110 execute any one of the parallel processing units CPUs 105 to 108 that is executing processing involving communication with the sequential execution units CPU 101 to 104. When the failure occurred, the parallel processing unit CPU that took over the process that was being executed by the parallel processing unit CPU in which the failure occurred was specified, and was the communication destination of the parallel processing unit CPU in which the failure occurred Take over the processing of the parallel processing unit CPU in which the failure has occurred in both the normal system CPU and the standby system CPU of the sequential execution unit CPU (only one of the normal system CPU or the standby system CPU has a fault). Notify the parallel processing unit CPU.

次に、図９及び図１０を参照して、本実施の形態に係る計算機１の動作例を説明する。 Next, an operation example of the computer 1 according to the present embodiment will be described with reference to FIGS.

図９及び図１０は、実施の形態１においてＣＰＵ異常が発生した場合のタスク及び通信切換えの動作を示すフローチャートである。
異常発生ＣＰＵ（以下ＣＰＵ（１）と呼ぶ）が逐次実行部ＣＰＵ、並列処理部ＣＰＵまたは並列管理部ＣＰＵであるかによって動作が異なる（ステップ９０１）。 FIG. 9 and FIG. 10 are flowcharts showing task and communication switching operations when a CPU abnormality occurs in the first embodiment.
The operation differs depending on whether the abnormality occurrence CPU (hereinafter referred to as CPU (1)) is the sequential execution unit CPU, the parallel processing unit CPU, or the parallel management unit CPU (step 901).

ＣＰＵ（１）が逐次実行部ＣＰＵである場合、待機系ＣＰＵが稼動中であれば（ステップ９０２でＹＥＳ）、待機系ＣＰＵが常用系ＣＰＵに切り換る（ステップ９０３）。このとき、逐次実行部ＣＰＵに対する通信は常用系及び待機系ＣＰＵ両方に対して実施することとしており、また逐次実行部ＣＰＵからの送信では待機系ＣＰＵはメッセージを送信しないが、待機系ＣＰＵは通信相手を知っていることから通信切換処理は発生しない。つまり、逐次実行部ＣＰＵに故障が発生した場合は、待機系ＣＰＵが稼動中であれば、並列管理部ＣＰＵが介在することなく、待機系ＣＰＵが常用系ＣＰＵの処理を引き継ぐ。また、逐次実行部ＣＰＵの待機系ＣＰＵは、通信相手である並列処理部ＣＰＵを知っているため、待機系ＣＰＵは、故障した常用系ＣＰＵと通信中であった並列処理部ＣＰＵに対してメッセージを送信することができ、並列管理部ＣＰＵが並列処理部ＣＰＵに通信相手を通知する必要がない。
一方、ステップ９０２において待機系ＣＰＵが停止している場合、常用系ＣＰＵの処理を切換えるＣＰＵが存在しないため、当該ＣＰＵの機能がなくなった状態での縮退運転となる。 When CPU (1) is a sequential execution unit CPU, if the standby CPU is in operation (YES in step 902), the standby CPU is switched to the regular CPU (step 903). At this time, communication with the sequential execution unit CPU is performed for both the active system and the standby CPU, and the standby system CPU does not transmit a message in the transmission from the sequential execution unit CPU, but the standby system CPU communicates. The communication switching process does not occur because the partner is known. That is, when a failure occurs in the sequential execution unit CPU, if the standby CPU is operating, the standby CPU takes over the processing of the normal CPU without the parallel management unit CPU intervening. Further, since the standby CPU of the sequential execution unit CPU knows the parallel processing unit CPU that is the communication partner, the standby CPU sends a message to the parallel processing unit CPU that is communicating with the failed normal CPU. The parallel management unit CPU does not need to notify the parallel processing unit CPU of the communication partner.
On the other hand, when the standby CPU is stopped in step 902, there is no CPU that switches the processing of the normal CPU, so that the degenerate operation is performed in a state where the function of the CPU is lost.

ステップ９０１において、ＣＰＵ（１）が並列処理部ＣＰＵである場合、以下のフローを全て並列管理部ＣＰＵが処理する。
ステップ９０４にて、ＣＰＵ（１）のタスクを割り当てるＣＰＵ（以下ＣＰＵ（２）と呼ぶ）を並列管理部ＣＰＵが決定する。
次に、ステップ９０５で、ＣＰＵ（１）のタスクが他ＣＰＵとの通信処理を含むかどうか並列管理部ＣＰＵが判断する。
当該タスクが通信を含まない場合（ステップ９０５でＮＯ）、ステップ９０４で選定したＣＰＵ（２）にＣＰＵ（１）のタスクを割り当てる（ステップ９０６）。
一方、ＣＰＵ（１）のタスクに通信が含まれる場合（ステップ９０５でＹＥＳ）、通信相手ＣＰＵ（以下ＣＰＵ（３）と呼ぶ）が逐次実行部ＣＰＵ、並列処理部ＣＰＵまたは並列管理部ＣＰＵであるかによって処理が異なる（ステップ９０７）。
このとき、当該タスクに含まれる通信が逐次実行部ＣＰＵ、並列処理部ＣＰＵまたは並列管理部ＣＰＵの組み合わせとなる場合、タスク切換え処理を除く全ての通信相手毎のフロー（ステップ９０８〜９１３）を実施した後、タスク切換え処理を実施する（ステップ９０６または９２１）。
なお、並列管理部ＣＰＵは、並列処理部ＣＰＵへのタスク割り当て時に、割り当てしたタスクが通信を含むかどうかを判定し、通信を含む場合は通信相手属性を記録する。各タスクが通信を含むかどうか及び通信相手の属性は、並列管理部ＣＰＵが、確認できるように事前にパラメータを設定する。
このため、並列管理部ＣＰＵは、ステップ９０５においてＣＰＵ（１）のタスクに通信処理が含まれるか否かを判断することができ、ステップ９０７において通信相手ＣＰＵ（３）の属性を判断することができる。 In step 901, when the CPU (1) is the parallel processing unit CPU, the parallel management unit CPU processes all the following flows.
In step 904, the parallel management unit CPU determines a CPU (hereinafter referred to as CPU (2)) to which the task of CPU (1) is assigned.
Next, in step 905, the parallel management unit CPU determines whether the task of the CPU (1) includes communication processing with another CPU.
If the task does not include communication (NO in step 905), the task of CPU (1) is assigned to CPU (2) selected in step 904 (step 906).
On the other hand, when communication is included in the task of CPU (1) (YES in step 905), the communication partner CPU (hereinafter referred to as CPU (3)) is the sequential execution unit CPU, parallel processing unit CPU, or parallel management unit CPU. The processing differs depending on whether or not (step 907).
At this time, when the communication included in the task is a combination of the sequential execution unit CPU, the parallel processing unit CPU, or the parallel management unit CPU, the flow (steps 908 to 913) for all communication partners except the task switching process is performed. After that, task switching processing is performed (step 906 or 921).
The parallel management unit CPU determines whether or not the assigned task includes communication when assigning a task to the parallel processing unit CPU, and records the communication partner attribute if communication is included. Whether or not each task includes communication and the attribute of the communication partner are set in advance so that the parallel management unit CPU can confirm them.
Therefore, the parallel management unit CPU can determine whether or not the communication process is included in the task of the CPU (1) in step 905, and can determine the attribute of the communication partner CPU (3) in step 907. it can.

ＣＰＵ（３）が並列処理部ＣＰＵである場合、ＣＰＵ（３）に通信相手の変更（ＣＰＵ（１）からＣＰＵ（２））を並列管理部ＣＰＵが通知し（ステップ９０８）、ＣＰＵ（２）にＣＰＵ（１）のタスクを割り当てる（ステップ９０６）。
ＣＰＵ（３）が逐次実行部ＣＰＵである場合、並列管理部ＣＰＵは、ＣＰＵ（３）に該当する常用系ＣＰＵ及び待機系ＣＰＵが稼動中であるか確認する（ステップ９０９）。
常用系及び待機系とも稼動中である場合（ステップ９０９でＹＥＳ）、ＣＰＵ（３）に該当する常用系ＣＰＵ及び待機系ＣＰＵの両方に通信相手の変更（ＣＰＵ（１）からＣＰＵ（２））を通知し（ステップ９１０）、ＣＰＵ（２）にＣＰＵ（１）のタスクを割り当てる（ステップ９０６）。
ＣＰＵ（３）の常用系ＣＰＵ又は待機系ＣＰＵの一方が停止している場合（ステップ９０９でＮＯ、ステップ９１１でＹＥＳ）、稼動中のＣＰＵ（３）に通信相手の変更（ＣＰＵ（１）からＣＰＵ（２））を通知し（ステップ９１２）、ＣＰＵ（２）にＣＰＵ（１）のタスクを割り当てる（ステップ９０６）。
ＣＰＵ（３）が常用系／待機系ともに停止している場合（ステップ９０９でＮＯ、ステップ９１１でＮＯ）、ＣＰＵ（３）との通信に同期通信を含むかどうか確認し（ステップ９１３）、同期通信を含まない場合、ＣＰＵ（２）にＣＰＵ（１）のタスクを割り当て（ステップ９０６）、同期通信を含む場合、ＣＰＵ（１）のタスクを破棄する（ステップ９１４）。
ＣＰＵ（３）との通信に同期通信が含まれる場合には、ＣＰＵ（３）が常用系／待機系ともに停止している状況で、ＣＰＵ（２）にタスクを割り振ると、ＣＰＵ（２）は稼動を停止しているＣＰＵ（３）からのメッセージを待ち続けることになり、ＣＰＵ（２）がデッドロックの状態に陥ってしまうことから、これを防止するため、同期通信が含まれる場合には、ＣＰＵ（１）のタスクを破棄することとしている。
ＣＰＵ（３）が並列管理部ＣＰＵである場合、並列管理部ＣＰＵは並列処理部ＣＰＵの通信を管制しているため、通信切換え処理は不要であり、ステップ９２１のタスク切換え処理を行う。 When the CPU (3) is the parallel processing unit CPU, the parallel management unit CPU notifies the CPU (3) of the change of the communication partner (CPU (1) to CPU (2)) (step 908), and the CPU (2) The task of CPU (1) is assigned to (step 906).
When the CPU (3) is the sequential execution unit CPU, the parallel management unit CPU confirms whether the normal system CPU and the standby system CPU corresponding to the CPU (3) are operating (step 909).
When both the normal system and the standby system are operating (YES in step 909), the communication partner is changed to both the normal CPU and the standby CPU corresponding to the CPU (3) (CPU (1) to CPU (2)). (Step 910), and the task of CPU (1) is assigned to CPU (2) (step 906).
When either the normal CPU or the standby CPU of the CPU (3) is stopped (NO in step 909, YES in step 911), the communication partner is changed (from CPU (1) to the operating CPU (3)). CPU (2)) is notified (step 912), and the task of CPU (1) is assigned to CPU (2) (step 906).
When the CPU (3) is stopped for both the active system and the standby system (NO in step 909, NO in step 911), it is confirmed whether or not the communication with the CPU (3) includes synchronous communication (step 913). If communication is not included, the task of CPU (1) is assigned to CPU (2) (step 906). If synchronous communication is included, the task of CPU (1) is discarded (step 914).
When synchronous communication is included in the communication with the CPU (3), when the task is allocated to the CPU (2) in a situation where the CPU (3) is stopped for both the normal system and the standby system, the CPU (2) In order to prevent this, since the CPU (2) will continue to wait for a message from the CPU (3) that has stopped operating, and the CPU (2) will fall into a deadlock state. , The task of CPU (1) is to be discarded.
When the CPU (3) is the parallel management unit CPU, the parallel management unit CPU controls the communication of the parallel processing unit CPU, so that the communication switching process is unnecessary, and the task switching process of step 921 is performed.

また、ステップ９０１においてＣＰＵ（１）が並列管理部ＣＰＵである場合、並列管理部の待機系ＣＰＵが稼動中であれば（ステップ９１４でＹＥＳ）、待機系ＣＰＵが常用系ＣＰＵに切り換り（ステップ９１５）、並列管理部ＣＰＵが並列処理部ＣＰＵの全てに対して並列管理部ＣＰＵの切換えを通知する（ステップ９１６）。
他方、ステップ９１４において、並列管理部ＣＰＵが常用系及び待機系ともに停止状態である場合、並列管理部では稼動状態のＣＰＵは存在せず、並列処理部ＣＰＵは並列管理部ＣＰＵと通信しない限り、タスク処理を続行する（ステップ９１７）。
並列処理部ＣＰＵが並列管理部ＣＰＵとの通信が必要になった場合（ステップ９１８でＹＥＳ）、並列処理部ＣＰＵの一方が再起動するまで待機状態となる（ステップ９１９）。
ステップ９１９において並列管理部ＣＰＵの一方が再起動した場合、再起動した並列管理部ＣＰＵが、並列処理部ＣＰＵの全てに再起動通知を実施（ステップ９１６）し、処理を継続する。 If the CPU (1) is the parallel management unit CPU in step 901 and the standby CPU of the parallel management unit is operating (YES in step 914), the standby CPU is switched to the regular CPU ( Step 915), the parallel management unit CPU notifies all of the parallel processing units CPU of switching of the parallel management unit CPU (step 916).
On the other hand, when the parallel management unit CPU is in a stopped state in step 914, there is no active CPU in the parallel management unit, and unless the parallel processing unit CPU communicates with the parallel management unit CPU, The task processing is continued (step 917).
When the parallel processing unit CPU needs to communicate with the parallel management unit CPU (YES in step 918), the CPU waits until one of the parallel processing units CPU is restarted (step 919).
When one of the parallel management units CPU is restarted in step 919, the restarted parallel management unit CPU issues a restart notification to all the parallel processing units CPU (step 916), and the processing is continued.

図４は、図１に示す構成のシステムにおける電源系異常時のＣＰＵ切換え動作例を示す。
ＣＰＵ１０１、１０２、１０５、１０６、１０９で構成される常用系４１３とＣＰＵ１０３、１０４、１０７、１０８、１１０で構成される待機系４１４は、独立した電源系である電源１（４１１）及び電源２（４１２）を持つ。
なお、図４の常用系４１３、待機系４１４の区別は、電源系統における区別であり、図１の常用系１２５及び待機系１２６とは必ずしも一致していない。
具体的には、図１では、並列処理部ＣＰＵであるＣＰＵ１０５〜１０８は、常用系／待機系に区別されていないが、図４では電源系統として、ＣＰＵ１０５、１０６は常用系に区別され、ＣＰＵ１０７、１０８は待機系に区別されている。
常用系電源である電源１（４１１）に異常が発生した場合、常用系ＣＰＵ１０１、１０２、１０５、１０６、１０９の処理が待機系ＣＰＵ１０３、１０４、１０７、１０８、１１０での処理に切り換る。
つまり、常用系電源である電源１（４１１）に障害が発生した際に、逐次実行部の常用系ＣＰＵ１０１、１０２が実行していた処理を逐次実行部の待機系ＣＰＵ１０３、１０４が引き継いで実行し、並列処理部ＣＰＵ１０５、１０６が実行していた処理を並列処理部ＣＰＵ１０７、１０８が引き継いで実行し、並列管理部の常用系ＣＰＵ１０９が実行していた処理を並列管理部の待機系ＣＰＵ１１０が引き継いで実行する。
例えば、電源１の障害発生前には周期処理Ｃを行っていたＣＰＵ１０７及びＣＰＵ１０８は、電源１の障害発生によりＣＰＵ１０５及びＣＰＵ１０６が行っていた周期処理Ａ及び周期処理Ｂを引き継ぎ、電源１の障害発生後は、周期処理Ａ〜Ｃを行うこととなる。
なお、常用系４１３から待機系４１４への切換は、図９に示す処理による。具体的には、図９のステップ９０１のＣＰＵ（１）の対象を逐次実行部ＣＰＵ、並列処理部ＣＰＵ、並列管理部ＣＰＵの全てとし、逐次実行部ＣＰＵ１０１、１０２については、ステップ９０２移行の処理を行い、並列処理部ＣＰＵ１０４、１０６については、ステップ９０４移行の処理を行い、並列管理部ＣＰＵ１０９については、ステップ９１４の処理を行うことにより、常用系４１３から待機系４１４への切換が行われる。 FIG. 4 shows an example of the CPU switching operation when the power supply system is abnormal in the system having the configuration shown in FIG.
An active system 413 composed of CPUs 101, 102, 105, 106, 109 and a standby system 414 composed of CPUs 103, 104, 107, 108, 110 are an independent power system 1 (411) and 2 (power supply 2). 412).
The distinction between the normal system 413 and the standby system 414 in FIG. 4 is a distinction in the power supply system, and does not necessarily match the regular system 125 and the standby system 126 in FIG.
Specifically, in FIG. 1, the CPUs 105 to 108 that are the parallel processing units CPU are not distinguished from the normal system / standby system, but in FIG. 4, the CPUs 105 and 106 are distinguished from the regular system as the power system, and the CPU 107 , 108 are classified into standby systems.
When an abnormality occurs in the power supply 1 (411) that is the normal system power supply, the processing of the normal CPUs 101, 102, 105, 106, and 109 is switched to the processing of the standby CPUs 103, 104, 107, 108, and 110.
That is, when a failure occurs in the power supply 1 (411) that is the normal power supply, the standby CPUs 103 and 104 of the sequential execution unit take over and execute the processing that was performed by the normal CPUs 101 and 102 of the sequential execution unit. The processing executed by the parallel processing units CPU 105 and 106 is executed by the parallel processing units CPU 107 and 108, and the processing executed by the active CPU 109 of the parallel management unit is taken over by the standby CPU 110 of the parallel management unit. Execute.
For example, the CPU 107 and the CPU 108 that performed the periodic process C before the occurrence of the failure of the power supply 1 take over the periodic process A and the periodic process B that were performed by the CPU 105 and the CPU 106 due to the occurrence of the failure of the power supply 1. Thereafter, the periodic processes A to C are performed.
Note that switching from the service system 413 to the standby system 414 is performed by the processing shown in FIG. Specifically, the target of the CPU (1) in step 901 in FIG. 9 is all the sequential execution unit CPU, parallel processing unit CPU, and parallel management unit CPU, and the sequential execution unit CPUs 101 and 102 are transferred to step 902. The parallel processing units CPU 104 and 106 perform the process of step 904, and the parallel management unit CPU 109 performs the process of step 914, thereby switching from the normal system 413 to the standby system 414.

次に、並列管理部による並列処理部へのタスク割り当ての動作例を示す。
並列処理部へのタスク割り当ては分散／並列処理分野における一般的な方法で良いが、ここではタスクの実行時間予測を用いて並列処理させるＣＰＵを計算し、負荷の低いＣＰＵに割り当てる方式とする。 Next, an operation example of task assignment to the parallel processing unit by the parallel management unit will be described.
Task allocation to the parallel processing unit may be a general method in the distributed / parallel processing field, but here, a CPU to be processed in parallel is calculated using task execution time prediction, and is assigned to a CPU with a low load.

図５は、ユーザの要求を満たす範囲で可能な限り少ないＣＰＵでタスクを実行する方式である。
なお、以下では、障害発生前の並列管理部ＣＰＵによる並列処理部ＣＰＵのタスク割り当てを前提として説明するが、図５に示す方式を障害発生後の並列処理部ＣＰＵのタスク割り当てに適用してもよい。 FIG. 5 shows a system in which tasks are executed with as few CPUs as possible within a range that satisfies user requirements.
In the following, the description will be made on the assumption that the task of the parallel processing unit CPU by the parallel management unit CPU before the failure occurs, but the method shown in FIG. 5 may be applied to the task allocation of the parallel processing unit CPU after the failure occurs. Good.

先ず、ユーザにて実行時間閾値Ｘ及び短縮時間閾値Ｙを定義する（Ｘ及びＹは０より大きい値とする）。
実行時間閾値Ｘは、ひとつの処理に許容できる実行時間を入力し、Ｘを下回る実行時間で処理可能な最少のＣＰＵで実行させる。つまり、実行時間閾値Ｘは、並列処理部プロセッサによる分散並列処理の実行時間の上限許容値を示す。
短縮時間閾値Ｙは、ＣＰＵ使用数を１増加させた場合に要求する実行時間の短縮時間であり、Ｙを満たさない場合、Ｘを下回る処理時間でなくともＣＰＵ数を増加させずに処理させる。つまり、短縮時間閾値Ｙは、並列処理部プロセッサの数を一つ増加させた際の実行時間の短縮幅に対する下限許容値を示す。 First, the user defines an execution time threshold value X and a shortened time threshold value Y (X and Y are set to values larger than 0).
As the execution time threshold value X, an execution time allowable for one process is input, and the execution time threshold X is executed by the minimum CPU capable of processing with an execution time lower than X. That is, the execution time threshold value X indicates the upper limit allowable value of the execution time of distributed parallel processing by the parallel processing unit processor.
The shortening time threshold Y is a shortening time of the execution time required when the number of CPU usage is increased by 1. When Y is not satisfied, the processing is performed without increasing the number of CPUs even if the processing time is not less than X. That is, the shortening time threshold Y indicates a lower limit allowable value with respect to the shortening range of the execution time when the number of parallel processing unit processors is increased by one.

並列管理部ＣＰＵは、計算機１に搭載されるＣＰＵ数をＮとし、タスクの実行時間Ｐ（１）〜Ｐ（Ｎ）を予測する（ステップ５０１）。Ｐ（１）は、１つのＣＰＵによりタスクを実行した際の予測実行時間であり、Ｐ（Ｎ）は、Ｎ個のＣＰＵによる分散並列処理によりタスクを実行した際の予測実行時間である。
並列管理部ＣＰＵは、ＣＰＵ数１の時の予測実行時間（Ｐ（１））がＸ以下であるか判定（ステップ５０２、５０３）し、条件を満たす場合（ステップ５０３でＹＥＳ）は並列処理部ＣＰＵにおいて最も負荷率が低いＣＰＵにタスクを割り当てる。
ステップ５０３で条件を満たさなかった場合（ステップ５０３でＮＯ）、並列管理部ＣＰＵは、ＣＰＵ使用数を１増加させた場合にユーザが定義した実行時間短縮要求（短縮時間閾値Ｙ）を満足するか判定し（ステップ５０４）、条件を満足する場合（ステップ５０４でＹＥＳ）はＣＰＵを１増加した場合（ステップ５０５）におけるステップ５０３の判定を実施し、最大ＣＰＵ数となるまで繰り返す。
ステップ５０４の条件を満足しない場合（ステップ５０４でＮＯ）、又は最大ＣＰＵ数における割り振り判定が合格した場合（ステップ５０６でＹＥＳ）、並列管理部ＣＰＵは、並列処理部の負荷状況を参照し（ステップ５０７）、タスクをｎ個のＣＰＵで実行するために分割する（ステップ５０８）。
そして、並列管理部ＣＰＵは、分割したタスクを負荷率の低いＣＰＵから順に割り当てる（ステップ５０９）。ただし最大ＣＰＵ数で実行する場合には負荷状況の参照は省き全ＣＰＵに分割したタスクを割り当てる。 The parallel management unit CPU predicts task execution times P (1) to P (N), where N is the number of CPUs installed in the computer 1 (step 501). P (1) is a predicted execution time when a task is executed by one CPU, and P (N) is a predicted execution time when a task is executed by distributed parallel processing by N CPUs.
The parallel management unit CPU determines whether the predicted execution time (P (1)) when the number of CPUs is 1 is equal to or less than X (steps 502 and 503). If the condition is satisfied (YES in step 503), the parallel processing unit A task is assigned to the CPU with the lowest load factor in the CPU.
If the condition is not satisfied in step 503 (NO in step 503), does the parallel management unit CPU satisfy the user-defined execution time reduction request (reduction time threshold Y) when the CPU usage number is increased by 1? If it is determined (step 504) and the condition is satisfied (YES in step 504), the determination in step 503 is performed when the CPU is increased by 1 (step 505), and the process is repeated until the maximum number of CPUs is reached.
If the condition of step 504 is not satisfied (NO in step 504), or if the allocation determination in the maximum number of CPUs has passed (YES in step 506), the parallel management unit CPU refers to the load status of the parallel processing unit (step 507), the task is divided for execution by n CPUs (step 508).
Then, the parallel management unit CPU assigns the divided tasks in order from the CPU with the lowest load factor (step 509). However, when executing with the maximum number of CPUs, the load status is not referred to, and divided tasks are assigned to all CPUs.

図５に示す処理の具体例を図１１に示す。
図１１（Ａ）において、ＣＰＵ数ｎ＝１の場合、予測実行時間Ｐ（１）は、実行時間閾値Ｘよりも大きく、また、Ｐ（２）−Ｐ（１）における短縮幅は、短縮時間閾値Ｙよりも大きい。
このため、並列管理部ＣＰＵは、ＣＰＵ数を一つ増加させて、ＣＰＵ数ｎ＝２とする。この場合、予測実行時間Ｐ（２）は、実行時間閾値Ｘよりも大きく、また、Ｐ（３）−Ｐ（２）における短縮幅は、短縮時間閾値Ｙよりも大きい。
このため、並列管理部ＣＰＵは、ＣＰＵ数を一つ増加させて、ＣＰＵ数ｎ＝３とする。この場合、予測実行時間Ｐ（３）は、実行時間閾値Ｘよりも小さいので、並列管理部ＣＰＵは、３つのＣＰＵで当該タスクを実行することを決定する。 A specific example of the process shown in FIG. 5 is shown in FIG.
In FIG. 11A, when the number of CPUs n = 1, the predicted execution time P (1) is larger than the execution time threshold value X, and the reduction width in P (2) -P (1) is the reduction time. It is larger than the threshold Y.
For this reason, the parallel management unit CPU increases the number of CPUs by one and sets the number of CPUs n = 2. In this case, the predicted execution time P (2) is larger than the execution time threshold value X, and the shortening range in P (3) -P (2) is larger than the shortening time threshold value Y.
For this reason, the parallel management unit CPU increments the number of CPUs by one to set the number of CPUs n = 3. In this case, since the predicted execution time P (3) is smaller than the execution time threshold value X, the parallel management unit CPU determines to execute the task with three CPUs.

図１１（Ｂ）では、ＣＰＵ数ｎ＝１の場合、予測実行時間Ｐ（１）は、実行時間閾値Ｘよりも大きく、また、Ｐ（２）−Ｐ（１）における短縮幅は、短縮時間閾値Ｙよりも大きい。
このため、並列管理部ＣＰＵは、ＣＰＵ数を一つ増加させて、ＣＰＵ数ｎ＝２とする。この場合、予測実行時間Ｐ（２）は、実行時間閾値Ｘよりも大きいが、Ｐ（３）−Ｐ（２）における短縮幅が、短縮時間閾値Ｙよりも小さいので、並列管理部ＣＰＵは、２つのＣＰＵで当該タスクを実行することを決定する。 In FIG. 11B, when the number of CPUs n = 1, the predicted execution time P (1) is larger than the execution time threshold value X, and the reduction width in P (2) -P (1) is the reduction time. It is larger than the threshold Y.
For this reason, the parallel management unit CPU increases the number of CPUs by one and sets the number of CPUs n = 2. In this case, the predicted execution time P (2) is larger than the execution time threshold value X, but since the reduction width in P (3) -P (2) is smaller than the reduction time threshold value Y, the parallel management unit CPU The two CPUs are determined to execute the task.

このように、並列管理部ＣＰＵは、並列処理部ＣＰＵによる分散並列処理の実行時間の上限許容値を実行時間閾値としてユーザから取得し、並列処理部ＣＰＵの数を一つ増加させた際の実行時間の短縮幅に対する下限許容値を短縮時間閾値としてユーザから取得し、並列処理部ＣＰＵの数を一つずつ増加させながら特定のタスクを分散並列処理により実行する際の予測実行時間を算出し、各々の予測実行時間と実行時間閾値とを比較するとともに、各々の予測実行時間から並列処理部ＣＰＵを一つ増加させた際の予測実行時間までの短縮幅と短縮時間閾値とを比較し、予測実行時間と実行時間閾値との比較結果及び予測実行時間の短縮幅と短縮時間閾値との比較結果に基づき、特定のタスクを割り当てる並列処理部ＣＰＵの数を決定する。
具体的には、並列管理部ＣＰＵは、予測実行時間の短縮幅と短縮時間閾値とを比較した結果、短縮幅が短縮時間閾値未満の場合に、当該予測実行時間の並列処理部ＣＰＵの数を、特定のタスクを割り当てる並列処理部ＣＰＵの数として決定し、予測実行時間の短縮幅と短縮時間閾値とを比較した結果、短縮幅が短縮時間閾値以上の場合に、並列処理部ＣＰＵの数を一つ増加させた予測実行時間と実行時間閾値とを比較し、比較の結果、当該予測実行時間が実行時間閾値以下である場合に、当該予測実行時間の並列処理部ＣＰＵの数を、特定のタスクを割り当てる並列処理部ＣＰＵの数として決定し、当該予測実行時間が実行時間閾値を超える場合に、当該予測実行時間から並列処理部ＣＰＵを一つ増加させた際の予測実行時間までの短縮幅と短縮時間閾値とを比較する。 As described above, the parallel management unit CPU obtains from the user the upper limit allowable time of the distributed parallel processing by the parallel processing unit CPU as an execution time threshold, and executes when the number of parallel processing units CPU is increased by one. Obtaining the lower limit allowable value for the reduction width of time from the user as a reduction time threshold, calculating the predicted execution time when executing a specific task by distributed parallel processing while increasing the number of parallel processing units CPU one by one, Each predicted execution time is compared with the execution time threshold value, and the shortening range from each predicted execution time to the predicted execution time when the parallel processing unit CPU is increased by one is compared with the reduced time threshold value. Based on the comparison result between the execution time and the execution time threshold, and the comparison result between the shortening range of the predicted execution time and the reduction time threshold, the number of parallel processing units CPU to which a specific task is assigned is determined.
Specifically, when the parallel management unit CPU compares the reduction width of the predicted execution time with the reduction time threshold, and the reduction width is less than the reduction time threshold, the parallel management unit CPU determines the number of parallel processing units CPU of the predicted execution time. The number of parallel processing units CPU to which a specific task is assigned is determined, and as a result of comparing the shortening range of the predicted execution time with the shortening time threshold value, The predicted execution time increased by one and the execution time threshold value are compared, and if the comparison result shows that the predicted execution time is less than or equal to the execution time threshold value, the number of parallel processing units CPU of the predicted execution time is specified Decrease as the number of parallel processing units CPU to which tasks are assigned, and when the predicted execution time exceeds the execution time threshold, the reduction range from the predicted execution time to the predicted execution time when the number of parallel processing units CPU is increased by one When Comparing the reduced time threshold.

このように、実施の形態１に示す計算機構成とすることで、特定タスクの実行時間の保証及び計算機全体の抗たん性の確保を実現し、かつ障害発生時における処理性能の低下を最低限に抑えることを可能とする。 As described above, by adopting the computer configuration shown in the first embodiment, it is possible to guarantee the execution time of a specific task and ensure the resilience of the entire computer, and minimize the degradation of processing performance when a failure occurs. It is possible to suppress.

本実施の形態では、複数のプロセッサから構成され、処理を分散処理する分散処理システムであって、各々のプロセッサが選択的に常用系及び待機系として稼動し、前記システムに入力される処理の一部のタスクを並列処理するための手段と、前記分散並列処理に故障許容処理を併せて行わせる故障許容処理手段を含む分散並列処理装置について説明した。 In the present embodiment, the distributed processing system is configured by a plurality of processors and performs distributed processing. Each processor selectively operates as an active system and a standby system, and one of processes input to the system. A distributed parallel processing device including means for parallel processing of the tasks of a part and fault tolerance processing means for performing fault tolerance processing together with the distributed parallel processing has been described.

また、本実施の形態では、故障許容手段の一部は、分散並列処理を実施しない管理プロセッサに設けられ、分散並列処理及び故障許容手段を併せて実行するプロセッサと、並列分散処理を実行するプロセッサと、故障許容処理を実行するプロセッサから構成される分散並列処理装置について説明した。 In this embodiment, part of the fault tolerance means is provided in a management processor that does not perform distributed parallel processing, and a processor that executes both distributed parallel processing and fault tolerance means, and a processor that executes parallel distributed processing. The distributed parallel processing device composed of processors that execute fault tolerance processing has been described.

また、本実施の形態では、物理的に電源系統を複数持ち、それぞれの電源系統にプロセッサを配置する手段と、選択的に電源系統を選択し切り換える手段と、電源系統の異常を検出する手段と、電源系統異常時に別系統の電源系へ切り換える手段を含む分散並列処理装置について説明した。 Further, in the present embodiment, there are means for physically having a plurality of power supply systems, a means for arranging a processor in each power supply system, means for selectively selecting and switching the power supply system, means for detecting an abnormality in the power supply system, The distributed parallel processing device including means for switching to a power system of another system when the power system is abnormal has been described.

実施の形態２．
図６は、実施の形態２に係る計算機１のプロセッサ構成の概要を示すシステム構成図である。
逐次実行部ＣＰＵ６０１〜６０４、並列処理部ＣＰＵ６０５〜６０８、計算機常用系６１３、待機系６１４の構成は、図１に示す実施の形態１と同一である。
実施の形態２では、図６に示す並列処理部のタスク６０９〜６１２が図１における並列処理部タスク１１５〜１１８と異なり、図１に示す並列管理部１２４が存在しない。
つまり、図１に示す並列管理部ＣＰＵのタスク１１９〜１２０の一部又は全てを図６の並列処理部タスク６０９〜６１２で処理する。
図１の並列管理部の処理全てを図６の並列処理部で処理しても良いが、ここでは処理負荷を考慮して並列処理部の負荷状況の監視及び図５に示す並列処理部ＣＰＵ数計算は実施しないこととする。 Embodiment 2. FIG.
FIG. 6 is a system configuration diagram showing an overview of the processor configuration of the computer 1 according to the second embodiment.
The configurations of the sequential execution units CPU601 to 604, the parallel processing units CPU605 to 608, the computer normal system 613, and the standby system 614 are the same as those of the first embodiment shown in FIG.
In the second embodiment, the parallel processing unit tasks 609 to 612 shown in FIG. 6 are different from the parallel processing unit tasks 115 to 118 shown in FIG. 1, and the parallel management unit 124 shown in FIG. 1 does not exist.
That is, some or all of the tasks 119 to 120 of the parallel management unit CPU shown in FIG. 1 are processed by the parallel processing unit tasks 609 to 612 of FIG.
1 may be processed by the parallel processing unit of FIG. 6, but here the load status of the parallel processing unit is monitored in consideration of the processing load and the number of CPUs of the parallel processing unit shown in FIG. No calculations will be performed.

図６に示す逐次実行部ＣＰＵ及び並列処理部ＣＰＵにおいて障害が発生した場合の動作は、図２及び図３に示す実施の形態１と同様であるが、逐次実行部と並列処理部にまたがる通信の管制については、並列処理部ＣＰＵが実施する。 The operation when a failure occurs in the sequential execution unit CPU and the parallel processing unit CPU shown in FIG. 6 is the same as that of the first embodiment shown in FIGS. 2 and 3, but communication across the sequential execution unit and the parallel processing unit. This control is performed by the parallel processing unit CPU.

このように、本実施の形態では、並列処理部（第二の処理部）に含まれる並列処理部プロセッサが、実施の形態１で示した並列管理部（プロセッサ管理部）として動作し、実施の形態１で示した並列管理部と同様に、逐次実行部プロセッサとの通信を伴う処理を実行していたいずれかの並列処理部プロセッサに障害が発生した際に、障害が発生した障害並列処理プロセッサが実行中であった障害プロセッサ実行処理を引き継いで実行する引継ぎ並列処理プロセッサを指定し、障害プロセッサ実行処理において障害並列処理プロセッサの通信先であった常用系プロセッサ及び待機系プロセッサの少なくともいずれかに引継ぎ並列処理プロセッサを通知する。 Thus, in the present embodiment, the parallel processing unit processor included in the parallel processing unit (second processing unit) operates as the parallel management unit (processor management unit) described in the first embodiment, and Similar to the parallel management unit described in the first aspect, when a failure occurs in any of the parallel processing units that were executing processing involving communication with the sequential execution unit processor, the failed parallel processing processor in which the failure occurred Specifies the takeover parallel processing processor that takes over the faulty processor execution process that was executing and executes the faulty processor execution process. At least one of the active processor and standby processor that was the communication destination of the faulty parallel processing processor in the faulty processor execution process Notify the takeover parallel processing processor.

並列処理部ＣＰＵにおいてＣＰＵ障害が発生した場合のタスク割り振りについては、特許文献１に示される方式等、一般的に公開されている方式で構わない。
ここでは図７に示すように、障害が発生したＣＰＵのタスクを割り当てるＣＰＵは、ＣＰＵ番号順に順次配置することとする。
図７の７０１〜７０８は並列処理部ＣＰＵであり、説明を容易にするため図２と構成を変更している。
並列処理部において１回目のＣＰＵ故障７０９が発生した場合、ＣＰＵ番号＃２のＣＰＵ７０１にＣＰＵ故障７０９が発生したＣＰＵ７０５のタスクを割り当てる。
次に、ＣＰＵ障害７１０が発生した場合は、ＣＰＵ番号＃３のＣＰＵ７０２にＣＰＵ障害７１０が発生したＣＰＵ７０３のタスクを割り当てる。
さらに、ＣＰＵ障害７１１が発生した場合には、ＣＰＵ番号＃４であるＣＰＵ７０３は故障しているためＣＰＵ番号＃５のＣＰＵ７０４にＣＰＵ障害７１１が発生したＣＰＵ７０７のタスクを割り振る。 As for task allocation when a CPU failure occurs in the parallel processing unit CPU, a generally disclosed method such as the method disclosed in Patent Document 1 may be used.
Here, as shown in FIG. 7, the CPU to which the task of the CPU in which the failure has occurred is sequentially arranged in the order of the CPU number.
Reference numerals 701 to 708 in FIG. 7 denote parallel processing units CPU, and the configuration is changed from that in FIG. 2 for easy explanation.
When the first CPU failure 709 occurs in the parallel processing unit, the task of the CPU 705 in which the CPU failure 709 has occurred is assigned to the CPU 701 with the CPU number # 2.
Next, when the CPU failure 710 occurs, the task of the CPU 703 in which the CPU failure 710 has occurred is assigned to the CPU 702 with the CPU number # 3.
Further, when the CPU failure 711 occurs, since the CPU 703 having the CPU number # 4 has failed, the task of the CPU 707 in which the CPU failure 711 has occurred is assigned to the CPU 704 having the CPU number # 5.

図７の方式の場合、各ＣＰＵが正常動作ＣＰＵ数（又は障害発生ＣＰＵ数）及び正常動作ＣＰＵ番号（又は障害発生ＣＰＵ番号）を認識できるよう、各ＣＰＵは自分の動作状況を定期的に他ＣＰＵに報告する。また各ＣＰＵで同一の並列タスク管理情報を共有し、かつ各ＣＰＵが全ての処理を実行することができるようにプログラム配置することで、障害発生ＣＰＵのタスクを別ＣＰＵが代行することが可能となる。 In the case of the method of FIG. 7, each CPU periodically changes its own operation status so that each CPU can recognize the number of normal operation CPUs (or the number of faulty CPUs) and normal operation CPU numbers (or faulty CPU numbers). Report to CPU. In addition, each CPU can share the same parallel task management information, and by arranging the program so that each CPU can execute all the processing, it is possible for another CPU to substitute the task of the failed CPU. Become.

このように、実施の形態２（図６）に示す計算機構成とすることで、特定タスクの実行時間の保証及び計算機全体の抗たん性の確保を実現し、かつ障害発生時における処理性能の低下を最低限に抑えることを可能とする。
また、実施の形態１（図１）に示す構成に比較してＣＰＵ数を２以上減らすことが可能である。 As described above, by adopting the computer configuration shown in the second embodiment (FIG. 6), it is possible to guarantee the execution time of a specific task and ensure the resilience of the entire computer, and to reduce the processing performance when a failure occurs. Can be minimized.
In addition, the number of CPUs can be reduced by two or more compared to the configuration shown in Embodiment 1 (FIG. 1).

以上、本実施の形態では、故障許容手段は、プロセッサの各々に設けられ、任意のプロセッサが故障した場合の常用系及び待機系の切り換え手段と、並列処理を実行しているプロセッサが故障した場合の他プロセッサへのタスク割り振り手段と、各々のプロセッサが分散並列処理と併せて、前記故障許容手段を処理させる手段を含む分散並列処理装置について説明した。 As described above, in the present embodiment, the fault tolerance means is provided in each of the processors, and the switching means between the normal system and the standby system when an arbitrary processor fails, and when the processor executing parallel processing fails A distributed parallel processing apparatus including task allocation means for other processors and means for causing each processor to process the fault tolerance means together with distributed parallel processing has been described.

最後に、実施の形態１、２に示した計算機１のハードウェア構成例について説明する。
図１２は、実施の形態１、２に示す計算機１のハードウェア資源の一例を示す図である。
なお、図１２の構成は、あくまでも計算機１のハードウェア構成の一例を示すものであり、計算機１のハードウェア構成は図１２に記載の構成に限らず、他の構成であってもよい。 Finally, a hardware configuration example of the computer 1 shown in the first and second embodiments will be described.
FIG. 12 is a diagram illustrating an example of hardware resources of the computer 1 illustrated in the first and second embodiments.
Note that the configuration of FIG. 12 is merely an example of the hardware configuration of the computer 1, and the hardware configuration of the computer 1 is not limited to the configuration described in FIG. 12, but may be another configuration.

計算機１は、図１、図６等に示したように、プログラムを実行するＣＰＵ９１１（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサともいう）を備えている。ＣＰＵ９１１は、バス９１２を介して、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９１３、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９１４、通信ボード９１５、表示装置９０１、キーボード９０２、マウス９０３、磁気ディスク装置９２０と接続され、これらのハードウェアデバイスを制御する。更に、ＣＰＵ９１１は、ＦＤＤ９０４（ＦｌｅｘｉｂｌｅＤｉｓｋＤｒｉｖｅ）、コンパクトディスク装置９０５（ＣＤＤ）、プリンタ装置９０６、スキャナ装置９０７と接続していてもよい。また、磁気ディスク装置９２０の代わりに、光ディスク装置、メモリカード読み書き装置などの記憶装置でもよい。
ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３、ＦＤＤ９０４、ＣＤＤ９０５、磁気ディスク装置９２０の記憶媒体は、不揮発性メモリの一例である。これらは、記憶装置の一例である。
通信ボード９１５、キーボード９０２、スキャナ装置９０７、ＦＤＤ９０４などは、入力装置の一例である。
また、通信ボード９１５、表示装置９０１、プリンタ装置９０６などは、出力装置の一例である。 The computer 1 includes a CPU 911 (also referred to as a central processing unit, a central processing unit, a processing unit, a processing unit, a microprocessor, a microcomputer, and a processor) that executes a program, as shown in FIGS. . The CPU 911 is connected to, for example, a ROM (Read Only Memory) 913, a RAM (Random Access Memory) 914, a communication board 915, a display device 901, a keyboard 902, a mouse 903, and a magnetic disk device 920 via a bus 912. Control hardware devices. Further, the CPU 911 may be connected to an FDD 904 (Flexible Disk Drive), a compact disk device 905 (CDD), a printer device 906, and a scanner device 907. Further, instead of the magnetic disk device 920, a storage device such as an optical disk device or a memory card read / write device may be used.
The RAM 914 is an example of a volatile memory. The storage media of the ROM 913, the FDD 904, the CDD 905, and the magnetic disk device 920 are an example of a nonvolatile memory. These are examples of the storage device.
The communication board 915, the keyboard 902, the scanner device 907, the FDD 904, and the like are examples of input devices.
The communication board 915, the display device 901, the printer device 906, and the like are examples of output devices.

通信ボード９１５は、例えば、ＬＡＮ（ローカルエリアネットワーク）、インターネット、ＷＡＮ（ワイドエリアネットワーク）などに接続されていても構わない。
磁気ディスク装置９２０には、オペレーティングシステム９２１（ＯＳ）、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。プログラム群９２３のプログラムは、ＣＰＵ９１１、オペレーティングシステム９２１、ウィンドウシステム９２２により実行される。 The communication board 915 may be connected to a LAN (Local Area Network), the Internet, a WAN (Wide Area Network), etc., for example.
The magnetic disk device 920 stores an operating system 921 (OS), a window system 922, a program group 923, and a file group 924. The programs in the program group 923 are executed by the CPU 911, the operating system 921, and the window system 922.

上記プログラム群９２３には、実施の形態１、２の説明において「逐次実行部」、「並列処理部」、「並列管理部」として説明している機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。 The program group 923 stores programs that execute the functions described as “sequential execution unit”, “parallel processing unit”, and “parallel management unit” in the description of the first and second embodiments. The program is read and executed by the CPU 911.

ファイル群９２４には、実施の形態１、２の説明において、「〜の判断」、「〜の計算」、「〜の比較」、「〜の算出」、「〜の割り当て」、「〜の設定」、「〜の登録」等として説明している処理の結果を示す情報やデータや信号値や変数値やパラメータが、「〜ファイル」や「〜データベース」の各項目として記憶されている。
「〜ファイル」や「〜データベース」は、ディスクやメモリなどの記録媒体に記憶される。ディスクやメモリになどの記憶媒体に記憶された情報やデータや信号値や変数値やパラメータは、読み書き回路を介してＣＰＵ９１１によりメインメモリやキャッシュメモリに読み出され、抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示などのＣＰＵの動作に用いられる。
抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示のＣＰＵの動作の間、情報やデータや信号値や変数値やパラメータは、メインメモリ、レジスタ、キャッシュメモリ、バッファメモリ等に一時的に記憶される。
また、実施の形態１、２で説明しているフローチャートの矢印の部分は主としてデータや信号の入出力を示し、データや信号値は、ＲＡＭ９１４のメモリ、ＦＤＤ９０４のフレキシブルディスク、ＣＤＤ９０５のコンパクトディスク、磁気ディスク装置９２０の磁気ディスク、その他光ディスク、ミニディスク、ＤＶＤ等の記録媒体に記録される。また、データや信号は、バス９１２や信号線やケーブルその他の伝送媒体によりオンライン伝送される。 In the description of the first and second embodiments, the file group 924 includes “determination of”, “calculation of”, “comparison of”, “calculation of”, “assignment of”, and “setting of”. ”,“ Registering ”, etc., information, data, signal values, variable values, and parameters indicating the results of the processing are stored as“ ˜file ”and“ ˜database ”items.
The “˜file” and “˜database” are stored in a recording medium such as a disk or a memory. Information, data, signal values, variable values, and parameters stored in a storage medium such as a disk or memory are read out to the main memory or cache memory by the CPU 911 via a read / write circuit, and extracted, searched, referenced, compared, Used for CPU operations such as calculation, calculation, processing, editing, output, printing, and display.
Information, data, signal values, variable values, and parameters are stored in the main memory, registers, cache memory, and buffers during the CPU operations of extraction, search, reference, comparison, calculation, processing, editing, output, printing, and display. It is temporarily stored in a memory or the like.
The arrows in the flowcharts described in the first and second embodiments mainly indicate input / output of data and signals. The data and signal values are the memory of the RAM 914, the flexible disk of the FDD904, the compact disk of the CDD905, and the magnetic field. Recording is performed on a recording medium such as a magnetic disk of the disk device 920, other optical disks, mini disks, DVDs, and the like. Data and signals are transmitted online via a bus 912, signal lines, cables, or other transmission media.

また、実施の形態１、２の説明において「〜部」の動作として説明しているものは、「〜ステップ」、「〜手順」、「〜処理」であってもよい。 In addition, what is described as the operation of “˜unit” in the description of the first and second embodiments may be “˜step”, “˜procedure”, and “˜processing”.

このように、実施の形態１、２に示す計算機１は、処理装置たるＣＰＵ、記憶装置たるメモリ、磁気ディスク等、入力装置たるキーボード、マウス、通信ボード等、出力装置たる表示装置、通信ボード等を備え、所定の処理をこれら処理装置、記憶装置、入力装置、出力装置を用いて実現するものである。 Thus, the computer 1 shown in the first and second embodiments includes a CPU as a processing device, a memory as a storage device, a magnetic disk, a keyboard as an input device, a mouse, a communication board, etc., a display device as an output device, a communication board, etc. The predetermined processing is realized by using these processing devices, storage devices, input devices, and output devices.

実施の形態１における計算機の基本的な構成を示すブロック図である。FIG. 3 is a block diagram showing a basic configuration of a computer in the first embodiment. 実施の形態１における計算機の障害発生時のプロセッサ切り換え動作の１例を示す図である。6 is a diagram illustrating an example of a processor switching operation when a computer failure occurs in the first embodiment. FIG. 実施の形態１における計算機の障害発生時のプロセッサ間通信の切り換え動作の１例を示す図である。6 is a diagram illustrating an example of switching operation of communication between processors when a computer failure occurs in the first embodiment. FIG. 実施の形態１における計算機の電源系統異常時の切り換え動作の１例を示す図である。6 is a diagram illustrating an example of a switching operation when the power supply system of the computer according to the first embodiment is abnormal. FIG. 実施の形態１における計算機の並列実行タスクのプロセッサへの割り振り動作を示すフローチャートである。3 is a flowchart illustrating an operation of allocating a parallel execution task of a computer to a processor in the first embodiment. 実施の形態２における計算機の基本的な構成を示すブロック図である。FIG. 6 is a block diagram showing a basic configuration of a computer in a second embodiment. 実施の形態２における計算機の障害発生時のプロセッサ切り換え動作の１例を示す図である。10 is a diagram illustrating an example of a processor switching operation when a computer failure occurs in the second embodiment. FIG. 従来におけるフォールトトレラント機能を重視した分散システムの構成を示すブロック図である。It is a block diagram which shows the structure of the distributed system which attaches importance to the fault tolerant function in the past. 実施の形態１における計算機の障害発生時の動作例を示すフローチャート図である。FIG. 6 is a flowchart illustrating an operation example when a failure occurs in a computer according to the first embodiment. 実施の形態１における計算機の障害発生時の動作例を示すフローチャート図である。FIG. 6 is a flowchart illustrating an operation example when a failure occurs in a computer according to the first embodiment. 実施の形態１における計算機の並列実行タスクのプロセッサへの割り振りの具体例を示す図である。FIG. 11 is a diagram illustrating a specific example of allocation of processors to parallel execution tasks in the first embodiment. 実施の形態１、２における計算機のハードウェア構成例を示す図である。FIG. 3 is a diagram illustrating a hardware configuration example of a computer in the first and second embodiments.

Explanation of symbols

１計算機、１２１逐次実行部、１２２逐次実行部、１２３並列処理部、１２４並列管理部、１２５常用系、１２６待機系、４１３常用系、４１４待機系、６１３常用系、６１４待機系。 1 computer, 121 sequential execution unit, 122 sequential execution unit, 123 parallel processing unit, 124 parallel management unit, 125 regular system, 126 standby system, 413 regular system, 414 standby system, 613 regular system, 614 standby system.

Claims

A first processing unit including an active processor and a standby processor that replaces the active processor when a failure occurs in the active processor;
A second processing unit that includes two or more parallel processors, and the two or more parallel processors cooperate to perform distributed parallel processing;
When a failure occurs in any one of the two or more parallel processing processors included in the second processing unit that is executing processing involving communication with the processor included in the first processing unit Is designated as a takeover parallel processing processor that takes over the faulty processor execution process that was being executed by the faulty parallel processing processor in which the fault occurred, and was the communication destination of the faulty parallel processing processor in the faulty processor execution process And a processor management unit for notifying at least one of a normal processor and a standby processor of the takeover parallel processing processor.

The processor management unit
An active processor and a standby processor that replaces the active processor in the event of a failure of the active processor,
At least one of an active processor and a standby processor
The fault that the faulty parallel processing processor in which the fault occurred was executing when a fault occurred in any of the parallel processing processors that were executing processing involving communication with the processor included in the first processing unit The takeover parallel processing processor which designates a takeover parallel processing processor to take over and execute the processor execution processing, and is set as at least one of a normal processor and a standby processor which are communication destinations of the faulty parallel processing processor in the faulty processor execution processing The computer according to claim 1, wherein:

One or more parallel processing processors included in the second processing unit,
Fault parallel processing in which a fault has occurred when a fault has occurred in any of the parallel processing processors operating as the processor management unit and executing processing involving communication with the processor included in the first processing unit Designating a takeover parallel processing processor to take over the failed processor execution process that was being executed by the processor, and at least one of a normal processor and a standby processor that were communication destinations of the failed parallel processing processor in the failed processor execution process The computer according to claim 1 or 2, wherein the takeover parallel processing processor is notified to any one of them.

The processor management unit
When the fault processor execution process executed by the fault parallel processor includes communication with any other parallel processor, the parallel that was the communication destination of the fault parallel processor in the fault processor execution process 4. The computer according to claim 1, wherein the processor is notified of the takeover parallel processing processor.

The normal processor of the first processing unit, some parallel processing processors included in the second processing unit, and the normal processor of the processor management unit are connected to a normal power supply, and the normal power supply Powered,
The standby processor of the first processing unit, the remaining parallel processors included in the second processing unit, and the standby processor of the processor management unit are connected to a standby system power supply, and power is supplied from the standby system power supply. Supplied
When a failure occurs in the normal power supply, the standby processor of the first processing unit takes over and executes the processing that was performed by the normal processor of the first processing unit, and the second processing The processing executed by a part of the parallel processing processors included in the processing unit was executed by the remaining parallel processing processors included in the second processing unit, and was executed by the normal processor of the processor management unit. The computer according to claim 2, wherein the standby processor of the processor management unit takes over and executes the processing.

The processor management unit
Get the upper limit of the execution time of distributed parallel processing by the parallel processor as the execution time threshold,
Get the lower limit allowable value for the execution time reduction width when the number of parallel processing processors is increased by one as the reduction time threshold,
Calculate the predicted execution time when executing a specific task by distributed parallel processing while increasing the number of parallel processors one by one, compare each predicted execution time with the execution time threshold, and each predicted execution Compares the reduction range from the time to the predicted execution time when the number of parallel processors is increased by one and the reduction time threshold, the comparison result between the predicted execution time and the execution time threshold, and the reduction range and reduction time of the predicted execution time. 6. The computer according to claim 1, wherein the number of parallel processing processors to which the specific task is assigned is determined based on a comparison result with a threshold value.

The processor management unit
As a result of comparing the predicted execution time reduction width with the reduction time threshold, if the reduction width is less than the reduction time threshold, the number of parallel processing processors to which the specific task is assigned is determined as the number of parallel processing processors with the predicted execution time. Determined as
As a result of comparing the reduction width of the predicted execution time and the reduction time threshold, if the reduction width is equal to or greater than the reduction time threshold, the prediction execution time obtained by increasing the number of parallel processors by one and the execution time threshold are compared. As a result of the comparison, when the predicted execution time is less than or equal to the execution time threshold, the number of parallel processing processors of the predicted execution time is determined as the number of parallel processing processors to which the specific task is assigned, and the predicted execution time The reduced time threshold is compared with a shortened range from the predicted execution time to a predicted execution time when the number of parallel processing processors is increased by one when the execution time threshold is exceeded. calculator.

Two or more parallel processors that cooperate to perform distributed parallel processing;
A processor management unit that determines the number of parallel processing processors to which a specific task is assigned;
The processor management unit
Get the upper limit of the execution time of distributed parallel processing by the parallel processor as the execution time threshold,
Get the lower limit allowable value for the execution time reduction width when the number of parallel processing processors is increased by one as the reduction time threshold,
Calculate the predicted execution time when executing a specific task by distributed parallel processing while increasing the number of parallel processors one by one, compare each predicted execution time with the execution time threshold, and each predicted execution Compares the reduction range from the time to the predicted execution time when the number of parallel processors is increased by one and the reduction time threshold, the comparison result between the predicted execution time and the execution time threshold, and the reduction range and reduction time of the predicted execution time. A computer which determines the number of parallel processing processors to which the specific task is assigned based on a comparison result with a threshold value.

The processor management unit
As a result of comparing the predicted execution time reduction width with the reduction time threshold, if the reduction width is less than the reduction time threshold, the number of parallel processing processors to which the specific task is assigned is determined as the number of parallel processing processors with the predicted execution time. Determined as
As a result of comparing the reduction width of the predicted execution time and the reduction time threshold, if the reduction width is equal to or greater than the reduction time threshold, the prediction execution time obtained by increasing the number of parallel processors by one and the execution time threshold are compared. As a result of the comparison, when the predicted execution time is less than or equal to the execution time threshold, the number of parallel processing processors of the predicted execution time is determined as the number of parallel processing processors to which the specific task is assigned, and the predicted execution time 9. The shortened time threshold is compared with the shortened range from the predicted execution time to the predicted execution time when the number of parallel processors is increased by one when the execution time threshold is exceeded. calculator.

The processor management unit
The specific task is divided by the determined number of parallel processing processors, the load state of each parallel processing processor is investigated, and the divided tasks are allocated in order from the parallel processing processor having the smallest load. The computer according to 8 or 9.