[go: up one dir, main page]

WO2020008392A2 - Predicting execution time of memory bandwidth intensive batch jobs - Google Patents

Predicting execution time of memory bandwidth intensive batch jobs Download PDF

Info

Publication number
WO2020008392A2
WO2020008392A2 PCT/IB2019/055684 IB2019055684W WO2020008392A2 WO 2020008392 A2 WO2020008392 A2 WO 2020008392A2 IB 2019055684 W IB2019055684 W IB 2019055684W WO 2020008392 A2 WO2020008392 A2 WO 2020008392A2
Authority
WO
WIPO (PCT)
Prior art keywords
threads
memory bandwidth
time
threaded
total number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2019/055684
Other languages
French (fr)
Other versions
WO2020008392A3 (en
Inventor
Dheeraj Chahal
Benny Mathew
Manoj Karunakaran Nambiar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tata Consultancy Services Ltd
Original Assignee
Tata Consultancy Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Ltd filed Critical Tata Consultancy Services Ltd
Publication of WO2020008392A2 publication Critical patent/WO2020008392A2/en
Publication of WO2020008392A3 publication Critical patent/WO2020008392A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the disclosure herein generally relates to predicting execution time of multi threaded and memory bandwidth intensive batch jobs, and, more particularly, to systems and methods for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs.
  • Batch jobs typically process large volumes of data and comprise memory bandwidth intensive batch jobs. These batch jobs perform, inter-alia, data reconciliation, risk analysis, and carry out analytics that are critical for the business. Hence, it is imperative that the batch jobs complete within the available time frame.
  • new servers with large number of cores and Non- Uniform Memory Access (NUMA) architecture provide for a large computing capacity. This means that multiple batch jobs may be executed concurrently to minimize collective completion time of the batch jobs. However, excessive parallelism may create memory bottleneck and adversely affect the completion time.
  • NUMA Non- Uniform Memory Access
  • Batch jobs can exceed the expected completion time due to sharing of a Central Processing Unit (CPU) and memory resource with other concurrently running jobs, which may interfere with business critical operations. Predicting a priori the completion time of a batch job running concurrently with other batch jobs is a challenging but important task. Concurrent processing allows multiple batch jobs to run concurrently to minimize total completion time.
  • CPU Central Processing Unit
  • a method for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention comprising: identifying, by one or more hardware processors, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi-threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; performing, by the one or more hardware processors, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determining a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and
  • a system for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention comprising a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: identify, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; perform, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determine a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the
  • one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors causes the one or more hardware processors to perform a method for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, the method comprising: identifying, by one or more hardware processors, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; performing, by the one or more hardware processors, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determining a distinct service demand for each of a total number of threads corresponding to the set of multi-th
  • FIG. 1 illustrates a block diagram of a system for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure.
  • CPU Central Processing Unit
  • FIG. 2A through 2B is a flow diagram illustrating the steps involved in the process of predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure.
  • CPU Central Processing Unit
  • FIG. 3 illustrates an architectural diagram or a high-level structural framework for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
  • FIG. 4 illustrates a block diagram and an example of a job execution model to be simulated by implementing a discrete event simulation technique for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
  • FIG. 5 is a block diagram and an example of partitioning total execution time of threads into a plurality of executing intervals, and computation of a busy time and an idle time of a thread amongst a total number of threads corresponding to the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
  • FIG. 6 illustrates a graphical representation and an example of memory bandwidth requirement identified for one or more threads amongst the total number of threads, in accordance with some embodiments of the present disclosure.
  • FIG. 7 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs by implementing the proposed methodology in case of a bandwidth contention, wherein Hyper- Threading is turned-off and in an absence of memory binding, in accordance with some embodiments of the present disclosure.
  • FIG. 8 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs based upon a characterization of each of the multi-threaded and memory bandwidth intensive batch jobs with one thread each, wherein the Hyper-Threading is tumed-off and in an absence of the memory binding, in accordance with some embodiments of the present disclosure.
  • FIG. 9 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs based upon a binding between threads corresponding to each of the multi-threaded and memory bandwidth intensive batch jobs to Non-Uniform Memory Access (NUMA) nodes, when the Hyper- Threading is turned-off but the memory binding is present, in accordance with some embodiments of the present disclosure.
  • NUMA Non-Uniform Memory Access
  • FIG. 10 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs by oversubscribing of cores, wherein there is no binding between the threads and the NUMA nodes, and the Hyper-Threading is tumed-on but in an absence of the memory binding, in accordance with some embodiments of the present disclosure.
  • FIG. 11 illustrates a graphical representation and an example of the execution time predicted based upon the binding of the multi-threaded and memory bandwidth intensive batch jobs to the NUMA nodes, wherein the Hyper- Threading is turned-on and the memory binding is present, in accordance with some embodiments of the present disclosure.
  • FIG. 12 illustrates a graphical representation and an example of the execution time predicted by changing a binding sequence of the multi-threaded and memory bandwidth intensive batch jobs, wherein the Hyper-Threading is tumed-on and the memory binding is present, in accordance with some embodiments of the present disclosure.
  • the embodiments of the present disclosure provide systems and methods for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention.
  • Batch processing constitutes a big portion of complex and large data processing in many organizations.
  • Batch jobs may be multi -threaded and memory intensive.
  • the use of multi-core systems for processing the multi-threaded and memory intensive batch jobs has increased significantly.
  • the increase in number of cores available on a processor chip reduces per core memory bandwidth, as a memory bandwidth available is shared amongst the cores.
  • NUMA Non-Uniform Memory Access
  • machines provide large bandwidth, it may still not be sufficient for some memory intensive multi-threaded applications. Insufficient available memory bandwidth affects the performance of the application.
  • memory latency in NUMA architecture depends upon the memory location relative to processor.
  • the performance of a system using a NUMA architecture may be improved using processor affinity or data locality. Allocating memory close to the computing resource may reduce the latency thereby improving the performance.
  • modeling such complex architectures for concurrent job execution behavior comprises numerous challenges.
  • Hyper- Threading results in execution of two threads on a single core and leverages the latencies due to data access.
  • Hyper- Threading does not provide the advantage of multi-core CPU but offers advantage over a single core by executing two threads simultaneously and filling unused stages in the functional pipeline. Since Hyper-Threading processor(s) behave differently than the single core or multi-core systems, predicting the thread or job execution behavior based on physical core data becomes challenging.
  • FIGS. 1 through 12 where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
  • FIG. 1 illustrates an exemplary block diagram of a system 100 for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, in accordance with an embodiment of the present disclosure.
  • the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104.
  • the one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory 102.
  • the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
  • the I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite.
  • the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
  • the memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM)
  • non-volatile memory such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • FIG. 2A through 2B illustrates an exemplary flow diagram of a method for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure.
  • the system 100 comprises one or more data storage devices of the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104.
  • the steps of the method of the present disclosure will now be explained with reference to the components of the system 100 as depicted in FIG. 1 and the flow diagram.
  • the hardware processors 104 when configured the instructions performs one or more methodologies described herein.
  • the one or more hardware processors 104 identify, based upon a concurrency level of one or more multi threaded batch jobs, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system.
  • a multi-threaded architecture supports not only multiple processors but also multiple streams (or batch streams) executing simultaneously in each processor.
  • the processor) of the multi-threaded architecture computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network.
  • compute and memory resources required by batch jobs are managed by time sharing.
  • the completion time of a job in a concurrent run depends on the number of cores and maximum memory bandwidth available. If required memory bandwidth of all concurrently running batch jobs is less than maximum available cores and bandwidth of multiprocessor the system respectively, the total completion time can be derived directly from the clock time.
  • advanced CPU and memory contention models may be required when the number of cores or the available memory bandwidth is less than the cumulative requirement of the concurrent jobs.
  • a high-level architecture of a multi-core general purpose computer comprises an arbitrary number of cores and a memory controller.
  • One or more of the processor cores amongst the arbitrary number of cores sends one or more memory requests of one or more executing threads to the memory controller via a corresponding cache.
  • the memory controller then schedules the incoming requests of the various threads for access to various banks (e.g., memory banks 1, . . .,K) of a shared memory via a memory bus.
  • an architecture or a high-level structural model for implementing the proposed methodology may be referred.
  • the implementation and execution of the architecture or the high-level structural model with the system 100 has been discussed in detail in steps 202 through 204 below. It may be noted that the embodiments of the proposed disclosure do not restrict the architecture or the high-level structural model to as shown in FIG. 3 only.
  • the embodiments of the proposed disclosure provide for adjusting modifications in the architecture or the high-level structural model or integrating a new architecture with the system 100 for implementing the proposed methodology.
  • a set of five jobs Jl, J2, J3, J4 and J5 executing concurrently in a batch environment X may be identified as multi threaded and memory bandwidth intensive, wherein although total memory bandwidth available in the batch environment X is 50 gigabit per second (Gbps), each of the set of five jobs Jl, J2, J3, J4 and J5 need a memory bandwidth of 25 Gbps or more for executing concurrently in the batch environment X.
  • Gbps gigabit per second
  • memory bandwidth requirement of the set of five jobs Jl, J2, J3, J4 and J5 saturates system memory bandwidth.
  • the one or more hardware processors 104 perform a plurality of steps based upon the identified set of multi-threaded and memory bandwidth intensive batch jobs.
  • the one or more hardware processors 104 determine a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads.
  • a difference in service demand of one or more threads amongst the total number of threads in a multi-threaded and memory bandwidth intensive batch job may affect completion time of each of the set of multi-threaded and memory bandwidth intensive batch jobs running concurrently.
  • fast running threads in a job with low service demand finish early and on completion do not compete for resources with slow running threads of the same batch job or other batch jobs.
  • a K-means clustering technique may be implemented for clustering the service demand of the total number of threads of an individual batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently.
  • the cluster centers are initialized using k-means ++ algorithms.
  • the distinct service demand comprises the distinct CPU utilization of each of the total number of threads and may be measured accordingly.
  • step 202(i) suppose the distinct service demand of the one or more threads may be determined as below:
  • each of the total number of threads may be clustered based upon a similar service demand (or the CPU utilization) of the one or more threads.
  • the one or more hardware processors 104 identify a memory bandwidth requirement of each of the total number of threads.
  • the memory bandwidth requirement of each of the total number of threads may be identified for a local access and a remote access, as a local memory bandwidth and a remote memory bandwidth available to each of the total number of threads are different, which may result in different latencies for memory access.
  • the different latencies are replicated while simulating an execution of a batch job amongst the set of multi threaded and memory bandwidth intensive batch jobs executing concurrently.
  • an overhead factor due to a remote memory data access y ro may be computed as:
  • y r and y represent a data access time from the local memory and the remote memory respectively. Comparing with a previous research, the proposed disclosure makes an assumption that when there is no memory to thread binding, almost 25% threads of any batch job (that is, any batch job in general or amongst the set of multi-threaded memory bandwidth intensive batch jobs) access data from a remote node.
  • step 202(ii) the memory bandwidth requirement identified (in Mega-bytes per second (MB/s)) for the one or more threads amongst the total number of threads may be referred.
  • MB/s Mega-bytes per second
  • the memory bandwidth requirement of each of the total number of threads varies from 1700 MB/s to 3500 MB/s.
  • the one or more hardware processors 104 derive an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads.
  • an instantaneous value may be derived based upon the instantaneous utilization of the CPU by one or more multi-threaded and memory bandwidth intensive batch jobs (amongst the set of multi-threaded and memory bandwidth intensive batch jobs).
  • the proposed disclosure provides for capturing a CPU utilization value of a batch job (or more specifically, of a multi-threaded and memory bandwidth intensive batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs) at a set of intervals, and more specifically, at a set of small intervals in isolation, and fitting the CPU utilization value captured into a regression function or distribution (for example, linear or non-linear, exponential, polynomial of degree n etc).
  • a regression function or distribution for example, linear or non-linear, exponential, polynomial of degree n etc.
  • the CPU utilization of the one or more multi-threaded and memory bandwidth intensive batch jobs may be represented with the help of an appropriate function of time or by some distribution.
  • the instantaneous value(s) of the CPU may be derived with the function of time or a distribution of time for each of the set of smaller intervals during simulation.
  • the proposed disclosure provides for selecting the idle time and the execution time of the one or more threads from the uniform distribution with average t d and t e for considering the fluctuations in the idle time and executions as below:
  • the one or more hardware processors 104 auto-design, based upon the instantaneous utilization of the CPU and the memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads.
  • the job execution model is auto-designed to be simulated for facilitating the prediction of the execution time of the set of multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment, based upon the simulation of the Central Processing Unit (CPU) and the memory contention (simulation discussed in step 204 below).
  • CPU Central Processing Unit
  • FIG. 4 an example of the job execution model may be referred.
  • the step of auto-designing the job execution model may now be considered in detail.
  • the job execution model comprises a memory contention model designed on top of a CPU contention model.
  • a job thread migrates between different states (for example, ready for execution state, execution state) of a thread during lifetime of the thread.
  • the proposed methodology assumes that when a thread (that is, any general thread) is ready for execution, it enters ready queue. Threads may then be picked up from the ready queue on First Come First Serve (FCFS) basis and may then be executed based upon the availability of a core.
  • FCFS First Come First Serve
  • a batch job any normal batch job
  • NUMA Non-Uniform Memory Access
  • a thread in a batch job when executing it may retrieve some data from the local memory or the remote memory based upon data placement of the thread in execution. Further, a thread migrates to an idle state for performing a number of functions (like Input/Output, remote procedure call etc.) by releasing the core. Again, the thread returns back to the queue when ready for execution.
  • the completion of a thread is a sum of time spent in a number of states (for example, a ready state, an idle state or an execution state etc.) while the completion time of a batch job may be determined based upon a slowest running thread of the batch job (that is, the batch job whose completion time is to be determined).
  • C t the CPU utilization at time T
  • husy C t X T equation (6)
  • kdie (1 ⁇ C t ) X T equation (7)
  • the busy time of that thread in the interval t busy gets incremented by a factor 5 bw as below: equation (8) wherein BW req is bandwidth requirement of the thread when a corresponding job (to which the thread belongs to) is executing in isolation and BW avaiiabie is the available bandwidth to the thread when the corresponding job is executing concurrently with other batch jobs.
  • the available bandwidth BW avaiiabie may be determined from the maximum bandwidth of a system node and a number of threads running on that node, that is:
  • BW max. is the maximum memory bandwidth available at node i and T n. denotes a maximum number of threads running at the node i.
  • the one or more threads (amongst the total number of threads) in a batch job may access data from the remote memory.
  • the busy time in an interval the one or more threads accessing data from the remote memory may be derived based upon the equation (1) and equation (8) as: t busy C t x T (l + S bw X y ro ) equation (11)
  • the step of computing the busy time and the idle time comprises computing an average busy time and an average idle time (denoted by t busy and t idle respectively) for each of a plurality of executing intervals, wherein the average busy time and the average idle time are computed based upon the standard deviation in the CPU utilization of the batch job.
  • the process of computing the busy time and the idle time may be referred.
  • the total execution time of a thread (amongst the total number of threads) is 20 seconds, and the total execution time may be partitioned into five executing intervals of 4 seconds each.
  • the thread may comprise the busy time of 3 seconds and the idle time of 1 second for first interval, the busy time of 2 seconds and the idle time of 2 seconds for first interval, and so on (wherein the five executing intervals comprise the plurality of executing intervals).
  • the one or more hardware processors 104 predict by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps.
  • the one or more hardware processors 104 simulate the auto-designed job execution model in a simulation environment by implementing the discrete event simulation technique.
  • the distinct service demand for each of the total number of threads, the identified memory bandwidth requirement of each of the total number of threads in batch job (amongst the set of multi-threaded and memory bandwidth intensive batch jobs), the CPU utilization of the batch job (corresponding to which the memory bandwidth requirement of each of the total number of threads has been identified) and maximum number of cores available in the computing system are given as an input to the simulation environment.
  • the simulation environment predicts the minimum and maximum completion time of each batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs identified.
  • the process of simulating the job execution model in the simulation environment may now be considered in detail by referring to the discrete event simulation technique algorithm below.
  • Input Maximum memory bandwidth available of the system, service demand distribution of threads, bandwidth requirement of the threads, and CPU utilization of jobs in a batch
  • gapTime ⁇ - (Uniform) gapDistList sampleQ Schedule_after(gapT ime );
  • B aVaii B W max [numa_b ind ] / total _thre ads [numa_b ind ]
  • procjtime proc time + proc time x overhead
  • the step of simulating the auto-designed job execution model comprises an execution of one or more programming functions by each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads.
  • the algorithm comprises five major functions (or the programming functions) namely, interval_end, schedule _job_execution, interval_start, arrival and departure.
  • the one or more programming functions facilitate simulating the job execution model.
  • Each of the total number of threads upon entering into a server invokes the arrival function, wherein the derived instantaneous CPU utilization, the identified memory bandwidth requirement and the distinct service demand determined for the thread entering into a server are initialized by the one or more hardware processors 104.
  • the simulation environment maintains two queues, that is, a job_queue, wherein the job queue keeps track of threads (amongst the total number of threads) that are in ready for execution queue and an executionjqueue , wherein the execution_queue keeps a track of threads that are in a run state.
  • Both the job_queue and the execution_queue are maintained for each NUMA node in each server for keeping the track of threads.
  • the total execution time of each of the total number of threads is partitioned into the set of small intervals of time, and in each of the set of small intervals of time, each of the total number of threads executes for a small interval of time and remains idle for remaining time.
  • each of the total number of threads may be migrated between the job queue and the execution queue using the functions interval _start and interval end on start and completion of the busy time in an interval amongst the plurality of executing intervals. Further, the schedule Job execution function keeps a record of a percentage of the total execution time that a thread (amongst the total number of threads) has completed.
  • the available memory bandwidth and a remaining execution time of any thread are evaluated using the discrete event simulation technique at each of the plurality of executing intervals.
  • execution time of a thread (amongst the total number of threads) in each of the plurality of executing intervals is adjusted based upon a time dilation due to shortage of the available bandwidth.
  • a node where the thread (that is the binding thread) is pinned may be determined.
  • the available memory bandwidth requirement may be determined for a single thread, based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system. That is, the thread’s available memory bandwidth requirement is determined using the number of threads (amongst the total number of threads) running on that node. Further, in the absence of the thread and memory binding, the busy time in an interval (amongst the plurality of executing intervals) may be computed using equation (7). The thread may exit from the multi-core system upon completion of the execution time of the thread by invoking the departure function.
  • the one or more hardware processors 104 predict, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs.
  • the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs predicted in seconds may be referred.
  • the step may now be considered in detail by analyzing the proposed methodology.
  • the proposed methodology was analyzed and evaluated using a synthetic benchmark called as STREAM or a STREAM benchmark.
  • the STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
  • the STREAM benchmark is suitable for studying NUMA effects.
  • the STREAM benchmark performs four different vector operations, namely, a vector copy operation, an add operation, a scalar operation and a triad operation.
  • a plurality of experiments were conducted on 28 physical (56 logical) core Intel® Xeon machine running CentOS 6.0 Linux® system. Further, the plurality of experiments were conducted on a machine with Hyper- Threading on and off, and two NUMA nodes encountered in two sockets were used for conducting the plurality of experiments.
  • the maximum memory bandwidth available was measured theoretically, wherein the maximum memory bandwidth available is a product of a clock frequency, a plurality of data transfers per clock cycle, memory bus width (32 or 64 bit) and a plurality of memory interfaces.
  • Intel(R)® Memory Latency Checker (mlc) tool was used to observe the latency due to the local access and the remote access, that is the y ( and the y r respectively.
  • each of a plurality of STREAM jobs was executed in isolation initially, wherein each of the plurality of STREAM jobs is a multi-threaded and memory bandwidth intensive batch job (that is, each of the plurality of STREAM jobs corresponds to the set of multi-threaded and memory bandwidth intensive batch jobs).
  • the service demand of each of the total number of threads in each of the plurality of STREAM jobs along with the instantaneous utilization of the CPET for each of the plurality of STREAM jobs at the plurality of executing intervals were observed. Further, the memory bandwidth requirement of each of the total number of threads was also identified.
  • the number of iterations performed by each of the plurality of STREAM jobs were modified, and the plurality of STREAM jobs were executed concurrently and the corresponding completion time was predicted for each of the plurality of STREAM jobs by implementing the discrete event simulation technique.
  • the predicted completion time results were compared with a set of experimental data.
  • Each simulation was executed with three different seed values and an average execution time was calculated.
  • a prediction error was computed as:
  • E e and E p represent the experimental and the predicted completion time of a batch job amongst the plurality of STREAM jobs executing concurrently.
  • the plurality of experiments were conducted with 28 physical cores on the machine by turning off the hyper- threading.
  • a plurality of STREAM benchmarks performing the vector copy operation, the add operation, the scalar operation and the triad operation were executed initially with 21, 14, 14 and 7 threads respectively.
  • Each of the plurality of STREAM benchmarks was initially executed in isolation and the distinct service demand, the instantaneous utilization of the CPET and an average memory bandwidth requirement of each of the total number of threads were obtained.
  • Each of the plurality of STREAM jobs were then executed concurrently without binding any threads to cores or memory.
  • a second experiment was conducted, wherein a stream of five STREAM jobs were characterized with 4, 7, 14, 21, and 28 threads. Each of the five STREAM jobs ran all the four workloads of the STREAM benchmark sequentially and iteratively. All the five STREAM jobs were then executed concurrently.
  • FIG. 8 a comparison of the completion time prediction for each of the five STREAM jobs with the predicted completion time generated by implementing the proposed methodology may be referred. It may be noted that the prediction error was observed in the first and the second experiments conducted because in general, service demand of threads in a job changes with concurrency. The proposed methodology assumes that the service demand of threads does not changes with the concurrency.
  • the threads of each of the five STREAM jobs were pinned, wherein the five STREAM jobs comprises the threads 14, 7, 14, 21 and 7, and the thread were pinned to NUMA nodes 0, 1, 0, 1, and 0 respectively. No interference was observed amongst the threads when each of the threads were executing concurrently on a plurality of distinct nodes. Resource sharing (memory and cores) was observed only amongst one or more threads running on a same node.
  • FIG. 9 a comparison of predicted and experimental job completion in execution may be referred.
  • completion of jobs (amongst the five STREAM jobs) running on NUMA node 1 is predicted with a higher accuracy as compared with NUMA node 0, and the NUMA node 1 has a small number of threads as compared with the NUMA node 0.
  • the higher accuracy for the NUMA node 1 is because of a smaller context switching amongst the threads running on the NUMA node 1.
  • Hyper-Threading on and no memory binding- a third experiment was conducted, wherein the Hyper-Threading was turned on which resulted in the availability of 56 logical cores on the system with no binding between the cores or the NUMA nodes and the threads.
  • the cores were over-subscribed by executing the five STREAM jobs, wherein each of the five STREAM jobs comprises 14 threads each and a number of distinct iterations.
  • each of five STREAM jobs ran each of the plurality of STREAM benchmarks sequentially and iteratively. Referring to FIG. 10, a comparison of predicted and experimental job completion in execution may be referred.
  • a fifth experiment was conducted, wherein the four STREAM benchmark jobs with a different binding sequence, that is, the Jobl and the Job2 pinned to the NEGMA node 0, and the Job3 and the Job4 pinned to the NEGMA node 1.
  • the effect on the completion time of the four STREAM benchmark jobs with the different binding sequence may be observed.
  • the different binding sequence affects the overall completion time, wherein changes resulting from the different binding sequence gets reflected in the predicted job completion time by implanting the proposed methodology.
  • the proposed methodology facilitates finding an optimum binding of batch jobs and memory.
  • the proposed disclosure provides for predicting the execution time of the set of multi-threaded and memory bandwidth intensive batch jobs. None of the traditional systems and methods have proposed computation of concurrently running memory bandwidth intensive batch workloads (or batch jobs) based upon the memory bandwidth requirement and the distinct service demand of each of the threads in the set of multi threaded and memory bandwidth intensive batch jobs. Further, the proposed disclosure provides for predicting the execution time of the concurrently running memory bandwidth intensive batch workloads with a high-level of prediction accuracy. Referring to FIGS. 7 through 12, it may be noted that the prediction accuracy of the proposed methodology is similar with logical as well as physical cores.
  • the proposed methodology may be implemented and integrated with other existing workflow systems, and may further optimize the other existing workflow systems.
  • the proposed disclosure may be implemented to test and predict a plurality of optimal job-memory binding techniques, thereby resulting in fastest overall completion time of batch workloads.
  • the memory 102 can be configured to store any data that is associated with predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention.
  • the information pertaining to the set of multi-threaded and memory bandwidth intensive batch jobs, the distinct service demand determined, the memory bandwidth requirement identified, the job execution model, simulation of the job execution model, and the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs predict etc. and all information pertaining to predicting the execution time of the multi threaded batch jobs predicted is stored in the memory 102. Further, all information (inputs, outputs and so on) pertaining to predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention.
  • the hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof.
  • the device may also include means which could be e.g. hardware means like e.g. an application- specific integrated circuit (ASIC), a field- programmable gate array (FPGA), or a combination of hardware and software means, e.g.
  • ASIC application- specific integrated circuit
  • FPGA field- programmable gate array
  • the means can include both hardware means and software means.
  • the method embodiments described herein could be implemented in hardware and software.
  • the device may also include software means.
  • the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
  • the embodiments herein can comprise hardware and software elements.
  • the embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc.
  • the functions performed by various modules described herein may be implemented in other modules or combinations of other modules.
  • a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
  • a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
  • the term“computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Systems and methods for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs based upon a simulation of a Central Processing Unit (CPU) and memory contention is provided. None of the traditional systems and methods provide for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs based upon a memory bandwidth requirement and a distinct service demand of threads. Embodiments of the present disclosure provide for predicting the execution time of a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently by identifying, the memory bandwidth requirement and the distinct service demand of each of the threads; auto-designing, based upon the identified memory bandwidth requirement and the distinct service demand, a job execution model; simulating the job execution model; and predicting, based upon the simulation, the execution time of each of the set of multi-threaded and memory bandwidth intensive batch jobs.

Description

PREDICTING EXECUTION TIME OF MEMORY BANDWIDTH
INTENSIVE BATCH JOBS
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] This patent application claims priority to India Patent Application 201821022846, filed on July 03, 2018.
TECHNICAL FIELD
[002] The disclosure herein generally relates to predicting execution time of multi threaded and memory bandwidth intensive batch jobs, and, more particularly, to systems and methods for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs.
BACKGROUND
[003] Large data centers usually run thousands of batch jobs every day. Batch jobs typically process large volumes of data and comprise memory bandwidth intensive batch jobs. These batch jobs perform, inter-alia, data reconciliation, risk analysis, and carry out analytics that are critical for the business. Hence, it is imperative that the batch jobs complete within the available time frame. Off late, new servers with large number of cores and Non- Uniform Memory Access (NUMA) architecture provide for a large computing capacity. This means that multiple batch jobs may be executed concurrently to minimize collective completion time of the batch jobs. However, excessive parallelism may create memory bottleneck and adversely affect the completion time.
[004] Batch jobs can exceed the expected completion time due to sharing of a Central Processing Unit (CPU) and memory resource with other concurrently running jobs, which may interfere with business critical operations. Predicting a priori the completion time of a batch job running concurrently with other batch jobs is a challenging but important task. Concurrent processing allows multiple batch jobs to run concurrently to minimize total completion time.
[005] In a multiprocessor system, compute and memory resources required by the batch jobs are managed by time sharing. The completion time of a job in a concurrent run depends on the number of cores and available maximum memory bandwidth. If the compute and memory bandwidth requirement of all the concurrently running batch jobs is less than the maximum available cores and bandwidth of the system respectively, the total completion time can be derived directly from the clock time. However, advanced CPU and memory contention models are required when the number of cores or available memory bandwidth is less than the cumulative requirement of the concurrent jobs.
[006] In view of the challenges discussed above, there is a need for an advanced model for predicting the execution time of the batch jobs, especially in the case of the memory bandwidth intensive jobs executing in parallel.
SUMMARY
[007] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention is provided, the method comprising: identifying, by one or more hardware processors, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi-threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; performing, by the one or more hardware processors, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determining a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads; (ii) identifying a memory bandwidth requirement of each of the total number of threads; and (iii) deriving an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads; auto-designing, based upon the instantaneous utilization of the CPU and the identified memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads; predicting, by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps, wherein the plurality of steps comprise: (i) simulating the auto-designed job execution model in a simulation environment based upon the distinct service demand of each of the total number of threads and the identified memory bandwidth requirement of each of the total number of threads; and (ii) predicting, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs; executing one or more programming functions by each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads; determining available memory bandwidth for a single thread based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system; computing an average busy time and an average idle time for each of the plurality of executing intervals of time based upon a standard deviation in CPU utilization of a batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs, and wherein the CPU utilization of the batch job is measured for a set of time intervals.
[008] In another aspect, there is provided a system for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, the system comprising a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: identify, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; perform, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determine a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads; (ii)identify a memory bandwidth requirement of each of the total number of threads; and (iii) derive an instantaneous utilization of the CPU for each of the set of multi threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads; auto-design, based upon the instantaneous utilization of the CPU and the identified memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads; predict, by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps, wherein the plurality of steps comprise: (i) simulate the auto-designed job execution model in a simulation environment based upon the distinct service demand of each of the total number of threads and the identified memory bandwidth requirement of each of the total number of threads; and (ii) predict, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs; simulate the auto-designed job execution model by executing one or more programming functions via each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads; determine the available memory bandwidth for a single thread based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system; compute the busy time and the idle time by computing an average busy time and an average idle time for each of the plurality of executing intervals of time to auto-design the job execution model, wherein the average busy time and the average idle time are computed based upon a standard deviation in CPU utilization of a batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs, and wherein the CPU utilization of the batch job is measured for a set of time intervals.
[009] In yet another aspect, there is provided one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors causes the one or more hardware processors to perform a method for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, the method comprising: identifying, by one or more hardware processors, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs; performing, by the one or more hardware processors, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise: (i) determining a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads; (ii) identifying a memory bandwidth requirement of each of the total number of threads; and (iii) deriving an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads; auto-designing, based upon the instantaneous utilization of the CPU and the identified memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads; predicting, by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps, wherein the plurality of steps comprise: (i) simulating the auto-designed job execution model in a simulation environment based upon the distinct service demand of each of the total number of threads and the identified memory bandwidth requirement of each of the total number of threads; and (ii) predicting, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs; executing one or more programming functions by each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads; determining available memory bandwidth for a single thread based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system; computing an average busy time and an average idle time for each of the plurality of executing intervals of time based upon a standard deviation in CPU utilization of a batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs, and wherein the CPU utilization of the batch job is measured for a set of time intervals.
[010] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[011] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[012] FIG. 1 illustrates a block diagram of a system for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure.
[013] FIG. 2A through 2B is a flow diagram illustrating the steps involved in the process of predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure.
[014] FIG. 3 illustrates an architectural diagram or a high-level structural framework for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
[015] FIG. 4 illustrates a block diagram and an example of a job execution model to be simulated by implementing a discrete event simulation technique for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
[016] FIG. 5 is a block diagram and an example of partitioning total execution time of threads into a plurality of executing intervals, and computation of a busy time and an idle time of a thread amongst a total number of threads corresponding to the multi-threaded and memory bandwidth intensive batch jobs, in accordance with some embodiments of the present disclosure.
[017] FIG. 6 illustrates a graphical representation and an example of memory bandwidth requirement identified for one or more threads amongst the total number of threads, in accordance with some embodiments of the present disclosure.
[018] FIG. 7 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs by implementing the proposed methodology in case of a bandwidth contention, wherein Hyper- Threading is turned-off and in an absence of memory binding, in accordance with some embodiments of the present disclosure.
[019] FIG. 8 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs based upon a characterization of each of the multi-threaded and memory bandwidth intensive batch jobs with one thread each, wherein the Hyper-Threading is tumed-off and in an absence of the memory binding, in accordance with some embodiments of the present disclosure.
[020] FIG. 9 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs based upon a binding between threads corresponding to each of the multi-threaded and memory bandwidth intensive batch jobs to Non-Uniform Memory Access (NUMA) nodes, when the Hyper- Threading is turned-off but the memory binding is present, in accordance with some embodiments of the present disclosure.
[021] FIG. 10 illustrates a graphical representation and an example of the execution time predicted for the multi-threaded and memory bandwidth intensive batch jobs by oversubscribing of cores, wherein there is no binding between the threads and the NUMA nodes, and the Hyper-Threading is tumed-on but in an absence of the memory binding, in accordance with some embodiments of the present disclosure.
[022] FIG. 11 illustrates a graphical representation and an example of the execution time predicted based upon the binding of the multi-threaded and memory bandwidth intensive batch jobs to the NUMA nodes, wherein the Hyper- Threading is turned-on and the memory binding is present, in accordance with some embodiments of the present disclosure.
[023] FIG. 12 illustrates a graphical representation and an example of the execution time predicted by changing a binding sequence of the multi-threaded and memory bandwidth intensive batch jobs, wherein the Hyper-Threading is tumed-on and the memory binding is present, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[024] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[025] The embodiments of the present disclosure provide systems and methods for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention. Batch processing constitutes a big portion of complex and large data processing in many organizations. Batch jobs may be multi -threaded and memory intensive. The use of multi-core systems for processing the multi-threaded and memory intensive batch jobs has increased significantly. In the multicore systems, the increase in number of cores available on a processor chip reduces per core memory bandwidth, as a memory bandwidth available is shared amongst the cores. Although NUMA (Non-Uniform Memory Access) machines provide large bandwidth, it may still not be sufficient for some memory intensive multi-threaded applications. Insufficient available memory bandwidth affects the performance of the application.
[026] Moreover, memory latency in NUMA architecture depends upon the memory location relative to processor. The performance of a system using a NUMA architecture may be improved using processor affinity or data locality. Allocating memory close to the computing resource may reduce the latency thereby improving the performance. However, modeling such complex architectures for concurrent job execution behavior comprises numerous challenges.
[027] As more and more enterprise systems migrate to multi-socket and multi-core nodes, memory modeling will become critical for performance prediction. Understanding of contention of a Central Processing Unit (CPU) and memory access pattern comprises a key to predict the job execution behavior.
[028] Additionally, new architectures from Intel® and other organizations provide for a Hyper- Threading (HT) in its processors. The Hyper-Threading results in execution of two threads on a single core and leverages the latencies due to data access. Although the Hyper- Threading does not provide the advantage of multi-core CPU but offers advantage over a single core by executing two threads simultaneously and filling unused stages in the functional pipeline. Since Hyper-Threading processor(s) behave differently than the single core or multi-core systems, predicting the thread or job execution behavior based on physical core data becomes challenging.
[029] Hence, there is a need for a technology that provides for simulating and identifying a requirement of memory bandwidth of each thread in the batch jobs for predicting the overall completion time of the multi-threaded and memory bandwidth intensive batch jobs.
[030] Referring now to the drawings, and more particularly to FIGS. 1 through 12, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[031] FIG. 1 illustrates an exemplary block diagram of a system 100 for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, in accordance with an embodiment of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[032] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[033] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[034] FIG. 2A through 2B, with reference to FIG. 1, illustrates an exemplary flow diagram of a method for predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention, in accordance with some embodiments of the present disclosure. In an embodiment the system 100 comprises one or more data storage devices of the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to the components of the system 100 as depicted in FIG. 1 and the flow diagram. In the embodiments of the present disclosure, the hardware processors 104 when configured the instructions performs one or more methodologies described herein. [035] According to an embodiment of the present disclosure, at step 201, the one or more hardware processors 104 identify, based upon a concurrency level of one or more multi threaded batch jobs, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system. In general, a multi-threaded architecture supports not only multiple processors but also multiple streams (or batch streams) executing simultaneously in each processor. The processor) of the multi-threaded architecture computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network.
[036] Further, in a multiprocessor system, compute and memory resources required by batch jobs are managed by time sharing. The completion time of a job in a concurrent run depends on the number of cores and maximum memory bandwidth available. If required memory bandwidth of all concurrently running batch jobs is less than maximum available cores and bandwidth of multiprocessor the system respectively, the total completion time can be derived directly from the clock time. However, advanced CPU and memory contention models may be required when the number of cores or the available memory bandwidth is less than the cumulative requirement of the concurrent jobs.
[037] In general, a high-level architecture of a multi-core general purpose computer comprises an arbitrary number of cores and a memory controller. One or more of the processor cores amongst the arbitrary number of cores sends one or more memory requests of one or more executing threads to the memory controller via a corresponding cache. The memory controller then schedules the incoming requests of the various threads for access to various banks (e.g., memory banks 1, . . .,K) of a shared memory via a memory bus.
[038] Referring to FIG. 3, an architecture or a high-level structural model for implementing the proposed methodology may be referred. The implementation and execution of the architecture or the high-level structural model with the system 100 has been discussed in detail in steps 202 through 204 below. It may be noted that the embodiments of the proposed disclosure do not restrict the architecture or the high-level structural model to as shown in FIG. 3 only. The embodiments of the proposed disclosure provide for adjusting modifications in the architecture or the high-level structural model or integrating a new architecture with the system 100 for implementing the proposed methodology.
[039] Considering an example scenario (for the step 201), a set of five jobs Jl, J2, J3, J4 and J5 executing concurrently in a batch environment X may be identified as multi threaded and memory bandwidth intensive, wherein although total memory bandwidth available in the batch environment X is 50 gigabit per second (Gbps), each of the set of five jobs Jl, J2, J3, J4 and J5 need a memory bandwidth of 25 Gbps or more for executing concurrently in the batch environment X. Thus, memory bandwidth requirement of the set of five jobs Jl, J2, J3, J4 and J5 saturates system memory bandwidth.
[040] According to an embodiment of the present disclosure, at step 202, the one or more hardware processors 104 perform a plurality of steps based upon the identified set of multi-threaded and memory bandwidth intensive batch jobs. At step 202(i), the one or more hardware processors 104 determine a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads.
[041] As is known in the art, a difference in service demand of one or more threads amongst the total number of threads in a multi-threaded and memory bandwidth intensive batch job (amongst the set of multi-threaded and memory bandwidth intensive jobs) may affect completion time of each of the set of multi-threaded and memory bandwidth intensive batch jobs running concurrently. Generally, fast running threads in a job with low service demand finish early and on completion do not compete for resources with slow running threads of the same batch job or other batch jobs. Hence, it is very important to measure the service demand (or the distinct service demand) of each thread (amongst the total number of threads) for an accurate simulation and prediction of job completion time of the set of multi threaded and memory bandwidth intensive batch jobs concurrently running.
[042] In an embodiment, a K-means clustering technique may be implemented for clustering the service demand of the total number of threads of an individual batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently. The cluster centers are initialized using k-means ++ algorithms. As mentioned above, the distinct service demand comprises the distinct CPU utilization of each of the total number of threads and may be measured accordingly.
[043] In an example implementation of the step 202(i), suppose the distinct service demand of the one or more threads may be determined as below:
job 1 : No. of threads=2, Service demand thread l=l0s, service demand thread 2=l2s, average CPU utilization of the job=25%
job 2 : No. of threads=3, Service demand thread l=25s, service demand thread 2=22s, Service demand thread l=27s , average CPU utilization of the job=35% [044] Further, by implementing the K-means clustering technique, each of the total number of threads may be clustered based upon a similar service demand (or the CPU utilization) of the one or more threads.
[045] According to an embodiment of the present disclosure at step 202(ii), the one or more hardware processors 104 identify a memory bandwidth requirement of each of the total number of threads. In an embodiment, the memory bandwidth requirement of each of the total number of threads may be identified for a local access and a remote access, as a local memory bandwidth and a remote memory bandwidth available to each of the total number of threads are different, which may result in different latencies for memory access. The different latencies are replicated while simulating an execution of a batch job amongst the set of multi threaded and memory bandwidth intensive batch jobs executing concurrently.
[046] In an embodiment, an overhead factor due to a remote memory data access yro may be computed as:
Yro = Yr/Yi equation(l)
wherein yr and y( represent a data access time from the local memory and the remote memory respectively. Comparing with a previous research, the proposed disclosure makes an assumption that when there is no memory to thread binding, almost 25% threads of any batch job (that is, any batch job in general or amongst the set of multi-threaded memory bandwidth intensive batch jobs) access data from a remote node.
[047] In an example implementation of step 202(ii), referring to FIG. 6, the memory bandwidth requirement identified (in Mega-bytes per second (MB/s)) for the one or more threads amongst the total number of threads may be referred. Referring to FIG. 6 again, it may be noted that the memory bandwidth requirement of each of the total number of threads varies from 1700 MB/s to 3500 MB/s.
[048] According to an embodiment of the present disclosure at step 202(iii), the one or more hardware processors 104 derive an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads. In an embodiment, an instantaneous value may be derived based upon the instantaneous utilization of the CPU by one or more multi-threaded and memory bandwidth intensive batch jobs (amongst the set of multi-threaded and memory bandwidth intensive batch jobs).
[049] The proposed disclosure provides for capturing a CPU utilization value of a batch job (or more specifically, of a multi-threaded and memory bandwidth intensive batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs) at a set of intervals, and more specifically, at a set of small intervals in isolation, and fitting the CPU utilization value captured into a regression function or distribution (for example, linear or non-linear, exponential, polynomial of degree n etc). Thus the CPU utilization of the one or more multi-threaded and memory bandwidth intensive batch jobs may be represented with the help of an appropriate function of time or by some distribution. Further, instead of using constant CPU utilization for entire run of a thread, the instantaneous value(s) of the CPU may be derived with the function of time or a distribution of time for each of the set of smaller intervals during simulation.
[050] In an example implementation, let the total execution time of each thread (amongst the total number of threads) i is partitioned into the set of small intervals of time n of size T such that n x T = £). In an embodiment, within each interval, the idle time tcL and an execution time te of each thread may be determined as below: te = T X CT equation (2) td = T X (1— CT) equation (3) wherein CT is the CPU utilization of the batch job (amongst the set of multi -threaded batch jobs) defined at time t of execution. The proposed disclosure provides for selecting the idle time and the execution time of the one or more threads from the uniform distribution with average td and te for considering the fluctuations in the idle time and executions as below:
[(1— s) X te, (1 + s) X te] = te equation (4)
[(1— s) X td, (1 + s) X te ] = td equation (5) wherein s represents the variation or range around mean CPU utilization of a batch job (amongst the set of multi-threaded batch jobs) measured at the set of small intervals of time.
[051] According to an embodiment of the present disclosure, at step 203, the one or more hardware processors 104 auto-design, based upon the instantaneous utilization of the CPU and the memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads. The job execution model is auto-designed to be simulated for facilitating the prediction of the execution time of the set of multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment, based upon the simulation of the Central Processing Unit (CPU) and the memory contention (simulation discussed in step 204 below). In an example implementation of step 203, referring to FIG. 4, an example of the job execution model may be referred. The step of auto-designing the job execution model may now be considered in detail.
[052] In an embodiment, referring to FIG. 4 again, it may be noted that the job execution model comprises a memory contention model designed on top of a CPU contention model. In general, a job thread migrates between different states (for example, ready for execution state, execution state) of a thread during lifetime of the thread. The proposed methodology assumes that when a thread (that is, any general thread) is ready for execution, it enters ready queue. Threads may then be picked up from the ready queue on First Come First Serve (FCFS) basis and may then be executed based upon the availability of a core. In the absence of any node binding, a batch job (any normal batch job) has an equal probability of assignment to available Non-Uniform Memory Access (NUMA) nodes in a computing system.
[053] Generally, when a thread in a batch job is executing it may retrieve some data from the local memory or the remote memory based upon data placement of the thread in execution. Further, a thread migrates to an idle state for performing a number of functions (like Input/Output, remote procedure call etc.) by releasing the core. Again, the thread returns back to the queue when ready for execution. The completion of a thread is a sum of time spent in a number of states (for example, a ready state, an idle state or an execution state etc.) while the completion time of a batch job may be determined based upon a slowest running thread of the batch job (that is, the batch job whose completion time is to be determined).
[054] According to an embodiment of the present disclosure, referring to FIG. 5, it may be noted that the step of auto-designing initially comprises partitioning the total execution time £) of an individual thread (amongst the total number of threads) into a plurality of executing intervals, usually small intervals of time T such that n x T = £), wherein n represents a total number of intervals. If the CPU utilization at time T is denoted by Ct, then a busy time tbusy and an idle time ticlle of a thread (amongst the total number of threads) in an interval is denoted by: husy = Ct X T equation (6); kdie = (1 ~ Ct) X T equation (7)
[055] According to an embodiment of the present disclosure, if the required bandwidth for executing a thread exceeds the available bandwidth corresponding to a thread in a core, then the busy time of that thread in the interval tbusy gets incremented by a factor 5bw as below:
Figure imgf000017_0001
equation (8) wherein BWreq is bandwidth requirement of the thread when a corresponding job (to which the thread belongs to) is executing in isolation and BWavaiiabie is the available bandwidth to the thread when the corresponding job is executing concurrently with other batch jobs. The available bandwidth BWavaiiabie may be determined from the maximum bandwidth of a system node and a number of threads running on that node, that is:
B avaiiabie BWmaXi— Tn. equation (9) wherein BWmax. is the maximum memory bandwidth available at node i and Tn. denotes a maximum number of threads running at the node i. In an embodiment, in case there is a thread and memory binding, and some data is retrieved by one or more threads (amongst the total number of threads) from one or more local nodes, the busy time in an interval may be incremented based upon the memory bandwidth contention as: tbusy = Ct X T (1 + Sbw) equation (10)
[056] In an embodiment, in the absence of the thread and memory binding, the one or more threads (amongst the total number of threads) in a batch job may access data from the remote memory. The busy time in an interval the one or more threads accessing data from the remote memory may be derived based upon the equation (1) and equation (8) as: tbusy Ct x T (l + Sbw X yro) equation (11)
[057] In an embodiment, for considering fluctuation in the idle time and the busy time as in a real operating system, all the threads (amongst the total number of threads) must not start executing or go to an idle state simultaneously. Hence, for each interval the idle time and the busy time may be selected from a uniform distribution with average
Figure imgf000018_0001
and tbusy as:
Figure imgf000018_0002
equation (12); and [(1 - s) ^ tidle> (1 + s) X kdiA = die equation (13) wherein s is a standard deviation in the CPU utilization of a batch job measured at small intervals of time. It may be noted that the cores are shared in a round robin fashion between running threads where time slice if fixed at 0.1 seconds. Thus, the step of computing the busy time and the idle time comprises computing an average busy time and an average idle time (denoted by tbusy and tidle respectively) for each of a plurality of executing intervals, wherein the average busy time and the average idle time are computed based upon the standard deviation in the CPU utilization of the batch job.
[058] Referring to FIG. 5 again, the process of computing the busy time and the idle time may be referred. Considering an example scenario, suppose the total execution time of a thread (amongst the total number of threads) is 20 seconds, and the total execution time may be partitioned into five executing intervals of 4 seconds each. During each of the five executing intervals, the thread may comprise the busy time of 3 seconds and the idle time of 1 second for first interval, the busy time of 2 seconds and the idle time of 2 seconds for first interval, and so on (wherein the five executing intervals comprise the plurality of executing intervals).
[059] According to an embodiment of the present disclosure, at step 204, the one or more hardware processors 104 predict by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps. At step 204(i) the one or more hardware processors 104 simulate the auto-designed job execution model in a simulation environment by implementing the discrete event simulation technique.
[060] In an embodiment, the distinct service demand for each of the total number of threads, the identified memory bandwidth requirement of each of the total number of threads in batch job (amongst the set of multi-threaded and memory bandwidth intensive batch jobs), the CPU utilization of the batch job (corresponding to which the memory bandwidth requirement of each of the total number of threads has been identified) and maximum number of cores available in the computing system are given as an input to the simulation environment. Upon completing simulations, the simulation environment predicts the minimum and maximum completion time of each batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs identified. The process of simulating the job execution model in the simulation environment may now be considered in detail by referring to the discrete event simulation technique algorithm below.
Discrete Event Simulation Technique Algorithm-
Input - Maximum memory bandwidth available of the system, service demand distribution of threads, bandwidth requirement of the threads, and CPU utilization of jobs in a batch
Result (Output) - Job completion time prediction (EJ of each job in a batch Initialization:
BWmax\numa bind ^ Maximum memory bandwidth on each NUMA node,
yr <- time to access data from remote memory node,
Yi <- time to access data from local memory node,
nCPU < - Number of CPUs, cpujdle <- CPU idle time
intervaljime <- Time interval,
/* First arrival of the job */
Function arrival
for (each_job_in_the_batch){
Calculate cpujdle and cpujmsy time distribution
for interval interval ime
gap_Dist <- interval ime x cpujdle x (1 ± s)
busy _Dist <- interval ime x (1— cpujdle ) x (1 ± a)
Generate uniform distribution of the idle time and the busy time for any interval gapDistList(gapDist)
busyDistList(busyDist)
for (i = 0; i < job hreads ; i = i + 1){ remjtim <- service demand of the thread Add job to job_queue(),
if execution queue. sizeQ < nCPU then
schedule J ob_execution(),·
end
}
total Jhreads = total_threads + 1
}
/ * Start of a new interval of a thread * /
Function interval _start
Add job in job_queue(),
if execution queue. sizeQ < nCPU then
schedule JobjexecutionQ)
end
/* Completion of an interval of a thread */
Function interval end
Remove job from executionjqueue ;
gapTime <- (Uniform) gapDistList. sampleQ Schedule_after(gapT ime );
if job queue. sizeQ > O then
schedule J ob_execution()
end
/* Start executing job */
Function schedule Job _execution
Pick one thread from the job_queue(),
Get residual time rem time of the thread;
Add thread to execution_queue( );
proc_time <- (Uniform) busyDistList. sample() if Thread_binding == TRUE then
if total_threads[numa_bind\ <= nCPUs then
B aVaii = B Wmax [numa_b ind ] / total _thre ads [numa_b ind ]
end
else BWavaii = BWmax[numa_bind\/nCPUs
end
/* Time dilation due to inadequate memory bandwidth */
if BWavail < BWreq then
overhead = ( BWreq - BWavail)/BWavail
end
/* Latency overhead due to remote data access */
if thre ad _id_r emote == TRUE then
Yro = Yr/Yl
overhead = overhead x yro
end
procjtime = proc time + proc time x overhead
if procjtime >= rem time then
departure ();
remjtime <- 0;
end
else
remjtime <- (remjtime— procjtime )
intervaljendQ ;
end
/* Removal of thread from the system on completion */
Function departure
Release job from execution_queueQ
print job execution time and other stats
if job queue. sizeQ > O then
schedule Job_executionQ ;
end
[061] According to an embodiment of the present disclosure, referring to the algorithm above, it may be noted that the step of simulating the auto-designed job execution model comprises an execution of one or more programming functions by each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads. The algorithm comprises five major functions (or the programming functions) namely, interval_end, schedule _job_execution, interval_start, arrival and departure.
[062] In an embodiment, the one or more programming functions facilitate simulating the job execution model. Each of the total number of threads upon entering into a server invokes the arrival function, wherein the derived instantaneous CPU utilization, the identified memory bandwidth requirement and the distinct service demand determined for the thread entering into a server are initialized by the one or more hardware processors 104.
[063] In an embodiment, the simulation environment maintains two queues, that is, a job_queue, wherein the job queue keeps track of threads (amongst the total number of threads) that are in ready for execution queue and an executionjqueue , wherein the execution_queue keeps a track of threads that are in a run state. Both the job_queue and the execution_queue are maintained for each NUMA node in each server for keeping the track of threads. As discussed above, the total execution time of each of the total number of threads is partitioned into the set of small intervals of time, and in each of the set of small intervals of time, each of the total number of threads executes for a small interval of time and remains idle for remaining time.
[064] In an embodiment, by implementing the discrete event simulation technique, each of the total number of threads may be migrated between the job queue and the execution queue using the functions interval _start and interval end on start and completion of the busy time in an interval amongst the plurality of executing intervals. Further, the schedule Job execution function keeps a record of a percentage of the total execution time that a thread (amongst the total number of threads) has completed.
[065] In an embodiment, the available memory bandwidth and a remaining execution time of any thread (amongst the total number of threads) are evaluated using the discrete event simulation technique at each of the plurality of executing intervals. Referring to equation (10), it may be noted that execution time of a thread (amongst the total number of threads) in each of the plurality of executing intervals is adjusted based upon a time dilation due to shortage of the available bandwidth. In case of a thread binding to memory, a node where the thread (that is the binding thread) is pinned may be determined.
[066] In an embodiment, the available memory bandwidth requirement may be determined for a single thread, based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system. That is, the thread’s available memory bandwidth requirement is determined using the number of threads (amongst the total number of threads) running on that node. Further, in the absence of the thread and memory binding, the busy time in an interval (amongst the plurality of executing intervals) may be computed using equation (7). The thread may exit from the multi-core system upon completion of the execution time of the thread by invoking the departure function.
[067] According to an embodiment of the present disclosure, at step 204(ii), the one or more hardware processors 104 predict, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs. Referring to FIGS. 7 through 12, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs predicted in seconds may be referred. The step may now be considered in detail by analyzing the proposed methodology.
[068] In an embodiment, the proposed methodology was analyzed and evaluated using a synthetic benchmark called as STREAM or a STREAM benchmark. As is known in the art, the STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The STREAM benchmark is suitable for studying NUMA effects. In an embodiment, the STREAM benchmark performs four different vector operations, namely, a vector copy operation, an add operation, a scalar operation and a triad operation.
[069] In an embodiment, a plurality of experiments were conducted on 28 physical (56 logical) core Intel® Xeon machine running CentOS 6.0 Linux® system. Further, the plurality of experiments were conducted on a machine with Hyper- Threading on and off, and two NUMA nodes encountered in two sockets were used for conducting the plurality of experiments. The maximum memory bandwidth available was measured theoretically, wherein the maximum memory bandwidth available is a product of a clock frequency, a plurality of data transfers per clock cycle, memory bus width (32 or 64 bit) and a plurality of memory interfaces. Still further, Intel(R)® Memory Latency Checker (mlc) tool to was used to observe the latency due to the local access and the remote access, that is the y( and the yr respectively. Finally, numactl command was executed a processor and memory binding. [070] In an embodiment, each of a plurality of STREAM jobs was executed in isolation initially, wherein each of the plurality of STREAM jobs is a multi-threaded and memory bandwidth intensive batch job (that is, each of the plurality of STREAM jobs corresponds to the set of multi-threaded and memory bandwidth intensive batch jobs). The service demand of each of the total number of threads in each of the plurality of STREAM jobs along with the instantaneous utilization of the CPET for each of the plurality of STREAM jobs at the plurality of executing intervals were observed. Further, the memory bandwidth requirement of each of the total number of threads was also identified.
[071] In an embodiment, in order to determine the distinct service demand of the plurality of STREAM jobs running concurrently, the number of iterations performed by each of the plurality of STREAM jobs were modified, and the plurality of STREAM jobs were executed concurrently and the corresponding completion time was predicted for each of the plurality of STREAM jobs by implementing the discrete event simulation technique. The predicted completion time results were compared with a set of experimental data. Each simulation was executed with three different seed values and an average execution time was calculated. In an embodiment, a prediction error was computed as:
equation (14)
Figure imgf000024_0001
wherein Ee and Ep represent the experimental and the predicted completion time of a batch job amongst the plurality of STREAM jobs executing concurrently.
[072] According to an embodiment of the present disclosure, the analysis of the results may now be considered in detail.
1. Hyper-Threading off and no memory binding- In an embodiment, the plurality of experiments were conducted with 28 physical cores on the machine by turning off the hyper- threading. In a first experiment, a plurality of STREAM benchmarks performing the vector copy operation, the add operation, the scalar operation and the triad operation were executed initially with 21, 14, 14 and 7 threads respectively. Each of the plurality of STREAM benchmarks was initially executed in isolation and the distinct service demand, the instantaneous utilization of the CPET and an average memory bandwidth requirement of each of the total number of threads were obtained. Each of the plurality of STREAM jobs were then executed concurrently without binding any threads to cores or memory. [073] Referring to the first experiment, the cores were over-subscribed since total number of threads running in the system is more than the total number of available cores. Further, it may be noted that there is a contention for a bandwidth as well since the memory bandwidth requirement of the threads is more than the available memory bandwidth at each core. Referring to FIG. 7, the completion time prediction for all four workloads of the STREAM benchmark is comparable to the time observed experimentally.
[074] In an embodiment, a second experiment was conducted, wherein a stream of five STREAM jobs were characterized with 4, 7, 14, 21, and 28 threads. Each of the five STREAM jobs ran all the four workloads of the STREAM benchmark sequentially and iteratively. All the five STREAM jobs were then executed concurrently. Referring to FIG. 8, a comparison of the completion time prediction for each of the five STREAM jobs with the predicted completion time generated by implementing the proposed methodology may be referred. It may be noted that the prediction error was observed in the first and the second experiments conducted because in general, service demand of threads in a job changes with concurrency. The proposed methodology assumes that the service demand of threads does not changes with the concurrency.
2. Hyper-Threading off and no memory binding - In an embodiment, the threads of each of the five STREAM jobs were pinned, wherein the five STREAM jobs comprises the threads 14, 7, 14, 21 and 7, and the thread were pinned to NUMA nodes 0, 1, 0, 1, and 0 respectively. No interference was observed amongst the threads when each of the threads were executing concurrently on a plurality of distinct nodes. Resource sharing (memory and cores) was observed only amongst one or more threads running on a same node.
[075] Referring to FIG. 9, a comparison of predicted and experimental job completion in execution may be referred. Referring to FIG. 9 again, it may be noted that completion of jobs (amongst the five STREAM jobs) running on NUMA node 1 is predicted with a higher accuracy as compared with NUMA node 0, and the NUMA node 1 has a small number of threads as compared with the NUMA node 0. The higher accuracy for the NUMA node 1 is because of a smaller context switching amongst the threads running on the NUMA node 1.
3. Hyper-Threading on and no memory binding- In an embodiment, a third experiment was conducted, wherein the Hyper-Threading was turned on which resulted in the availability of 56 logical cores on the system with no binding between the cores or the NUMA nodes and the threads. During the third experiment, the cores were over-subscribed by executing the five STREAM jobs, wherein each of the five STREAM jobs comprises 14 threads each and a number of distinct iterations. Further, each of five STREAM jobs ran each of the plurality of STREAM benchmarks sequentially and iteratively. Referring to FIG. 10, a comparison of predicted and experimental job completion in execution may be referred.
[076] Referring to FIG. 10 again, a higher job completion time experimentally was observed when compared to the predicted job completion time due to a higher context switching amongst the threads, thereby resulting in a reduction of a maximum sustainable memory bandwidth and thus, resulting in a larger completion time for five STREAM jobs. In an embodiment, a plurality of prediction errors resulted from the discrepancy in a sustainable memory bandwidth during the actual ran, and the bandwidth measured theoretically for a prediction in a simulation.
4. Hyper-Threading on with memory binding- In an embodiment, a fourth experiment was conducted, wherein four STREAM benchmark jobs comprising 14 threads each and a number of distinct iterations were pinned to the NEGMA nodes. Jobl and Job 3 were executed on the NEGMA node 0, while the Job2 and Job4 were executed on the NEGMA node 1. The available memory bandwidth on a single NEGMA node (that is, either the NEGMA node 1 or the NEGMA node 0) was shared amongst the threads running on that NEGMA node only. If a job finishes on any single NEGMA node, the available memory bandwidth on the NEGMA node is shared amongst threads still running on that node. . Referring to FIG. 11, a comparison of predicted and experimental job completion in execution may be referred.
[077] In an embodiment, a fifth experiment was conducted, wherein the four STREAM benchmark jobs with a different binding sequence, that is, the Jobl and the Job2 pinned to the NEGMA node 0, and the Job3 and the Job4 pinned to the NEGMA node 1. Referring to FIG. 12, the effect on the completion time of the four STREAM benchmark jobs with the different binding sequence may be observed. Referring to FIGS. 11 and 12 together, it may be noted that the different binding sequence affects the overall completion time, wherein changes resulting from the different binding sequence gets reflected in the predicted job completion time by implanting the proposed methodology. Thus, the proposed methodology facilitates finding an optimum binding of batch jobs and memory.
[078] The technical advantages of the proposed disclosure may now be considered in detail. As discussed above, the proposed disclosure provides for predicting the execution time of the set of multi-threaded and memory bandwidth intensive batch jobs. None of the traditional systems and methods have proposed computation of concurrently running memory bandwidth intensive batch workloads (or batch jobs) based upon the memory bandwidth requirement and the distinct service demand of each of the threads in the set of multi threaded and memory bandwidth intensive batch jobs. Further, the proposed disclosure provides for predicting the execution time of the concurrently running memory bandwidth intensive batch workloads with a high-level of prediction accuracy. Referring to FIGS. 7 through 12, it may be noted that the prediction accuracy of the proposed methodology is similar with logical as well as physical cores.
[079] Still further, the proposed methodology may be implemented and integrated with other existing workflow systems, and may further optimize the other existing workflow systems. Finally, the proposed disclosure may be implemented to test and predict a plurality of optimal job-memory binding techniques, thereby resulting in fastest overall completion time of batch workloads.
[080] In an embodiment, the memory 102 can be configured to store any data that is associated with predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention. In an embodiment, the information pertaining to the set of multi-threaded and memory bandwidth intensive batch jobs, the distinct service demand determined, the memory bandwidth requirement identified, the job execution model, simulation of the job execution model, and the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs predict etc. and all information pertaining to predicting the execution time of the multi threaded batch jobs predicted is stored in the memory 102. Further, all information (inputs, outputs and so on) pertaining to predicting the execution time of the multi-threaded and memory bandwidth intensive batch jobs in the concurrent batch job environment based upon the simulation of the Central Processing Unit (CPU) and memory contention.
[081] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[082] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application- specific integrated circuit (ASIC), a field- programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[083] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[084] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words“comprising,”“having,” “containing,” and“including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms“a,”“an,” and“the” include plural references unless the context clearly dictates otherwise.
[085] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term“computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[086] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.

Claims

CLAIMS:
1. A method of predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, the method comprising a processor implemented steps of:
identifying, by one or more hardware processors, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi-threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs (201);
performing, by the one or more hardware processors, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise (202):
(i) determining a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads (202(i));
(ii) identifying a memory bandwidth requirement of each of the total number of threads (202(ii)); and
(iii) deriving an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads (202(iii)); auto-designing, based upon the instantaneous utilization of the CPU and the identified memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads (203); and
predicting, by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps, wherein the plurality of steps comprise (204):
(i) simulating the auto-designed job execution model in a simulation environment based upon the distinct service demand of each of the total number of threads and the identified memory bandwidth requirement of each of the total number of threads (204(i)); and (ii) predicting, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs (204(ii)).
2. The method of claim 1, wherein the step of auto-designing the job execution model comprises:
(i) partitioning a total execution time for each of the total number of threads into a plurality of executing intervals;
(ii) computing, for each of the plurality of executing intervals, a busy time and an idle time for each of the total number of threads using the instantaneous utilization of the CPU; and
(iii) determining, based upon a comparison of the identified memory bandwidth requirement and an available memory bandwidth, whether the busy time for each of the total number of threads is to be incremented.
3. The method of claim 1, wherein the step of simulating the auto-designed job execution model comprises an execution of one or more programming functions by each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads.
4. The method of claim 2, wherein the available memory bandwidth is determined for a single thread based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system.
5. The method of claim 2, wherein the step of computing the busy time and the idle time comprises computing an average busy time and an average idle time for each of the plurality of executing intervals of time to auto-design the job execution model.
6. The method of claim 5, wherein the average busy time and the average idle time are computed based upon a standard deviation in CPU utilization of a batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs, and wherein the CPU utilization of the batch job is measured for a set of time intervals.
7. A system (100) for predicting execution time of multi-threaded and memory bandwidth intensive batch jobs in a concurrent batch job environment based upon a simulation of a Central Processing Unit (CPU) and memory contention, the system (100) comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
identify, a set of multi-threaded and memory bandwidth intensive batch jobs executing concurrently on a multi-core system, wherein the set of multi-threaded and memory bandwidth intensive batch jobs are identified based upon a concurrency level of one or more multi-threaded batch jobs;
perform, a plurality of steps based upon the set of multi-threaded and memory bandwidth intensive batch jobs identified, wherein the plurality of steps comprise:
(i) determine a distinct service demand for each of a total number of threads corresponding to the set of multi-threaded and memory bandwidth intensive batch jobs, wherein the distinct service demand comprises a CPU utilization of each of the total number of threads;
(ii) identify a memory bandwidth requirement of each of the total number of threads; and
(iii) derive an instantaneous utilization of the CPU for each of the set of multi-threaded and memory bandwidth intensive batch jobs as a function of time for a set of time intervals corresponding to each of the total number of threads;
auto-design, based upon the instantaneous utilization of the CPU and the identified memory bandwidth requirement of each thread, a job execution model comprising of a plurality of idle threads and a plurality of threads ready for execution amongst the total number of threads; and
predict, by a discrete event simulation technique, the execution time for each of the set of multi-threaded and memory bandwidth intensive batch jobs by performing a plurality of steps, wherein the plurality of steps comprise:
(i) simulate the auto-designed job execution model in a simulation environment based upon the distinct service demand of each of the total number of threads and the identified memory bandwidth requirement of each of the total number of threads; and
(ii) predict, based upon the simulation, the execution time of each of the concurrently executing set of multi-threaded and memory bandwidth intensive batch jobs.
8. The system (100) of claim 7, wherein the one or more hardware processors (104) are configured to:
(i) partition a total execution time for each of the total number of threads into a plurality of executing intervals;
(ii) compute, for each of the plurality of executing intervals, a busy time and an idle time for each of the total number of threads using the instantaneous utilization of the CPU; and
(iii) determine, based upon a comparison of the identified memory bandwidth requirement and an available memory bandwidth, whether the busy time for each of the total number of threads is to be incremented.
9. The system (100) of claim 7, wherein the one or more hardware processors (104) are configured to simulate the auto-designed job execution model by executing one or more programming functions via each of the total number of threads to initialize the distinct service demand and the memory bandwidth requirement of each of the total number of threads.
10. The system (100) of claim 8, wherein the one or more hardware processors (104) are configured to determine the available memory bandwidth for a single thread based upon a number of threads executing on a single node, wherein the single thread and the number of threads correspond to the total number of threads, and wherein the single node corresponds to the multi-core system.
11. The system (100) of claim 8, wherein the one or more hardware processors (104) are configured to compute the busy time and the idle time by computing an average busy time and an average idle time for each of the plurality of executing intervals of time to auto-design the job execution model.
12. The system (100) of claim 11, wherein the one or more hardware processors (104) are configured to compute the average busy time and the average idle time based upon a standard deviation in CPU utilization of a batch job amongst the set of multi-threaded and memory bandwidth intensive batch jobs, and wherein the CPU utilization of the batch job is measured for a set of time intervals.
PCT/IB2019/055684 2018-07-03 2019-07-03 Predicting execution time of memory bandwidth intensive batch jobs Ceased WO2020008392A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN201821022846 2018-07-03
IN201821022846 2018-07-03

Publications (2)

Publication Number Publication Date
WO2020008392A2 true WO2020008392A2 (en) 2020-01-09
WO2020008392A3 WO2020008392A3 (en) 2020-04-30

Family

ID=69060190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2019/055684 Ceased WO2020008392A2 (en) 2018-07-03 2019-07-03 Predicting execution time of memory bandwidth intensive batch jobs

Country Status (1)

Country Link
WO (1) WO2020008392A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112346866A (en) * 2020-11-05 2021-02-09 中国科学院计算技术研究所 A GPU scheduling method and system based on asynchronous data transmission
CN112925697A (en) * 2021-03-30 2021-06-08 中国建设银行股份有限公司 Operation difference monitoring method, device, equipment and medium
CN117149447A (en) * 2023-10-31 2023-12-01 苏州元脑智能科技有限公司 Bandwidth adjustment method, device, equipment and storage medium
US12056524B2 (en) 2021-07-28 2024-08-06 International Business Machines Corporation Predictive analysis on running batch jobs
US12242367B2 (en) 2022-05-15 2025-03-04 International Business Machines Corporation Feature importance based model optimization

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7406689B2 (en) * 2005-03-22 2008-07-29 International Business Machines Corporation Jobstream planner considering network contention & resource availability
US8643656B2 (en) * 2010-09-30 2014-02-04 Nec Laboratories America, Inc. Energy-aware task consolidation on graphics processing unit (GPU)
US9367803B2 (en) * 2012-05-09 2016-06-14 Tata Consultancy Services Limited Predictive analytics for information technology systems

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112346866A (en) * 2020-11-05 2021-02-09 中国科学院计算技术研究所 A GPU scheduling method and system based on asynchronous data transmission
CN112346866B (en) * 2020-11-05 2023-09-01 中国科学院计算技术研究所 A GPU scheduling method and system based on asynchronous data transmission
CN112925697A (en) * 2021-03-30 2021-06-08 中国建设银行股份有限公司 Operation difference monitoring method, device, equipment and medium
CN112925697B (en) * 2021-03-30 2024-03-01 中国建设银行股份有限公司 Method, device, equipment and medium for monitoring job difference
US12056524B2 (en) 2021-07-28 2024-08-06 International Business Machines Corporation Predictive analysis on running batch jobs
US12242367B2 (en) 2022-05-15 2025-03-04 International Business Machines Corporation Feature importance based model optimization
CN117149447A (en) * 2023-10-31 2023-12-01 苏州元脑智能科技有限公司 Bandwidth adjustment method, device, equipment and storage medium
CN117149447B (en) * 2023-10-31 2024-02-13 苏州元脑智能科技有限公司 Bandwidth adjustment method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2020008392A3 (en) 2020-04-30

Similar Documents

Publication Publication Date Title
WO2020008392A2 (en) Predicting execution time of memory bandwidth intensive batch jobs
Chen et al. Deep learning research and development platform: Characterizing and scheduling with qos guarantees on gpu clusters
EP2879055B1 (en) System and method facilitating performance prediction of multi-threaded application in presence of resource bottlenecks
US9569262B2 (en) Backfill scheduling for embarrassingly parallel jobs
US11693706B2 (en) System and method for dynamic scheduling of distributed deep learning training jobs
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
US10379883B2 (en) Simulation of high performance computing (HPC) application environment using virtual nodes
US10310908B2 (en) Dynamic usage balance of central processing units and accelerators
CN102902512B (en) A kind of multi-threading parallel process method based on multi-thread programming and message queue
US11861272B2 (en) Comprehensive contention-based thread allocation and placement
WO2019193570A1 (en) Batch jobs execution time prediction using distinct service demand of threads and instantaneous cpu utilization
US20170032487A1 (en) Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
US11175940B2 (en) Scheduling framework for tightly coupled jobs
US12314851B2 (en) Microservice-based training systems in heterogeneous graphic processor unit (GPU) cluster and operating method thereof
US9507633B2 (en) Scheduling method and system
US20200125411A1 (en) Detection, modeling and application of memory bandwith patterns
CN105897805A (en) Method and device for cross-layer scheduling of resources of data center with multi-layer architecture
US8977752B2 (en) Event-based dynamic resource provisioning
CN118210615A (en) Resource allocation method and device
US20230185625A1 (en) Workload characterization-based capacity planning for cost-effective and high-performance serverless execution environment
Lu et al. InSTechAH: Cost-effectively autoscaling smart computing hadoop cluster in private cloud
Das et al. Energy-aware dynamic reconfiguration of communication-centric applications for reliable MPSoCs
Chen et al. Multiple CNN-based tasks scheduling across shared GPU platform in research and development scenarios
Thanh Chung et al. From reactive to proactive load balancing for task‐based parallel applications in distributed memory machines

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19831196

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19831196

Country of ref document: EP

Kind code of ref document: A2