ISSN 2278-3091 Volume No.2, March April 2015 4(2), March - April 2015, 36 - 43 Vikas Shinde, International Journal of Advanced Trends4, in Computer Science- and Engineering,
International Journal of Advanced Trends in Computer Science and Engineering Available Online at http://www.warse.org/ijatcse/static/pdf/file/ijatcse06422015.pdf
Evaluation of Parallel Processing Systems through Queuing Model Vikas Shinde Department of Applied Mathematics, Madhav Institute of Technology & Science, Gwalior-India
ABSTRACT
and the programs as ‘customers’. A model of
In this investigation, Jackson queueing network
parallel processing system is a system which is
has been widely used to model and analyze the
expandable in vertical and horizontal manner and
performance of complex parallel systems. M/G/1
can be treated as cluster for a single queue of
queueing system is used to model a parallel
waiting jobs. A job is modeled as a sequence of
processing system, which is expandable in vertical
independent stages which must be processed,
and horizontal manner. Determine a closed form
where the number of processors desired by the
solution for the system performance metrics, such
jobs in each stage may be different. If, for some
as processor’s waiting time, system processing
stage, the job in service requires fewer processors
power, etc.
than the system provides, then the job will occupy
Keywords: Queueing Network, Massive Parallel
the processors according to its need and the other
Processing, Shared Memory, Waiting Time.
processors will be idle for that stage. If, for some other stage, the job in service requires more
1.
INTRODUCTION
processors than the system provides, then it will
Parallel processing of the computer
use all the processors in the system for an
systems has been widely studied due to a
extended period of time such that the total work
significant role in day-by-day fast computing of
served in that stage is conserved.
the
Many researchers have extensively investigated
jobs.
As
parallel
computing
systems
proliferate the need for effective performance
processing
evaluation, queueing techniques become ever
approaches. Al-Saqabi et al. [1] established a
more important. In fact, the performance of such
distributed scheduling algorithm that will track
systems depends on the hardware resources,
the available workstations i.e the workstations not
(CPU, Memory, etc.,) on software (system
being used by their owners in networks and act
programs,
the
upon those workstations by scheduling processes
organization and management of these resources.
of parallel applications onto them. Guan and
In view of the increasing complexity of
Cheung [2] constructed a massively parallel
computing systems, it is more and more difficult
processing system which has drawn a lot of
to predict their performance indices based on
attention to an important feature affecting the
analytical queueing models. In such models, it is
performance and characteristics of the architecture
convenient to represent the resources as ‘servers’
with an interconnection of multiple processors.
compilers,
etc.,)
and
on
36
systems
via
queue
theoretic
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
Jean-Marie et al. [5] introduced a hybrid
dynamic allocation of the resources of a general
analytical approach by using techniques from the
parallel processing system comprised of several
theories of both stochastic task graphs and
heterogeneous processors.
queueing networks. Jozwiak and Jan [6] discussed
The rest of paper is organized as follows. Model
quality driven model based multi processor
description is given in section 2. In section 3,
accelerator
adequately
described the governing equation and their
addresses the architecture design issues of
performance analysis. Conclusion is mentioned in
hardware multi processors for the modern highly
section 4.
demanding studied
design
method
embedded
that
applications.
communication
Jan
architectures
[7] for
2. MODEL DESCRIPTION
massively parallel hardware multi processors.
Every computer consists of a set of processors
Systematic framework and a corresponding
(CPUs) P1, P2, P3, ……Pn and m 0 shared
methodology for workload modeling of parallel
memory units M1, M2, M3,…..Mm
systems was proposed by Kotsis [8]. Mohapatra
communicate via an interconnection network N,
et al. [10] proposed the structure for processors
as illustrated in figure 1.
which is divided into groups or cluster and
constitute a global main memory that provides a
organized in several stages. Maheshwari and Shen
convenient message depository for processor-to-
[11] established a clustering algorithm wherein all
processor communication. A system with this
the clusters have balanced amount of computation
arrangement is called a shared memory computer.
load and there is only one communication path
A global shared memory can be a serious
between any pair of clusters. Nassar [12]
bottleneck, particularly when the processors share
evaluated the throughput of several multi buses as
large amounts of information, since normally only
a discrete time Markov chain under different
one processor can access a given memory module
working conditions. Reijns [13] considered the
at a time. If the processors have their own local
delay effect caused by memory interference in a
memories, then the global memory can be reduced
parallel processing system with shared memory
in size, or even eliminated completely. To
was
separate the functions of processing and memory,
implemented
queueing.
Tomic
using
a
machine repair
[14]
gave
the
which
The memory units
matrix
which refer to a CPU with no associated main
representation of the linear evolution operator of
memory, but with other temporary storage units
the certain class of parallel processing system and
such as register files and caches as a processing
effectively used as a performance prediction tool
element (PE).
for the modern parallel processing systems. Wasserman et al. [15] studied the problem of
37
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
PE1
PE2
PE3
PEn
Interconnection network N
Memory M
Figure 1: Shared Memory The basic cluster shown in figure 2, each
The basic cluster is defined in two ways:
processing unit has a local memory for its own
(i) by increasing the number of the processing
computation and there is a shared memory for
units or using several basic clusters with one
facilitating the communication between the
additional memory that is shared by those
processors. A horizontal communication network
clusters, and (ii) in a two stage system, it must be
(HCN) is used for transmitting data between
noted that in the second level of the system, there
processors and shared memory. Moreover the
is a HCN that connects the VCN of each basic
basic cluster includes a unit for I/O operations and
cluster to SM2. The units that are located inside
a unit for supervisory and managing the
the basic clusters are indicated by (SM1, HCN1,
processors. A vertical communication network
‌.), and the units that are located outside of the
(VCN) is used for transmitting control signals and
cluster are indicated by (SM2, HCN2,‌‌)
vertical expansion of the system.
SM1
Horizontal Communication Network LM1
P1
LM2
P2
LMN I/O
PN
Vertical Communication Network Figure 2: Basic Cluster
38
Manager
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
SM2
Horizontal Communication Network 2
SM1
I/ O Cluster 2
Manager
HCN 1
VCN 1
Vertical Communication Network 2 Figure 3: Two Stage System This method can expand the system
horizontally by increasing the number of PCs in
vertically and constructing s-stages system. A
each level. In multistage clustering structure based
cluster in i
th
stage of the s-stages system is
system, if there are number of PCs that make a
depicted in figure 4. Here cluster include some
cluster will be equal for all clusters of ith stage, the
processing clusters or PC namely, one I/O cluster
system is known as homogenous at level i. If
and one managing cluster. There are two
system is homogenous in all level it will be called
interconnection networks, HCNi and VCNi that
homogenous on the other hand if it will not be
transmitting data inside and outside of the clusters
homogenous at least in one stage, it will be
respectively. Such systems can be expanded
recognized as non homogenous or heterogeneous.
vertically by increasing the number of stages or
39
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
SMi
Horizontal Communication Network i
I/OCi-1 PCi-1
MCi-1
PCi-1
Vertical Communication Network i
Vertical Expansion Path
Horizontal Expansion Path
Figure 4: Cluster in ith stage of s stage system 3. THE PERFORMANCE ANALYSIS
For evaluating the performance of the
and Co is the number of processors in
system, let consider the system is constructed based on homogenous MSCS. In this system any processor
Processors itself generated the inter job communication requests.
The time between two consecutive requests
probable that a job needs to communicate with the
have
other
parameter λ.
jobs.
Therefore several
queues can
be
constructed for each interconnection networks and
each
basic cluster.
performs a piece of the main program that is called processor’s job. During the job execution, it is
Ci is the number of PCS in ith stage of system
shared memories.
exponentially
distributed
with
Access time to memory in ith stage has exponentially distributed with parameter µmi.
Consider the following assumptions for analyzing the system.
The destination of each request will be uniformly distributed between processor’s
40
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
jobs and the probability of outgoing request from ith stage is denoted by Pi.
PROC
The service time of the inter connection networks in ith stage have exponentially
v1
1P1
distributed with parameter µhi and µvi for h1
HCNi and VCNi, respectively.
Conflict
over
memory
P1
modules
HCN1
and
VCN1
interconnection networks will be resolved by the queueing center which is modeled as
v2
h2 1P2
P2
M/G/1.
Request processors must be waited until they
VCN2
HCN2
SM1
FromVCNs-2
offer service as per above scheme and during m1
waiting period, they can not generate any
h3
vs1
Ps1
1 P3 P3
To VCN3
other request. SM2
The parallel processing system in which the
HCN3
VCNS-1
input rate of each stage must be computed and m2
queueing problem is analyzed by developing the
-
-
M/G/1 model. For analyzing the design of MPP’s with
hs
HCNS
SM3
a large number of units, the area of computation for closed queueing network will be very large. Apply
m3
queueing network methodology for analyzing the closed queueing network and also determine the input
SMs
rate of each service center as a function of the input ms
rate for previous center. This technique can reduce the
Figure 5: Multi stage Cluster MPP’s with s stage system
calculation and simulation time. As shown in the figure 5, all the request departs
Since there are (C0-1) processors in each basic cluster,
from HCNi will pass through the SMi with probability
the requests that receive to HCN1 and VCN1
one. Therefore, compute input request rate of VCNs
originating from other processor in the same cluster,
and HCNs. The processor requests will be directed to
indicated by γh1 and γv1, will be λ(1-P1)(C0-1) and
service center HCN1 and VCN1 by probability (1-P1)
λP1(C0-1), respectively. So the total requests of the
and P1, respectively. If the request rate of a processor
processors that received to service centers in the first
will be λ, the input rate of HCN1 and VCN1 that
stage can be computed by following equations:
originated from that process will be λ (1-P1) and λP1.
41
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
steps will be negligible. After calculating the
v 1 P1 ( C 0 1 ) P1 (1) C 0 P1
effective request rate, the waiting time can be determine by Little formula as
m1 h1 (1 P1 ) (C 0 1)(1 P1 ) C 0 (1 P1 )
L 1 2 2 2 s L W or W 2(1 ) (8) 2 2 2 2 s 2 (1 )
(2) The input request rate at the ith stage from each PCs is λ(vi-1)
vi Pi v( i 1) (C i 1 1) Pi v (i 1) C i 1 Pi v( i1)
Here Pvi , , Pmi , Phi are the probabilities that referred to
(3)
a processor request to
VCN i , SM i & HCN i
respectively and computed by the following product type solution
mi hi (1 Pi ) v (i 1) (C i 1 1) (1 Pi ) v ( i 1) C i 1 (1 Pi ) v (i 1)
i 1
Pvi Pj 1
(4)
(9)
j0
Pmi Phi
In the last stage there is no request for outer cluster, so that
(1 Pi ) i 1 Pj 1 Pi j 0
(10)
By determining the average waiting time of a
vs 0
processor for each communication request, which can (5) determine the processor utilization as by using:
(5)
ms hs C s 1 (1 Ps ) v ( s 1) C s 1 Ps v ( s 1) C s 1 v ( s 1)
(6)
Processor
Utilization
=
1 2( ) w 2 2 2 s
Now consider M/G/1 model to calculate the
PU
=
(11)
queue length at each mode for all stages, then the
Total processing power of the system (TPP), is
average of total waited processors in the system can be
obtained by considering the single processor power
computed as.
(SPP). Thus
By using Pollaczek-Khintchine formula, it give TPP 2
L
2
2
2(1 )
s
=
N
2( ) SPP 2 2 2 s
(7)
x
PU
x
SPP
=
s
C
i
(12) (7)
i 0
The waited processors would not be able to generate the request. In this situation the effective
4. CONCLUSIONS
processor’s request rate would be lower than the
In
required. The effective request rate will be decreased
this
investigation,
the
performance
modeling of a parallel processing system as a
with the same ratio as there are active processor’s in
sequence of stages, each of which requires a
the system. L and λ have been calculated
certain integral number of processors for a certain
successively till their changes in two consecutive
integral of time. This proposed a new structure and 42
Vikas Shinde, International Journal of Advanced Trends in Computer Science and Engineering, 4(2), March - April 2015, 36 - 43
developed an analytical model for massive parallel
parallel hardware multi processors, J of
processing system based on queueing theory. The
Parallel Distribution Computing, Vol. 72, pp.
system performance metrics may provide insights
1450-1463. (2012) 8.
to the system designers and decision makers to
Kotsis, G.: A systematic approach for workload modeling for parallel processing
improve the system at optimal cost.
systems, J. Parallel computing, Vol. 22, No. 13, pp. 1771-1787. (1997)
REFERENCES 9.
1. Al-Saqabi, K., Sarwar, S. and Saleh, K.:
Computer Application, New York Wiely.
Distributed gang scheduling in networks of
(1975)
heterogeneous workstations, J. Computer
10. Mohapatra, P., Das, C. R. and Feng, T. Y.:
Communications, Vol. 20, No. 5, pp. 338-
Performance
348. (1997)
programs, J. Parallel Computing, Vol. 24,
(2000)
No. 5-6, pp. 893-909. (1998)
3. Hayes, J. P. : Computer Architecture and
12. Nassar, H.: A Markov model for multibus
organization, McGraw-Hill. (1998)
multiprocessor systems under asynchronous
4. Hwang, K. H. and Xu, .Z.: Scalable parallel
operation, J. Information Processing Letters,
computing, McGraw-Hill (1998)
Vol. 54, No. 1, pp. 11-16. (1995)
Jean-Marie, A., Lefebvre-Barbaroux, S. and
13. Reijns, G. L. and Gemund, Van. J. C.:
Liu, Z.: An analytical approach to the
Analysis of a shared- memory multiprocessor
master-slave
via a novel queueing model, J of system
computational models, J. Parallel Computing,
Architecture, Vol. 45, No. 14, pp. 1189-1193.
Vol. 24, No. 5-6, pp. 841-862. (1998) 6.
(1999)
Jozwiak, L., Jan, Y.: Design of massively
14. Tomic, D.: Spectral performance evaluation
parallel hardware multi-processors for highly
of parallel processing systems, J. Parallel
demanding embedded applications, J of
computing, Vol. 13, No. 1, pp. 25-38. (2002)
Microprocessors and Microsystems, Vol. 37,
15. Wasserman, K. M., Michailidis G. and
pp. 1155-1172. (2013) 7.
Jan,
Y.
and
Jozwiak,
L.
:
based
clustering algorithm for partitioning parallel
Architecture, Vol. 46, No. 13, pp. 1185-1190.
of
cluster
11. Maheshwari, P. and Shen, H.: An efficient
parallel processing system, J. of systems
evaluation
of
Vol. 43, pp. 109-114. (1994)
approaches for constructing a massively
performance
analysis
multiprocessor, IEEE Trans. on Computer,
2. Guan, H. and Cheung, To-Yat. : Efficient
5.
Keleinrock, L.: Queueing Systems, Vol. II,
Bambos, N.: Optimal processor allocation to
Scalable
differentiated job flows, J. Performance
communication architectures for massively
Evaluation, Vol. 63, No. 1, pp. 1-14. (2006)
43