Assessment of the Network Reliability by Co. SEP

www.as-se.org/ccse

Communications in Control Science and Engineering (CCSE) Volume 1 Issue 4, October 2013

Assessment of the Network Reliability Al-hadsha F. A. H, Gaevoy S.V., Lukyanov V.S. Computers and Computer Systems department, Volgograd State Technical University 400005, Volgograd, Lenin 50, Russia alhadsha@mail.ru Abstract Due to mass spread of distributed computer systems, the assessment analysis of such systems reliability is necessary. In this paper, an attempt to analyze the system consisting of two nodes (the source and the receiver) and a set of intermediate members, is made. The system is considered to be functional, when there is a connection between two elements. One of the primary goals is to define which elements of the system can be considered as perfectly reliable.

much time because of a lot of states (each element can be broken or working). Also every random variable must be assumed to be exponentially distributed.

Networks; Simulation; Reliability Indices; Messaging

To get the solution, simulation and discrete events methods are utilized by simulating random results of the system operation for several times, to get the statistics and according to sufficient estimate of accurate parameters required. In accordance with Central limit theorem (CLT) the error is estimated. As a confidence probability is taken to be 99, 73%, that is 68-95-99.7 rule (three-sigma rule, empirical rule).

Introduction

The simulation of the system will be conducted:

In entering the research, the problem is formulated as follows, reliability indices of the network consisting of the station and interconnections are needed to be defined. Let`s call one of the stations the source of the signal, and the second–the receiver. The system is considered to be efficient if there is at least one transmission channel from the source to the receiver. It is assumed that interconnections can transmit the signal in both directions.

a) Without recovery (I)

Keywords

Now, assuming that at the initial time the system is perfectly faultless. Then the failure of the elements under the influence of the Poisson flows of failures occurs. The recovery of the failed elements is still possible. The dependence of the probability of nofailure operation on time is necessary to be defined that is the probability of no-failure work within certain period. Also, the average time of system operation should be defined. Let us consider the system as shown in FIG 1. Take the following Poisson flows of failures for the network elements ”Proc” and ”Dev” - 0,1 failure/hour, for the interconnections ”Cnn” - 0,01 failure/hour, and for the source ”A” and the receiver ”B” - 0,02 failure/hour. The average recovery time is the same for all the elements – 2 hours. The direction of the signal is shown with arrows. The solution can be theoretically found by using continuous-time Markov chains but that consume very

b) Exponential recovery (E) c) Weibull recovery with parameter 3.3 (W). “3.3” makes the principle close to normal. d) With determined time of the recovery (C) The nodes are set independently. There is another distribution that can be used. This set contains but is not limited to Normal distribution, Uniform distribution, Triangle distribution, Gamma-distribution, Hyper exponential distribution and some “custom” distributions. These distributions can be easily integrated in the model but the example does not need to include them. We consider the following variants of the system operation: a) Only intermediate nodes fail (N) b) Intermediate nodes and interconnections fail (NL) c) Intermediate nodes, including the source and the receiver, and interconnections fail (All) d) All, but the source and the receiver have the reservation (AllR). As the redundancy we consider a loaded redundancy which is two sources and two receivers, working simultaneously; when one of them fails, the other works efficiently.

Communications in Control Science and Engineering (CCSE) Volume 1 Issue 4, October 2013

www.as-se.org/ccse

If (A[i][1] == 1) { A[i][1] = 0; A[i][2]=A[i][2]+ get_time_for_break(i); } else { A[i][1] = 1; A[i][2]=A[i][2]+ get_time_for_work(i); } “1” in the first column means the operability of the occurrence, “0” – the state of failure. To accelerate the calculations the event time may be kept in the binary heap.

FIG .1 GRAPHICAL SYSTEMS

The example indicating the systems is AllR-W, N-C. Let us bring in some extra designations: AB – the probability of no-failure operation of the source and the receiver without redundancy; ABx2 – with a single redundancy of each. Modeling Description of the Model The model works in the following way. Each element has two modes-operating status and the failure–and there is a certain time of changing the status, counted off from the beginning of the modeling. A special event of the model is failures and the recovery of the elements. At this moment the status of the element changes for the opposite. A procedure of getting a random time of i occurrence no-failure operation is defined as get_time_for_work(i), and a random time of its failure status as get_time_for_break(i). m – is the number of the system elements, n – is the total number of the occurrences. The matrix of the initial moment:

A =

get_time_for_work(1)

get_time_for_work(2)

get_time_for_work(3)

...

get_time_for_work(n-1)

get_time_for_work(n)

Special moments will occur at instant time. Let's describe the chain of events in the pseudocode:

Now, the operating elements are defined by the operating occurrences. Let us confront a counter, equal to the number of operating occurrences of the element, with each element and refreshed with every change of the occurrence flag A [i]. A sign of the operability will be a positive value of the counter, and the sign of failure/recovery, respectively, will be the transition of the counter to zero/from zero. To accelerate the calculations, all the paths from the source to the receiver in advance are calculated. It was decided to use recursive procedure of depth search. The same elements can be included into different search branches, but not into the same. All possible paths can be saved and then checked every time when the element is recovered in the failed system and failure in the operating system. But there is an easier way. Let us confront a list of paths with each element. You don't have to save the paths themselves. Every path is confronted with another counter – the number of the elements failed. A sign of a possible path is a counter, equal to zero. With every element failure, path counters are increased, and when the element is recovered, the counters are decreased. In this case, a sign of the operability will be a path with the counter, equal to zero. The counter of operating paths is taken and its value will be changed by switching on/off the paths. An example of the counters system is investigated. The system AllR is taken, but to reduce the size of the scheme, the interconnections are considered to be perfectly reliable. The result is on the FIG 2. The influence of the counters upon each other is indicated with arrows. A transfer to the failure is a sign of the modeling end, with the aim to graph the probability of system operability depending on time, using a standard

www.as-se.org/ccse

Communications in Control Science and Engineering (CCSE) Volume 1 Issue 4, October 2013

procedure of statistical data manipulation. If complementing the unknown quantity to unity we get the integral function of distributing rising time for the failure. A histogram of distributing times for the failure. We standardize the histogram is built in such a way that the area under it is equal to unity. A probability of system operability in a certain moment of time will be equal to the area under the histogram, situated to the right from this moment. Integral probability distribution function (the probability of the failure) is the area to the left from the moment. If the dependence of the graph evenness on the number of experiments is analyzed, result is obtained as given on the FIG 3.

but the error of the average no-failure operation time, according to CLT, is about 10%. The computational capability allows increasing the experiments number in 100 times, which can reduce the error. That is why we conduct 100000 experiments.

FIG. 3 THE DEPENDENCE OF THE GRAPH FROM NUMBER OF EXPERIMENTS

Recovery Policy Firstly, we consider a system, where only nods (N) fail (pic.4). The graphs N-W and N-C concur, because the most probable values for the N-W graph are close to the ensemble average, for the N-C graph they coincide, and for the N-E graph the most probable values are close to zero. That is why we get the gain of the work probability in N-E graph.

FIG. 2 COUNTERS SYSTEM

It is seen from the graphs that if the number of experiments is equal to 1000, the graph becomes even,

FIG. 4 THE PROBAILITY OF N SYSTEM OPERATION

Communications in Control Science and Engineering (CCSE) Volume 1 Issue 4, October 2013

Now we consider the systems NL, All и AllR. Here the curves W and C concur, but to save the space, their graphs are not given. Because the recovery is drawn towards the normal principle, we consider the Weibull recovery curves. The average time of system operation in TAB 1. In accordance with CLT the error is 0, 92%. The redundancy and the recovery together increase the no-failure operation time of the intermediate elements. However, the source and the receiver are not redundant, which means that in spite of the low intensity of failures, in comparison with the other nodes, they are the weak link, meaning that they must be redundant.

www.as-se.org/ccse

receiver no-failure operation by the probability of NL-I system operation, we get the probability of All-I system operation. It can be observed that at the very beginning of system functioning a dashed line “cuts” the probability of All-I system operation. Then, the necessity of redundancy of the source and the receiver occurs. This variant is shown on the AllR-I curve. When there is one and redundant element for the source and one redundant element for the receiver, their operation becomes almost perfectly reliable, in comparison with the rest of the system. Now, we consider the systems with Weibull recovery (FIG 6).

TABLE 1. THE AVERAGE TIME OF SYSTEM OPERATION, HOURS N

All

AllR

9,66817

8,45199

6,77966

8,14088

28,9417

23,0651

12,2829

21,9415

27,0671

21,3864

11,7182

20,096

26,9366

21,0882

11,6027

20,06

The Impact of the Absolute Reliability Assumptions Now, let`s analyze how the absolute reliability assumptions may affect the result. First, we consider the systems without redundancy (FIG 5).

FIG. 6 SYSTEMS WITH WEIBULL RECOVERY

The case is the same - we need to redundancy the source and the receiver. One redundant element per each is enough. The recovery without redundancy does not affect the source and the receiver operability. Conclusion A new simulation model, based on the failure and operation statuses of nodes and communication paths combination analysis, is introduced.

It is obvious that the interconnections absolute reliability assumption is invalid.

In this case, we get some new results for estimating different parameters of reliability – the probability of no-failure operation and the average rising time for the failure - with a glance of communication paths failure and redundancy of separate nodes probability.

By multiplying the probability of the source and the

The most vulnerable nodes in the system are the initial

FIG. 5 NON-RECOVERY SYSTEMS

www.as-se.org/ccse

Communications in Control Science and Engineering (CCSE) Volume 1 Issue 4, October 2013

and the final (the source and the receiver) because of the absence of their redundancy. Necessary measures have been introduced

Network with Limited Network Capacity“, VSTU news.

REFERENCE

pp. 193-199.

AL-hadsha F. A. H., “Selecting the network k-shortest paths“, VSTU news. Series "Actual problems of management, computing and informatics in technical systems"), vol. 12, no. 11, 2011, pp. 136-137. Chernyshov K. V., “Indicators of reliability of complex systems“: time to failure, time of service: a tutorial, Volgograd: Polytechnic, 2007, 80 p. Kalmykov P.S., Lukyanov V.S., Slesarev G.V., “Designing Structure

the

Network“:

tutorial,

Volgograd, 1997, 75 p. Lukyanov V. S., Andreev A. E., Zharikov D. N., Ostrovsky A. A., Gaevoy S. V., “Simulation of Grid-systems“: monograph, Volgograd : "Polytechnic", 2012. - 215 p. Lukyanov V. S., Al-hadsha F. A. H., “Designing Topology

informatics in technical systems", vol. 14, no. 10 (97), 2012, Lukyanov V. S., Gaevoy S. V., AL-hadsha F.A.H., “A simulation model of a heterogeneous computing system“, VSTU news. Series "Actual problems of management, computing and informatics in technical systems", vol. 11, no. 9, 2011, pp. 85-88. Lukyanov V. S., Zharikov D. N., Gaevoy S. V., Shafran Yu. V., “Modeling GRID-systems“, IT Modeling and Control, no. 5, 2009, pp. 669 – 677.

Fomenkov S.A., “System modeling“, 2004, CD-ROM. Topological

Series "Actual problems of management, computing and

Lukyanov V. S., Zharikov D. N., Gaevoy S. V., Shapovalov O. V., “Modeling fault-tolerant GRID-systems“, Innovations based on information and communication technologies of the International Scientific-Practical Conference (Russia, Sochi, 1-10 October of 2010.), pp. 253-254. Phillips D., Garcia-Diaz A., “Fundamentals of Network Analysis“, Moscow: Mir, 1984, 496 pp.