An Analysis of Road Accidental Data Using Clustering and Itemset Mining Algorithms by ijdmtaeditoriir

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

An Analysis of Road Accidental Data Using Clustering and Itemset Mining Algorithms R.Siddthan1, A.Nagarajan2 Research Scholar, Department of Computer Application , Alagappa University, Karaikudi, Tamil Nadu, India 2 Assistant Professor, Department of Computer Application, Alagappa University, Karaikudi, Tamil Nadu, India Email : sithan314@gmail.com 1

Abstract— Road accidental detection is one of the emerging issue in recent days, which has been focused by many researchers. Road accident is the major cause for unnatural death, and desirability, which is unpredictable. So, many existing works aimed to develop some prediction approaches for analyzing the real time dataset and predicting the accidental rate for future. But, it limits with the drawbacks like inefficient prediction, reduced accuracy, and increased time consumption. Thus, this paper aims to propose a new prediction model by implementing various data mining techniques. It includes the stages of preprocessing, clustering, and itemset mining. Initially, the dataset obtained from the UCI repository is preprocessed by eliminating the irrelevant attributes and filling the missing values. Then, the density based clustering technique is implemented to group the filtered data into a cluster. After that, the rules are formed based on the support and confidence values for predicting the future. Finally, the frequent items are mined by the use of Apriori algorithm. In experiments, the performance results of the proposed system is validated and evaluated by using various measures such as accuracy, precision, recall, and time consumption. Index Terms— Data Mining, Density based Clustering, Association Rule Mining (ARM), Support and Confidence Measures, Apriori Algorithm, and Prediction Rate. I. INTRODUCTION Data Mining is the one of the widely used research field, in which the patterns and relationships between the large number of data are represented. The aim of data mining[1] is to extract the useful information from the large dataset and transform it to understandable structure[2] for valid predictions. The application areas of data mining are customer relationship management, banking, production management, marketing, and traffic analysis[3]. Moreover, it supports various technologies such as statistics, machine learning, database technology, information science, visualization, and other disciplines[4]. The major functionalities of data mining are as follows[2]:  Data storage and maintenance  Data extraction and transformation  Providing data access In this field, mining of the appropriate entries in the database plays an essential role[5]. The two main objectives concentrated in data mining are prediction and description, in which the prediction [6]of the future patterns is performed based on the fields in the dataset. Then, the description focuses on finding the patterns that can be interpreted by the users[7]. Furthermore, the data mining[8] can allow the users to search an interesting and relevant patterns. Roadway traffic analysis is an emerging research area in the field of data mining[9]. In recent days, the usage of vehicles are increased drastically, so traffic monitoring is one of the demanding task[10]. Because, the accident could happen at any where due to this traffic[11]. To analyze the driver safety, the data mining techniques are applied on the traffic dataset for predicting the valuable information[12, 13]. The association rule mining is a popular methodology[14, 15] used in data mining for identifying the relations between the data stored in a large dataset. Also, it is highly useful for mining the frequent itemsets. The major objectives of this paper are as follows:  To eliminate the irrelevant data and to fill the missing values, the original dataset is preprocessed.

130

Integrated Intelligent Research (IIR)

ď&#x201A;ˇ ď&#x201A;ˇ

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

To analyze the traffic data, the density based clustering algorithm is implemented, which groups the data into clusters. To identify the frequent itemsets, the ARM algorithm, namely, Apriori is used, which identifies the relationships between the attributes for future prediction.

Fig 1. Structure of the data mining framework to analyze the traffic data The rest of the sections in the paper are organized as follows: the existing techniques and algorithms related to clustering and classification in data mining are surveyed in Section II. The description about the proposed methodology is presented with its flow illustration in Section III. The results of the proposed accidental rate prediction system are evaluated by using various measures in Section IV. Finally, the paper is concluded and the future work is stated in Section V. II. LITERATURE SURVEY (EXISTING SYSTEM) This section investigates the existing techniques and algorithms related to accident rate prediction analysis, and also it determines the advantages and disadvantages of these techniques. Zhang, et al [16] developed a Distributed Frequent Itemset Mining Algorithm (DFIMA) for processing the Big data using Spark framework. In this paper, a matrix-based pruning approach was utilized to efficiently reduce the amount of candidate itemsets. Here, the efficiency of the iterative computation was improved by using the Spark architecture. Shetty, et al [17] analyzed some data mining techniques for predicting the accidental rates based on the frequent itemsets. Here, the association between the road accidents were predicted by applying the apriori algorithm. It was one of the association rule mining technique used to predict the patterns of the road accidents. The main intention of this work was to take the precautionary measures for reducing the number of accidents. Li, et al [18]analyzed the road traffic fatal accidents by using various data mining techniques. Here, the relationship between the fatal rate and other attributes was evaluated based on the surface condition, collision manner, light condition, and weather. Moreover, the apriori algorithm was utilized to discover the association rules such as support and confidence. During data preparation, the missing values were filled, and the numerical values were changed with the nominal values. In modeling, various algorithms such as apriori association, NaĂŻve bayes classification, and k-means clustering were utilized. Kumar and Toshniwal[19] suggested an Association Rule Mining (ARM) algorithm for analyzing the road accident data. Here, the Emergency Management Research Institute (EMRI) dataset was utilized, which contains different sources of road accident details. The attributes considered in this paper were victims injured, victim age, gender, day, month, region, road type and area around. The major stages involved in this work were data preprocessing, clustering, and association rule mining. In which, the multiple missing values were filled manually by changing the results. Then, the partition based and density based clustering techniques were implemented to cluster the datasets. Moreover, the ARM technique was utilized to determine the correlation

131

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

between the attributes in the data. Atnafu and Kaur [20] examined the most widely used data mining tools for predicting the severity level of the road accidents. The techniques were surveyed under the categories of classification, association rule mining, and data mining tools. Moreover, the National Highway Authority of India (NHAI) dataset was considered for prediction. The severity prediction and finding was the contributory factor in the field of accidental analysis. Soysal[21] analyzed the problem of mining structured data with the use of association rule mining. Here, a heuristic approach was developed to extract the associated patterns based on the item combinations. The intention of this paper was to reduce the computational time and memory requirements with highest co-occurrence. Here, the main reason of using ARM was to discover the hidden patterns based on the enormous combination of the individual and conditional frequencies. Zhang, et al [22] utilized a High Utility Itemset Mining (HUIM) algorithms in the field of data mining. The aim of this paper was to discover the set of items by selecting the most suitable algorithm. Here, various row enumeration and column enumeration strategies were utilized to mine the itemsets. Sanmiquel, et al [23] analyzed some of the major factors that caused the Spanish mining accidents. Also, it aimed to develop some future prevention policies for minimizing the occupational hazards. The objectives mainly focused in this paper were as follows: mining data selection, attribute evaluation, association rule mining, and measuring severity. Singh, et al [24] developed a new technique, namely, Distribution Preserving Kernel Support Vector Machine (DiP-SVM) for processing the big data. This paper intended to attain the minimal loss in classification accuracy by applying the classification technique on various benchmark datasets. This mechanism includes two phases such as Distribution Preserving Partitioning (DPP) and distributed learning phase. Here, the k-means clustering technique was utilized to partition the clusters from the balanced tree. Chen, et al [25] investigated the traffic accidental data for predicting the risks using deep architecture training. The stages involved in this system were data analysis, preprocessing, and model training. From the survey, it is studied that the existing works have both benefits and demerits, but it mainly lacks with the following disadvantages:  Accuracy is not in a satisfied level  Required to improve the efficiency Thus, the proposed work aims to develop a new system for predicting the accidental rates in an accurate way. III. PROPOSED SYSTEM In this section, the description about the proposed methodology is presented with the clear flow illustration. The main objective of this paper is to analyze the accidental dataset and predict the rate in future by using the data mining algorithms. At first, the accidental dataset obtained from the UCI repository is taken as the input, which is preprocessed for eliminating the unwanted attributes and filling the missing values. Then, the preprocessed data is cluster based on the year by applying the density based clustering technique. After that, the apriori classification technique is implemented to classify the accidental area with the corresponding year. Finally, the Association Rule Mining (ARM) algorithm is applied to estimate the support and confidence values for mining the itemsets. Based on this, the performance of the proposed system is evaluated by using various measures. As shown in Fig 2, the proposed accident prediction analysis system has the following stages:  Preprocessing  Density based clustering  Apriori algorithm  Association rule mining A. Preprocessing Data preprocessing is one of the important task that must be performed before analyzing the data. In which, the irrelevant attribute removal and missing values fill up processes are performed. Here, the real world data is considered, so the manual filling leads inefficient results. Due to this, the data transformation process is applied on some attributes such as traffic density and age. Moreover, the data preprocessing is mainly performed to reduce complexity and dimensionality.

132

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

B. Density based Clustering After preprocessing, the filtered data is clustered by using the density based clustering technique, which is an unsupervised algorithm that aims to cluster the data into groups. Clustering in data mining is based on the property of scalability, which has the capability to discover various features. In this technique, a finite set of points are partitioned based on the density, and boundary. This technique has the ability to discover the clusters in an arbitrary shape. Moreover, it efficiently handled the relocation problem based on the density property. Moreover, this technique is defined based on the dense region of objects, which are formed as a clusters.

Fig 2. Flow of the proposed system

Fig 3. Density based clustering The major advantage of this technique is, it can able to identify the noisy data, and it does not require a priori specification. Also, it provides the following benefits:  It does not require to specify the number of clusters in the data  It has the ability to discover the arbitrarily shaped clusters  It used only two parameters and is not affected by the ordering of points  High robust Thus, the density based clustering technique is implemented in this work. Algorithm I – Density based Clustering Step 1: Density based clustering (input data D, neighborhood , and minimum point Minpts) C = 0; Step 2: for each unvisited point P in the dataset D, Mark P as visited; Neighbor Pts = region query (P, ) Step 3: if size of (Neighbor Pts) <Minpts Mark P as noise;

133

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

Else C = Next cluster; Expand Cluster (P, Neighbor Pts, C, , Min Pts); Add P to Cluster C; Step 4: for each point P’ in Neighbor Pts If P’ is not visited Mark P’ as visited; Neighbor Pts’ = region query (P’, ) If size of (Neighbor Pts’) >= Minpts Neighbor Pts = Neighbor Pts joined with Neighbor Pts; If P’s is not yet member of any cluster Add P’ to any cluster; Step 5: Region query (P, ); Step 6: Return all points within P’s neighborhood (including P); C. Association Rule Mining In this work, the standard Association Rule Mining (ARM) algorithm, namely, Apriori is used for finding the frequent itemsets. It is one of the widely used technique in data mining, which extracts the correlation between the various set of attributes in the data.Also, this technique analyzes the roadway traffic for predicting the accidental rate in future. It finds the relationship between the variables, so it is also known as dependency modeling. Moreover, it generates the best rules to depict the association between various attributes in the dataset.In this technique, the support and confidence values are estimated for defining the strong rules. Here, it is mainly applied to identify the condition in which the accident may occur for each cluster. This algorithm includes the following steps:  The minimum support value is identified and applied for finding the frequent itemsets on the dataset.  The rules are formed based on the frequent itemsets and the minimum confidence value. Moreover, the ARM searches the interesting patterns from the huge datasets with the use of association rules. The major benefits of using Apriori algorithm are easy to use and simplicity. It is applied on each and every cluster to generate the association rules. IV. IMPLEMENTATION RESULTS This section evaluates the experimental results of the proposed system by using various performance measures such as accuracy, precision, recall and time. For this analysis, the publicly available dataset obtained from the UCI repository is taken. A. Accident Detection Fig 4 shows the accident detection rate with respect to different states and years. The states considered for this analysis are Maharastra, Tami Nadu, Uttar Pradesh, and Andhra Pradesh. Also, the analysis is taken for the year of 2001 to 2010. From this evaluation, it is observed that the year of 2010 has the increased death rate.

Fig 4 (a). Accident detection rate with respect to different states

134

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

Fig 4 (b). Accident detection rate with respect to different years B. Precision and Recall Precision and recall are the widely used measures for evaluating the performance of the data mining techniques. In which, the precision is also called as the Positive Predictive Value (PPV), which provides the relevant results related to itemset mining. Then, it is estimated based on the ratio of true positives and true positives plus false positives. Recall provides the most relevant results during itemset mining. The precision and recall are estimated as shown in the following equations: (1) (2) Where, TP represents True Positive, TN represents True Negative, FP represents False Positive and FN represents False Negative. Fig 5 shows the precision and recall values of the existing DiP-SVM and proposed Apriori techniques. From the analysis, it is evident that the proposed Apriori provides the better performance results, when compared to the existing technique.

Performance Value

100 95 90 85 80 75 70 DiP-SVM

Apriori Algorithms Precision Recall

Fig 5. Precision and recall C. Accuracy The correctness and efficiency of the data mining techniques are validated with the use of accuracy. It states that how accurately the data is mined by using the clustering and rule mining algorithms. It is calculated as follows: (3)

135

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

Fig6 shows the accuracy analysis of the existing DiP-SVM and proposed Apriori techniques. From the results, it is evident that the proposed Apriori technique provides the increased accuracy compared than the existing technique. 100

Accuracy (%)

95 90 85 80 75 DiP-SVM

Apriori Algorithms

Fig 6. Accuracy D. Time Consumption Time consumption is defined as the amount of time difference from the release time and the finishing time of a given task. Also, it is defined as the time required to mine the frequent itemsets from a large datasets. Here, the response time is calculated for both existing and proposed techniques as shown in Fig 7. From the results, it is evident that the time consumption of the proposed technique is efficiently reduced compared than the existing technique.

Time Consumption (s)

9.5 9 8.5 8 7.5 7 6.5 DiP-SVM

Apriori Algorithms

Fig 7. Time consumption V. CONCLUSION This paper proposed a new prediction system based on the clustering and rule mining algorithms. Here, the dataset obtained from the UCI repository is taken as the input, which is preprocessed for eliminating the irrelevant attributes and filling the missing values in the dataset. Then, the noise free dataset is considered for further analysis, in which the clustering algorithm is applied for grouping the attributes of the data into a cluster. For this purpose, the density based clustering technique is implemented in this work. Furthermore, the standard association rule mining algorithm, namely, Apriori is applied to mine the frequent itemsets. In this technique, the support and confidence values are calculated for frequent items mining. During evaluation, various performance measures such as precision, recall, accuracy, and time consumption are considered. The existing DiP-SVM technique is compared with the proposed apriori technique. From the evaluation, it is observed that the proposed Apriori algorithm provides the better results compared than the existing technique.

136

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

References [1] S. Kumar and D. Toshniwal, "A data mining approach to characterize road accident locations," Journal of Modern Transportation, vol. 24, pp. 62-72, 2016. [2] S.-h. Park, et al., "Highway traffic accident prediction using VDS big data analysis," The Journal of Supercomputing, vol. 72, pp. 2815-2831, 2016. [3] A. Jain, et al., "Data mining approach to analyse the road accidents in India," in Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), 2016 5th International Conference on, 2016, pp. 175-179. [4] S. Kumar and D. Toshniwal, "Analysis of hourly road accident counts using hierarchical clustering and cophenetic correlation coefficient (CPCC)," Journal of Big Data, vol. 3, p. 13, 2016. [5] S. Kumar and D. Toshniwal, "A novel framework to analyze road accident time series data," Journal of Big Data, vol. 3, p. 8, 2016. [6] F. L. Mannering, et al., "Unobserved heterogeneity and the statistical analysis of highway accident data," Analytic methods in accident research, vol. 11, pp. 1-16, 2016. [7] B. Sharma, et al., "Traffic accident prediction model using support vector machines with Gaussian kernel," in Proceedings of Fifth International Conference on Soft Computing for Problem Solving, 2016, pp. 1-10. [8] I. Alam, et al., "Pattern mining from historical traffic big data," in IEEE Region 10 Symposium (TENSYMP), 2017, 2017, pp. 1-5. [9] U. Lokala, et al., "Road Accidents Bigdata Mining and Visualization using Support Vector Machines," 2017. [10] G. Kaur and E. H. Kaur, "Prediction of the cause of accident and accident prone location on roads using data mining techniques," in Computing, Communication and Networking Technologies (ICCCNT), 2017 8th International Conference on, 2017, pp. 1-7. [11] Y.-R. Shiau, et al., "The application of data mining technology to build a forecasting model for classification of road traffic accidents," Mathematical Problems in Engineering, vol. 2015, 2015. [12] C.-S. Li and M.-C. Chen, "A data mining based approach for travel time prediction in freeway with nonrecurrent congestion," Neurocomputing, vol. 133, pp. 74-83, 2014. [13] S. Sarkar, et al., "Text mining based safety risk assessment and prediction of occupational accidents in a steel plant," in Computational Techniques in Information and Communication Technologies (ICCTICT), 2016 International Conference on, 2016, pp. 439-444. [14] M. I. Sameen and B. Pradhan, "Severity Prediction of Traffic Accidents with Recurrent Neural Networks," Applied Sciences, vol. 7, p. 476, 2017. [15] Y. Castro and Y. J. Kim, "Data mining on road safety: factor assessment on vehicle accidents using classification models," International journal of crashworthiness, vol. 21, pp. 104-111, 2016. [16] F. Zhang, et al., "A distributed frequent itemset mining algorithm using Spark for Big Data analytics," Cluster Computing, vol. 18, pp. 1493-1501, 2015. [17] P. Shetty, et al., "Analysis of road accidents using data mining techniques," International Research Journal of Engineering and Technology (IRJET) pp. 1494-1496, 2017. [18] L. Li, et al., "Analysis of road traffic fatal accidents using data mining techniques," in Software Engineering Research, Management and Applications (SERA), 2017 IEEE 15th International Conference on, 2017, pp. 363-370. [19] S. Kumar and D. Toshniwal, "Analysing road accident data using association rule mining," in Computing, Communication and Security (ICCCS), 2015 International Conference on, 2015, pp. 1-6. [20] B. Atnafu and G. Kaur, "Survey on Analysis and Prediction of Road Traffic Accident Severity Levels using Data Mining Techniques in Maharashtra, India," International Journal of Current Engineering and Technology, vol. 7, pp. 1974-1978, 2017. [21] Ă&#x2013;. M. Soysal, "Association rule mining with mostly associated sequential patterns," Expert Systems with applications, vol. 42, pp. 2582-2592, 2015. [22] C. Zhang, et al., "An Empirical Evaluation of High Utility Itemset Mining Algorithms," Expert Systems with applications, 2018. [23] L. Sanmiquel, et al., "Study of Spanish mining accidents using data mining techniques," Safety science, vol. 75, pp. 49-55, 2015.

137

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.130-138 SSN: 2278-2419

[24] D. Singh, et al., "Dip-svm: distribution preserving kernel support vector machine for big data," IEEE Transactions on Big Data, vol. 3, pp. 79-90, 2017. [25] Q. Chen, et al., "Learning Deep Representation from Big and Heterogeneous Data for Traffic Accident Inference," in AAAI, 2016, pp. 338-344.

138