Classification of Malware based on Data Mining Approach

Page 1

IJSRD - International Journal for Scientific Research & Development| Vol. 1, Issue 2, 2013 | ISSN (online): 2321-0613

Classification of Malware based on Data Mining Approach 1

1

Ankita K Tiwari Department of Computer Science, IT Systems and Network Security 1 Gujarat Technological University, India

Abstract— In recent years, the number of malware families/variants has exploded dramatically. Automatic malware classification is becoming an important research area. Using data mining, we identify seven key features within the Microsoft PE file format that can be fed to machine learning algorithms to classify malware. In this paper, resting on the analysis of Windows API execution sequences called by PE files, we develop the Intelligent Malware Detection System (IMDS) using ObjectiveOriented Association (OOA) mining based classification. IMDS is an integrated system consisting of three major modules: PE parser, OOA rule generator, and rule based classifier. An OOA_Fast_FP Growth algorithm is adapted to efficiently generate OOA rules for classification. Promising experimental results demonstrate that the accuracy and efficiency of our IMDS system outperform popular antivirus software such as Norton Antivirus and McAfee Virus Scan, as well as previous data mining based detection systems which employed Naive Bayes, Support Vector Machine (SVM) and Decision Tree techniques. I. INTRODUCTION Malware refers to a broad class of malicious software that threatens computer and information systems and networks. Such software may modify, destroy or steal data, may obtain unauthorized access to confidential information and exploit vulnerabilities in applications. Authors of malware often obfuscate the executable components so as to make the malware more difficult to identify. In [2], the authors identify four ways in which this is commonly done. These are: the insertion of dead code that does not change the original code in any way, interchanging the uses of registers, replacing one sequence of instructions with an equivalent one, and permuting instruction sequences in the code without changing the code behavior. In this paper, we do not assume that the malware under analysis has been DE obfuscated. However, we do assume that the malware we consider has been unpacked, that is, it has not been compressed or encrypted in any way. Still, the traditional approach to analyze malware requires that a human analyst manually performs the tests and extracts the information in order to classify the sample (Moser et al., 2007). Unfortunately, due to the tremendous growth of malicious code, antivirus companies receive everyday thousands of suspicious files that have to be analyzed and classified as malware or benign software. Hence, a reliable and fast automation of the analysis and classification is a crucial point to be able to cope with this threat. With this scenario in mind, there are two malware analysis approaches: static analysis (Carrera and Erd´elyi, 2004) which is performed without actually executing the

file, only observing the binary looking for suspicious patterns, and dynamic analysis (Christodorescu et al., 2007; Rieck et al., 2008) which implies running the sample in an isolated and controlled environment monitoring its behavior. Our efforts for detecting polymorphic and new, previously unseen malicious executable lead to the Intelligent Malware Detection System (IMDS), which applies Objective-Oriented Association (OOA) mining based classification [10, 14]. The IMDS rests on the analysis of Windows Application Programming Interface (API) execution sequences which recollect the behavior of program code pieces. The associations among the APIs capture the underlying semantics for the data are essential to malware detection. Our malware detection is carried out directly on Windows Portable Executable (PE) code with three major steps: (1) first constructing the API execution sequences by developing a PE parser; (2) followed by extracting OOA rules using OOA Fast FP-Growth algorithm; (3) and finally conducting classification based on the association rules generated in the second step. So far, we have gathered 29580 executable, of which 12214 are referred to as benign executable and 17366 are malicious ones. This executable called 12964 APIs in total. The collected data in our work is significantly larger than those used in previous studies on data mining for malware detection [13, 9, 18]. The experimental results illustrate that the accuracy and efficiency of our IMDS outperform popular anti-virus software such as Norton Antivirus and McAfee Virus Scan, as well as previous systems using data mining approaches such as Naïve Bayes, SVM and Decision Tree. In summary, our main contributions are: (1) we develop an integrated IMDS system based on analysis of Windows API execution calls. The system consists of three components: PE parser, rule generator and classifier; (2) We adapt existing association based classification techniques to improve the system effectiveness and efficiency; (3)We evaluate our system on a large collection of executable including 12,214 benign samples and 17,366 malicious ones; (4)We provide a comprehensive experimental study on various anti-virus software as well as various data mining techniques for malware detection using our data collection; (5) Our system has been already incorporated into the KingSoft’s Antivirus software. II. RELATED WORK As mentioned earlier the signature based method , there is still some work to improve in this method [5,18,23,26] and also a few attempts to apply data mining and machine learning techniques, such as Naive Bayes method, support vector machine (SVM) and Decision Tree classifiers, to detect new malicious executable.

All rights reserved by www.ijsrd.com

183


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.