International Journal of Computer Trends and Technology (IJCTT) – volume 8 number 4– Feb 2014
A Proposed Methodology for Virus Detection Using Data Mining and Reverse Engineering Tools with Client-Server Model Uday Babu P1, Visakh R2 1 2
(Department of Computer Science & Engineering, Rajagiri School Of Engineering & Technology-Kochi, India) (Department of Computer Science & Engineering, Rajagiri School Of Engineering & Technology-Kochi, India)
ABSTRACT : Viruses are a class of malicious programs that cause unfavourable effects on the computer system and thereby becomes an obstacle to the standard operation of the system. Their existence and execution within the system should be detected within an apt time to prevent them from causing irrecoverable and devastating problems that can cause loss of performance and loss of confidentiality of sensitive information. To detect the presence of a virus within a system, firstly the effects of various viruses on the computer systems are analysed by executing them one by one in a virtual environment. These effects are captured using reverse engineering tools. Data mining is applied on the data recorded by the reverse engineering tools to extract the significant patterns that characterize the respective viruses. These patterns are converted into a unique binary code which can be used to detect viruses using a clientserver model. Keywords - Client-server model, Data mining, FP growth algorithm, Reverse engineering, Virus detection.
I.
INTRODUCTION
Viruses are malwares that are designed to damage the computer systems and thereby make them vulnerable to security threats and performance degradation. The evolution of internet has resulted in the spawning of new malwares including viruses. Viruses have mutated into a sophisticated form that their detection may become a laborious process using major conventional methods like signature-based and heuristic detection. Cyber security is under threat and cyber wars are forecasted in the near future [1]. Hence a methodology that is capable enough to detect the presence of malwares in a system should be formulated so that the malwares and their effects can be removed from the infected system. Data mining is the process of excavating frequent and relevant patterns from a humongous data set [2]. In the proposed system, reverse engineering tools and data mining algorithm are applied one after the other in each virtual system before and after infecting the system with a given virus, to extract the relevant patterns that characterize the effects of the given virus. These patterns are exploited to discover the presence of an unidentified virus in a system on the basis of a
ISSN: 2231-2803
client-server model. Characteristic pattern of each known virus, after transforming it into a binary code, is saved in to database at the server. Binary code formulated for an unknown virus is compared with binary codes of known viruses to spot the unknown virus. In [3], Burji, Liszka and Cha stated that malwares can be detected by integrating reverse engineering tools and data mining. Three virtual machines are created in each system and each of them is infected with a given malware. Reverse engineering tools like file monitor, registry monitor, API call tracer, etc are executed in each of the virtual machines to record various aspects of the machine state of each of the infected virtual machines. Data mining is applied on the reverse engineered data of the virus to retrieve pertinent and frequent data patterns that characterize the virus efficiently. The output of the data mining step is supplied to rough set theory based tool known as Blem2. Blem2 will generate the rules of required confidence and strength that can be used to detect malwares. But here machine state is only captured after infection; hence the observations taken may contain effects that may not be caused by the malware attack. Hence rules developed by the rough set based tool may not be precise enough to catch the malware and this may result in detection of false positives. Reverse engineering of a malware is the analysis of a malware in order to comprehend and capture its design, components, behaviour and effects by executing them in a controlled and isolated virtual environment. Reverse engineering tools like File system monitor, Registry monitor, etc are used to trace the machine state of the system. Each reverse engineering tool captures a single aspect of the machine state. Virus changes one or more aspects of the machine state of the system, when it infects that system. Few of the reverse engineering tools available to capture the state of the machine are the following: 1.1 File System Monitor When a process is executed in a system, it makes changes in the file system by adding, deleting or editing the files in the system. File system monitor captures all the file system activity performed by all the processes running in the system. Changes made by the malware in the file system of the
www.internationaljournalssrg.org
Page 200
International Journal of Computer Trends and Technology (IJCTT) – volume 8 number 4– Feb 2014 infected system can be gathered using file system monitor. 1.2 Registry Monitor A system Registry preserves the configuration details of the operating system and all the programs that are installed in it. Registry monitor can be used to monitor the registry. Malwares infects the system by making modifications in the registry and these manipulations can be retrieved using a registry monitor. 1.3 Process Monitor A process monitor keeps track of the processes that are currently executing in a given system. A virus is associated with one or more processes. A system is said to be infected with a given virus, if processes associated with that virus is currently running in the system. The paper is organized in the following manner. Section II gives an insight into the proposed system. Section III elaborates on the techniques available to evaluate the system. Section IV gives a conclusion and provides information regarding the scope for future works.
II.
PROPOSED METHODOLOGY
The proposed methodology as shown in Fig.1 includes the following steps:2.1 Application Of Reverse Engineering Tools The effects of viruses on a machine can be captured dynamically by executing them in controlled virtual machines that are created using the software Oracle virtual box [4]. Such a virtual machine keeps the host operating system protected from the ill effects of the malware, when the virtual machine gets infected by a malware. Two to three virtual machines can be created in a single system depending on its configuration, each of which has its own operating system which is isolated from operating system of other virtual machines and the host operating system. Multiple virtual machines are created and reverse engineering tools are applied to retrieve the current machine state which represents an infection free condition. Each virtual machine is infected with the same given virus and reverse engineering tools are reapplied to capture the updated machine states. A symmetric difference is taken between the original and updated machine states to obtain the changes that were made by the infection. This removes observations that are not probably the result of the malware attack.
ISSN: 2231-2803
Fig.1. proposed methodology 2.2 Data Mining Data mining algorithm namely FP growth algorithm [2], which is proved to be efficient in terms of memory usage and execution time, can be applied on the data obtained after taking the symmetric difference. This will extract the most crisp and relevant patterns that will be helpful to characterize the given virus. Repeated database scans are not performed by FP growth algorithm. FP growth algorithm uses a divide and conquer strategy [2]. First, it constructs a FP-tree by compressing the database of frequent items by preserving the itemset association data. The compressed database thus obtained is partitioned into a set of conditional databases, each of which is associated with a unique frequent item and mining is done independently on each partition. 2.3 Converting To Binary Code The patterns that characterise each virus, fetched using data mining can be converted into a binary code that represents the respective virus. 2.4 Server Creation A server can be created in the cloud computing platform which will store the binary codes corresponding to each known virus. Server is
www.internationaljournalssrg.org
Page 201
International Journal of Computer Trends and Technology (IJCTT) – volume 8 number 4– Feb 2014 programmed to wait for the requests from client machines which are suspected to be infected by one or more viruses. 2.5 Client Processing Each computer system acts as a client machine. The machine states of each computer system are captured periodically after regular intervals. Length of the interval depends on the level of security that the system needs. The captured machine state of a system is compared with the machine state of the system obtained during previous capture. If the degree of difference between the two machine states is above a predetermined threshold, then data mining is applied on the symmetric difference taken between the machine states, to obtain the significant patterns that may be characterizing an unknown virus. The retrieved patterns can be converted into a binary code. The binary code is send to the remote server for analysis. The server performs an analysis to detect virus based on the stored binary codes. Client system is pushed into a dormant state if it is a highly sensitive machine, until a confirmation is received from the server. Actions are taken based on the server’s response. The client sends the binary code to server only when the degree of difference between consecutive machine states is more than a threshold, so server will not be overloaded unnecessarily.
2.6 Server Processing As shown in Fig.3, when the server obtains the binary code of an unknown virus from the client, it calculates a hamming distance of the given code from all the binary codes of known viruses which is stored in the database of the server. The binary code which has minimum hamming distance with the given binary code is selected. The system is hence said to be infected with a virus corresponding to the most matching binary code. Server notifies the client and client can resume accordingly. Viruses corresponding to the binary codes, whose hamming distance from the given binary code is below a predetermined threshold hamming distance can be considered to be belonging to the same family of viruses which may be formed through mutation. Hamming distance is the number of bit positions at which a change is observed between the two binary codes under consideration. If the minimum hamming distance of the given binary code with respect to all binary codes in database is above a predetermined threshold then it may be a newly generated unclassified virus. If it is a newly generated virus then its binary code is inserted into the database at server to represent the new virus. Hence known viruses, their mutants and unknown viruses can be detected.
Fig.3. server processing
III.
Fig.2. client processing
ISSN: 2231-2803
EVALUATION
Evaluation of the methodology can be done by employing the parameters like accuracy, detection rate, precision and false positive rate. Accuracy is the ratio of the sum of number of true negatives and true positives to the sum of number of true positives, true negatives, false positives and false negatives. Detection Rate is the ratio of number of true positives to the sum of number of true positives and false negatives. Precision is the ratio of number of true positives to the sum of number of
www.internationaljournalssrg.org
Page 202
International Journal of Computer Trends and Technology (IJCTT) – volume 8 number 4– Feb 2014 true positives and false positives. False positive rate is the ratio of number of false positives to the sum of number of true negatives and false positives.
IV.
CONCLUSION
New viruses are generated one after the other in an accelerated mode due to the rapid development of internet. Complete dependence on the antivirus software, which takes a significant amount of memory to run, is not reliable to detect the recent sophisticated viruses that have devastative effects. Hence the proposed system provides a cost effective and efficient method to detect unknown virus infections, which may be better than the antivirus softwares that need to be updated periodically and whose detection power varies with the manufacturer of the antivirus. As an implementation, all systems in a company can form clients except one system which acts as the server. Detection of an infection alone will not be sufficient hence as a future work, techniques to remove the detected viruses from the system must be determined and they should be integrated with the proposed system to complete the task of protecting the computer systems from the attacks of viruses. The proposed methodology may be extremely useful for systems, incorporating highly sensitive data, for which security is of prime importance and short periods of unavailability can be tolerated.
REFERENCES [1] Bhavani Thuraisingham, Data mining for malicious code detection and security applications, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology – Workshops, 2009. [2] Jiawei Han and Micheline Kamber, Data Mining Concepts and techniques (Morgan Kaufmann Publishers, USA, Second Edition). [3] Supreeth Burji, Kathy J. Liszka, Chan, Malware analysis using reverse engineering and data mining tools, International Conference on System Science and Engineering, 2010. [4] Oracle VirtualBox, virtual machine software tool, www.virtualbox.org
ISSN: 2231-2803
www.internationaljournalssrg.org
Page 203