第8卷第1期 智能系统学报 Vol.8 No.1 2013年2月 CAAI Transactions on Intelligent Systems Feh.2013 D0I:10.3969/j.issn.1673-4785.201209059 Network Publishing Address:http://www.cnki.net/kcms/detail/23.1538.TP.20130205.1834.001.html Immune based computer virus detection approaches TAN Ying2,ZHANG Pengtao .2 (1.Department of Machine Intelligence,School of Electronics Engineering and Computer Science,Peking University,Bei- jing 100871,China;2.Key Laboratory of Machine Perception,Ministry of Education,Peking University,Beijing 100871, China) Abstract:The computer virus is considered one of the most horrifying threats to the security of computer systems worldwide.The rapid development of evasion techniques used in virus causes the signature based computer virus detection techniques to be ineffective.Many novel computer virus detection approaches have been proposed in the past to cope with the ineffectiveness,mainly classified into three categories: static,dynamic and heuristics techniques.As the natural similarities between the biological immune sys- tem(BIS),computer security system (CSS),and the artificial immune system (AIS)were all developed as a new prototype in the community of anti-virus research.The immune mechanisms in the BIS provide the opportunities to construct computer virus detection models that are robust and adaptive with the ability to detect unseen viruses.In this paper,a variety of classic computer virus detection approaches were intro- duced and reviewed based on the background knowledge of the computer virus history.Next,a variety of immune based computer virus detection approaches were also discussed in detail.Promising experimental results suggest that the immune based computer virus detection approaches were able to detect new variants and unseen viruses at lower false positive rates,which have paved a new way for the anti-virus research. Keywords:computer virus detection;artificial immune system;immune algorithms;hierarchical model; negative selection algorithm with penalty factor CLC Number:TP309.5 Document Code:A Article ID:1673-4785(2013)01-0080-15 Due to the rapid development of computer Currently,there are several companies that pro- technology and the Internet,the computer has become duce various anti-virus products,most of which are a part of daily life in the 21st century.Meanwhile,the based on signatures.These products are usually able to computer security systems are getting more and more detect known viruses effectively with lower false posi- notice.The computer viruses,new variants and unseen tive rates and overheads.Unfortunately,these same viruses in particular,have been one of the most dread- products fail to detect new variants and unseen viruses. ful threats to the computers worldwide.Today viruses Based on the metamorphic and polymorphous tech- are becoming more complex with faster propagation niques,even a layman can develop new variants of speed and stronger ability for latency,destruction and known viruses easily using virus automatons.For ex- infection.At present a virus is able to spread all over ample,the Agobot has observed more than 580 variants the world in a matter of minutes and results in huge e- from its initial release,which makes use of polymor- conomic losses.The mission of how to protect comput- phism to evade detection and disassembly Thus, ers from these various types of viruses has become pri- traditional signature based computer virus detection ap- ority number one. proaches are no longer suitable for the new environ- ments;dynamic and heuristics techniques as well have Received Date:2012-09-27.Network Publishing Date:2013-02-05. Foundation Item:National Natural Science Foundation of China No. started to emerge. 61170057,60875080). Dynamic techniques,such as virtual machine, Corresponding Author:TAN Ying.E-mail:ytan@pku.edu.cn
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·81 keep watch over the execution of every program during with penalty factor (NSAPF)was proposed to over- run-time and stop the program once it tries to harm the come the drawback of the traditional NSA in defining system.Most of these techniques monitor the behaviors the harmfulness of“self'and“nonself'”.It focuses on of a program by analyzing the application programming the danger of the code and greatly improves the effec- interface (API)call sequences generated at runtime. tiveness of the NSAPF based virus detection model. As the huge overheads of monitoring API calls,it is The rest of this paper was organized as follows:In practically impossible to deploy the dynamic techniques Section 1,the background knowledge of computer viru- on personal computers at this time. ses is introduced.Section 2 presents a variety of clas- Data mining approaches,one of the most popular sic computer virus detection approaches.In Section 3, heuristics,try to mine frequent patterns or association the artificial immune system and immune based com- rules to detect viruses by using classic classifiers. puter virus detection approaches are briefly described. These approaches have led to some success.However, Our previous works and conclusions are proposed in data mining approaches lose the semantic information detail in Sections 4 and 5,respectively. of the code and cannot easily recognize unseen viruses in the long term. 1 Computer virus The computer virus is named after the biological 1.1 Definition and features virus,due to the similarity between them,such as par- In a narrow sense,a computer virus is a program asitism,propagation and infection.The biological im- that can infect other programs by modifying them to in- mune system(BIS)protects organisms from antigens clude a possibly evolved copy of it.In a broad for a long time,resolving the problem to detect unseen sense,a computer virus indicates all the malicious antigens successfully.Inspired from the BIS,apply- code that is a program designed to harm or secretly ac- ing immune mechanisms to detect computer viruses has cess a computer system without the owners'informed developed into a new anti-virus field in the past few consent;such as viruses in the narrow sense,worms, years,attracting many researchers.Forrest et al.ap- backdoor and Trojans.Through the development of plied the immune theory to computer anomaly detection the computer virus,the lines have become blurred be- for the first time in 19943.Since then,many re- tween the different types of viruses and are not clear. searchers have proposed various kinds of virus detec-In this paper,all the programs that are not authorized tion approaches and achieved some success.Some of by users and perform harmful operations in the back- them are mainly derived from ARTIS(46). ground are referred to as viruses. As time goes on,more and more immune mecha- The features of the computer viruses are listed be- nisms become clear which lay a good foundation for the low. development of the AIS.On this basis,many immune 1)Infectivity:Infectivity is the fundamental and based computer virus detection approaches have been essential feature of the computer virus in the narrow proposed,in which more and more immune mecha- sense,which is the foundation to detect a virus.When nisms are involved.The simulations of the AIS to the a virus intrudes into a computer system,it starts to BIS keep going on and the immune based computer vi- scan the programs and computers on the Internet that rus detection approaches have paved a new way for the can be infected.Next,through self-duplicating,it anti-virus research. spreads to the other programs and computers. The researchers of this paper have done some re- 2)Destruction:According to the extent of de- lated works in the anti-virus field and achieved some struction,the virus is divided into "benign"virus and success7.They have tried to make full use of the malignant virus."Benign"viruses merely occupy sys- relativity among different features in a virus sample by tem resources,such as GENP,W-BOOT,while malig- constructing an immune based hierarchical model7. nant viruses usually have clear purposes.They can de- On the basis of the traditional negative selection algo- stroy data;delete files,even format diskettes. rithms (NSA),a novel negative selection algorithm 3)Concealment:Computer viruses often attach
·82 智能系统学报 第8卷 themselves to benign programs and start up with the and their punching bags are data files,mainly Mi- host programs.They perform harmful operations in the crosoft Office files. background hiding from users. 5)Virus techniques merging with hacker tech- 4)Latency:After intruding into a computer sys- niques:Nowadays merging of virus techniques and tem,the viruses usually hide themselves from users in- hacker techniques has been a tendency.It makes the stead of attacking the system immediately.This feature viruses have much stronger concealment,latency and makes the viruses have longer lives.They spread them- much faster propagation speed than ever before. selves and infect other programs in this period. System power on 5)Trigger:Most viruses have one or more trigger conditions.When these conditions are satisfied,the vi- Enter ROM-BIOS ruses begin to destroy the system.Other features of the Read boot sector to 0:7C00H viruses include illegality,expressiveness,and unpre- dictability. System reset 1.2 Development phases of the viruses Read in COMMAND.COM The viruses are evolved with the computer tech- nology all the time.The development of the viruses ap- Complete the disk bootstrap proximately goes through several phases which are de- scribed below. Fig.1 Normal boot procedure of DOS 1)DOS boot phase:Fig.1 and Fig.2 illustrate System power on Read the boot sector virus the boot procedures of DOS without and with boot sec- tor virus,respectively.Before the computer system ob- Enter ROM-BIOS Run the virus tains the control right,the virus starts up,modifies in- Read boot sector to terrupt vector and copies it to infect the diskette.These 0:7C00H Modify interupt vector are the original infection procedures of the viruses. System reset Copy the virus and What is more,the similar infection procedures can be infect the disk found in the viruses nowadays. Read in COMMAND.COM 2)DOS executable phase:The viruses exist in a computer system in the form of executable files in this Complete the disk bootstrap phase.They would control the system when the users Fig.2 Boot procedure of DOS with boot sector virus run applications infected by the viruses.Most viruses now are executable files. 2 Classic virus detection approaches 3)Virus generator phase:Virus generators,also The computer virus has become a major threat to called virus automatons,can generate new variants of known viruses with different signatures.Metamorphic the security of computers and the Internet worldwide. A wide range of host-based anti-virus solutions have techniques are used here to obfuscate virus scanners been proposed by many researchers and compa- which are based on virus signatures,including instruc- tion reordering,code expansion,code shrinking and nies These anti-virus techniques could be broad- garbage code insertion ly classified into three categories:static techniques, 4)Macro virus phase:Before the emerging of dynamic techniques and heuristics. macro viruses,all the viruses merely infect executable The fight between the viruses and the anti-virus files as it is almost the only way for the viruses to ob- techniques is more violent now than ever before.The tain the right of execution.When users run the host of viruses disguise themselves by using various kinds of e- a virus,the virus starts up and controls the system.In- vasion techniques,such as metamorphic and polymor- fecting data files cannot help the virus to run itself. phous techniques,packer and encryption techniques. The emerging of macro viruses changed this situation Coping with the new situations,the anti-virus tech- niques unpack the suspicious programs,decrypt them
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·83· and try to be robust to those evasion techniques.Nev- niques are vulnerable to unseen viruses and the evasion ertheless,the viruses evolve to anti-unpack anti-de- techniques of viruses.As a result,a variety of dynamic crypt and develop into obfuscating the anti-virus tech- and heuristic anti-virus approaches are developed to niques again.The fight will never stop and the virus cope with these situations. techniques will always be ahead of the anti-virus tech-2.2 Dynamic techniques niques.What can we do is to increase the difficulty of Computer viruses often show some special behav- intrusion,decrease the losses caused by the viruses iors when they harm the computer systems.For exam- and react to them as soon as possible. ple,writing operation to executable files,dangerous 2.1 Static techniques operations (e.g.,formatting a diskette),and switc- Static techniques usually work on program bit hing between a virus and its host.These behaviors give strings,assembly codes,and application programming us an opportunity to recognize the viruses.Based on interface (API)calls of a program without running the the above idea,the dynamic techniques keep watch o- program.One of the most famous static techniques is ver the execution of every program during run-time and the signature based virus detection technique. observe the behaviors of the program.They would stop The signature based virus detection technique is the program once it tries to harm the computer system. the mainstream anti-virus approach and most of the The dynamic techniques usually utilize the operating commercial anti-virus products are based on this tech- system's API sequences,system calls and other kinds nique.A signature usually is a bit string which is di- of behavior characteristics to identify the purpose of a vided from a virus sample and it is able to identify a vi- program14] rus uniquely.The signature based anti-virus products There are two main types of dynamic techniques: are referred to as scanners in this paper. the behavior monitoring approach and the virtual ma- In order to extract a signature from a virus,the chine approach. anti-virus experts first disassemble the virus to assem- Based on the assumption that the viruses have bly codes.Then they analyze it in the semantic level to some special behaviors that can identify themselves and figure out the mechanisms and workflow of the virus. would never emerge in benign programs,the behavior Finally,a signature is extracted to characterize the vi- monitors keep watch over every behavior of a virus and rus uniquely. wish to prevent destruction from the dangerous opera- This technique is able to detect known virus very tions effectively. quickly with lower false positive and high true positive This approach is considered to be able to detect rates.It is one of the simplest approaches with minimal known viruses,new variants and unseen viruses, overheads.Nevertheless,since a signature of a new vi- whereas it is very dangerous to run viruses in a real rus can be only extracted after the break out of the vi- computer by using this approach.If a behavior monitor rus by experts,it would take a long time to detect the fails to kill a virus,the virus would take control of the new virus effectively.The losses caused by the virus computer.Moreover,the overheads brought in by a already cannot be recovered.Furthermore,with the behavior monitor are too huge to personal computers. development of virus techniques,there are many eva-The false positive rate of this approach is high inevita- sion techniques which are used to help the virus evade bly and the approach cannot recognize the type and from the signature based scanners,such as metamor-name of a virus,thus it cannot eliminate the virus from phic and polymorphous techniques,packer,and en- a computer.Furthermore,it is very hard to implement cryption techniques.The signature based anti-virus a relative perfect behavior monitor. techniques are easily defeated by these techniques.For In order to separate the running program from the example,simple program entry point modifications real computer,the virtual machine approach creates a consisting of two extra jump instructions effectively de- virtual machine (VM)and runs the programs in the feat most signature based scanners. VM.The execution environment of a program here is To conclude,the signature based anti-virus tech- the VM which is software,instead of the physical ma-
·84 智能系统学报 第8卷 chine.Hence the computer is safe,even when the VM and data mining to detect virus is feasible.N-Gram is is crashed by a virus.It is very easy to collect all the a concept from text categorization,which means N con- information while a program is running in a VM.If the tinuous words or phrases.In the anti-virus field,an N- VM captures any dangerous operation,it would give Gram is usually defined as a binary string of length N the users a tip.When it confirms that the running pro- bytes.The experimental results revealed that the boos- gram is a virus,it will kill the virus. ted decision trees outperformed other classifiers with an The virtual machine approach is very safe and can area under the receiver operating characteristic curve recognize almost all the viruses,including encrypted (AUC),0.996.Later they extended this technique to and packed viruses.Now the VM approach has become classify virus according to the functions of their pay- one of the most amazing virus detection approaches. loads126] However,the VM brings comparable overheads to the A new feature selection criterion,class-wise docu- host computers.How to implement a relative perfect ment frequency (CDF),was proposed by Reddy et al. VM is a new research study.In addition,the VM only and applied to the procedure of N-Gram selection(27 simulates a part of the computer's functions which pro- Their experimental results suggested that the CDF out- vides opportunities for anti-VM techniques to evade performed the IG in the feature selection process.They from the VM approach. guessed the reason might be most of the relevant N- Anti-VM techniques have been used in many viru- Grams selected by using the IG came from benign pro- ses recently.For example,inserting some special in- grams.What is more,since the CDF tries to select the structions into a virus may cause the crash of a VM. features with the highest frequencies in a specific The entry point obscuring is also involved by the viru- class,it has a bias to the information of the class.As a ses to evade from the VM approach. result,it could not select the discriminating features for Ref.[15-20]proposed some new dynamic tech- the class effectively. niques based virus detection models.Although these Stolfo et al.made use of N-Grams to identify file models have shown promising results,they can produce types and later to detect stealthy virus Their high false positive rates,an issue which has yet to be experimental results showed that the method was able resolved[2 to detect embed virus.However,this method was not a 2.3 Heuristics general virus detection method. Schultz et al.,who are pioneers to apply the tech- Sulaiman et al.proposed a static analysis frame- niques of machine learning and data mining to the anti- work for detecting variants of viruses which was called virus field,proposed a data mining framework to detect disassembled code analyzer for virus (DCAM)3 unseen virus effectively and automatically2.Three Different from the traditional static code analysis which approaches are taken to the feature extraction proce- usually works on the binary string of a program,the dure.The first one makes use of Bin-Utils!231 of GNU authors extracted virus features from disassembled to extract resource information of a program.String se- codes.The programs which got through three steps of quences are extracted by using GNU strings program in matching were considered as benign programs;other- the second approach.The third approach is called hex wise the DCAM classified the programs as viruses.The dump)which transforms binary files into byte se- experimental results suggested that the DCAM worked quences.However,DLL and function names are too very well and could prevent breakouts of previous iden- unstable to detect virus.This work lays a good founda- tified viruses. tion for the application of the techniques of machine Henchiri and Japkowicz adopted a data mining ap- learning and data mining in the anti-virus field. proach to extract the frequent patterns (FPs)to detect Kolter et al.proposed a technique to detect virus virus(3).They filtered FPs twice and tried to obtain in the field based on the relevant N-Grams selected by general FPs based on the intra-family support and in- using the information gain(IG)2s).They clearly iden-ter-family support.Several classifiers were involved in tified that using the techniques from machine learning this work,such as the J48 decision tree and naive
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·85· Bayes.They verified the effectiveness of their model u- and noise resistant,it is suitable for the applications in sing 5-fold cross validation,showing good results. time-varying unknown environment. A virus detection model using cosine similarity a- The AIS has been applied to many complex prob- nalysis to detect obfuscated viruses was proposed in lem fields,such as optimization,pattern recognition, Ref.[33].This work was based on the premise that fault and anomaly diagnosis,network intrusion detec- given a variant of a virus,they can detect any obfusca-tion,and virus detection.The steps of a general im- ted version of the virus with high probability.Actually mune algorithm are shown in Algorithm 1. this model was only worked on code transposition tech- Algorithm 1 A general immune algorithm nique.The biggest issue in this model was that how to Step 1):Input antigens; extract functions within a program cannot be completed Step 2):Initialize antibody population; in real time. Step 3):Calculate the affinities of the antibodies; Ye et al.made use of associative classification Step 4):Lifecycle event and update the antibodies- and post-processing techniques to detect virus[). creation and destruction; Firstly,they extracted the API calls from Windows PE Step 5):If the terminate criteria are satisfied,go to files as the features of the samples,and stored them in Step 6);otherwise,go to Step 3); a feature database.Then they extended a modified FP- Step 6):Output the antibodies. Growth algorithm proposed in Ref.[35-36]to generate There are four main algorithms in the AIS field: the association rules.Finally,the authors reduced the negative selection algorithm (NSA),clonal selection number of the rules and got a concise classifier by u- algorithm,immune network model,and danger theory sing post-processing techniques.Promising results based immune algorithms.The principle of the NSA is demonstrated that this model outperformed popular an- shown in Fig.3. ti-virus software as well as previous data mining based “Self"set virus detection systems. Tabish et al.proposed a virus detection model u- Generate sing statistical analysis of the byte-level file con- detectors Match YDiscard the tent(37.This model worked on 1-,2-,3-,4-Gram. randomly detectors And 13 statistical features were computed on the basis Detector set of N-Grams.This model was not based on signatures. It neither memorized specific strings appearing in the Fig.3 The principle of the negative selection algorithm file content nor depended on prior knowledge of file Let us take the computer virus detection problem types.However,the false positive rates were relatively as an example to introduce the NSA.Firstly,the NSA high.And how to set the block size was not intro- generates virus detectors randomly which are referred to duced. as "nonself",while the benign programs are taken as Very recently,many new virus detection methods “self”.Secondly,matches between“self"and“non have been proposed,for details,please refer to Ref. self"are done.If a detector matches a "self",it [3846]. would be considered as "self"and discarded;other- 3 Immune based computer virus de- wise,it is included in a detector set.Finally,the NSA tection approaches obtains a detector set in which none of the detectors matches any "self",and which is then used to detect 3.1 Artificial immune system viruses. 3.1.1 Features and applications of AIS The“nonself”prior knowledge is not needed in The AIS is a computational system inspired by the the procedure of extracting the detector set by using the BIS,which are referred to as the second brain.The NSA,so the NSA based approaches are able to classify AIS is a dynamic,adaptive,robust and distributed “self'and“nonself'”without the knowledge of“non- learning system.As it has the ability of fault tolerant self".This feature makes the NSA based approaches a-
·86· 智能系统学报 第8卷 ble to recognize unseen“nonself”.Now the NSA is immune system in Ref.(s They set forth some criteria mainly used in the computer security and fault diagno- that must be met to provide real-world,functional pro- sis fields. tection from rapidly spreading viruses,including innate 3.1.2 Motivations of applying immune mechanisms to immunity,adaptive immunity,delivery and dissemina- detect virus tion,high speed,scalability,safety and reliability as As we know,the computer virus is named after well as customer control.In fact,these criteria have biological virus because of the similarity between them, become the standards for other computer immune sys- such as parasitism,propagation,infection,and de- tems from then on. struction.The BIS has protected body from antigens Based on the clonal selection theory of Burnet, from the very beginning of life successfully,resolving the clone selection algorithm was proposed by Kim and the problem of defeating unseen antigens.The com-Bentley puter security system has the similar functions with the Matzinger proposed the“danger theory”in BIS.Furthermore,the features of the AIS,such as dy- 20021491.The danger theory (DT)believes that the namic,adaptive,robust,are also needed in the com- immune system is more concerned with danger than puter anti-virus system (CAVS).To sum up,applying nonself.It explains a lot of new findings successfully immune mechanisms to computer security system,es- and corrects the fault of traditional“self”and“non- pecially the CAVS is reasonable and has developed into self'”model in defining of harmfulness of“self”and a new field in the past few years,attracting many re- "nonself".Many researchers have tried to introduce searchers.The relationship of the BIS and CAVS is lis- this new theory into AIS which has developed into a ted in Table 1. new branch of AIS. Table 1 The relationship of the BIS and CAVS Since then,more and more researchers have de- BIS CAVS voted themselves to the study of computer immune sys- Antigens Computer viruses tems based on immune mechanisms and various kinds of immune based computer virus detection approaches Antibodies Detectors for the viruses have been proposedts651. Binding of an antigen and an Pattern matching of the vi- Edge et al.introduced a new artificial immune antibody ruses and detectors system based on retrovirus algorithm REALGO) Applying immune mechanisms to detect virus ena- which was inspired by reverse transcription ribonucleic bles the CAVS to recognize new variants and unseen vi- acid (RNA)as found in the biological systems(s1).In ruses by using existing knowledge.The CAVS with im- the learning phase,positive selection generated new mune mechanisms would own many finer features,such antibodies using genetic algorithm based on known vi- as dynamic,adaptive and robust.It is considered to be rus signatures and negative selection,ensuring that able to make up the fault of the signature based virus these antibodies did not trigger on"self".The REAL- detection techniques.The immune based computer vi- GO provided a memory for each antibody in the genetic rus detection approaches have paved a new way for the algorithm so that an antibody could remember its best sit- anti-virus research in the past few years. uation.With the help of the memory,the REALGO was 3.2 Related works able to revert back to the previous generation and mutate With the development of immunology,immune in a different "direction"to escape local extremum. mechanisms have begun to be applied in the field of Li Zhou et al.proposed an immune based virus computer security.Forrest et al.first proposed a nega- detection approach with process call arguments and us- tive selection algorithm to detect anomaly modification er feedbacks It collected arguments of process calls on protected data in 1994 and later applied it to instead of the sequences of process,and utilized these UNIX process detection.It is the beginning of ap- arguments to train detectors with real-valued negative plying immune theory to the computer security system. selection algorithms.In the test phase,they adjus- Kephart et al.described a blueprint of a computer ted the threshold between benign programs and viruses
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·87 through user feedback.The detection rate achieved was the simulations of the AIS to the BIS are still very simple. 0.7,which proved the approach could cope with un- Combining with the characteristics of computer virus de- seen viruses.However,let users distinguish a virus tection and the studies of immune algorithms are needed. from normal files and give feedback was very difficult. There is still a long way to apply immune based computer Li Tao proposed a dynamic detection model for virus detection approaches in the real world. computer viruses based on an immune system[ss). Through dynamic evolution of"self",an antibody gene 4 Our work library and detectors,this model reduced the size of 4.1 An immune based virus detection system u- the "self"set,raised the generating efficiency of detec- sing affinity vectors tors,and resolved the problem of detector training time 4.1.1 Overview being exponential with respect to the size of"self". Aiming at building a light-weighted,limited com- A DT inspired artificial immune algorithm for on- puter resources and early virus warning system,an im- line supervised two-class classification problem was mune based virus detection system using affinity vectors proposed The size of the danger zone in this algo- (IVDS)was proposeds) rithm is decreased with the increasing of the accumula- Firstly,the IVDS generates a detector set from a ted intensity of the antibody.The better antibodies will training set by using negative selection and clone selec- proliferate and live longer by using the clonal selection tion.The negative selection eliminates autoimmunity algorithm,while a suppression mechanism is utilized to detectors and ensures that any detector in the detector control the antibody population.Experimental results set will not match any "self",while the clone selection suggested that this algorithm performed well with good increases the diversity of the detectors in the detector generalization capability. set which helps the IVDS obtain a stronger ability to Zhu and Tan proposed a DT based learning model recognize new variants and unseen viruses.Secondly, for combing classifier and applied it to spam detec-two novel hybrid distances named hamming-max and tion.There are three components in this model:bina- shift r bit-continuous distance are proposed to calculate ry-valued signal 1,signal 2 and danger zone.If the signal the affinity vector of a program.Finally,based on the 1 and signal 2 make the same classification for a sample, affinity vector,three classic classifiers,support vector the sample is classified directly.Otherwise,a self-trigger machine (SVM),radial basis function (RBF)neural process has to be done to solve the signal conflict prob- network and K-nearest neighbor (KNN),are involved to lem.The classifiers used to emit immune signals are sup- estimate the performance of the proposed IVDS. posed to be conditionally independent,in order to get dif-4.1.2 Experiments and discussions ferent trained classifiers from the same data source. The experiments were conducted in the CILPKU0S The immune based computer virus detection ap- datasetwhich are collected by the Computationa Intel- proaches are able to detect new variants and unseen vi- ligence Laboratory at Peking University in 2008.There ruses.These approaches have developed into a new are 3 547 malware in this dataset.Three test datasets field for computer virus detection and attracted more used here are obtained by randomly dividing the CILP- and more researchers.However,there is a lack of rig- KU08 dataset,the details of which are shown in Table 2. orous theoretical principle of mathematics.In addition, Table 2 The test datasets used in the experiments Training set Test set The percentage Datasets Benign programs Viruses Benign programs Viruses of training set Dataset 1 71 885 213 2662 0.25 Dataset 2 142 1773 142 1773 0.50 Dataset 3 213 2662 11 885 0.75 The percentage of training set given in Table 2 is denote the number of the programs in a training set and set as NTS/(NTS NDS),where the NTS and NDS a test set,respectively.There is no overlap between a
·88· 智能系统学报 第8卷 test set and its corresponding training set.This setting 25%.The RBF network outperformed the SVM and makes the experiments believable.The experimental KNN in all the training sets,while it got worse accura- results are shown in Fig.4. cies in the test sets.It is easy to see that the RBF net- 0.99 0 work has weaker generalization ability in this context, 0 0.98 是09d ⊙-SVM whereas the SVM and KNN have stable performances in RBF Network KNN all the training sets and test sets with different percent- ages of the training sets.The experimental results sug- gested that the proposed IVDS has a strong detection a- 网0.93 0.92 bility and good generalization performance. 4.2 A hierarchical artificial immune model for vi- 0.50 0.75 Percent of train set rus detection 4.2.1 Overview (a)Detect benign files in train set A hierarchical artificial immune model for virus de- 0.998s tection (HAIM)was proposed in Ref.[7].The motiva- tion of the HAIM is to make full use of the relativity a- 099 mong the different signatures in a virus sample.Generally speaking,a virus usually contains several heuristic signa- 0.994 0.993 d tures and a heuristic signature may appear in various viru RBF Network 0.992 KNN 0 ses.It is reasonable to believe that there is some relativity 0.991 .25 0.50 0.75 among these heuristic signatures and a specific combination Percent of train set of some signatures makes up a virus.The HAIM,taking a (b)Detect virus files in train set virus as a unit,detects viruses by making full use of the 0.99序 simple relativity among signatures in a virus sample. 0.985 The HAIM is composed of two modules:virus 兰098 gene library generating module and self-nonself classifi- 0.975 cation module.The first module is used to generate the 0.970 detecting gene library to accomplish the training in a 80.965 -⊙-SVM training set,while the second module is assigned as 0.960 B-RBF Network 子KNN the detecting phase in terms of the results from the first 0.955 0.25 0.50 0.75 module for detecting the suspicious programs.The Percent of train set processes of the two modules are given in Fig.5 and (c)Detect benign files in test set Fig.6,respectively.The virus gene library generating 094f module extracts a virus instruction library based on the 0.93 0 statistics collected in a training set.Here an instruction is defined as a binary string of length 2 bytes.Then a candidate virus gene library and a benign virus-like eSVM gene library are obtained by traversing all the viruses 0.89 -RBF Network KNN and benign programs in the training set by using a slid- 0.88 0.25 0.50 0.75 ing window,respectively.Finally,according to the Percent of train set negative selection mechanism,the candidate virus li- (d)Detect virus files in test set brary is upgraded to the detecting virus gene library. Fig.4 The accuracies of the SVM,RBF network and KNN The virus gene library generating module extracts Fig.4 illustrates that the IVDS achieves the opti- a virus instruction library based on the statistics col- mal accuracy when the percentage of the training set is lected in a training set.Here an instruction is defined as a binary string of length 2 bytes.Then a candidate
第1期 TAN Ying,et al:Immune based computer virus detection approaches ·89 virus gene library and a benign virus-like gene library in a training set.Here an instruction has been defined are obtained by traversing all the viruses and benign as a binary string of length 2 bytes.Then a candidate programs in the training set by using a sliding window, virus gene library and a benign virus-like gene library respectively.Finally,according to the negative selec- are obtained by traversing all the viruses and benign tion mechanism,the candidate virus library is upgra- programs in the training set by using a sliding window, ded to the detecting virus gene library. respectively.Finally,according to the negative selec- tion mechanism,the candidate virus library is upgra- Viruses in the Statistical Benign programs training set in the training set ded to the detecting virus gene library. In the self-nonself classification module,the sus- Traversal Traversal malching hbrary matching picious virus-like genes are extracted from a suspicious program by using the method to generate the candidate Ca知ndidate vins Benign vins-like virus gene library.Then the virus-like genes in the gene library gene library suspicious program are matched with the detectors in Negative the detecting virus gene library to get a matching val- selccton ue,which is taken as the affinity of the program.If it Detecting virs is larger than a chosen threshold,the program is regar- gene library ded as a virus;otherwise it is a legal program. Fig.5 Virus gene library generating process The hierarchical matching method to calculate the affinity of a suspicious program is illustrated in Fig.7. /Input a suspicious program In the gene level,T-successive consistency matching is used to make a fuzzy matching.In the individual level, Traversal matching Virus instruction library a suspicious program is compared to a virus sample in the detecting virus gene library on the individual level. Suspicious virus- Detecting virus like genes gene library As the interrelated information of instructions is saved as much as possible,the HAIM takes full advantage of Similarity yalue>I the potential relevance between different signatures to detect viruses.Due to the similarity among different vi- Virus Benign program ruses,it is able to detect new variants and unseen viruses effectively.Finally,a classification decision is made by Fig.6 Self-nonself classification process the detecting virus gene library.By getting through the The virus gene library generating module extracts a matching processes with three levels,a wise decision was virus instruction library based on the statistics collected made to successfully classify the program. Decision Match 3 Level Detecting virus gene library -Classification Individual Match 2 A suspicious program A virus sample Level A virus sample Gene Match Level Gene Gene Gene Gene Gene Gene Fig.7 The hierarchical matching method 4.2.2 Experiments and discussions dataset nine times,nine test datasets are obtained and The experiments in this work are taken on the nine tests are carried out,respectively.The experi- CILPKU08 dataset.Through randomly dividing the mental results are listed in Table 3