IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X MILD:Multiple-Instance Learning via Disambiguation Wu-Jun Li,Dit-Yan Yeung Abstract In multiple-instance learning(MIL),an individual example is called an instance and a bag contains a single or multiple instances.The class labels available in the training set are associated with bags rather than instances.A bag is labeled positive if at least one of its instances is positive; otherwise,the bag is labeled negative.Since a positive bag may contain some negative instances in addition to one or more positive instances,the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and,consequently,the instance labels are inherently ambiguous.In this paper,we propose a very efficient and robust MIL method, called MILD(Multiple-Instance Learning via Disambiguation),for general MIL problems.First,we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second,we propose two feature representation schemes,one for instance-level classification and the other for bag-level classification,to convert the MIL problem into a standard single-instance learning(SIL)problem that can be solved by well-known SIL algorithms,such as support vector machine.Third,an inductive semi-supervised learning method is proposed for MIL.We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency,robustness and accuracy. Index Terms Multiple-instance learning,learning from ambiguity,CBIR,object recognition,co-training, drug activity prediction Wu-Jun Li and Dit-Yan Yeung are with the Department of Computer Science and Engineering,Hong Kong University of Science and Technology.Hong Hong,China.E-mail:liwujunecse.ust.hk;dyyeungecse.ust.hk Manuscript received xxxx xx,2008;revised xxxx xx,xxxx. March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 1 MILD: Multiple-Instance Learning via Disambiguation Wu-Jun Li, Dit-Yan Yeung Abstract In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. Since a positive bag may contain some negative instances in addition to one or more positive instances, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In this paper, we propose a very efficient and robust MIL method, called MILD (Multiple-Instance Learning via Disambiguation), for general MIL problems. First, we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second, we propose two feature representation schemes, one for instance-level classification and the other for bag-level classification, to convert the MIL problem into a standard single-instance learning (SIL) problem that can be solved by well-known SIL algorithms, such as support vector machine. Third, an inductive semi-supervised learning method is proposed for MIL. We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency, robustness and accuracy. Index Terms Multiple-instance learning, learning from ambiguity, CBIR, object recognition, co-training, drug activity prediction ✦ Wu-Jun Li and Dit-Yan Yeung are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Hong, China. E-mail: liwujun@cse.ust.hk; dyyeung@cse.ust.hk Manuscript received xxxx xx, 2008; revised xxxx xx, xxxx. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 2 1 INTRODUCTION 1.1 Multiple-Instance Learning The task of drug activity prediction [1]is to classify aromatic molecules according to whether or not they are"musky".Here,the same molecule can manifest several different steric configurations (i.e.,molecular shapes),each with very different energy properties.A molecule is said to be musky if it binds itself to a particular receptor when it is in one or some of its configurations, although it cannot bind itself to the receptor in its other configurations.When a molecule cannot bind itself to the receptor in any of its configurations,it is said to be non-musky A new learning paradigm called multiple-instance learning (MIL)was proposed in [1]to model learning problems like drug activity prediction.In an MIL problem [1],an individual example is called an instance and a bag contains a single or multiple instances.A bag is labeled positive if at least one of its instances is positive;otherwise,the bag is labeled negative.In the example of drug activity prediction,a bag corresponds to a molecule,and an instance corresponds to a configuration.A configuration is said to be positive if it can make the molecule bind to the receptor.Otherwise,it is called negative.In MIL,the class labels available in the training set are associated with bags rather than instances.For example,we only know whether or not a molecule is musky,but do not know in what configuration a molecule can bind itself to the receptor. As more and more applications have been formulated as MIL problems,some of them are slightly different from the original formulation of MIL in [1].Recent works [2],[3]have pointed out that there are actually two different settings for MIL.One setting is based on the existential assumption,which assumes that a bag can be determined to be positive as long as one single positive instance exists in it.This setting is the same as the original formulation in [1].The other setting is based on the collective assumption [3],which assumes that a bag's label is collectively determined by a set of instances or even all the instances in the corresponding bag. One representative application conforming to the collective assumption is the region-based image classification task [4],[5],where the label of an image always refers to some object in that image. Because current automatic image segmentation methods may cut the object responsible for the label of the image into multiple parts,any single part cannot represent the object satisfactorily. It is the collective property of multiple parts that determines the label of the image. March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 2 1 INTRODUCTION 1.1 Multiple-Instance Learning The task of drug activity prediction [1] is to classify aromatic molecules according to whether or not they are “musky”. Here, the same molecule can manifest several different steric configurations (i.e., molecular shapes), each with very different energy properties. A molecule is said to be musky if it binds itself to a particular receptor when it is in one or some of its configurations, although it cannot bind itself to the receptor in its other configurations. When a molecule cannot bind itself to the receptor in any of its configurations, it is said to be non-musky. A new learning paradigm called multiple-instance learning (MIL) was proposed in [1] to model learning problems like drug activity prediction. In an MIL problem [1], an individual example is called an instance and a bag contains a single or multiple instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. In the example of drug activity prediction, a bag corresponds to a molecule, and an instance corresponds to a configuration. A configuration is said to be positive if it can make the molecule bind to the receptor. Otherwise, it is called negative. In MIL, the class labels available in the training set are associated with bags rather than instances. For example, we only know whether or not a molecule is musky, but do not know in what configuration a molecule can bind itself to the receptor. As more and more applications have been formulated as MIL problems, some of them are slightly different from the original formulation of MIL in [1]. Recent works [2], [3] have pointed out that there are actually two different settings for MIL. One setting is based on the existential assumption, which assumes that a bag can be determined to be positive as long as one single positive instance exists in it. This setting is the same as the original formulation in [1]. The other setting is based on the collective assumption [3], which assumes that a bag’s label is collectively determined by a set of instances or even all the instances in the corresponding bag. One representative application conforming to the collective assumption is the region-based image classification task [4], [5], where the label of an image always refers to some object in that image. Because current automatic image segmentation methods may cut the object responsible for the label of the image into multiple parts, any single part cannot represent the object satisfactorily. It is the collective property of multiple parts that determines the label of the image. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 3 From the formulation of MIL,we can easily see that a positive bag may contain some negative instances in addition to one or more positive instances.Hence,the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and,consequently,the instance labels are inherently ambiguous.In the MIL literature [2],true positive instances and false positive instances refer to the positive and negative instances,respectively,in the positive bags. 1.2 Motivation Since labels are not available for the training instances,some methods [2],[6]simply assume that all instances in a bag have the same label as the bag label.However,this assumption can be very unreasonable for positive bags because a positive bag may contain as few as only one true positive instance.If the majority of the negative instances in a positive bag are mislabeled this way,the features learned for distinguishing positive instances from negative instances may end up being very misleading.Other methods [7],[8],[9]try to extend some standard single-instance learning (SIL)methods for multi-instance data by adding some constraints.Unfortunately,the resulting methods typically require solving non-convex optimization problems which suffer from the local minima problem and have high computation cost. Considering the limitations of some previous methods,we advocate here the necessity of instance label disambiguation as a way to eventually improve the prediction accuracy of the bag labels.In the context of MIL,disambiguation essentially refers to identifying the true positive instances in the positive bags. However,existing disambiguation methods cannot achieve promising performance.The APR (axis-parallel rectangle)method [1]tries to find an axis-parallel rectangle (or,more generally, hyper-rectangle)in the feature space to represent the area of the true positive instances.This rectangle should include at least one instance from each positive bag but exclude all instances from the negative bags.Although APR works quite well for the drug activity prediction problem, it is highly possible that no aPR can be found for some other applications,such as image or text categorization,to satisfy the requirement that at least one instance from each positive bag is included while all instances from the negative bags are excluded.Moreover,APR is very sensitive to labeling noise.Suppose only one negative bag is mislabeled as a positive one.In order to include at least one instance from this mislabeled negative bag,the computed APR may March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 3 From the formulation of MIL, we can easily see that a positive bag may contain some negative instances in addition to one or more positive instances. Hence, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In the MIL literature [2], true positive instances and false positive instances refer to the positive and negative instances, respectively, in the positive bags. 1.2 Motivation Since labels are not available for the training instances, some methods [2], [6] simply assume that all instances in a bag have the same label as the bag label. However, this assumption can be very unreasonable for positive bags because a positive bag may contain as few as only one true positive instance. If the majority of the negative instances in a positive bag are mislabeled this way, the features learned for distinguishing positive instances from negative instances may end up being very misleading. Other methods [7], [8], [9] try to extend some standard single-instance learning (SIL) methods for multi-instance data by adding some constraints. Unfortunately, the resulting methods typically require solving non-convex optimization problems which suffer from the local minima problem and have high computation cost. Considering the limitations of some previous methods, we advocate here the necessity of instance label disambiguation as a way to eventually improve the prediction accuracy of the bag labels. In the context of MIL, disambiguation essentially refers to identifying the true positive instances in the positive bags. However, existing disambiguation methods cannot achieve promising performance. The APR (axis-parallel rectangle) method [1] tries to find an axis-parallel rectangle (or, more generally, hyper-rectangle) in the feature space to represent the area of the true positive instances. This rectangle should include at least one instance from each positive bag but exclude all instances from the negative bags. Although APR works quite well for the drug activity prediction problem, it is highly possible that no APR can be found for some other applications, such as image or text categorization, to satisfy the requirement that at least one instance from each positive bag is included while all instances from the negative bags are excluded. Moreover, APR is very sensitive to labeling noise. Suppose only one negative bag is mislabeled as a positive one. In order to include at least one instance from this mislabeled negative bag, the computed APR may March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 4 contain so many negative instances that it cannot represent the area of true positive instances at all (cf.Section 3.2.3).DD (diverse density)value [10]is proposed to measure how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point.The DD method tries to find the point with the highest DD value as the target concept.The DD method is very sensitive to labeling noise too since the DD value at a point will be exponentially reduced if an instance from some negative bag is close to that point (cf.Section 3.2.3).This phenomenon has been validated empirically by [11].Moreover, since the DD landscape contains local maxima,searching for the point with the highest DD value generally requires multiple restarts and hence incurs high computation cost.Other methods,such as mi-SVM [7],adopt some heuristic methods to solve the optimization problem,which may lead to local minima and incur high computation cost. 1.3 Main Contributions In this paper,we propose a very efficient and robust MIL method,called MILD (Multiple- Instance Learning via Disambiguation),for general MIL problems.The main contributions of this paper can be summarized as follows: By investigating the properties of true positive instances in depth,we propose a novel disambiguation method for identifying the true positive instances in the positive bags.This method is not only very efficient but also very robust. Two feature representation schemes,one for instance-level classification and the other for bag-level classification,are proposed to convert the MIL problem into a standard single- instance learning problem that can be solved directly by well-known SIL algorithms,such as support vector machine (SVM). By combining the two feature representation schemes,a multi-view semi-supervised learning method based on co-training [12]is proposed for MIL.To the best of our knowledge,this is the first inductive semi-supervised learning method for MIL. To demonstrate the promising performance of our method,we extensively compare our method with many state-of-the-art MIL methods in diverse applications,including drug activity prediction,protein sequence classification and image classification. One of the most attractive advantages of our methods is that,after instance label disambigua- tion,the MIL problem is converted into a standard single-instance learning problem.As a result, March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 4 contain so many negative instances that it cannot represent the area of true positive instances at all (cf. Section 3.2.3). DD (diverse density) value [10] is proposed to measure how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point. The DD method tries to find the point with the highest DD value as the target concept. The DD method is very sensitive to labeling noise too since the DD value at a point will be exponentially reduced if an instance from some negative bag is close to that point (cf. Section 3.2.3). This phenomenon has been validated empirically by [11]. Moreover, since the DD landscape contains local maxima, searching for the point with the highest DD value generally requires multiple restarts and hence incurs high computation cost. Other methods, such as mi-SVM [7], adopt some heuristic methods to solve the optimization problem, which may lead to local minima and incur high computation cost. 1.3 Main Contributions In this paper, we propose a very efficient and robust MIL method, called MILD (MultipleInstance Learning via Disambiguation), for general MIL problems. The main contributions of this paper can be summarized as follows: • By investigating the properties of true positive instances in depth, we propose a novel disambiguation method for identifying the true positive instances in the positive bags. This method is not only very efficient but also very robust. • Two feature representation schemes, one for instance-level classification and the other for bag-level classification, are proposed to convert the MIL problem into a standard singleinstance learning problem that can be solved directly by well-known SIL algorithms, such as support vector machine (SVM). • By combining the two feature representation schemes, a multi-view semi-supervised learning method based on co-training [12] is proposed for MIL. To the best of our knowledge, this is the first inductive semi-supervised learning method for MIL. • To demonstrate the promising performance of our method, we extensively compare our method with many state-of-the-art MIL methods in diverse applications, including drug activity prediction, protein sequence classification and image classification. One of the most attractive advantages of our methods is that, after instance label disambiguation, the MIL problem is converted into a standard single-instance learning problem. As a result, March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 5 many existing supervised and semi-supervised learning methods can be easily adapted for MIL applications. 2 RELATED WORK Over the past few years,many applications have been formulated as MIL problems.These include drug activity prediction [1],[13],[14],[15],image classification [4],[11],[16],[17],[18],text classification [2],[7],stock selection [7],[10],protein family modeling [19],computer aided diagnosis [20],[21],and security applications [22].To solve these problems,many MIL methods have been proposed.The first MIL method is APR [1],which represents the target concept by an axis-parallel rectangle (or hyper-rectangle)in the feature space.The rectangle includes at least one instance from each positive bag but excludes all instances from the negative bags.[10] proposed a measure called DD(diverse density),which essentially measures how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point.A DD-based method tries to find the point with the highest DD value as the target concept.EM-DD [23],[24]combines expectation-maximization (EM)[25]with the DD formulation by using EM to search for the most likely concept. Several other methods try to modify standard single-instance learning methods for MIL by introducing constraints derived from the MIL formulation.[7]proposed two methods based on SVM,one(mi-SVM)for instance-level classification and the other (MI-SVM)for bag-level clas- sification.Both methods are formulated as mixed integer quadratic programming problems.[26] applied deterministic annealing to the SVM formulation to find better local mimima compared to the heuristic methods.[6]proposed a kernel function directly for bags.With this multi-instance (MI)kernel,in principle any kernel-based classifier can be trained for classifying the bags.[27] extended this work by proposing a marginalized MI kernel to convert the MIL problem from an incomplete data problem to a complete data problem.[28]proposed an SVM-based method for sparse positive bags by directly enforcing the desired constraint that at least one of the instances in a positive bag is positive.[9]proposed to apply transductive SVM for solving MIL problems. Recently,some methods try to convert MIL into a standard single-instance problem and then solve it with conventional methods by representing the bags as feature vectors.[4]proposed the DD-SVM method through a feature mapping defined on some instance prototypes learned by DD.[11]proposed another method,called MILES (Multiple Instance Learning via Embedded March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 5 many existing supervised and semi-supervised learning methods can be easily adapted for MIL applications. 2 RELATED WORK Over the past few years, many applications have been formulated as MIL problems. These include drug activity prediction [1], [13], [14], [15], image classification [4], [11], [16], [17], [18], text classification [2], [7], stock selection [7], [10], protein family modeling [19], computer aided diagnosis [20], [21], and security applications [22]. To solve these problems, many MIL methods have been proposed. The first MIL method is APR [1], which represents the target concept by an axis-parallel rectangle (or hyper-rectangle) in the feature space. The rectangle includes at least one instance from each positive bag but excludes all instances from the negative bags. [10] proposed a measure called DD (diverse density), which essentially measures how many different positive bags have instances near a point in the feature space and how far the negative instances are from that point. A DD-based method tries to find the point with the highest DD value as the target concept. EM-DD [23], [24] combines expectation-maximization (EM) [25] with the DD formulation by using EM to search for the most likely concept. Several other methods try to modify standard single-instance learning methods for MIL by introducing constraints derived from the MIL formulation. [7] proposed two methods based on SVM, one (mi-SVM) for instance-level classification and the other (MI-SVM) for bag-level classification. Both methods are formulated as mixed integer quadratic programming problems. [26] applied deterministic annealing to the SVM formulation to find better local mimima compared to the heuristic methods. [6] proposed a kernel function directly for bags. With this multi-instance (MI) kernel, in principle any kernel-based classifier can be trained for classifying the bags. [27] extended this work by proposing a marginalized MI kernel to convert the MIL problem from an incomplete data problem to a complete data problem. [28] proposed an SVM-based method for sparse positive bags by directly enforcing the desired constraint that at least one of the instances in a positive bag is positive. [9] proposed to apply transductive SVM for solving MIL problems. Recently, some methods try to convert MIL into a standard single-instance problem and then solve it with conventional methods by representing the bags as feature vectors. [4] proposed the DD-SVM method through a feature mapping defined on some instance prototypes learned by DD. [11] proposed another method, called MILES (Multiple Instance Learning via Embedded March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 6 instance Selection),based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However,most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning:which are true positive instances and what property do the true positive instances have?This motivates the work in this paper. 3 MILD:MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section,our disambiguation method is described in detail,and then two feature rep- resentation schemes are presented for instance-level classification and bag-level classification, respectively,based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms,such as SVM.In addition,multi-view learning [12]can be easily adapted for MIL by combining the two views,the instance-level view and the bag-level view,of the bags. 3.1 Notations B denotes a positive bag and B.denotes a negative bag.When the label of a bag does not matter,we simply denote the bag as B.B denotes an instance in a positive bag B and B is an instance in a negative bag B.Let B={B,B,...,B,B,B2,...,B-}denote the set of n+positive and n-negative training bags.I(Bi)E+1,-1}is the bag label of Bi and l(Bj)[+1,-1)is the instance label of Bij.In general,we always represent all instances as feature vectors of the same dimensionality.Hence,in this paper,an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation,all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise.As for the positive bags,the only thing we know is that each positive bag must contain at least one true positive instance,but it may also contain many negative instances.Thus,ambiguity arises in the positive bags since we do not know the labels of the instances there.The goal of disambiguation is to identify the true positive instances in the positive bags. March 1.2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 6 instance Selection), based on a novel instance-based feature mapping and the 1-norm SVM model for simultaneous feature selection and classification. However, most of the existing methods cannot answer the essential question giving rise to the gap between multiple-instance learning and single-instance learning: which are true positive instances and what property do the true positive instances have? This motivates the work in this paper. 3 MILD: MULTIPLE-INSTANCE LEARNING VIA DISAMBIGUATION In this section, our disambiguation method is described in detail, and then two feature representation schemes are presented for instance-level classification and bag-level classification, respectively, based on which the MIL problem is converted into a standard single-instance leaning problem that can be directly solved by SIL algorithms, such as SVM. In addition, multi-view learning [12] can be easily adapted for MIL by combining the two views, the instance-level view and the bag-level view, of the bags. 3.1 Notations B + i denotes a positive bag and B − i denotes a negative bag. When the label of a bag does not matter, we simply denote the bag as Bi . B + ij denotes an instance in a positive bag B + i and B − ij is an instance in a negative bag B − i . Let B = {B + 1 , B+ 2 , . . . , B+ n+ , B− 1 , B− 2 , . . . , B− n− } denote the set of n + positive and n − negative training bags. l(Bi) ∈ {+1, −1} is the bag label of Bi and l(Bij ) ∈ {+1, −1} is the instance label of Bij . In general, we always represent all instances as feature vectors of the same dimensionality. Hence, in this paper, an instance also refers to a feature vector. 3.2 Disambiguation According to the MIL formulation, all instances in the negative bags are negative and hence there exists no ambiguity in the negative bags if there is no labeling noise. As for the positive bags, the only thing we know is that each positive bag must contain at least one true positive instance, but it may also contain many negative instances. Thus, ambiguity arises in the positive bags since we do not know the labels of the instances there. The goal of disambiguation is to identify the true positive instances in the positive bags. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1:We assume that given a true positive instance t,the probability that an instance Bij is positive is calculated as follows: Pr(l(Bij)=+1|t)=exp llt-Bisll 042 (1) where denotes the 2-norm of the vector x,and a is a parameter larger than 0. Remark 1:The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian.Kernel density estimation can be thought of as placing a small "bump"at each observation,which will give higher probability density near the observations. Similarly,if we have already observed an instance t to be truly positive,then the chance that an instance near t is truly positive will be relatively high.However,one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function,Assumption 1 is used to compute a conditional probability.From(1),we can easily see that0≤Pr(l(B)=+1|t)≤1,Pr(l(B)=+1|t)=0 when llt-Bl‖l=+oo, and Pr(l(Bj)=+1t)=1 when -Bjll =0.Obviously,this is a reasonable probability function,but not a reasonable probability density function.Furthermore,the assumption in(1) also coincides well with our intuition.If we have known that t is a true positive instance and llt-Bjl=0,Bj will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij)=+1t)=1 is reasonable.Similarly,iflt-Bjll =+oo,Bij is infinitely far away from the true positive instance t,Pr(l(Bj)=+1t)=0 is also reasonable. The farther Bii is away from t,the lower is the probability that Bii is positive given t,which is reasonable based on our intuition. Remark 2:We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space,but are generated from some probability distribution over the whole instance space.Hence,each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1:The most-likely-cause estimator for estimating the probability that a bag Bi is 1.This parameter will be learned from the training data by the algorithm introduced later. March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 7 3.2.1 Some Properties of True Positive Instances Assumption 1: We assume that given a true positive instance t, the probability that an instance Bij is positive is calculated as follows: Pr (l(Bij ) = +1 | t) = exp − kt − Bijk 2 σt 2 ! , (1) where kxk , pP k x 2 k denotes the 2-norm of the vector x, and σt is a parameter 1 larger than 0. Remark 1: The rationale of Assumption 1 is similar to that of kernel density estimation with the kernel being a Gaussian. Kernel density estimation can be thought of as placing a small “bump” at each observation, which will give higher probability density near the observations. Similarly, if we have already observed an instance t to be truly positive, then the chance that an instance near t is truly positive will be relatively high. However, one key point that should be noted is that unlike kernel density estimation which tries to estimate a probability density function, Assumption 1 is used to compute a conditional probability. From (1), we can easily see that 0 ≤ Pr(l(Bij ) = +1 | t) ≤ 1, Pr(l(Bij ) = +1 | t) = 0 when kt − Bijk = +∞, and Pr(l(Bij ) = +1 | t) = 1 when kt − Bijk = 0. Obviously, this is a reasonable probability function, but not a reasonable probability density function. Furthermore, the assumption in (1) also coincides well with our intuition. If we have known that t is a true positive instance and kt − Bijk = 0, Bij will definitely be a true positive instance because Bij is just equal to t, which means that Pr(l(Bij ) = +1 | t) = 1 is reasonable. Similarly, if kt − Bijk = +∞, Bij is infinitely far away from the true positive instance t, Pr(l(Bij ) = +1 | t) = 0 is also reasonable. The farther Bij is away from t, the lower is the probability that Bij is positive given t, which is reasonable based on our intuition. Remark 2: We try to deal with the general case that all the true positive instances are not necessarily identical to a single point in the instance space, but are generated from some probability distribution over the whole instance space. Hence, each point in the instance space is associated with a probability value to indicate how confidently that point is truly positive. Definition 1: The most-likely-cause estimator for estimating the probability that a bag Bi is 1. This parameter will be learned from the training data by the algorithm introduced later. March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 8 a positive bag given a true positive instance t is defined as follows: Pr(l(Bi)=+1t)=max Pr(l(Bii)=+1t) B∈B t- B 器ep d2(t,Bi) exp (2) where d(t,B)=min lt-Bl: (3) Bii∈B In other words,the distance d(t,Bi)between an instance t and a bag Bi is simply equal to the distance between t and the nearest instance in Bi. The definition of the most-likely-cause estimator completely follows the underlying principle of the MIL formulation.Let Bik be the instance in bag Bi which is the most likely one to be positive,i.e.,Pr(l(Bk)=+1t)is the largest among all instances in B:.If Pr(l(B)=+1t) is small enough,certainly we should assign Bi to be negative.Otherwise,B;should be positive, even if all instances in B:other than Bik have low probabilities to be positive.Moreover,the definition is also consistent with intuition.If d(t,Bi)=0,which implies that t is an instance of Bi,Pr(l(B)=+1t)will be 1.This is reasonable,because B;contains a true positive instance t.In general,the larger d(t,Bi)is,the lower is the probability that Bi is positive given the true positive instance t. Theorem 1:Given a true positive instance t,there exists a threshold O which makes the decision function defined in (4)label the bags according to the Bayes decision rule. +1 ,(B) d(t,B)≤0e (4) -1 otherwise Proof:According to the Bayes decision rule [29,page 23],the label of Bi given a true positive instance t should be decided as follows: (B:)= +1 Pr(l(Bi)=+1|t)z Pr(l(Bi)=-1|t) -1 otherwise March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 8 a positive bag given a true positive instance t is defined as follows: Pr(l(Bi) = +1 | t) = max Bij∈Bi Pr(l(Bij ) = +1 | t) = max Bij∈Bi exp − kt − Bijk 2 σ 2 t ! = exp − d 2 (t, Bi) σ 2 t , (2) where d(t, Bi) = min Bij∈Bi kt − Bijk. (3) In other words, the distance d(t, Bi) between an instance t and a bag Bi is simply equal to the distance between t and the nearest instance in Bi . The definition of the most-likely-cause estimator completely follows the underlying principle of the MIL formulation. Let Bik be the instance in bag Bi which is the most likely one to be positive, i.e., Pr(l(Bik) = +1 | t) is the largest among all instances in Bi . If Pr(l(Bik) = +1 | t) is small enough, certainly we should assign Bi to be negative. Otherwise, Bi should be positive, even if all instances in Bi other than Bik have low probabilities to be positive. Moreover, the definition is also consistent with intuition. If d(t, Bi) = 0, which implies that t is an instance of Bi , Pr(l(Bi) = +1 | t) will be 1. This is reasonable, because Bi contains a true positive instance t. In general, the larger d(t, Bi) is, the lower is the probability that Bi is positive given the true positive instance t. Theorem 1: Given a true positive instance t, there exists a threshold θt which makes the decision function defined in (4) label the bags according to the Bayes decision rule. h t θt (Bi) = +1 d(t, Bi) ≤ θt −1 otherwise (4) Proof: According to the Bayes decision rule [29, page 23], the label of Bi given a true positive instance t should be decided as follows: l(Bi | t) = +1 Pr(l(Bi) = +1 | t) ≥ Pr(l(Bi) = −1 | t) −1 otherwise March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X 9 From Definition 1,we have Pr(l(B)=+1|t)-Pr(l(B)=-1|t) d2(t,Bi) exp )--e(f】 2exp d2(t,Bi) -1 Furthermore,we have Pr(l(Bi)=+1t)>Pr(l(Bi)=-1t) ÷Pr(l(B)=+1|t)-Pr(l(B)=-1|t)≥0 台2exp d2(t,Bi) 02 -1≥0 ÷d(t,B:)≤otV1n2 (5) Hence,if ,=o:vIn 2,the decision function defined in (4)will label the bags in accordance with the Bayes decision rule 口 Therefore,if t is a true positive instance,there must exist a decision function as defined in (4)to label the bags well,meaning that the distances from t to the positive bags are expected to be smaller than those to the negative bags. For a negative instance (i.e.,false positive instance),however,its distances to the positive and negative bags do not exhibit the same distribution as those from t.Since some positive bags may also contain negative instances just like the negative bags,the distances from the negative instance to the positive bags may be as random as those to the negative bags.This distributional difference provides an informative hint for identifying the true positive instances. 3.2.2 Disambiguation Method Unlike in the previous subsection,t does not necessarily refer to a true positive instance in this subsection.However,we still define a decision function as that in (4)even when t is a negative instance.The difference is that if t is a negative instance,the corresponding decision function cannot label the bags well.It is this very phenomenon that forms the basis of our disambiguation method. Definition 2:The empirical precision of the decision function in (4)is defined as follows: P(0)= 1 nt+”1+,(B)(B) (6) n++n- 2 i=1 March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 9 From Definition 1, we have Pr(l(Bi) = +1 | t) − Pr(l(Bi) = −1 | t) = exp − d 2 (t, Bi) σ 2 t − 1 − exp − d 2 (t, Bi) σ 2 t = 2 exp − d 2 (t, Bi) σ 2 t − 1. Furthermore, we have Pr(l(Bi) = +1 | t) ≥ Pr(l(Bi) = −1 | t) ⇔ Pr(l(Bi) = +1 | t) − Pr(l(Bi) = −1 | t) ≥ 0 ⇔ 2 exp − d 2 (t, Bi) σ 2 t − 1 ≥ 0 ⇔ d(t, Bi) ≤ σt √ ln 2. (5) Hence, if θt = σt √ ln 2, the decision function defined in (4) will label the bags in accordance with the Bayes decision rule. Therefore, if t is a true positive instance, there must exist a decision function as defined in (4) to label the bags well, meaning that the distances from t to the positive bags are expected to be smaller than those to the negative bags. For a negative instance (i.e., false positive instance), however, its distances to the positive and negative bags do not exhibit the same distribution as those from t. Since some positive bags may also contain negative instances just like the negative bags, the distances from the negative instance to the positive bags may be as random as those to the negative bags. This distributional difference provides an informative hint for identifying the true positive instances. 3.2.2 Disambiguation Method Unlike in the previous subsection, t does not necessarily refer to a true positive instance in this subsection. However, we still define a decision function as that in (4) even when t is a negative instance. The difference is that if t is a negative instance, the corresponding decision function cannot label the bags well. It is this very phenomenon that forms the basis of our disambiguation method. Definition 2: The empirical precision of the decision function in (4) is defined as follows: Pt(θt) = 1 n+ + n− n + X +n− i=1 1 + h t θt (Bi)l(Bi) 2 , (6) March 1, 2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.X,NO.X,XXX 200X o where Bi E B is a training bag. The empirical precision essentially measures how well the decision function()with threshold mimics l()in predicting the bag labels. Now,the question is that we do not know the scaling factor o?of the function in(2)for a specific instance t.As such,we cannot compute the threshold 6t.It will be desirable to learn o? from the training data automatically.To do so,we find the best threshold that maximizes the empirical precision to estimate o:*=arg maxe,P().Thus the best (maximum)empirical precision P*(t)for t is given by: P*(t)=max P(0:). (7) Remark 3:The basic idea of finding 0;is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data.However,the criterion used here is the empirical precision as defined in(6)rather than the likelihood.Furthermore,our method is totally different from the DD method.The objective function of DD is multiplicative,making DD very sensitive to noise.Howerver,the objective function of our method is the empirical precision in (6)with an additive form,which guarantees the robustness of our method.The robustness of our method will be further discussed later. In essence,P*(t)reflects the ability of instance t in discriminating the training bags.The larger P(t)is,the more likely is t a true positive instance. It is worth noting that although 6:is a continuous-valued variable,we can find by checking a finite set of candidate values only.This is guaranteed by Theorem 2. Theorem 2:The best empirical precision P*(t)for t is achieved when Ot is an element in the set {d(t,B)i=1,...,n. Proof:Let 0:be a value satisfying P()=P*(t)and be the maximum value in d(t,B )i=1,...,n+}that is no larger than 0. If B is correctly labeled when 0=0:,then d(t,B)0,and hence d(t,B)>.Therefore,any correctly labeled negative bag when 6t=0*is also correctly labeled when=. Thus we have,P()=P(0;)=P*(t). March 1,2009 DRAFT
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 200X 10 where Bi ∈ B is a training bag. The empirical precision essentially measures how well the decision function h t θt (·) with threshold θt mimics l(·) in predicting the bag labels. Now, the question is that we do not know the scaling factor σ 2 t of the function in (2) for a specific instance t. As such, we cannot compute the threshold θt . It will be desirable to learn σ 2 t from the training data automatically. To do so, we find the best threshold θ ∗ t that maximizes the empirical precision to estimate σ 2 t : θ ∗ t = arg maxθt Pt(θt). Thus the best (maximum) empirical precision P ∗ (t) for t is given by: P ∗ (t) = max θt Pt(θt). (7) Remark 3: The basic idea of finding θ ∗ t is analogous to maximum likelihood estimation (MLE) which finds the parameters that best explain the observed data. However, the criterion used here is the empirical precision as defined in (6) rather than the likelihood. Furthermore, our method is totally different from the DD method. The objective function of DD is multiplicative, making DD very sensitive to noise. Howerver, the objective function of our method is the empirical precision in (6) with an additive form, which guarantees the robustness of our method. The robustness of our method will be further discussed later. In essence, P ∗ (t) reflects the ability of instance t in discriminating the training bags. The larger P ∗ (t) is, the more likely is t a true positive instance. It is worth noting that although θt is a continuous-valued variable, we can find θ ∗ t by checking a finite set of candidate values only. This is guaranteed by Theorem 2. Theorem 2: The best empirical precision P ∗ (t) for t is achieved when θt is an element in the set {d(t, B+ i ) | i = 1, . . . , n+}. Proof: Let θ ∗ t be a value satisfying Pt(θ ∗ t ) = P ∗ (t) and θ 0 t be the maximum value in {d(t, B+ i ) | i = 1, . . . , n+} that is no larger than θ ∗ t . If B + i is correctly labeled when θt = θ ∗ t , then d(t, B+ i ) ≤ θ ∗ t , and hence d(t, B+ i ) ≤ θ 0 t . Therefore, any correctly labeled positive bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Similarly, if B − i is correctly labeled when θt = θ ∗ t , then d(t, B− i ) > θ∗ t , and hence d(t, B− i ) > θ0 t . Therefore, any correctly labeled negative bag when θt = θ ∗ t is also correctly labeled when θt = θ 0 t . Thus we have, Pt(θ 0 t ) = Pt(θ ∗ t ) = P ∗ (t). March 1, 2009 DRAFT