Recommender Systems for the Conference Paper Assignment Problem Don col Naren Ramakrishnan Virginia TE ahoo! Research Blacksburg, VA, USA Haifa Israel Blacksburg, VA, USA dconry @cs. vt. edu yehuda@yahoo-inc.com ABSTRACT The primary input to the conference paper assignment We present a recommender problem(CPAP)is a papers x reviewers matrix of"bids' paper assignment, i.e., the expressing interest or disinterest of reviewers to review spe- ons to reviewers. We addres cific papers. The goal is to construct a set of re signments taking into account reviewer capacity constraints, lem)and the optimization of reviewing assignments to sat- adequate numbers of reviews for papers, expertise modeling. isfy global conference criteria(which can be viewed as con straint satisfaction). Due to the paucity of preference data There are three key differences between traditional recom per reviewer or per paper (relative to other recommende mender applications and the CPAP problem. (i)In a tradi ystems applications) we show how we can integrate mul- tional recommender recommendations that meet the needs tiple sources of information to learn reviewer-paper prefer of one user do not affect the satisfaction of other users. In ence models. Our models are evaluated not just in terms of CPAP, on the other hand, multiple users(reviewers )are bid prediction accuracy but in terms of end-assignment quality ding to review the same papers and hence there is the pos- Using a linear program based assignment optin tion sibility of one user's recommendations(assignments) affect we show how our approach better explores the space of un- ing the satisfaction levels(negatively)of other users.Hence supplied assignments to maximize the overall affinities of the design of reviewer preference models must be posed and papers assigned to reviewers. We demonstrate our result studied in an overall optimization framework. (ii)In a con- n real reviewer bidding data from the IEEE ICDm 2007 ventional recommender, the goal is often to recommend new entities that are likely to be of interest, whereas in CPAP, the goal is to ensure that reviewers are predominantly assigned Categories and subject Descriptors their(most)preferred papers. Nevertheless, preference mod- eling is still crucial because it gives the assignment algorithm Systems-Decision support; J 4[Computer Applications): some degree of latitude in aiming to satisfy multiple users Social and Behavioral Sciences sparse data but the amount of 'signal available to model General terms preferences in the CpaP domain is exceedingly small; hence Algorithms, Human Factors we must integrate multiple sources of information to build strong preference models We organize our framework into two stages: 'growing the Recommender systems, collaborative filtering, conference pa- given bids by adapting recommendation techniques to pre- per assignment, linear programming dict unknown reviewer-paper preferences, and identifying a good assignment by optimizing conference criteria. Other pproaches to CPaP (e.g,I) are surveyed elsewhere 1. INTRODUCTION We apply our framework on bids and auxiliary information Modern conferences are beset with excessively high num- (see Fig. 1)gathered from the 7th IEEE Intl. Conf on Data bers of paper submissions. Assigning these papers to ap- Mining(ICDM07) for which the third author was a pro- oropriate reviewers in the program committee(which can gram chair. Similar scope datasets from other conferences constitute a few hundred members) is a daunting task and are not publicly available(also acknowledged in [5])and we hence motivates the use of recommender systems hope our research will spur greater availability.(The Cyber- chair system used by the ICDM series has expressed inter- est in implementing our approach and we plan to approach Easychair and other CMSs as well. )We emphasize that all Permission to make digital or hard copies of all or part of this work for datasets were anonymized before the modeling and analysis personal or classroom use is granted without fee provided that copies are steps conducted here. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specifi 2. MODELING REVIEW PREFERENCES ReeSys09. October 23-25. 2009. New York, New York, USA We are given ratings(henceforth, interchangeable with Copyright2009ACM978-1-60558-4355/09/1051000. preferences) between m reviewers and n papers.(Recall that
Recommender Systems for the Conference Paper Assignment Problem Don Conry Virginia Tech Blacksburg, VA, USA dconry@cs.vt.edu Yehuda Koren Yahoo! Research Haifa, Israel yehuda@yahoo-inc.com Naren Ramakrishnan Virginia Tech Blacksburg, VA, USA naren@cs.vt.edu ABSTRACT We present a recommender systems approach to conference paper assignment, i.e., the task of assigning paper submissions to reviewers. We address both the modeling of reviewerpaper preferences (which can be cast as a learning problem) and the optimization of reviewing assignments to satisfy global conference criteria (which can be viewed as constraint satisfaction). Due to the paucity of preference data per reviewer or per paper (relative to other recommender systems applications) we show how we can integrate multiple sources of information to learn reviewer-paper preference models. Our models are evaluated not just in terms of prediction accuracy but in terms of end-assignment quality. Using a linear programming-based assignment optimization, we show how our approach better explores the space of unsupplied assignments to maximize the overall affinities of papers assigned to reviewers. We demonstrate our results on real reviewer bidding data from the IEEE ICDM 2007 conference. Categories and Subject Descriptors H.4.2 [Information Systems Applications]: Types of Systems—Decision support; J.4 [Computer Applications]: Social and Behavioral Sciences General Terms Algorithms, Human Factors Keywords Recommender systems, collaborative filtering, conference paper assignment, linear programming 1. INTRODUCTION Modern conferences are beset with excessively high numbers of paper submissions. Assigning these papers to appropriate reviewers in the program committee (which can constitute a few hundred members) is a daunting task and hence motivates the use of recommender systems. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RecSys’09, October 23–25, 2009, New York, New York, USA. Copyright 2009 ACM 978-1-60558-435-5/09/10 ...$10.00. The primary input to the conference paper assignment problem (CPAP) is a papers × reviewers matrix of ‘bids’, expressing interest or disinterest of reviewers to review specific papers. The goal is to construct a set of reviewing assignments taking into account reviewer capacity constraints, adequate numbers of reviews for papers, expertise modeling, conflicts of interest, and other global conference criteria. There are three key differences between traditional recommender applications and the CPAP problem. (i) In a traditional recommender, recommendations that meet the needs of one user do not affect the satisfaction of other users. In CPAP, on the other hand, multiple users (reviewers) are bidding to review the same papers and hence there is the possibility of one user’s recommendations (assignments) affecting the satisfaction levels (negatively) of other users. Hence the design of reviewer preference models must be posed and studied in an overall optimization framework. (ii) In a conventional recommender, the goal is often to recommend new entities that are likely to be of interest, whereas in CPAP, the goal is to ensure that reviewers are predominantly assigned their (most) preferred papers. Nevertheless, preference modeling is still crucial because it gives the assignment algorithm some degree of latitude in aiming to satisfy multiple users. Finally, (iii) recommender systems are used to working with sparse data but the amount of ‘signal’ available to model preferences in the CPAP domain is exceedingly small; hence we must integrate multiple sources of information to build strong preference models. We organize our framework into two stages: ‘growing’ the given bids by adapting recommendation techniques to predict unknown reviewer-paper preferences, and identifying a good assignment by optimizing conference criteria. Other approaches to CPAP (e.g., [1]) are surveyed elsewhere [2]. We apply our framework on bids and auxiliary information (see Fig. 1) gathered from the 7th IEEE Intl. Conf on Data Mining (ICDM’07) for which the third author was a program chair. Similar scope datasets from other conferences are not publicly available (also acknowledged in [5]) and we hope our research will spur greater availability. (The Cyberchair system used by the ICDM series has expressed interest in implementing our approach and we plan to approach Easychair and other CMSs as well.) We emphasize that all datasets were anonymized before the modeling and analysis steps conducted here. 2. MODELING REVIEW PREFERENCES We are given ratings (henceforth, interchangeable with preferences) between m reviewers and n papers. (Recall that
Content Abstract Contact of Figure 1: Data used in this paper for building paper-reviewer preference models these ratings are really bids/ signs of interest to review pa- it is useless for making actual assignments. After all, it gives pers, not the actual ratings reviewers assign to papers after all reviewers exactly the same order of paper preference reading and evaluating them ) A rating Tui indicates the Thus, we are really after the remaining unexplained vari- preference by reviewer u of paper i, where high values mean ability, where reviewer-specific preferences are getting ex- ronger preferences. Usually the vast majority of ratings pressed. Uncovering these preferences is the subject of the are unk next subsections reviewers, and only 6267 bids. In ICDM'O7, the given bids re between I and 4, indicating preferences as follows: 4= 2.2 A factor model High",3=“OK”,2=“Low"andl=“ No and we aim to make Latent factor models(e.g,3)comprise a common ap- predictions in the same space proach to collaborative filtering with the goal to uncover We distinguish predicted ratings from known ones, by us- latent features that explain observed ratings. The premise ing the notation fui for the predicted value of Tui. To eval- of such models is that both ers and papers can be uate the models we assess rmse over 100 random 90-10 characterized as vectors in a common f-D space. The in- training-test splits. We hasten to add that we do not ad teraction between reviewers and papers is modeled by inner vocate the myopic view of RMSE [4 as the primary crite- products in that space, the fourth term of Eg. 1 rion for recommender systems evaluation. We use it in this pu E R and gi ER are the factor vectors of reviewer u ection primarily due to its and paper i, respectively. The resulting average test RMSE rect optimizers. In the next section we will evaluate perfo is slowly decreasing when increasing the dimensionality of mance according to criteria more natural to CPAP. We also the latent factor space. E. g, for f= 50 it is 0.6240, and note that small improvements in overall RMsE will typi- for f= 100 it is 0. 6234. Henceforth, we use f= 100 cally translate into substantial improvements in bottom-line performance for predicting reviewer-paper preferences 2.3 Subject categories The model we learn is of the form While latent factor models automatically infer suitable u+bu+bi+pugi+>ice categories, much can be learned by known categories at- tributed to both papers and reviewers. ICDMO7 submis- sions specify a number of predefined categories as primary EIER(u)situs+ 2vER( SuUret and secondary topics for a given paper. We model the en- ∈R(u)Su tered matching between paper i and category c by and we proceed to explain each of the terms below 1c∈ prinary(i) 2.1 Baseline model c∈ secondary(i) otherwise Much of the variability in the data is explained by global effects, which can be reviewer- or paper-specific. It is im The value ent(1 for "primary", 0.5 for"secondary portant to capture this variability by a separate component, is derived validation and is quite intuitive. Sin hus letting the more involved models deal only with genuine larly, we us lowing for matching reviewers with their reviewer-paper interactions. We model these global effects desired categories through the first three terms of Eq. 1, i.e., u+bu+bi. The e∈ interest onstant u indicates a global bias in the data, which is taken o be the overall mean rating. The parameter bu captures otherwise reviewer-specific bias, accounting for the fact that different reviewers use different rating scales. Finally, the paper bias, Notice that in ICDMO7, reviewers could specify lack of in- bi, accounts for the fact that certain papers tend to attract terest(or inability to) review papers from certain categories higher (or, lower) bids than others. We learn optimal values (this is different from conflicts of interest, discussed later) n) and bi(i= 1 In the fifth term of Eg. 1, the weights we indicate the sig the associated squared error function with just these three nificance of each category in linking a reviewer to a paper terms(along with some regularization to avoid overfitting). and are learnt automatically by minimizing the squared er- The resulting average test RMSE is 0.6286 ror on the training set. It is plausible that, e. g, a mutual viewer effect (u+bu, with RMSE 0. 6336)to be much more a paper, while a mutual interest in another category B is less significant than paper bias (u+bi, RMSE 1.2943)in re- influential on papers choice. Table 1 depicts results of this ducing the error. This indicates a tendency of reviewers nalysis, showing differences in orders of magnitude in the o concentrate all ratings near their mean ratings, which is ability of different categories to correctly predict associations supported by examination of the data. of reviewers to papers. Note in particular that there is no While the baseline model could explain much of the data obvious monotonic relationship between the weight imputed variability, as evident by its relatively low associated RMSE, to categories and the number of papers/reviewers associated
Figure 1: Data used in this paper for building paper-reviewer preference models. these ratings are really bids/signs of interest to review papers, not the actual ratings reviewers assign to papers after reading and evaluating them.) A rating rui indicates the preference by reviewer u of paper i, where high values mean stronger preferences. Usually the vast majority of ratings are unknown, e.g., the ICDM data involves 529 papers, 203 reviewers, and only 6267 bids. In ICDM’07, the given bids are between 1 and 4, indicating preferences as follows: 4= “High”, 3=“OK”, 2=“Low” and 1=“No” and we aim to make predictions in the same space. We distinguish predicted ratings from known ones, by using the notation ˆrui for the predicted value of rui. To evaluate the models we assess RMSE over 100 random 90-10 training-test splits. We hasten to add that we do not advocate the myopic view of RMSE [4] as the primary criterion for recommender systems evaluation. We use it in this section primarily due to its convenience for constructing direct optimizers. In the next section we will evaluate performance according to criteria more natural to CPAP. We also note that small improvements in overall RMSE will typically translate into substantial improvements in bottom-line performance for predicting reviewer-paper preferences. The model we learn is of the form: rˆui = µ + bu + bi + p T u qi + X c σicθucwc +γ P j∈R(u) sij ruj α + P j∈R(u) sij + φ P v∈R(i) suvrvi β + P v∈R(i) suv (1) and we proceed to explain each of the terms below. 2.1 Baseline model Much of the variability in the data is explained by global effects, which can be reviewer- or paper-specific. It is important to capture this variability by a separate component, thus letting the more involved models deal only with genuine reviewer-paper interactions. We model these global effects through the first three terms of Eq. 1, i.e., µ + bu + bi. The constant µ indicates a global bias in the data, which is taken to be the overall mean rating. The parameter bu captures reviewer-specific bias, accounting for the fact that different reviewers use different rating scales. Finally, the paper bias, bi, accounts for the fact that certain papers tend to attract higher (or, lower) bids than others. We learn optimal values for bu (u = 1, . . . , m) and bi (i = 1, . . . , n), by minimizing the associated squared error function with just these three terms (along with some regularization to avoid overfitting). The resulting average test RMSE is 0.6286. A separate analysis of each of the two biases shows reviewer effect (µ + bu, with RMSE 0.6336) to be much more significant than paper bias (µ + bi, RMSE 1.2943) in reducing the error. This indicates a tendency of reviewers to concentrate all ratings near their mean ratings, which is supported by examination of the data. While the baseline model could explain much of the data variability, as evident by its relatively low associated RMSE, it is useless for making actual assignments. After all, it gives all reviewers exactly the same order of paper preferences. Thus, we are really after the remaining unexplained variability, where reviewer-specific preferences are getting expressed. Uncovering these preferences is the subject of the next subsections. 2.2 A factor model Latent factor models (e.g., [3]) comprise a common approach to collaborative filtering with the goal to uncover latent features that explain observed ratings. The premise of such models is that both reviewers and papers can be characterized as vectors in a common f-D space. The interaction between reviewers and papers is modeled by inner products in that space, the fourth term of Eq. 1. Here, pu ∈ R f and qi ∈ R f are the factor vectors of reviewer u and paper i, respectively. The resulting average test RMSE is slowly decreasing when increasing the dimensionality of the latent factor space. E.g., for f = 50 it is 0.6240, and for f = 100 it is 0.6234. Henceforth, we use f = 100. 2.3 Subject categories While latent factor models automatically infer suitable categories, much can be learned by known categories attributed to both papers and reviewers. ICDM’07 submissions specify a number of predefined categories as primary and secondary topics for a given paper. We model the entered matching between paper i and category c by: σic = 8 < : 1 c ∈ primary(i) 1 2 c ∈ secondary(i) 0 otherwise The value assignment (1 for “primary”, 0.5 for “secondary”) is derived by cross validation and is quite intuitive. Similarly, we use the following for matching reviewers with their desired categories: θuc = 8 < : 1 c ∈ interest(u) − 1 2 c ∈ no interest(u) 0 otherwise Notice that in ICDM’07, reviewers could specify lack of interest (or inability to) review papers from certain categories (this is different from conflicts of interest, discussed later). In the fifth term of Eq. 1, the weights wc indicate the significance of each category in linking a reviewer to a paper, and are learnt automatically by minimizing the squared error on the training set. It is plausible that, e.g., a mutual interest in some category A, will strongly link a reviewer to a paper, while a mutual interest in another category B is less influential on papers choice. Table 1 depicts results of this analysis, showing differences in orders of magnitude in the ability of different categories to correctly predict associations of reviewers to papers. Note in particular that there is no obvious monotonic relationship between the weight imputed to categories and the number of papers/reviewers associated
number of papers(assigned to the category). For brevity, only a few categories are showlt't category),and Table 1: Subject categories, inferred weights, number of reviewers (with expertise in tha f re papers Healthcare, epidemic modeling, and clinical research Security, privacy, and data integrity ctured data ings: web, social and comput Novel data mining algorithms in traditional areas(such as classification,0.08924 Algorithms for new, structured, data types, such as arising in 0.006015 ding subject categories to the ple because it was used during ICDM07 and thus enables a eIs, the resulting rmse is 0.6197 baseline comparison with an approach that does not perfor 2.4 Paper-paper similarities any preference modeling. It can incorporate global confer ence constraints such as the desired number of reviewers for We inject paper-paper similarities into our models in a each paper(kp), and a desired maximum number of papers way reminiscent of item- item recommenders [6]. The build- for each reviewer(kr).(For ICDM'O7, these values are 3 ing blocks here are similarity values sii, which measure the and 9, respectively. Denoting the predicted ratings matrix similarity of paper i and paper j. The similarities could be as R, the goal is to optimize the assignments matrix A [7 derived from the ratings data, but those are already covered y the latent factor model. Rather, we derive the similarity argmax trace(R2A)=agmx∑∑RnA,(2) of two papers by computing the cosine of their abstracts Usually we work with the square of the cosine, which better where Au∈[0,1vu,j ontrasts the higher similarities against the lower ones. In the sixth term of Eq. 1, the set R(u) contains all papers and∑An≤k,w, on which u bid. The constant a is for regularization: it penalizing cases where the weighted average has very low Au≤k determined by CR(u)i is very small. In our dataset it was support,ie.∑ oss validation to be 0.001. The parameter Here, the objective criterion-trace(RA)-captures the r sets the overall weight of the paper-paper component. It is learnt as part of the optimization process(cross-validation global affinity of all reviewers across all their assigned pa- could have been used as well). Its final value is close to 0.7. pers. Cols can be modeled by hardwiring the desired entries When this term is combined with the overall scheme. the of A (to zero) and taking them out of play'in Eq 2. rMsE drops down further to 0.6038 This integer programming problem is reformulated into an easier-to-manage linear programming problem by a series of 2.5 Reviewer-reviewer similarities steps, using the node-edge adjacency matrix, where every row corresponds to a node in A, and every column repre We craft reviewer-reviewer similarities suv analogously to sents an edge [7]. This reformulation is a bit more com paper-paper similarities, measured as the number of com plicated, but essentially renders the problem solvable via monly co-authored papers as reported in DBLP. We point methods such as Simplex or interior point programming. In out that DBlP data might be incomplete, and co-authorship particular, as Taylor shows in [7], because the reformulated does not imply similarity of research interests. Nevertheless our main contribution here is to show how to incorporate constraint matrix is totally unimodular, there exists at least reviewer-reviewer similarities in Eq. 1 and more sophist one globally optimal assignment with integral (and due to cated ways to define suv can be readily plugged in. By in- the constraints, Boolean) tegrating this factor, the rmse is 0.6015 4. EXPERIMENTAL RESULTS 2.6 Conflicts of interest(Col We have already shown the ability of our modeling to A final source of data is conflicts of interest for certain(pa- better capture reviewer- paper preferences. But do the imm per, reviewer)combinations, e.g., the reviewer might be the proved models translate into better assignments? Note the former advisor of the author. Many conferences define what key distinction between recommendations and assignments. it means to have a Col and solicit this information explicitly To evaluate assignment quality, we extend the train-test aring the bidding phase. We do not aim to model/predict methodology from above. In other words, both the predic- new Cols but show in the next section how they are incor- tion algorithm and the assignment algorithm cannot see the porated to avoid making erroneous assignments originally given preferences within the test set. We use the training set to learn model (1), predict all ratings using this 3. OPTIMIZING PAPER ASSIGNMENT model, and feed these predictions as input to(2). While the resulting assignment will be spread across the training and Our predicted preference matrix can now be supplied as test sets, we will specifically evaluate those made from the input to any of the assignment algorithms discussed in 2 est set and determine whether the reviewer had rated them We chose the Taylor algorithm 7 as a representative exam as" No, ' 'Low, ' OK, or High. This methodology mimics
Table 1: Subject categories, inferred weights, number of reviewers (with expertise in that category), and number of papers (assigned to the category). For brevity, only a few categories are shown. Category Weight # reviewers # papers (wc) primary (secondary) Healthcare, epidemic modeling, and clinical research 0.395121 31 7 (7) Security, privacy, and data integrity 0.334821 23 12 (6) Handling imbalanced data 0.284398 24 6 (10) Mining textual and unstructured data 0.245319 66 38 (30) Mining in networked settings: web, social and computer networks, 0.206318 62 44 (29) and online communities Novel data mining algorithms in traditional areas (such as classification, 0.089248 91 147 (71) regression, clustering, probabilistic modeling, and association analysis) Dealing with cost sensitive data and loss models 0.03453 12 4 (4) Algorithms for new, structured, data types, such as arising in 0.006015 60 21 (25) chemistry, biology, environment, and other scientific domains with the category. When adding subject categories to the baseline and factor models, the resulting RMSE is 0.6197. 2.4 Paper-paper similarities We inject paper-paper similarities into our models in a way reminiscent of item-item recommenders [6]. The building blocks here are similarity values sij , which measure the similarity of paper i and paper j. The similarities could be derived from the ratings data, but those are already covered by the latent factor model. Rather, we derive the similarity of two papers by computing the cosine of their abstracts. Usually we work with the square of the cosine, which better contrasts the higher similarities against the lower ones. In the sixth term of Eq. 1, the set R(u) contains all papers on which u bid. The constant α is for regularization: it is penalizing cases where the weighted average has very low support, i.e. P j∈R(u) sij is very small. In our dataset it was determined by cross validation to be 0.001. The parameter γ sets the overall weight of the paper-paper component. It is learnt as part of the optimization process (cross-validation could have been used as well). Its final value is close to 0.7. When this term is combined with the overall scheme, the RMSE drops down further to 0.6038. 2.5 Reviewer-reviewer similarities We craft reviewer-reviewer similarities suv analogously to paper-paper similarities, measured as the number of commonly co-authored papers as reported in DBLP. We point out that DBLP data might be incomplete, and co-authorship does not imply similarity of research interests. Nevertheless, our main contribution here is to show how to incorporate reviewer-reviewer similarities in Eq. 1 and more sophisticated ways to define suv can be readily plugged in. By integrating this factor, the RMSE is 0.6015. 2.6 Conflicts of interest (CoI) A final source of data is conflicts of interest for certain (paper, reviewer) combinations, e.g., the reviewer might be the former advisor of the author. Many conferences define what it means to have a CoI and solicit this information explicitly during the bidding phase. We do not aim to model/predict new CoIs but show in the next section how they are incorporated to avoid making erroneous assignments. 3. OPTIMIZING PAPER ASSIGNMENT Our predicted preference matrix can now be supplied as input to any of the assignment algorithms discussed in [2]. We chose the Taylor algorithm [7] as a representative example because it was used during ICDM’07 and thus enables a baseline comparison with an approach that does not perform any preference modeling. It can incorporate global conference constraints such as the desired number of reviewers for each paper (kp), and a desired maximum number of papers for each reviewer (kr). (For ICDM’07, these values are 3 and 9, respectively.) Denoting the predicted ratings matrix as R, the goal is to optimize the assignments matrix A [7]: argmax A trace “ R T A ” = argmax A X u X j RujAuj , (2) where Auj ∈ [0, 1] ∀u, j, and X j Auj ≤ kp, ∀u, and X u Auj ≤ kr, ∀j. Here, the objective criterion—trace ` RT A ´ —captures the global affinity of all reviewers across all their assigned papers. CoIs can be modeled by hardwiring the desired entries of A (to zero) and taking them ‘out of play’ in Eq. 2. This integer programming problem is reformulated into an easier-to-manage linear programming problem by a series of steps, using the node-edge adjacency matrix, where every row corresponds to a node in A, and every column represents an edge [7]. This reformulation is a bit more complicated, but essentially renders the problem solvable via methods such as Simplex or interior point programming. In particular, as Taylor shows in [7], because the reformulated constraint matrix is totally unimodular, there exists at least one globally optimal assignment with integral (and due to the constraints, Boolean) coefficients. 4. EXPERIMENTAL RESULTS We have already shown the ability of our modeling to better capture reviewer-paper preferences. But do the improved models translate into better assignments? Note the key distinction between recommendations and assignments. To evaluate assignment quality, we extend the train-test methodology from above. In other words, both the prediction algorithm and the assignment algorithm cannot see the originally given preferences within the test set. We use the training set to learn model (1), predict all ratings using this model, and feed these predictions as input to (2). While the resulting assignment will be spread across the training and test sets, we will specifically evaluate those made from the test set and determine whether the reviewer had rated them as ‘No,’ ‘Low,’ ‘OK,’ or ‘High.’ This methodology mimics
the real life scenario where the given reviewer ratings(cor- puting the missing ratings using either Resid or Norm, the responding to the training set) are limiting the possibilities balance completely changes in favor of higher quality pref- of the assignment algorithm, but by revealing more rating erences. Resid makes about 60% of test assignments out of hrough our prediction phase, we aim to gain the fexibility the highest quality ratings ("High"), and only about 12% to provide better assignments. As the proportion of the test of test assignments are bad ("No"). Norm is close, but not set increases, we take away more available preferences, which quite as good as Resid, a difference that should be further simulates an increasingly harsher assignment environment investigated over additional datasets. Overall we find that However, before using Taylor's model (2), it is important the results strongly support our contention that assignment to balance the rating scale of various reviewers. For exam- quality can be increased by providing more flexibility with ple, some reviewers are very enthusiastic and tend to give additional ratings from which to choose mostly high ratings, while others are more cautious and give low ratings. While our preference modeling captures such 5. DISCUSSION variance, it is unnecessary for the assignment phase since Taylor's model would concentrate only on reviewers with Why does our approach work? Especially with harsh high ratings, which is undesirable. Thus, we suggest two rain-test splits? If we view a reviewer's preferences as a alternative per-reviewer normalization strategies partial order over papers, we can think of our approach as 'straightening' out the partial order into a total order 1. Subtract the per-reviewer mean fr that is consistent with multiple sources of data. We in rating to find the residual rating for tend to provide theoretical justification for our empirical re signment combination.(Henceforth dubbed as Resid. sults using this viewpoint. The second new aspect to our 2. Calculate normalized ratings for each reviewer. so work is the integration of recommendation and optimiza that the sum of each reviewer's predicted ratings is 1 ion/constraint satisfaction. In the future we seek to study (Henceforth dubbed as Norm. how recommenders help aid optimization routines by pro- viding additional 'cues'or flexibilities in constraint satisfac- Regardless of the chosen normalization scheme, we add the tion/search. Besides CPAP, this has applications to com ormalized predicted rating to the original preferences(if bined recommendation-optimization scenarios such as tar- it is part of the training data)or to the mean rating value geted marketing and advertising under resource constraints (2.5)(for test data; recall that this is between the 'Ok'and Finally, to gain qualitative user feedback, we intend to field Low'ratings). This forms our final input matrix R, which the recommendation/assignment capabilities presented here we feed into Taylor's optimization algorithm in a real conference management system and gain further We evaluate many train-test splits, averaging 100 random insights into the issues involved. trials for each split. The baseline is Taylor's original al rithm, where all missing ratings, including those in the test set, are treated as"unknowns. "We compare this base. 6. ACKNOWLEDGEMENTS line against the two aforementioned alternatives, Resid and We thank Prof. Xindong Wu, Steering Committee chair Norm, with an identical handling for missing ratings. Specif- of IEEE ICDM for permission to use bidding and auxiliary ically, we look at the proportion of assignments from the test data from ICDM'O set that fall in the "No, 'Low, OK, and" High'categories. 7. REFERENCES Table 2: Evaluating assignments: observe the dra- C. Basu, H. Hirsh, W. Cohen, and C. Nevill-manning. matic improvement from the baseline(Taylor)to our Technical paper recommendation: a stud methods(Norm and Resid) multiple information sources. Journal of Al Research pages231-252,2001. Method Test Ratings 2 D Conry, Y Koren, and N. Ramakrishnan lo' ''OK"High Recommender systems for the conference paper ignment problem. Technical report, ar Xiv: Taylor40%63.3%1.%25%12.7% 0906.4044v1,2009 Taylor50%63.6%26%17.4%16.4% 3 T. Hofmann. Latent semantic models for collaborative orm30%16.5%0.4%25.3%54.2% filtering. ACM TOIS, 22: 89-115, 2004 Norm40%119%5.1%23.8%59.2% Norm50%132%5.5%25.2%56.1% [4 S McNee, J. Riedl, and J. Konstan. Being accurate is not enough: how accuracy metrics have hurt Resid30%11.1%3.2%24.6%61.0% ecommender systems. In CHI Extended Abstracts Resid40%109%3.2%24.5%61.3% pages1097-11101,2006. Resid50%112%40%24.1%60.6% 5 D Mimno and A McCallum Expertise modeling for natching papers with reviewers. In Proc. KDD07 pages500-509,2007. The results presented in Table 2 were fairly cross different test set proportions. As illustrated here. the [6]B Sarwar, G Karypis, J Konstan,and JRiedl. Item-based collaborative filtering recommendation predominant number(around 60-65%)of test assignment algorithms. In www01, pages 285-295, 2001 made using the original preference matrix(Taylor)fall in the unpreferred ("No")category, mirroring experiences dur- 7 C.J. Taylor. On the optimal assignment of conference ing ICDMO7 organization. On the other hand, when im- papers to reviewers. Technical Report MS-CIS-0S- University of Pennsylvania, 2008 I The assignments were manually re-wired afterward
the real life scenario where the given reviewer ratings (corresponding to the training set) are limiting the possibilities of the assignment algorithm, but by revealing more ratings through our prediction phase, we aim to gain the flexibility to provide better assignments. As the proportion of the test set increases, we take away more available preferences, which simulates an increasingly harsher assignment environment. However, before using Taylor’s model (2), it is important to balance the rating scale of various reviewers. For example, some reviewers are very enthusiastic and tend to give mostly high ratings, while others are more cautious and give low ratings. While our preference modeling captures such variance, it is unnecessary for the assignment phase since Taylor’s model would concentrate only on reviewers with high ratings, which is undesirable. Thus, we suggest two alternative per-reviewer normalization strategies: 1. Subtract the per-reviewer mean from each predicted rating to find the residual rating for each potential assignment combination. (Henceforth dubbed as Resid.) 2. Calculate normalized ratings for each reviewer, so that the sum of each reviewer’s predicted ratings is 1. (Henceforth dubbed as Norm.) Regardless of the chosen normalization scheme, we add the normalized predicted rating to the original preferences (if it is part of the training data) or to the mean rating value (2.5) (for test data; recall that this is between the ‘Ok’ and ‘Low’ ratings). This forms our final input matrix R, which we feed into Taylor’s optimization algorithm. We evaluate many train-test splits, averaging 100 random trials for each split. The baseline is Taylor’s original algorithm, where all missing ratings, including those in the test set, are treated as “unknowns.” We compare this baseline against the two aforementioned alternatives, Resid and Norm, with an identical handling for missing ratings. Specifically, we look at the proportion of assignments from the test set that fall in the ‘No,’ ‘Low,’ ‘OK,’ and ‘High’ categories. Table 2: Evaluating assignments: observe the dramatic improvement from the baseline (Taylor) to our methods (Norm and Resid). Method Test Ratings set ‘No’ ‘Low’ ‘OK’ ‘High’ Taylor 30% 59.9% 0.1% 30.2% 8.5% Taylor 40% 63.3% 1.4% 22.5% 12.7% Taylor 50% 63.6% 2.6% 17.4% 16.4% Norm 30% 16.5% 0.4% 25.3% 54.2% Norm 40% 11.9% 5.1% 23.8% 59.2% Norm 50% 13.2% 5.5% 25.2% 56.1% Resid 30% 11.1% 3.2% 24.6% 61.0% Resid 40% 10.9% 3.2% 24.5% 61.3% Resid 50% 11.2% 4.0% 24.1% 60.6% The results presented in Table 2 were fairly consistent across different test set proportions. As illustrated here, the predominant number (around 60-65%) of test assignments made using the original preference matrix (Taylor) fall in the unpreferred (“No”) category, mirroring experiences during ICDM’07 organization1 . On the other hand, when im- 1The assignments were manually re-wired afterward. puting the missing ratings using either Resid or Norm, the balance completely changes in favor of higher quality preferences. Resid makes about 60% of test assignments out of the highest quality ratings (“High”), and only about 12% of test assignments are bad (“No”). Norm is close, but not quite as good as Resid, a difference that should be further investigated over additional datasets. Overall we find that the results strongly support our contention that assignment quality can be increased by providing more flexibility with additional ratings from which to choose. 5. DISCUSSION Why does our approach work? Especially with harsh train-test splits? If we view a reviewer’s preferences as a partial order over papers, we can think of our approach as ‘straightening’ out the partial order into a total order that is consistent with multiple sources of data. We intend to provide theoretical justification for our empirical results using this viewpoint. The second new aspect to our work is the integration of recommendation and optimization/constraint satisfaction. In the future we seek to study how recommenders help aid optimization routines by providing additional ‘cues’ or flexibilities in constraint satisfaction/search. Besides CPAP, this has applications to combined recommendation-optimization scenarios such as targeted marketing and advertising under resource constraints. Finally, to gain qualitative user feedback, we intend to field the recommendation/assignment capabilities presented here in a real conference management system and gain further insights into the issues involved. 6. ACKNOWLEDGEMENTS We thank Prof. Xindong Wu, Steering Committee chair of IEEE ICDM for permission to use bidding and auxiliary data from ICDM’07. 7. REFERENCES [1] C. Basu, H. Hirsh, W. Cohen, and C. Nevill-Manning. Technical paper recommendation: a study in combining multiple information sources. Journal of AI Research, pages 231–252, 2001. [2] D. Conry, Y. Koren, and N. Ramakrishnan. Recommender systems for the conference paper assignment problem. Technical report, arXiv: 0906.4044v1, 2009. [3] T. Hofmann. Latent semantic models for collaborative filtering. ACM TOIS, 22:89–115, 2004. [4] S. McNee, J. Riedl, and J. Konstan. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In CHI Extended Abstracts, pages 1097–11101, 2006. [5] D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. KDD’07, pages 500–509, 2007. [6] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In WWW’01, pages 285–295, 2001. [7] C. J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report MS-CIS-08-30, University of Pennsylvania, 2008