A Demo Search Engine for Products Beibei Li Leonard N.Stern School of business, New York University g> pein Anindya ghose Panagiotis G. Ip bli@stern. nyu. edu aghose @stern. nyu. edu panos @stern. nyu New York. New York 10012. USA ABSTRACT using customer review ratings. This approach has quite a few Most product search engines today build on models of rele. shortcomings. First, it ignores the multidimensional preferences vance devised for information retrieval. However, the decision of consumers. Second, it fails to leverage the information gener- mechanism that underlies the process of buying a product is ated by the online communities, going beyond simple numerical different than the process of locating relevant documents or ratings. Third, it hardly takes into account the heterogeneity bjects. We propose a theory model for product search based on of consumers. These drawbacks highly necessitate a recommen- a ranking technique in which we rank highest the products that underlying purchase behavior, to capture their multidimensional our research by building a demo search engine for hotels that Recommender systems [1] could fix some of these problems takes into account consumer heterogeneous preferences, and also but, to the best of our knowledge, existing techniques still have ccounts for the varying hotel price. Moreover, we achieve this limitations: First, most recommendation mechanisms require without explicitly asking the preferences or purchasing histories consumers to log into the system. However, in reality many of individual consumers but by using aggregate demand data. consumers browse only anonymously. Due to the lack of any This new ranking system is able to recommend consumers prod meaningful, personalized recommendations, consumers do not ucts with"best value for money" in a privacy-preserving feel compelled to login before purchasing. Even when they log Thedemoisaccessibleathttp://nyuhotels.appspot.co efore or after a purchase, consumers are reluctant to give out their individual demographic information due to many reasons Categories and Subject descriptors (e. g, time constraints, privacy issues, or lack of incentives Therefore. most context information is missing at the individ H.3.3 [Information Storage and Retrieval]: Information consumer level. Second, for goods with a low purchase frequency Search and retrieval for an individual consumer, such as hotels, cars, or real estate there are few repeated purchases we could leverage towards General terms building a predictive model (i. e. models based on collaborative filtering). Third, and potentially more importantly, as privacy Algorithms, Economics, Experimentation, Measurement issues become increasingly noticeable today, marketers may not be able to observe the individual-level purchase history of each Keywords consumer(or consumer segment ). Instead, the only information Consumer Surplus, Conor Product Search, Ranking, Text available is at an aggregate level (e.g, market share or unit Mining User-Generated nt, Utility Theory sold). As a consequence, many algorithms that rely on knowing individual-level behavior lack the ability of deriving consumer 1. INTRODUCTION preferences from such aggregate data Alternative techniques try to identify the"Pareto optimal It is now widely acknowledged that online search for products set of results 2. Unfortunately, the feasibility of this approach is increasing in popularity, as more and more users search and diminishes as the number of product characteristics increases. purchase products from the Internet. Most search engines for With more than five or six characteristics, the probability of products today are based on models of relevance from "clas- point being classified as"Pareto optimal "dramatically increases. sic"information retrieval theory g or use variants of faceted As a consequence, the set of Pareto optimal results soon includes search [11] to facilitate browsing. However, the decision mecha- every product In our work, we design a new ranking system for recommen- from the process of finding a relevant document or object. Cus- dation that leverages economic modeling. We aim at making tomers do not simply seek something relevant to their search, recommendations based on better perception of the underlying but also try to identify the"best "deal that satisfies their specific the "causality"of consumers'purchase decisions. Our algorith criteria. Today's product search engines provide only rudimen- learns consumer preferences based on the largely anonymous tary ranking facilities for search results, typically using a single publicly observed distributions of consumer demographics as ranking criterion such as price, best selling, or more recently, well as the observed aggregate-level purchases(i. e, anonymous nternational World wide Web Conference Com- purchases and market shares in NYC and LA), not by learning mittee(Iw3C2). Distribution of these papers is limited to classroom use. from the identified behavior or demographics of each individ- ual. We instantiate our research by building a demo searc vww 2011, March 28-April 1, 2011, Hyderabad, India. ACM978-1-4503-06379/11/03
A Demo Search Engine for Products Beibei Li bli@stern.nyu.edu Anindya Ghose aghose@stern.nyu.edu Panagiotis G. Ipeirotis panos@stern.nyu.edu Department of Information, Operations, and Management Sciences Leonard N. Stern School of Business, New York University New York, New York 10012, USA ABSTRACT Most product search engines today build on models of relevance devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. We instantiate our research by building a demo search engine for hotels that takes into account consumer heterogeneous preferences, and also accounts for the varying hotel price. Moreover, we achieve this without explicitly asking the preferences or purchasing histories of individual consumers but by using aggregate demand data. This new ranking system is able to recommend consumers products with “best value for money” in a privacy-preserving manner. The demo is accessible at http://nyuhotels.appspot.com/ Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Economics, Experimentation, Measurement Keywords Consumer Surplus, Economics, Product Search, Ranking, Text Mining, User-Generated Content, Utility Theory 1. INTRODUCTION It is now widely acknowledged that online search for products is increasing in popularity, as more and more users search and purchase products from the Internet. Most search engines for products today are based on models of relevance from “classic” information retrieval theory [9] or use variants of faceted search [11] to facilitate browsing. However, the decision mechanism that underlies the process of buying a product is different from the process of finding a relevant document or object. Customers do not simply seek something relevant to their search, but also try to identify the “best” deal that satisfies their specific criteria. Today’s product search engines provide only rudimentary ranking facilities for search results, typically using a single ranking criterion such as price, best selling, or more recently, Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2011, March 28–April 1, 2011, Hyderabad, India. ACM 978-1-4503-0637-9/11/03. using customer review ratings. This approach has quite a few shortcomings. First, it ignores the multidimensional preferences of consumers. Second, it fails to leverage the information generated by the online communities, going beyond simple numerical ratings. Third, it hardly takes into account the heterogeneity of consumers. These drawbacks highly necessitate a recommendation strategy for products that can better model consumers’ underlying purchase behavior, to capture their multidimensional preferences and heterogeneous tastes. Recommender systems [1] could fix some of these problems but, to the best of our knowledge, existing techniques still have limitations: First, most recommendation mechanisms require consumers to log into the system. However, in reality many consumers browse only anonymously. Due to the lack of any meaningful, personalized recommendations, consumers do not feel compelled to login before purchasing. Even when they login, before or after a purchase, consumers are reluctant to give out their individual demographic information due to many reasons (e.g., time constraints, privacy issues, or lack of incentives). Therefore, most context information is missing at the individual consumer level. Second, for goods with a low purchase frequency for an individual consumer, such as hotels, cars, or real estate, there are few repeated purchases we could leverage towards building a predictive model (i.e., models based on collaborative filtering). Third, and potentially more importantly, as privacy issues become increasingly noticeable today, marketers may not be able to observe the individual-level purchase history of each consumer (or consumer segment). Instead, the only information available is at an aggregate level (e.g., market share or unit sold). As a consequence, many algorithms that rely on knowing individual-level behavior lack the ability of deriving consumer preferences from such aggregate data. Alternative techniques try to identify the “Pareto optimal” set of results [2]. Unfortunately, the feasibility of this approach diminishes as the number of product characteristics increases. With more than five or six characteristics, the probability of a point being classified as “Pareto optimal” dramatically increases. As a consequence, the set of Pareto optimal results soon includes every product. In our work, we design a new ranking system for recommendation that leverages economic modeling. We aim at making recommendations based on better perception of the underlying the “causality” of consumers’ purchase decisions. Our algorithm learns consumer preferences based on the largely anonymous, publicly observed distributions of consumer demographics as well as the observed aggregate-level purchases (i.e., anonymous purchases and market shares in NYC and LA), not by learning from the identified behavior or demographics of each individual. We instantiate our research by building a demo search
engine for hotels, using a unique data set containing transac- utility after purchasing a product. This idea naturally gener- tions from Nov. 2008 to Jan 2009 for US hotels from a maj ates a ranking order: The products that generate the highest travel web site. Our extensive user studies, using more than consumer surplus should be ranked on top 15000 user judgments, demonstrate an overwhelming preference for the ranking generated by our techniques, compared to a 2.2 The BLP model The major contributions of our research are: (1)We present a characteristics and estimate the corresponding weights assign- arge number of existing strong baselines The key for our model is to identify the different produ causal model, based on economic theory, to capture the decision- by consumers towards the characteristics and the price of the king process of consumers, leading to a better understanding product. However, different consumers hold different evalu- of consumer preferences. The causal model relaxes the assump- ations towards the product characteristics and towards the tion of" consistent environment"across training and testing money. To capture the consumer heterogeneity, we use the data sets: we can now have changes in the environment and Random-Coefficient Logit Model 3(also known as BLP can predict what should happen under such changes. (2) We This model incorporates consumer heterogeneity by assuming infer personal preferences from aggregate data, in a privacy- that consumers have idiosyncratic tastes towards product char preserving manner. (3) We propose a ranking method using acteristics. In other words, the coefficients B and a in equation the notion of surplus, which is derived from a"generative"user 1 and 2 are different for each consumer. Based on this, we behavior model.(4)We present an extensive experimental define the utility surplus for consumer i to buy product Xj as study: using six hotel markets, and 15000 user evaluations sing blind tests, we demonstrate that the generated rankings S=Uh(x1)-{n()-Um(r-p)+e(3) are significantly better than existing approaches. ∑·+5 2. THEORY MODEL Utility of money Stochastic error Utility of product In this section, we first introduce the background of the ex Here, I is the income of consumer i, P, is the price of product pected utility theory, characteristics-based theory, and economic X, Um is the utility of money(parameterized by user specific surplus. Then we discuss how we leverage these concepts into weight scalar a ) and Uh is the utility of product purchased our setting and empirically estimate our model. (parameterized by user specific weight vector B). Note that 2.1Ba E is a product-specific disturbance scalar summarizing unob- served characteristics of product X,, whereas a; is a stochastic Our model is derived from from expected utility and ratione choice error term that is assumed to be i.i.d. across products hoice theories. A fundamental notion in utility theory is that and consumers in the selection process. The parameters to be each consumer is endowed with an associated utility function estimated are a' and Bi, which represent the weights that con which is"a measure of the satisfaction from consumption of sumer i assigns towards"money" and towards different observe various goods and services. The rationality assumption defines product characteristics, respectively that each person tries to maximize its own utility The technical details for the model estimation are in 7. To More formally, assume that the consumer has a choice across etter understand our model, let's consider an example products X1,..., Xn, and each product X, has a price Pj. Buy ing a product involves the exchange of money for a product ExAMPLE 1. Suppose that we have two cities, A and B and Therefore, to analyze the purchasing behavior we need to have two types of consumers: business trip travelers and family trip two components for the utility function: (1)Utility of Product: travelers. City A is a business destination (e.g, New york The utility that the consumer will get by buying the product City)with 80% of the travelers being business travelers and 20% nd(2)Utility of Money: The utility that the consumer families. City B is mainly a family destination(e g, Orlando ill lose by paying the price pi for product Xj with 10% business travelers and 90% family travelers. In city A On one hand, the decision to purchase product X, generates a we have two hotels: Hilton(A1) and Doubletree(A2). In city product utility U(X, ) According to Lancasters characteristics B, we have again tuo hotels: Hilton(B1) and Doubletree(B2) theory (6 and Rosen's hedonic price model(10), differentiated Hilton hotels(At and Bi) have a conference center but not a products are described by vectors of objectively measured char- pool, and Doubletree hotels(A2 and B2) have a pool but not acteristics. Let rk denote the kth observed characteristics of a conference center. To keep the example simple, we assune product X,. Thus, the utility of product can be defined as that preferences of consumers do not change when they travel the aggregation of weighted utilities of observed individual in different cities and that prices are the same haracteristics and an unobserved characteristic, E,, as follows By observing demand, we see that demand in city A(busi U(x)=U(x}…)=∑时x+51,(1)JorDoubere.hnctyBoamadestinaonthedemandas 540 bookings per day for Hilton and 460 bookings for Doubletree On the other hand. assume that the consumer has some Since the hotels are identical in the two cities, the changes in disposable income I that ge a money utility U(D). Paying demand must be the result of different traveler demographics typically assume that pi is relatively small compared to the from hotel A(conference center, no pool)is US(A1)=5A1+ disposable income I, and the marginal utility of money remains (Bcon/1+Bpool-0)+e, and for family travelers, the corresponding onstant in the interval I-p, to I[. In this case, utility surplus is US(A1)=641+(Bnr·1+Bpa:0)+∈ U(r)-U(I-pi)=al-a(I-pi)=api business travelers towards"conference center"and "pool" and With the assumption of rationality, a consumer purchases by B, we denote the respective deviations for family travelers. utility. Let consumer surplus denote the"increase" Similarly, we can write down the utilities for hotels A2, Bi and B2. Following the estimation steps, we discover that family
engine for hotels, using a unique data set containing transactions from Nov. 2008 to Jan. 2009 for US hotels from a major travel web site. Our extensive user studies, using more than 15000 user judgments, demonstrate an overwhelming preference for the ranking generated by our techniques, compared to a large number of existing strong baselines. The major contributions of our research are: (1) We present a causal model, based on economic theory, to capture the decisionmaking process of consumers, leading to a better understanding of consumer preferences. The causal model relaxes the assumption of “consistent environment” across training and testing data sets: we can now have changes in the environment and can predict what should happen under such changes. (2) We infer personal preferences from aggregate data, in a privacypreserving manner. (3) We propose a ranking method using the notion of surplus, which is derived from a “generative” user behavior model. (4) We present an extensive experimental study: using six hotel markets, and 15000 user evaluations using blind tests, we demonstrate that the generated rankings are significantly better than existing approaches. 2. THEORY MODEL In this section, we first introduce the background of the expected utility theory, characteristics-based theory, and economic surplus. Then we discuss how we leverage these concepts into our setting and empirically estimate our model. 2.1 Background Our model is derived from from expected utility and rational choice theories. A fundamental notion in utility theory is that each consumer is endowed with an associated utility function U, which is “a measure of the satisfaction from consumption of various goods and services.” The rationality assumption defines that each person tries to maximize its own utility. More formally, assume that the consumer has a choice across products X1, . . . , Xn, and each product Xj has a price pj . Buying a product involves the exchange of money for a product. Therefore, to analyze the purchasing behavior we need to have two components for the utility function: (1) Utility of Product: The utility that the consumer will get by buying the product Xj , and (2) Utility of Money: The utility that the consumer will lose by paying the price pj for product Xj . On one hand, the decision to purchase product Xj generates a product utility U(Xj ). According to Lancaster’s characteristics theory [6] and Rosen’s hedonic price model [10], differentiated products are described by vectors of objectively measured characteristics. Let x k j denote the kth observed characteristics of product Xj . Thus, the utility of product can be defined as the aggregation of weighted utilities of observed individual characteristics and an unobserved characteristic, ξj , as follows U(Xj ) = U(x 1 j , . . . , x k j ) = X k β k j · x k j + ξj . (1) On the other hand, assume that the consumer has some disposable income I that generates a money utility U(I). Paying the price pj decreases the money utility to U(I − pj ). We typically assume that pj is relatively small compared to the disposable income I, and the marginal utility of money remains constant in the interval I − pj to I [8]. In this case, U(I) − U(I − pj ) = αI − α(I − pj ) = αpj . (2) With the assumption of rationality, a consumer purchases product Xj if and only if it provides him with the highest increase in utility. Let consumer surplus denote the “increase” in utility after purchasing a product. This idea naturally generates a ranking order: The products that generate the highest consumer surplus should be ranked on top. 2.2 The BLP Model The key for our model is to identify the different product characteristics and estimate the corresponding weights assigned by consumers towards the characteristics and the price of the product. However, different consumers hold different evaluations towards the product characteristics and towards the money. To capture the consumer heterogeneity, we use the Random-Coefficient Logit Model [3] (also known as BLP). This model incorporates consumer heterogeneity by assuming that consumers have idiosyncratic tastes towards product characteristics. In other words, the coefficients β and α in equation 1 and 2 are different for each consumer. Based on this, we define the utility surplus for consumer i to buy product Xj as USi j = Uh(Xj ) − [Um(I i ) − Um(I i − pj )] + ε i j (3) = X k β ik · x k j + ξj | {z } Utility of product − α i pj |{z} Utility of money + ε i j . |{z} Stochastic error Here, I i is the income of consumer i, pj is the price of product Xj , Um is the utility of money (parameterized by user specific weight scalar α i ), and Uh is the utility of product purchased (parameterized by user specific weight vector β i ). Note that ξ is a product-specific disturbance scalar summarizing unobserved characteristics of product Xj , whereas ε i j is a stochastic choice error term that is assumed to be i.i.d. across products and consumers in the selection process. The parameters to be estimated are α i and β i , which represent the weights that consumer i assigns towards “money” and towards different observed product characteristics, respectively. The technical details for the model estimation are in [7]. To better understand our model, let’s consider an example. Example 1. Suppose that we have two cities, A and B and two types of consumers: business trip travelers and family trip travelers. City A is a business destination (e.g., New York City) with 80% of the travelers being business travelers and 20% families. City B is mainly a family destination (e.g., Orlando) with 10% business travelers and 90% family travelers. In city A, we have two hotels: Hilton (A1) and Doubletree (A2). In city B, we have again two hotels: Hilton (B1) and Doubletree (B2). Hilton hotels (A1 and B1) have a conference center but not a pool, and Doubletree hotels (A2 and B2) have a pool but not a conference center. To keep the example simple, we assume that preferences of consumers do not change when they travel in different cities and that prices are the same. By observing demand, we see that demand in city A (business destination) is 820 bookings per day for Hilton and 120 bookings for Doubletree. In city B (family destination) the demand is 540 bookings per day for Hilton and 460 bookings for Doubletree. Since the hotels are identical in the two cities, the changes in demand must be the result of different traveler demographics. More specifically, for business traveler, the utility surplus from hotel A1 (conference center, no pool) is USB(A1) = δA1 + (β B conf ·1+β B pool ·0)+, and for family travelers, the corresponding utility surplus is USF (A1) = δA1 + (β F conf · 1 + β F pool · 0) + . By β B • we denote the deviations from the population mean for business travelers towards “conference center” and “pool” and by β F • we denote the respective deviations for family travelers. Similarly, we can write down the utilities for hotels A2, B1 and B2. Following the estimation steps, we discover that family
travelers have Bcon=Pool=0.5. In other words, they have 2117 randomly selected hotels over the United States. The the same preferences regarding a pool and conference center. On transactions covered the period from November 2008 until Jan- the other hand, for business travelers, their preference towards uary 2009. Based on the given transactions, we were able to "conference center"is much higher than towards "pool, writh compute the market shares of each hotel in each local market BPonf =0.9 and Bpo This estimation result can be further interpreted with mone- Consumer demographics: To measure the demographics tary meanings. For instance, we can infer that a business trip of consumers in each market, we used data from the Trip Advisor traveler is willing to pay $54 for the conference center and $6 web site: The consumers that write reviews about hotels on Tri- whereas a family trip traveler is willing to pay pAdvisor also identify their travel purpose(business, romance, equally $30 for each of the two feature family, friend, other) and their age group(13-17, 18-24, 25-34, 35-49, 50-64, 65+). Based on the data, we were able to identify 3. SURPLUS-BASED RANKING the demographic distribution of travelers for each destination. So far. we have described the economic model used for Hotel location characteristics: We used search tools(in particular the Bing Maps API) and social geo- ferring the preferences of consumers using a utility model and tags(from geonames.org)to identify the external amenities" aggregate demand data. This model uses the concept of surplus ainly as a conceptual tool to infer consumer preferences to- (such as shops, restaurants, etc)and available public trans- wards different product characteristics In our work, the concept portation in the area around the hotel. We also used image of surplus is directly used to find the product that is the best there is a nearby beach, a nearby lake, a downtown area, and value for money "for a given consumer. whether the hotel is close to a highway. We extracted these We define Consumer Surplus for consumer i from product j characteristics within an area of 0.25-mile, 0.5 mile, 1-mile, and as the"normalized utility surplus the surplus US, divided 2-mile radius by the mean marginal utility of money a Hotel service characteristics: We extracted the service. ased cha CS,=Normalized_ US, (4) review provides a general rating of the hotel, plus provides seven individual ratings on the following service characteristics e thereby use the estimated surplus for each product and value, Room, Location, Cleanliness, Service, Check-in, and rank the products in decreasing order of surplus. Therefore, Business Service. We computed the average ratings of each at the top we will have the products that are the"best value" hotel across these seven characteristics and used them in our for consumers, for a given price. Furthermore, we extend our data set, together with the general review rating. We also used anking to include a personalization component. To compute the hotel description information from Travelocity, Orbitz, and the personalized surplus, we ask the consumer to give the Expedia, to identify the"internal amenities"of the hotels(e. g appropriate demographic characteristics and purchase context pool, spa (e. g, 25-34 years old, male, S100K income, business traveler) Stylistic characteristics of online reviews: Finally, and then use the corresponding deviation matrices Br and aI. extracted indicators that measure not the polarity of the re- It is then easy to compute the personalized"value for money" views but rather some stylistic characteristics of the available for this consumer, and rank products accordingly. Notice that reviews. We examined two text-style features: "subjectivity the consumer has the incentive to reveal demographics in this and"readability"of reviews 5. Also, since prior research sug- gested that disclosure of identity information is associated with changes in subsequent online product sales [4, we measured EXAMPLE 2. For better understanding, let's re-consider the the percentage of reviewers for each hotel who reveal their real setting of the two hotels Al and A2 for city A fro name or location information on their profile web pages Examples 1. Suppose that two consumers are traveling to city A with an income $50,000-100,000, and C2, a 35-64 years 4.2 An Example: Personalized Hotel Search old family traveler, with an income less than $50, 000 Using the data described above, we are able to construct Since these two travelers belong to different demographic group our economic model and create a system that generates hotel and travel with different purposes, their preferences towards rankings. We estimate the mean and variance of the weights "conference center"and"pool"are different. Thus, the surplus that consumers assign to each hotel characteristic. Using these they obtain from Al and A2 varies. For example, the business estimates, we can derive the consumer surplus from each hotel aveler gets higher utility from Al due to the specialized confer ence center services, whereas the family traveler find A2 more We developed a prototype hotel search and ranking syster valuable due to the pool and price. and deployed it on Google App Engine. It consists of three basic components: a user search interface, a ary result page with the ranked hotels, and a(set of) explanatory web pages 4. A DEMO SEARCH ENGINE FOR HOTELS with details of each individual hotel listed in the results. First,a le instantiated our product search framework using as target customer is required to select the location of the trip destination application the area of hotel search. The demo is accessible at the type of the trip(e. g, business, family, romance, friend. ),and http://nyuhotels.appspot.com his/her income level via the search interface. Given the input 4.1 Data search criteria and the demographic information, the system computes the personalized consumer surplus for each hotel in First, to simulate the online search environment, we created the specified location and ranks the search results in descend one exhaustive data set using multiple data sources. order of consumer surplus (i. e, best value on top). The customer Demand data: Travelocity, a large hotel booking system, can review the list of search results and can click on the hotel to provided us with the set of all hotel booking transactions, get more information. In the detailed explanatory page of each
travelers have β F conf = β F pool = 0.5. In other words, they have the same preferences regarding a pool and conference center. On the other hand, for business travelers, their preference towards “conference center” is much higher than towards “pool,” with β P conf = 0.9 and β F pool = 0.1, respectively. This estimation result can be further interpreted with monetary meanings. For instance, we can infer that a business trip traveler is willing to pay $54 for the conference center and $6 for the pool, whereas a family trip traveler is willing to pay equally $30 for each of the two features. 3. SURPLUS-BASED RANKING So far, we have described the economic model used for inferring the preferences of consumers using a utility model and aggregate demand data. This model uses the concept of surplus mainly as a conceptual tool to infer consumer preferences towards different product characteristics. In our work, the concept of surplus is directly used to find the product that is the “best value for money” for a given consumer. We define Consumer Surplus for consumer i from product j as the “normalized utility surplus,” the surplus US¯ (i) j divided by the mean marginal utility of money ¯α. CSj = Normalized USj = X t 1 α¯ US¯ (i) j . (4) We thereby use the estimated surplus for each product and rank the products in decreasing order of surplus. Therefore, at the top we will have the products that are the “best value” for consumers, for a given price. Furthermore, we extend our ranking to include a personalization component. To compute the personalized surplus, we ask the consumer to give the appropriate demographic characteristics and purchase context (e.g., 25-34 years old, male, $100K income, business traveler) and then use the corresponding deviation matrices βT and αI . It is then easy to compute the personalized “value for money” for this consumer, and rank products accordingly. Notice that the consumer has the incentive to reveal demographics in this scenario. Example 2. For better understanding, let’s re-consider the previous setting of the two hotels A1 and A2 for city A from Examples 1. Suppose that two consumers are traveling to city A on the same day: C1, a 25-34 years old business traveler, with an income $50,000-100,000, and C2, a 35-64 years old family traveler, with an income less than $50,000. Since these two travelers belong to different demographic groups and travel with different purposes, their preferences towards “conference center” and “pool” are different. Thus, the surplus they obtain from A1 and A2 varies. For example, the business traveler gets higher utility from A1 due to the specialized conference center services, whereas the family traveler find A2 more valuable due to the pool and price. 4. A DEMO SEARCH ENGINE FOR HOTELS We instantiated our product search framework using as target application the area of hotel search. The demo is accessible at http://nyuhotels.appspot.com/. 4.1 Data First, to simulate the online search environment, we created one exhaustive data set using multiple data sources. Demand data: Travelocity, a large hotel booking system, provided us with the set of all hotel booking transactions, for 2117 randomly selected hotels over the United States. The transactions covered the period from November 2008 until January 2009. Based on the given transactions, we were able to compute the market shares of each hotel in each local market (i.e., metropolitan area), for each day. Consumer demographics: To measure the demographics of consumers in each market, we used data from the TripAdvisor web site: The consumers that write reviews about hotels on TripAdvisor also identify their travel purpose (business, romance, family, friend, other ) and their age group (13-17, 18-24, 25-34, 35-49, 50-64, 65+). Based on the data, we were able to identify the demographic distribution of travelers for each destination. Hotel location characteristics: We used geo-mapping search tools (in particular the Bing Maps API) and social geotags (from geonames.org) to identify the “external amenities” (such as shops, restaurants, etc) and available public transportation in the area around the hotel. We also used image classification together with Mechanical Turk to examine whether there is a nearby beach, a nearby lake, a downtown area, and whether the hotel is close to a highway. We extracted these characteristics within an area of 0.25-mile, 0.5 mile, 1-mile, and 2-mile radius. Hotel service characteristics: We extracted the servicebased characteristics from the reviews on TripAdvisor. Each review provides a general rating of the hotel, plus provides seven individual ratings on the following service characteristics: Value, Room, Location, Cleanliness, Service, Check-in, and Business Service. We computed the average ratings of each hotel across these seven characteristics and used them in our data set, together with the general review rating. We also used the hotel description information from Travelocity, Orbitz, and Expedia, to identify the “internal amenities” of the hotels (e.g., pool, spa.) Stylistic characteristics of online reviews: Finally, we extracted indicators that measure not the polarity of the reviews but rather some stylistic characteristics of the available reviews. We examined two text-style features: “subjectivity” and “readability” of reviews [5]. Also, since prior research suggested that disclosure of identity information is associated with changes in subsequent online product sales [4], we measured the percentage of reviewers for each hotel who reveal their real name or location information on their profile web pages. 4.2 An Example: Personalized Hotel Search Using the data described above, we are able to construct our economic model and create a system that generates hotel rankings. We estimate the mean and variance of the weights that consumers assign to each hotel characteristic. Using these estimates, we can derive the consumer surplus from each hotel, for a given customer. We developed a prototype hotel search and ranking system and deployed it on Google App Engine. It consists of three basic components: a user search interface, a summary result page with the ranked hotels, and a (set of) explanatory web pages with details of each individual hotel listed in the results. First, a customer is required to select the location of the trip destination, the type of the trip (e.g., business, family, romance, friend.), and his/her income level via the search interface. Given the input search criteria and the demographic information, the system computes the personalized consumer surplus for each hotel in the specified location and ranks the search results in descending order of consumer surplus (i.e., best value on top). The customer can review the list of search results and can click on the hotel to get more information. In the detailed explanatory page of each
Hotel search bast bang for your t Affinia Dumont青 a $2859 -57.21 12 23721-7227 Location Score 1: Ranking results for C1(Business, $80,000 12.5 Search best bang for your buck ernal Amenity Highway 61.8 Figure 3: Hotel overall score and breakdown across individual hotel characteristics personalized scores of "Affinia Dumont" for C1, paired with the population average scores. For instance, we found this hotel has Figure 2: Ranking results for C2(Family, s30,000. 35- a personalized score(27)for "public transportation,"higher than 64) the overall population score(16). This result demonstrates that business travelers have a stronger preference towards "public transportation"than the overall population. hotel, we list the breakdown of the surplus computation, showing help customers interpret the meaning of those surplus values References the system provides not only the personalized surplus tailored 1 ADOMAVICIUS, G. AND TUZHILIN, A. Toward the next generation of for each customer, but also provides the population average the-art and possible extensions. IEEE TKDE 17(2005). 734-749 surplus as a baseline for comparison. This gives customers a [2BALKE, W.-T, AND GONTZER, U. Multi-objective query processing better idea of the relative, personalized value they get from each for database systems. In VLDB(2004),pp. 936-947 hotel characteristic [3] BERRY, S, LEVINSOHN, J., AND PAKES, A. Automobile prices in market To better illustrate this, lets look at an example uilibrium. Econometrica 63(1995),841-890 4 FORMAN, C A. AND WIESENFELD. B. Examining the rela EXAMPLE 3. We have the same setting as in eramples 1 and 2. To find the best-value hotel, customer C1 specifies the losure in electronic markets. ISR 19, 3(2008),291-313ntity search criteria as"Location: New York, NY: Trip type: busi [5] GHOSE, A,, AND IPEIROTIS, P. G. Estimating the helpfulness and of product reviews: Mining text and reviewer char- ness; Income $80, 000; Age group: 25-34. Similarly, customer cteristics. IEEE TKDE (2010) C2 specifies "Location: New York, NY Trip type: family; In come $30, 000: Age group: 35-64. Figure 1 and 2 shows the University Press, New York, 1971. top three hotels in response to the two customized searches by P G. Towards a theory model for C1 and C2. As we can see, "A finia Dumont, "a 4-star hotel [8 MARSHALL, A. Principles of Economics, Eighth ed. Macmillan and with an price of $249, appears on top of the ranking list for ustomer C1, providing a"Value for Money"of $28. On th [9 NIE, Z, WEN, J-R, AND MA, W.-Y. Webpage understanding: be- other hand. "Tudor Hotel at the United Nations, a 4-star hotel nd page-level search. SIGMOD Record 37, 4(2008), 48-54 with an lower price of $12g, is ranked the first to customer [10 RoSEN, S. Hedonic prices and implicit markets: Product differentia- tIon competition.J. of Political Econ. 82, 1(1974), 34-55 C2. The ranking results are dynamically justified based on the [11] YEE, K-P, SwEARINGEN, K, LI, K, AND HEARST, M. Faceted meta- demographic information of the customers(e.g, For C2 with data for image search and browsing. In CHI (2003), pp. 401-408 lower income, the top-ranked hotels have mainly within lower lass and price rang pared to the ones for C1 Customers can click each hotel for details on how each individ- ual hotel characteristic contributes to the total value for money of that hotel. Figure 3 illustrates as an example the breakdown
Figure 1: Ranking results for C1 (Business, $80,000, 25-34) Figure 2: Ranking results for C2 (Family, $30,000, 35- 64) hotel, we list the breakdown of the surplus computation, showing the value of each individual hotel characteristic. Moreover, to help customers interpret the meaning of those surplus values, the system provides not only the personalized surplus tailored for each customer, but also provides the population average surplus as a baseline for comparison. This gives customers a better idea of the relative, personalized value they get from each hotel characteristic. To better illustrate this, let’s look at an example. Example 3. We have the same setting as in Examples 1 and 2. To find the best-value hotel, customer C1 specifies the search criteria as “Location: New York, NY; Trip type: business; Income $80,000; Age group: 25-34.” Similarly, customer C2 specifies “Location: New York, NY; Trip type: family; Income $30,000; Age group: 35-64.” Figure 1 and 2 shows the top three hotels in response to the two customized searches by C1 and C2. As we can see, “Affinia Dumont,” a 4-star hotel with an price of $249, appears on top of the ranking list for customer C1, providing a “Value for Money” of $28. On the other hand, “Tudor Hotel at the United Nations,” a 4-star hotel with an lower price of $124, is ranked the first to customer C2. The ranking results are dynamically justified based on the demographic information of the customers (e.g., For C2 with lower income, the top-ranked hotels have mainly within lower class and price range, compared to the ones for C1.). Customers can click each hotel for details on how each individual hotel characteristic contributes to the total value for money of that hotel. Figure 3 illustrates as an example the breakdown Figure 3: Hotel overall score and breakdown across individual hotel characteristics personalized scores of “Affinia Dumont” for C1, paired with the population average scores. For instance, we found this hotel has a personalized score (27) for “public transportation,” higher than the overall population score (16). This result demonstrates that business travelers have a stronger preference towards “public transportation” than the overall population. References [1] Adomavicius, G., and Tuzhilin, A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE TKDE 17 (2005), 734–749. [2] Balke, W.-T., and Guntzer, U. ¨ Multi-objective query processing for database systems. In VLDB (2004), pp. 936–947. [3] Berry, S., Levinsohn, J., and Pakes, A. Automobile prices in market equilibrium. Econometrica 63 (1995), 841–890. [4] Forman, C., Ghose, A., and Wiesenfeld, B. Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. ISR 19, 3 (2008), 291–313. [5] Ghose, A., , and Ipeirotis, P. G. Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE TKDE (2010). [6] Lancaster, K. Consumer Demand: A New Approach. Columbia University Press, New York, 1971. [7] Li, B., Ghose, A., and Ipeirotis, P. G. Towards a theory model for product search. In WWW (2011). [8] Marshall, A. Principles of Economics, Eighth ed. Macmillan and Co., London, 1926. [9] Nie, Z., Wen, J.-R., and Ma, W.-Y. Webpage understanding: beyond page-level search. SIGMOD Record 37, 4 (2008), 48–54. [10] Rosen, S. Hedonic prices and implicit markets: Product differentiation in pure competition. J. of Political Econ. 82, 1 (1974), 34–55. [11] Yee, K.-P., Swearingen, K., Li, K., and Hearst, M. Faceted metadata for image search and browsing. In CHI (2003), pp. 401–408