MOBILE VISUAL CLOTHING SEARCH George A.Cushen and Mark S.Nixon University of Southampton gc505,msn@ecs.soton.ac.uk ABSTRACT to recognize clothing in surveillance videos.Although their We present a mobile visual clothing search system whereby a method is fast,they capture their dataset in a controlled lab smart phone user can either choose a social networking photo with a simple white background.The clothing retrieval prob- lem has been less extensively studied.One scenario is pre- or take a new photo of a person wearing clothing of interest and search for similar clothing in a retail database.From the sented in [6,7.State of the art work focusses primarily on query image,the person is detected,clothing is segmented, clothes parsing and semantic classification [8,9].Although, and clothing features are extracted and quantized.The infor- Yamaguchi et al.achieve good performance,they only briefly demonstrate retrieval and their method is very computation- mation is sent from the phone client to a server,where the fea- ally intensive. ture vector of the query image is used to retrieve similar cloth- ing products from online databases.The phone's GPS loca- Current mobile image retrieval systems include Google Goggles',Kooaba',and LookTel.However,these systems tion is used to re-rank results by retail store location.State of the art work focusses primarily on the recognition of a diverse are developed for image retrieval on general objects in a range of clothing offline and pays little attention to practical scene.When these systems are applied to clothes search,they applications.Evaluated on a challenging dataset,the system can provide visually and categorically less relevant results is relatively fast and achieves promising results. than our method for retrieving products based on a dressed person and can have significantly longer response times than Index Terms-Clothes Search,Mobile Search,Image our method. Retrieval The main contributions of this paper are as follows:(1) we present a novel mobile client-server framework for au- 1.INTRODUCTION tomatic visual clothes searching;(2)we propose an exten- sion of GrabCut for the purpose of clothing segmentation;(3) Clothing was the fastest growing segment in US e-commerce we propose a dominant colour descriptor for the efficient and last year,with it predicted to have grown by 20%to $40.9 bil- compact representation of clothing;and(4)we have evalu- lion from 2011 to 2012.It is also expected to have been the ated our approach on query images from a fashion social net- second biggest segment by revenue overall [1].Thus,an ef- work dataset along with a clothing product dataset for results ficient mobile application to automatically recognize clothing and shown promising retrieval results with a relatively fast re- in photos of people and retrieve similar clothing items that are sponse time.The contributions in this paper thus reside in a available for sale from retailers could transform the way we mobile system for automated clothes search with proven ca- shop whilst giving retailers a great potential for commercial pability. gain.Tightly connected to this,is the potential for an efficient clothing retrieval system to be employed for the purpose of 2.SYSTEM OVERVIEW highly targeted mobile advertising which learns what clothing a person may wish to purchase given their social networking The pipeline for our mobile visual clothing search system to photos. retrieve similar clothing products in nearby retail stores is The problem of efficient and practical mobile clothing shown in Figure 1.A smart phone user can either capture a search appears relatively unexplored in literature.Recently, photo of a person wearing clothing of interest or choose an ex- the fields of clothing segmentation,recognition and parsing isting photo,such as from a social network.The person is then have started to gain much attention in literature.Gallagher detected in the image and our clothing segmentation method and Chen designed a Graphcuts approach to segment cloth- is performed to attempt to select only the clothing pixels for ing [2]to aid the person recognition application.Various pri- the next step of feature extraction.Note that we only consider ors have been proposed to segment clothing by Hasan and searching upper body clothing since the images in our social Hogg [3]and Wang and Ai [4].Meanwhile Yang and Yu [5]proposed to integrate tracking and clothing segmentation Igoogle.com/mobile/goggles.kooaba.com,looktel.com
MOBILE VISUAL CLOTHING SEARCH George A. Cushen and Mark S. Nixon University of Southampton {gc505, msn}@ecs.soton.ac.uk ABSTRACT We present a mobile visual clothing search system whereby a smart phone user can either choose a social networking photo or take a new photo of a person wearing clothing of interest and search for similar clothing in a retail database. From the query image, the person is detected, clothing is segmented, and clothing features are extracted and quantized. The information is sent from the phone client to a server, where the feature vector of the query image is used to retrieve similar clothing products from online databases. The phone’s GPS location is used to re-rank results by retail store location. State of the art work focusses primarily on the recognition of a diverse range of clothing offline and pays little attention to practical applications. Evaluated on a challenging dataset, the system is relatively fast and achieves promising results. Index Terms— Clothes Search, Mobile Search, Image Retrieval 1. INTRODUCTION Clothing was the fastest growing segment in US e-commerce last year, with it predicted to have grown by 20% to $40.9 billion from 2011 to 2012. It is also expected to have been the second biggest segment by revenue overall [1]. Thus, an ef- ficient mobile application to automatically recognize clothing in photos of people and retrieve similar clothing items that are available for sale from retailers could transform the way we shop whilst giving retailers a great potential for commercial gain. Tightly connected to this, is the potential for an efficient clothing retrieval system to be employed for the purpose of highly targeted mobile advertising which learns what clothing a person may wish to purchase given their social networking photos. The problem of efficient and practical mobile clothing search appears relatively unexplored in literature. Recently, the fields of clothing segmentation, recognition and parsing have started to gain much attention in literature. Gallagher and Chen designed a Graphcuts approach to segment clothing [2] to aid the person recognition application. Various priors have been proposed to segment clothing by Hasan and Hogg [3] and Wang and Ai [4]. Meanwhile Yang and Yu [5] proposed to integrate tracking and clothing segmentation to recognize clothing in surveillance videos. Although their method is fast, they capture their dataset in a controlled lab with a simple white background. The clothing retrieval problem has been less extensively studied. One scenario is presented in [6, 7]. State of the art work focusses primarily on clothes parsing and semantic classification [8, 9]. Although, Yamaguchi et al. achieve good performance, they only briefly demonstrate retrieval and their method is very computationally intensive. Current mobile image retrieval systems include Google Goggles1 , Kooaba1 , and LookTel1 . However, these systems are developed for image retrieval on general objects in a scene. When these systems are applied to clothes search, they can provide visually and categorically less relevant results than our method for retrieving products based on a dressed person and can have significantly longer response times than our method. The main contributions of this paper are as follows: (1) we present a novel mobile client-server framework for automatic visual clothes searching; (2) we propose an extension of GrabCut for the purpose of clothing segmentation; (3) we propose a dominant colour descriptor for the efficient and compact representation of clothing; and (4) we have evaluated our approach on query images from a fashion social network dataset along with a clothing product dataset for results and shown promising retrieval results with a relatively fast response time. The contributions in this paper thus reside in a mobile system for automated clothes search with proven capability. 2. SYSTEM OVERVIEW The pipeline for our mobile visual clothing search system to retrieve similar clothing products in nearby retail stores is shown in Figure 1. A smart phone user can either capture a photo of a person wearing clothing of interest or choose an existing photo, such as from a social network. The person is then detected in the image and our clothing segmentation method is performed to attempt to select only the clothing pixels for the next step of feature extraction. Note that we only consider searching upper body clothing since the images in our social 1google.com/mobile/goggles, kooaba.com, looktel.com
networking dataset indicate that many people just take up- body only region,ROl,.We attempt to segment the person per body fashion photos.The segmented upper body clothing from the background within the bounding box ROIp by using image is divided up into non-overlapping patches and domi- the popular GrabCut algorithm.GrabCut is based on graph nant colour and HoG features are extracted.These sets of de- cuts which have been shown to be reasonably efficient and to scriptors are quantized using vocabulary codebooks and con- have good performance at segmenting humans [13]. catenated to generate a histogram of visual words(HovW). We attempt to eliminate the skin from the segmented per- The HoVW defines the ultimate query which is compared to son by employing an efficient thresholding method.Chai and a database of HoVWs for clothing products from retailers. Ngan [14]reported that skin pixels on the face can be iden- Finally,a similarity measure is applied to determine the most tified by the presence of a certain set of chrominance values similar matches and these are re-ranked based on the GPS lo- in the YCrCb colour space and utilized for face detection pur- cation of the user (obtained from the smart phone)and the poses.Based on this work,we propose a thresholding method location of the retailers,stored in the database. for the purpose of clothing segmentation that takes into ac- It is not practical to store databases of a large number of count other skin pixels on the body.This can be more chal- clothing products from various retailers on the client.Thus, lenging as we find illumination on the face tends to be more a client-server architecture is conceived for our mobile visual uniform.Consider R and Ro as ranges of the respective Cr clothing search. and Cb values that correspond to the colour of skin pixels. Our system is designed to be efficient with short response For a random sample of our social networking dataset,we times and offer an interactive graphical user experience.The found ranges of Rr [140165]and R [105 135]to be client communicates with the server using compressed fea- optimal.In our experiments,these ranges prove to provide a ture information rather than a large query image.This al- good compromise between robustness against different types lows for fast transmission on typical 3G mobile networks and of skin colour and attempting to preserve clothing pixels of has the additional benefit of distributing processing between similar chrominance to the skin.Thus,we have the following client and server so that the server may handle more simulta- equation: neous search requests.Our contributions are described in the following sections. 1 skin(x, ifCr(x,y)∈Rr nCb(x,y)∈Rb 0 otherwise (1) 3.CLOTHING SEGMENTATION where z and y are pixels in ROIp.Morphological opening is Clothing segmentation is a challenging field of research then performed on the binary mask skin(,y)to reduce noise. which can benefit numerous fields including human detec- Finally,the segmented full body clothing is cropped to tion [10,recognition for re-identification [2],pose estima- the upper body region ROL and normalised in size.The area of segmented clothing is compared to the area of the ROL. tion [11],and image retrieval.Although the fast segmenta- tion of a person's clothing in a photo appears effortless for a If the percentage of clothing pixels is less than an empiri- cally defined threshold Ta,we perform the next stage (feature human to perform,it remains challenging for a machine due extraction)on the pre-processed image rather than the seg- to the wide diversity and ever-changing nature of fashion,un- controlled scene lighting,dynamic backgrounds,variation in mented image.This final step can increase overall robustness of the system to the special case where the clothing and either human pose,and self and third-party occlusions.Addition- ally,difficult sub-problems such as face detection are usually skin or background are of a very similar colour.The resulting involved to initialize the segmentation procedure. upper body clothing image is denoted I The main objectives of this stage of the system are to au- tomatically crop the image to the region of interest(the re- 4.CLOTHING FEATURE EXTRACTION gions of the body below the head where clothes are typically located)and to eliminate both the background and skin from Colour is one of the most distinguishing visual features of the image to constrain the regions where clothing features will clothing.We propose an efficient method to describe dom- be extracted from. inant colours in the segmented clothing image based on the We propose converting the query image I to the more MPEG-7 descriptor [15],and integrate this feature with the perceptually relevant YCrCb colour space and the corre- HoG texture/shape descriptor. sponding illumination channel is normalized to help allevi- The upper body clothing image Ie is divided up into a reg- ate,to some extent,the non-uniform effects of uncontrolled ular grid of 5 x 5 cells (depicted in the third column of Fig- lighting. ure 3).We denote each column of the grid as ROI where The Viola-Jones face detector is used to estimate the face k 1...5 and we propose providing robustness to lay- size and location which are fed as parameters to initialise a ered clothing(e.g.jacket and top)by computing the dominate human detector based on [12].The detector yields a ROI for colours for each column and concatenating. the full body pose excluding head,ROIp,and a smaller upper A 3D histogram is computed on Ie E ROI in HSV colour
networking dataset indicate that many people just take upper body fashion photos. The segmented upper body clothing image is divided up into non-overlapping patches and dominant colour and HoG features are extracted. These sets of descriptors are quantized using vocabulary codebooks and concatenated to generate a histogram of visual words (HoVW). The HoVW defines the ultimate query which is compared to a database of HoVWs for clothing products from retailers. Finally, a similarity measure is applied to determine the most similar matches and these are re-ranked based on the GPS location of the user (obtained from the smart phone) and the location of the retailers, stored in the database. It is not practical to store databases of a large number of clothing products from various retailers on the client. Thus, a client-server architecture is conceived for our mobile visual clothing search. Our system is designed to be efficient with short response times and offer an interactive graphical user experience. The client communicates with the server using compressed feature information rather than a large query image. This allows for fast transmission on typical 3G mobile networks and has the additional benefit of distributing processing between client and server so that the server may handle more simultaneous search requests. Our contributions are described in the following sections. 3. CLOTHING SEGMENTATION Clothing segmentation is a challenging field of research which can benefit numerous fields including human detection [10], recognition for re-identification [2], pose estimation [11], and image retrieval. Although the fast segmentation of a person’s clothing in a photo appears effortless for a human to perform, it remains challenging for a machine due to the wide diversity and ever-changing nature of fashion, uncontrolled scene lighting, dynamic backgrounds, variation in human pose, and self and third-party occlusions. Additionally, difficult sub-problems such as face detection are usually involved to initialize the segmentation procedure. The main objectives of this stage of the system are to automatically crop the image to the region of interest (the regions of the body below the head where clothes are typically located) and to eliminate both the background and skin from the image to constrain the regions where clothing features will be extracted from. We propose converting the query image Iq to the more perceptually relevant YCrCb colour space and the corresponding illumination channel is normalized to help alleviate, to some extent, the non-uniform effects of uncontrolled lighting. The Viola-Jones face detector is used to estimate the face size and location which are fed as parameters to initialise a human detector based on [12]. The detector yields a ROI for the full body pose excluding head, ROIp, and a smaller upper body only region, ROIu. We attempt to segment the person from the background within the bounding box ROIp by using the popular GrabCut algorithm. GrabCut is based on graph cuts which have been shown to be reasonably efficient and to have good performance at segmenting humans [13]. We attempt to eliminate the skin from the segmented person by employing an efficient thresholding method. Chai and Ngan [14] reported that skin pixels on the face can be identified by the presence of a certain set of chrominance values in the YCrCb colour space and utilized for face detection purposes. Based on this work, we propose a thresholding method for the purpose of clothing segmentation that takes into account other skin pixels on the body. This can be more challenging as we find illumination on the face tends to be more uniform. Consider Rr and Rb as ranges of the respective Cr and Cb values that correspond to the colour of skin pixels. For a random sample of our social networking dataset, we found ranges of Rr = [140 165] and Rb = [105 135] to be optimal. In our experiments, these ranges prove to provide a good compromise between robustness against different types of skin colour and attempting to preserve clothing pixels of similar chrominance to the skin. Thus, we have the following equation: skin(x, y) = 1 if Cr(x, y) ∈ Rr ∩ Cb(x, y) ∈ Rb 0 otherwise (1) where x and y are pixels in ROIp. Morphological opening is then performed on the binary mask skin(x, y) to reduce noise. Finally, the segmented full body clothing is cropped to the upper body region ROIu and normalised in size. The area of segmented clothing is compared to the area of the ROIu. If the percentage of clothing pixels is less than an empirically defined threshold τa, we perform the next stage (feature extraction) on the pre-processed image rather than the segmented image. This final step can increase overall robustness of the system to the special case where the clothing and either skin or background are of a very similar colour. The resulting upper body clothing image is denoted Ic. 4. CLOTHING FEATURE EXTRACTION Colour is one of the most distinguishing visual features of clothing. We propose an efficient method to describe dominant colours in the segmented clothing image based on the MPEG-7 descriptor [15], and integrate this feature with the HoG texture/shape descriptor. The upper body clothing image Ic is divided up into a regular grid of 5 × 5 cells (depicted in the third column of Figure 3). We denote each column of the grid as ROIk c where k = 1 . . . 5 and we propose providing robustness to layered clothing (e.g. jacket and top) by computing the dominate colours for each column and concatenating. A 3D histogram is computed on Ic ∈ ROIk c in HSV colour
Dictionaries Dominant HOG Ha Client ROl detection Clothing segmentation Histogram of Feature Outer vellow box visual words extraction Similar clothing products Grid feature ROls from nearby retailers GPS coordinates- Product Inverted file index search Location based re-ranking Server Database(DP) Fig.1:Overview of our clothing retrieval pipeline. space.For clothing,hue quantization requires the most atten- Bins with a pixel percentage less than Tp are considered in- tion.We find a quantization of the hue circle at 20 steps suf- significant colours and merged to their closest neighbour bin. ficiently separates the hues such that the red,green,blue,yel- Since each set of worn upper body clothing in our product low,magenta and cyan are each represented with three sub- dataset is humanly perceived to generally have less than 3 divisions.Also,saturation and illumination are each quan- dominant colours per ROI,thresholds Ta and Tp are empiri- tized to three sub-divisions.Hence the colour is compactly cally defined to yield approximately this amount of dominant represented with a vector of size 18×3×3=l62. colours.For the purpose of our similarity stage,we convert The quantized colour of each colour bin is selected as its the polar HSV colours to the Euclidean LAB space and the centroid.If we let Ci represent the quantized colour for bin represent the dominant colours Fe as: i,X=(,XS,HV)represent the pixel colour,and ni be the number of pixels in bin i.we can calculate the mean of F={(CL,CA,CB,P),...(CL CA,CB,Pn)(4) the bin's colour distribution as follows where(CL,CA,Cf)is a vector of LAB dominant colour, 1≤i≤162 (2) the corresponding percentage of that colour in the clothing is ni given by Pi and 0>n<3 is the number of dominant colours on the clothing.For our application,we generate Fe=[Fe} Ideally,the dominant colours would be given by bins with the (padding each Fe if necessary)to yield total dimensions of greatest percentage of image pixels.However,in practice,due 4×3×5=60D. to factors such as uncontrolled illumination,bins of similar Texture/shape features based on histogram of oriented quantized colours often exist per perceived clothing colour. gradient(HoG)are computed in each cell on Ie globally quan- Therefore,the mutual polar distance between adjacent bin tized to its dominant colours.Gradient orientations are quan- centres is iteratively calculated and compared with a thresh- tized to every 45,thus there are 8 direction bins.The local old,Td,and similar colour bins are merged using weighted histograms of the cells are then concatenated together to form average agglomerative clustering.Considering Xi and X2 in the 8 x 25 200D HoG feature Fh. the adjacent bins,we let Pe represent the pixel percentage of the colour component E and perform the following equation for each colour component,substituting E for the H.S,and 5.CLOTHING SIMILARITY V components respectively: A Bag of Words(Bow)representation of the features is em- (P+PE+x ployed to increase robustness to noise,wrinkles,folding,and (3) illumination.For a query image Ig,we perform the clothing
ROI detection Clothing segmentation Outer yellow box = Grid = feature ROIs Histogram of visual words Dictionaries Feature extraction Similar clothing products from nearby retailers Dominant colors HOG Client Server Internet Product Database (DP) Inverted file index search Hq j ˆ I c c F h F GPS coordinates Location based re-ranking I q Fig. 1: Overview of our clothing retrieval pipeline. space. For clothing, hue quantization requires the most attention. We find a quantization of the hue circle at 20◦ steps suf- ficiently separates the hues such that the red, green, blue, yellow, magenta and cyan are each represented with three subdivisions. Also, saturation and illumination are each quantized to three sub-divisions. Hence the colour is compactly represented with a vector of size 18 × 3 × 3 = 162. The quantized colour of each colour bin is selected as its centroid. If we let Ci represent the quantized colour for bin i, X = (XH, XS, HV ) represent the pixel colour, and ni be the number of pixels in bin i, we can calculate the mean of the bin’s colour distribution as follows: Ci = X¯ i = 1 ni Xni j Xi,j , 1 ≤ i ≤ 162 (2) Ideally, the dominant colours would be given by bins with the greatest percentage of image pixels. However, in practice, due to factors such as uncontrolled illumination, bins of similar quantized colours often exist per perceived clothing colour. Therefore, the mutual polar distance between adjacent bin centres is iteratively calculated and compared with a threshold, τd, and similar colour bins are merged using weighted average agglomerative clustering. Considering X1 and X2 in the adjacent bins, we let PE represent the pixel percentage of the colour component E and perform the following equation for each colour component, substituting E for the H, S, and V components respectively: XE = XE 1 P E 1 P E 1 + P E 2 + XE 2 P E 2 P E 1 + P E 2 (3) Bins with a pixel percentage less than τp are considered insignificant colours and merged to their closest neighbour bin. Since each set of worn upper body clothing in our product dataset is humanly perceived to generally have less than 3 dominant colours per ROIk c , thresholds τd and τp are empirically defined to yield approximately this amount of dominant colours. For the purpose of our similarity stage, we convert the polar HSV colours to the Euclidean LAB space and the represent the dominant colours F c k as: F c k = {(C L 1 , CA 1 , CB 1 , P1), . . . ,(C L n , CA n , CB n , Pn)} (4) where (C L 1 , CA 1 , CB 1 ) is a vector of LAB dominant colour, the corresponding percentage of that colour in the clothing is given by P1 and 0 > n ≤ 3 is the number of dominant colours on the clothing. For our application, we generate F c = {F c k } (padding each F c k if necessary) to yield total dimensions of 4 × 3 × 5 = 60D. Texture/shape features based on histogram of oriented gradient (HoG) are computed in each cell on Ic globally quantized to its dominant colours. Gradient orientations are quantized to every 45◦ , thus there are 8 direction bins. The local histograms of the cells are then concatenated together to form the 8 × 25 = 200D HoG feature F h . 5. CLOTHING SIMILARITY A Bag of Words (BoW) representation of the features is employed to increase robustness to noise, wrinkles, folding, and illumination. For a query image Iq, we perform the clothing
segmentation and feature extraction steps and then the his- togram of visual words,H.is generated as follows.For every 起的 image feature Fi,we locate its corresponding visual word w from every dictionary D.These visual words are accumu- lated into individual histograms Hn for each dictionary and the unified histogram is given by concatenating the individual histograms:H=[HH. Finally,an inverted index is employed,minimizing the L distance between Ho and the codeword histogram H;of the hclothing product in the product dataset to obtain the search result: 方=arg mind山(Hg,H) (5) (a) (b) (c) where di(Ha:Hj)=Ha-Hjll1 =H(i)-Hj(i). This approach to searching is chosen as it allows for fast and Fig.2:Application:(a)home,(b)search,(c)product map efficient searching of large databases. For training,the dominant colour and HoG features are extracted (as per our method for testing)from each image in This dataset contains real world images from a fashion based the product database.A dictionary is built for each feature us- social network (chictopia.com)and is perhaps one of ing Approximate K-Means.The codebook size is empirically the most challenging for clothing segmentation and retrieval. set to 200 for Fe and 100 for h.Then each clothing product Since we are concerned with clothing product search,we con- image in dataset DP is mapped to the codebook in order to sider real-world e-commerce images from esprit.co.uk obtain its BoVW histogram. for Dataset DP.For this dataset,we collected 1500 images of models in frontal poses wearing woman's tops along with their associated product URLs (so visual retrieval results can 6.EXPERIMENTAL RESULTS link to further details) 6.1.Implementation 6.2.Computational Time The server stage is implemented in C++and deployed on a 2.93GHz CPU and 8GB of RAM.A graphical user appli- Our system takes on average approximately 6.7 seconds for cation is designed for the client side which is implemented client processing.Although,we do not fully investigate trans- in Java and C++and is deployed for Android smart phones mission timing,our system can achieve a total response time specifically,we consider the popular Samsung SIII Mini of 9 seconds to retrieve results from the server across a 3G (IGHz dual-core Arm Cortex A9)for demonstration and tim- data network with excellent smart phone reception.Table 1 ing analysis.For demonstration,we design features such as lists the computational times of the various stages of the sys- photo querying,viewing top search results,product informa- tem performed on the client and server.For reliability,the tion (by linking to the retailer's website),and displaying sim- average timings consider a random sample of 10 images with ilar products from nearby retailers on a map(refer to Figure 2 each image in the sample being processed 10 times.These re- for screenshots).Also,products are set to arbitrary locations. sults show that the clothing segmentation is our biggest bot- whereas for evaluation,we set all products to one retail loca- tleneck.Our approach is slower than the real time work of tion so that the more important visual relevance is evaluated. [5],however their approach is for a different application,is Several clothing datasets exist but none of them are suit- not implemented in a mobile framework and their dataset is able to evaluate our clothing retrieval task.Datasets men- captured on a white background.Our approach is much faster tioned in the current literature either do not solely contain than the work by [8]which works offline on our parent dataset frontal poses [8],or do not feature a large range of cloth- (Fashionista),requiring 2-3GB of memory. ing and people,or do not feature adults [2].or are low res- olution [5]or private.We collect two datasets:a query 6.3.Accuracy dataset(DQ)and a product dataset(DP).We primarily con- sider woman's clothing since it generally exhibits a greater We select a random sample of 30 images from Dataset A with range of colours,textures and shapes than men's and can variation in skin colour and manually segment ground truth to also be more complex for retrieval than men's due to cloth- quantitatively analyse our clothing segmentation.Accuracy is ing occlusions by long hair.Dataset DQ consists of a sub- reported using the best F-score criterion:F=2RP/(P+R), set of 1000 images from the Fashionista dataset [8]featur- where P and R are the precision and recall of pixels in the ing frontal poses suitable for our Viola-Jones face detector. cloth segment relative to our manually segmented ground
segmentation and feature extraction steps and then the histogram of visual words, Hq, is generated as follows. For every image feature F j , we locate its corresponding visual word w j n from every dictionary Dn. These visual words are accumulated into individual histograms Hn for each dictionary and the unified histogram is given by concatenating the individual histograms: Hq = [HT 1 HT 2 ] T . Finally, an inverted index is employed, minimizing the L1 distance between Hq and the codeword histogram Hj of the j th clothing product in the product dataset to obtain the search result: ˆj = arg min j d1(Hq, Hj) (5) where d1(Hq, Hj) = kHq −Hjk1 = Pn i=1 |Hq(i)−Hj (i)|. This approach to searching is chosen as it allows for fast and efficient searching of large databases. For training, the dominant colour and HoG features are extracted (as per our method for testing) from each image in the product database. A dictionary is built for each feature using Approximate K-Means. The codebook size is empirically set to 200 for F c and 100 for F h . Then each clothing product image in dataset DP is mapped to the codebook in order to obtain its BoVW histogram. 6. EXPERIMENTAL RESULTS 6.1. Implementation The server stage is implemented in C++ and deployed on a 2.93GHz CPU and 8GB of RAM. A graphical user application is designed for the client side which is implemented in Java and C++ and is deployed for Android smart phones - specifically, we consider the popular Samsung SIII Mini (1GHz dual-core Arm Cortex A9) for demonstration and timing analysis. For demonstration, we design features such as photo querying, viewing top search results, product information (by linking to the retailer’s website), and displaying similar products from nearby retailers on a map (refer to Figure 2 for screenshots). Also, products are set to arbitrary locations, whereas for evaluation, we set all products to one retail location so that the more important visual relevance is evaluated. Several clothing datasets exist but none of them are suitable to evaluate our clothing retrieval task. Datasets mentioned in the current literature either do not solely contain frontal poses [8], or do not feature a large range of clothing and people, or do not feature adults [2], or are low resolution [5] or private. We collect two datasets: a query dataset (DQ) and a product dataset (DP). We primarily consider woman’s clothing since it generally exhibits a greater range of colours, textures and shapes than men’s and can also be more complex for retrieval than men’s due to clothing occlusions by long hair. Dataset DQ consists of a subset of 1000 images from the Fashionista dataset [8] featuring frontal poses suitable for our Viola-Jones face detector. (a) (b) (c) Fig. 2: Application: (a) home, (b) search, (c) product map This dataset contains real world images from a fashion based social network (chictopia.com) and is perhaps one of the most challenging for clothing segmentation and retrieval. Since we are concerned with clothing product search, we consider real-world e-commerce images from esprit.co.uk for Dataset DP. For this dataset, we collected 1500 images of models in frontal poses wearing woman’s tops along with their associated product URLs (so visual retrieval results can link to further details). 6.2. Computational Time Our system takes on average approximately 6.7 seconds for client processing. Although, we do not fully investigate transmission timing, our system can achieve a total response time of 9 seconds to retrieve results from the server across a 3G data network with excellent smart phone reception. Table 1 lists the computational times of the various stages of the system performed on the client and server. For reliability, the average timings consider a random sample of 10 images with each image in the sample being processed 10 times. These results show that the clothing segmentation is our biggest bottleneck. Our approach is slower than the real time work of [5], however their approach is for a different application, is not implemented in a mobile framework and their dataset is captured on a white background. Our approach is much faster than the work by [8] which works offline on our parent dataset (Fashionista), requiring 2 − 3GB of memory. 6.3. Accuracy We select a random sample of 30 images from Dataset A with variation in skin colour and manually segment ground truth to quantitatively analyse our clothing segmentation. Accuracy is reported using the best F-score criterion: F = 2RP/(P +R), where P and R are the precision and recall of pixels in the cloth segment relative to our manually segmented ground
Table 1:Computational Time http://www.emarketer.com/newsroom/index.php/apparel- drives-retail-ecommerce-sales-growth,2012. Client Time(ms) Person Detection 138 [2]A.C.Gallagher and T.Chen,"Clothing cosegmentation Clothing Segmentation 6040 for recognizing people,"in CVPR 2008.IEEE,2008, Feature Extraction 411 Pp.1-8. Feature Quantization 35 [3]B.Hasan and D.Hogg,"Segmentation using De- Server Time (ms) formable Spatial Priors with Application to Clothing," Search and re-ranking 19 in BMVC,2010,pp.1-11. [4]N.Wang and H.Ai,"Who Blocks Who:Simultaneous truth.We achieve an average F-score over this random sam- Clothing Segmentation for Grouping Images,"in /CCV, ple of 0.857.Since the F-score reaches its best value at 1 and Nov.2011. worst at 0,our approach shows reasonable accuracy.Also, [5]M.Yang and K.Yu,"Real-time clothing recognition in this is favourable considering the baseline(GrabCut only)re- surveillance videos,"in IEEE ICIP,2011,pp.2937- sults in an F-score of 0.740 and with the skin elimination rou- 2940. tine of Chai rather than our own.0.808 is achieved.Addi- tionally,by visual inspection of Figure 3,we can see that our [6]X.Wang and T.Zhang,"Clothes search in consumer approach can segment clothing of persons in various difficult photos via color matching and attribute learning,"in uncontrolled scenes. MM.2011,Pp.1353-1356,ACM. Our retrieval results are reported qualitatively in Figure [7]X.Chao,M.J.Huiskes,T.Gritti,and C.Ciuhu,"A 3.We can see that when the clothing is segmented accurately, framework for robust feature selection for real-time the system appears promising with relevant clothing results of fashion style recommendation,"in Workshop on IMCE. a similar colour and shape retrieved.The clothing segmenta- ACM.2009. tion stage is important since if it is inaccurate,errors are prop- agated forward to rest of the system.Segmentation inaccura- [8]K.Yamaguchi,M.H.Kiapour,L.E.Ortiz,and T.L. cies appear to generally be caused by inherent issues such as Berg."Parsing clothing in fashion photographs,"in when scenes contain a garment that is a very similar colour to CVPR.IEEE.2012 the background or skin,or there is poor illumination present, [9]H.Chen,A.Gallagher,and B.Girod,"Describing or excessive long hair covering the clothing.However,when Clothing by Semantic Attributes,"in ECCV.2012, clothing segmentation fails and our algorithm decides to in- Springer. stead use the unsegmented image to establish features,such as in Figure 3g.we see that the results can still be reasonably [10]J.Sivic,C.L.Zitnick,and R.Szeliski,"Finding people relevant although may not be the most accurate. in repeated shots of the same scene,"in BMVC,2006, vol.3,pp.909-918. 7.CONCLUSIONS [11]M.W.Lee and I.Cohen,"A model-based approach for estimating human 3D poses in static images,"IEEE In this paper,we present a novel mobile client-server frame- TPAMI,Pp.905-916,2006. work for automatic visual clothes searching.Our system em- [12]G.A.Cushen and M.S.Nixon,"Real-Time Semantic ploys a Bag of Words(BoW)model and proposes an exten- Clothing Segmentation,"in ISVC.2012,pp.272-281, sion of GrabCut for clothing segmentation and a colour de- Springer. scriptor optimized for clothing.We demonstrate a novel ap- plication of combining a photo captured on a smart phone [13]R.Carsten,K.Vladimir,and B.Andrew,"GrabCut: (or from social networking)with GPS data to locate cloth- interactive foreground extraction using iterated graph ing of interest at nearby retailers.For future work,we aim cuts,"ACM Trans.Graph.,vol.23,no.3,pp.309-314. to perform a more comprehensive evaluation and integrate Aug.2004. more features to train clothing classifiers and re-rank domi- nant colour results by predicted clothing labels. [14]D.Chai and K.N.Ngan,"Face segmentation using skin- color map in videophone applications,"CSVT.IEEE Trans on,vol.9,no.4,pp.551-564,1999. 8.REFERENCES [15]T.Sikora,"The MPEG-7 visual standard for content description-an overview,"CSVT,IEEE Trans on,vol. [1]eMarketer, “Apparel Drives US 11,no.6.pp.696-702,2001. Retail Ecommerce Sales Growth
Table 1: Computational Time Client Time (ms) Person Detection 138 Clothing Segmentation 6040 Feature Extraction 411 Feature Quantization 35 Server Time (ms) Search and re-ranking 19 truth. We achieve an average F-score over this random sample of 0.857. Since the F-score reaches its best value at 1 and worst at 0, our approach shows reasonable accuracy. Also, this is favourable considering the baseline (GrabCut only) results in an F-score of 0.740 and with the skin elimination routine of Chai rather than our own, 0.808 is achieved. Additionally, by visual inspection of Figure 3, we can see that our approach can segment clothing of persons in various difficult uncontrolled scenes. Our retrieval results are reported qualitatively in Figure 3. We can see that when the clothing is segmented accurately, the system appears promising with relevant clothing results of a similar colour and shape retrieved. The clothing segmentation stage is important since if it is inaccurate, errors are propagated forward to rest of the system. Segmentation inaccuracies appear to generally be caused by inherent issues such as when scenes contain a garment that is a very similar colour to the background or skin, or there is poor illumination present, or excessive long hair covering the clothing. However, when clothing segmentation fails and our algorithm decides to instead use the unsegmented image to establish features, such as in Figure 3q, we see that the results can still be reasonably relevant although may not be the most accurate. 7. CONCLUSIONS In this paper, we present a novel mobile client-server framework for automatic visual clothes searching. Our system employs a Bag of Words (BoW) model and proposes an extension of GrabCut for clothing segmentation and a colour descriptor optimized for clothing. We demonstrate a novel application of combining a photo captured on a smart phone (or from social networking) with GPS data to locate clothing of interest at nearby retailers. For future work, we aim to perform a more comprehensive evaluation and integrate more features to train clothing classifiers and re-rank dominant colour results by predicted clothing labels. 8. REFERENCES [1] eMarketer, “Apparel Drives US Retail Ecommerce Sales Growth,” http://www.emarketer.com/newsroom/index.php/appareldrives-retail-ecommerce-sales-growth, 2012. [2] A. C. Gallagher and T. Chen, “Clothing cosegmentation for recognizing people,” in CVPR 2008. IEEE, 2008, pp. 1–8. [3] B. Hasan and D. Hogg, “Segmentation using Deformable Spatial Priors with Application to Clothing,” in BMVC, 2010, pp. 1–11. [4] N. Wang and H. Ai, “Who Blocks Who: Simultaneous Clothing Segmentation for Grouping Images,” in ICCV, Nov. 2011. [5] M. Yang and K. Yu, “Real-time clothing recognition in surveillance videos,” in IEEE ICIP, 2011, pp. 2937– 2940. [6] X. Wang and T. Zhang, “Clothes search in consumer photos via color matching and attribute learning,” in MM. 2011, pp. 1353–1356, ACM. [7] X. Chao, M. J. Huiskes, T. Gritti, and C. Ciuhu, “A framework for robust feature selection for real-time fashion style recommendation,” in Workshop on IMCE. ACM, 2009. [8] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg, “Parsing clothing in fashion photographs,” in CVPR. IEEE, 2012. [9] H. Chen, A. Gallagher, and B. Girod, “Describing Clothing by Semantic Attributes,” in ECCV. 2012, Springer. [10] J. Sivic, C. L. Zitnick, and R. Szeliski, “Finding people in repeated shots of the same scene,” in BMVC, 2006, vol. 3, pp. 909–918. [11] M. W. Lee and I. Cohen, “A model-based approach for estimating human 3D poses in static images,” IEEE TPAMI, pp. 905–916, 2006. [12] G. A. Cushen and M. S. Nixon, “Real-Time Semantic Clothing Segmentation,” in ISVC. 2012, pp. 272–281, Springer. [13] R. Carsten, K. Vladimir, and B. Andrew, “GrabCut: interactive foreground extraction using iterated graph cuts,” ACM Trans. Graph., vol. 23, no. 3, pp. 309–314, Aug. 2004. [14] D. Chai and K. N. Ngan, “Face segmentation using skincolor map in videophone applications,” CSVT, IEEE Trans on, vol. 9, no. 4, pp. 551–564, 1999. [15] T. Sikora, “The MPEG-7 visual standard for content description-an overview,” CSVT, IEEE Trans on, vol. 11, no. 6, pp. 696–702, 2001
(a) (b) (c) (d) (e) (f) (g) (h) () (k) (①) (m) (n) (o) (q) (r) (s) () Fig.3:Qualitative results.Columns depict:(1)query image (I),(2)ROIp(magenta box)and ROI(yellow box),(3) segmented clothing(I)overlaid with feature extraction grid,and(4)top retrieval candidate()
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (t) Fig. 3: Qualitative results. Columns depict: (1) query image (Iq), (2) ROIp (magenta box) and ROIu (yellow box), (3) segmented clothing (Ic) overlaid with feature extraction grid, and (4) top retrieval candidate (ˆj)