
TheWeakeningandDelayed Effects ofLongTail DistributionsinBigDataAccesses
1 The Weakening and Delayed Effects of Long Tail Distributions in Big Data Accesses

Big Data and Power Law#of hitstoeachdataobjectPopularityranksforeachdataobjecTotherights(theyellowregion)isthelongtailof lower8o%objects;to the left are the few that dominate (the top 20%objects).Withlimitedspacetostoreobjectsandlimitedsearchability to a large volume of objects, most attentions and hitshave to be in the top 2o% objects, ignoring the long tail
To the rights (the yellow region) is the long tail of lower 80% objects; to the left are the few that dominate (the top 20% objects). With limited space to store objects and limited search ability to a large volume of objects, most attentions and hits have to be in the top 20% objects, ignoring the long tail. # of hits to each data object Popularity ranks for each data object Big Data and Power Law

The Change of Time (short search latency) and Space (unlimited storagecapacity)for BigDataCreatesDifferent Data AccessDistributionsTraditional longtaildistributionFlattereddistributionafterthelongtailcanbeeasilyaccessedTheheadis loweredandthetailisdropped moreandmoreslowlyIftheflattereddistributionisnotpowerlawanymore,whatisit?
Traditional long tail distribution Flattered distribution after the long tail can be easily accessed • The head is lowered and the tail is dropped more and more slowly • If the flattered distribution is not power law anymore, what is it? The Change of Time (short search latency) and Space (unlimited storage capacity) for Big Data Creates Different Data Access Distributions

DistributionChangesinDVDsinNetflix2000to201180%70%60%Lessdemandforthetop50050%40%201120002005predicted30%20%Moredemandforthe"middle'10%Longertail(15%ofdemandcomefrombeyondrank.3.000.whereandmortarretailersrun.outof.inventory)0%005TOOSST0000T0OOTTOOST0002T0050000EOOET0000OOOST00090000O0SO0059000000a0=aS=O0a2=00500008-UnO555550CTThegrowthofNetflixselections(today:30millionUSusers,40millionuserstotal,1/3streamingtrafficofInternet)2000:4.500DVDs.2005:18.000DVDs-2011:over100,000DvDs(thelongtailwouldbedroppedevenmoreslowlyformoredemands)Note:"breaksandmortarretailers":face-to-facesellshops
• The growth of Netflix selections ( today: 30 million US users, 40 million users total, 1/3 streaming traffic of Internet) – 2000: 4,500 DVDs, 2005: 18,000 DVDs – 2011: over 100,000 DVDs (the long tail would be dropped even more slowly for more demands) – Note: “breaks and mortar retailers”: face-to-face sell shops. Distribution Changes in DVDs in Netflix 2000 to 2011 2011 predicted

Amazon Case: Growth of Sales from the Changes of Time/SpaceAmazonNorthAmericaMediaSalesBarnes&NobleChainStoreSalesBordersChainStoreSales$7.0$6.0SAL$5.0ES$4.0B1$3.0LV1$2.0oN$1.0s$0.0020304050607080910Fromwwrw.fonerbooks.com/booksale.htmBN Sales shown without BN.com to contrast online vs offlineBorders &BNSalesFY ends Q1 2011,shown as2010
Amazon Case: Growth of Sales from the Changes of Time/Space

We Must Find the New Distribution for Big Data AccessesInternet stores all kinds of huge big data sets_ The rapid growth and wide distribution of Internet mediacontent is a representative case study of big data The media content is carried by scalable distributed systemsWehope distribution model developed is- General purpose for other applications of big data- Scalability nature of both data and systems1
We Must Find the New Distribution for Big Data Accesses • Internet stores all kinds of huge big data sets – The rapid growth and wide distribution of Internet media content is a representative case study of big data – The media content is carried by scalable distributed systems • We hope distribution model developed is – General purpose for other applications of big data – Scalability nature of both data and systems 7

Zipf distributionis believed the generalmodel of data access patternsZipfdistribution(powerlaw)Characterizesthepropertyofscaleinvariance-Heawytailed,scalefree80-20ruleheavy tailIncomedistribution:80%ofsocialwealth-owned by20% people (Pareto law)Webtraffic:80%Webrequestsaccess20% pages (Breslau,INFOCOM'99)y, oαci-αα.0.6~0.8Systemimplicationsi:rank of objects-Objectivelycachingtheworkingsetinyi : number of referencesproxy-Significantlyreducenetworktraffic8
8 Zipf distribution is believed the general model of data access patterns • Zipf distribution (power law) – Characterizes the property of scale invariance – Heavy tailed, scale free • 80-20 rule – Income distribution: 80% of social wealth owned by 20% people (Pareto law) – Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM’99) • System implications – Objectively caching the working set in proxy – Significantly reduce network traffic log i log y slope: -a i y i−a i : rank of objects yi : number of references a: 0.6~0.8 i y heavy tail

Does Internet media trafficfollow Zipf's law?Webmedia systemsVoDmedia systemsaudlo/videoChesire,USITS'O1:Zipf-likeAcharya,MMcN'oo:non-ZipfCherkasova,NOSSDAVo2:non-ZipfYu,EUROSYS'O6:Zipf-likeP2PmediasystemsLivestreamingandIPTVsystemsVeloso,IMW'02:Zipf-likeGummadi,SOSPo3:non-Zipf9Sripanidkulchai,IMC'04:non-Zipflamnitchi,INFOCOM'O4:Zipf-like
9 Does Internet media traffic follow Zipf’s law? Chesire, USITS’01: Zipf-like Cherkasova, NOSSDAV’02: non-Zipf Acharya, MMCN’00: non-Zipf Yu, EUROSYS’06: Zipf-like Web media systems VoD media systems Live streaming and IPTV systems Veloso, IMW’02: Zipf-like Sripanidkulchai, IMC’04: non-Zipf P2P media systems Gummadi, SOSP’03: non-Zipf Iamnitchi, INFOCOM’04: Zipf-like

Inconsistent media access pattern modelsStill basedontheZipfmodel-Zipfwithexponential cutoff-Zipf-Mandelbrotdistribution- Generalized Zipf-like distributionheuristicassumptions-Two-modeZipfdistribution-Fetch-at-most-onceeffect-ParabolicfractaldistributionAllcasestudies-Basedononeortwoworkloads- Different from or even conflict with each otherAninsightfulunderstandingisessentialto-Contentdelivery systemdesign-Internetresourceprovisioning- Performance optimization10
10 Inconsistent media access pattern models • Still based on the Zipf model – Zipf with exponential cutoff – Zipf-Mandelbrot distribution – Generalized Zipf-like distribution – Two-mode Zipf distribution – Fetch-at-most-once effect – Parabolic fractal distribution – . • All case studies – Based on one or two workloads – Different from or even conflict with each other • An insightful understanding is essential to – Content delivery system design – Internet resource provisioning – Performance optimization heuristic assumptions

ResearchObjectives: Find a general distribution model of Internet mediaaccess patterns as a case for big data- Comprehensive measurements and experiments- Rigorous mathematical analysis and modeling- Insights into media system designs11
11 Research Objectives • Find a general distribution model of Internet media access patterns as a case for big data – Comprehensive measurements and experiments – Rigorous mathematical analysis and modeling – Insights into media system designs