ๆญฃๅœจๅŠ ่ฝฝๅ›พ็‰‡...
of the tag size ni is above/below the threshold t,the specified B.Algorithm for the Iceberg Query Problem category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less Algorithm 2 Algorithm for Iceberg Query than B;otherwise,the specified category is still undetermined. 1:Initialize R to all categories,set Q,U,V to Set I=1. According to the weighted statistical averaging method,as the 2:while Rโ‰ ado number of repeated tests increases,the averaged variance o;for 3 If I=1,set the initial frame size f. each category decreases,thus the confidence interval for each 4 Issue a query cycle over the tags,add those relatively category is shrinking.Therefore,after a certain number of query major categories into the set S.Set S=S. cycles,all categories can be determined as qualified/unqualifed 5 while Sโ‰ ado for the population constraint. 6: Compute the frame size fi for each category CiES suchthat the variance,=ไบงๅฏ๏ผšf็‰‡>ๅ…ƒe, Qualified Category then remove Ci from S to V.If fi>fmar,set fi= fmaz.Obtain the frame size f as the mid-value among Unqualified Category the series of fi. Threshold 7: Select all tags in S,issue a query cycle with the frame size f,compute the estimated tag size n and the averaged standard deviation o;for each category ๅ†…็”ฑ็”ฑไธ€ C S.Detect the qualified category set Q and CI C2 C3 C4 C5 C6 C7 C8 C9 CI0CII CI2 unqualified category set U.Set S=S-Q-U. 8 ifUโ‰ then Fig.1.Histogram with confidence interval annotated 9: Wipe out all categories unexplored in the singleton slots from S Note that when the estimated valueๅ…ƒ๏ผ›>t orใ€Št, 10: end if the required variance in the population constraint is much 11: end while larger than the specifications of the accuracy constraint.In 12: n=ๅ…ƒ-โˆ‘c,esๅ…ƒ.R=R-S,1=1+1 this situation,these categories can be quickly identified as 13:end while qualified/unqualified,and can be wiped out immediately from 14:Further verify the categories in V and Q for the accuracy the ensemble sampling for verifying the population constraint. constraint Thus,those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size,verifying We propose the algorithm for the iceberg query problem in the population constraint in a faster approach. Algorithm 2.Steps 1-4 are quite similar to steps 1-4 in Algo- Sometimes the tag sizes of various categories are subject rithm 1,due to lack of space we omit the detailed statements to some skew distributions with a "long tail".The long tail for these steps.Assume that the current set of categories is represents those categories each of which occupies a rather R,during the query cycles of ensemble sampling,the reader small percentage among the total categories,but all together continuously updates the statistical value of n as well as the they occupy a substantial proportion of the overall tag sizes.In standard deviation oi for each category CiE R.After each regard to the iceberg query,conventionally,the categories in the query cycle,the categories in R can be further divided into the long tail are unqualified for the population constraint.However, following categories according to the population constraint: due to the small tag size,most of them may not have the Qualified categories O:They refer to the categories whose opportunity to occupy even one singleton slot when contending tag sizes are determined to be over the specified threshold with those major categories during the ensemble sampling. t.fๅ…ƒไน‹t and o:โ‰คๅทฒ๏ผŸthen category C is They remain undetermined without being immediately wiped identified as qualified for the population constraint. out,leading to inefficiency in scanning the other categories.We Unqualified categories U:They refer to the categories rely on the following theorem to quickly wipe out the categories whose tag sizes are determined to be below the specified in the long tail. t-ni threshold t..Ifm<t andoiโ‰คo--ๅฏ๏ผŒthen category C Theorem 4:For any two categories Ci and Ci that ns.i< is identified as unqualified for the population constraint. ns.j satisfies for each query cycle of ensemble sampling,if Ci Undetermined categories R:The remaining categories to is determined to be unqualified for the population constraint, be verified are undetermined categories. then C;is also unqualified. Therefore,after each query cycle of ensemble sampling, Due to lack of space,we omit the proof of Theorem 4.The those ungualified categories and qualified categories can be detailed proof is given in [17]. immediately wiped out from the ensemble sampling.When According to Theorem 4,after a number of query cycles of at least one category is determined as unqualified,all of the ensemble sampling,if a category Ci is determined unqualified categories in the current group which have not been explored for the population constraint,then for any category Ci which in the singleton slots are wiped out immediately.The query has not appeared once in the singleton slots,ns.j>ns.i=0,cycles are then continuously issued over those undetermined it can be wiped out immediately as an unqualified category. categories in R until R=0.of the tag size ๐‘›ห†๐‘– is above/below the threshold ๐‘ก, the specified category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less than ๐›ฝ; otherwise, the specified category is still undetermined. According to the weighted statistical averaging method, as the number of repeated tests increases, the averaged variance ๐œŽ๐‘– for each category decreases, thus the confidence interval for each category is shrinking. Therefore, after a certain number of query cycles, all categories can be determined as qualified/unqualifed for the population constraint. Tag size for each category C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Qualified Category Unqualified Category Undetermined Category Threshold t Fig. 1. Histogram with confidence interval annotated Note that when the estimated value ๐‘›ห†๐‘– โ‰ซ ๐‘ก or ๐‘›ห†๐‘– โ‰ช ๐‘ก, the required variance in the population constraint is much larger than the specifications of the accuracy constraint. In this situation, these categories can be quickly identified as qualified/unqualified, and can be wiped out immediately from the ensemble sampling for verifying the population constraint. Thus, those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size, verifying the population constraint in a faster approach. Sometimes the tag sizes of various categories are subject to some skew distributions with a โ€œlong tailโ€. The long tail represents those categories each of which occupies a rather small percentage among the total categories, but all together they occupy a substantial proportion of the overall tag sizes. In regard to the iceberg query, conventionally, the categories in the long tail are unqualified for the population constraint. However, due to the small tag size, most of them may not have the opportunity to occupy even one singleton slot when contending with those major categories during the ensemble sampling. They remain undetermined without being immediately wiped out, leading to inefficiency in scanning the other categories. We rely on the following theorem to quickly wipe out the categories in the long tail. Theorem 4: For any two categories ๐ถ๐‘– and ๐ถ๐‘— that ๐‘›๐‘ ,๐‘– < ๐‘›๐‘ ,๐‘— satisfies for each query cycle of ensemble sampling, if ๐ถ๐‘— is determined to be unqualified for the population constraint, then ๐ถ๐‘– is also unqualified. Due to lack of space, we omit the proof of Theorem 4. The detailed proof is given in [17]. According to Theorem 4, after a number of query cycles of ensemble sampling, if a category ๐ถ๐‘— is determined unqualified for the population constraint, then for any category ๐ถ๐‘– which has not appeared once in the singleton slots, ๐‘›๐‘ ,๐‘— > ๐‘›๐‘ ,๐‘– = 0, it can be wiped out immediately as an unqualified category. B. Algorithm for the Iceberg Query Problem Algorithm 2 Algorithm for Iceberg Query 1: Initialize ๐‘… to all categories, set ๐‘„, ๐‘ˆ, ๐‘‰ to โˆ…. Set ๐‘™ = 1. 2: while ๐‘… โˆ•= โˆ… do 3: If ๐‘™ = 1, set the initial frame size ๐‘“. 4: Issue a query cycle over the tags, add those relatively major categories into the set ๐‘†. Set ๐‘†โ€ฒ = ๐‘†. 5: while ๐‘† โˆ•= โˆ… do 6: Compute the frame size ๐‘“๐‘– for each category ๐ถ๐‘– โˆˆ ๐‘† such that the variance ๐œŽ๐‘– = โˆฃ๐‘กโˆ’๐‘›ห†๐‘–โˆฃ ฮฆโˆ’1(1โˆ’๐›ฝ) . If ๐‘“๐‘– > ๐‘›ห†๐‘– โ‹… ๐‘’, then remove ๐ถ๐‘– from ๐‘† to ๐‘‰ . If ๐‘“๐‘– > ๐‘“๐‘š๐‘Ž๐‘ฅ, set ๐‘“๐‘– = ๐‘“๐‘š๐‘Ž๐‘ฅ. Obtain the frame size ๐‘“ as the mid-value among the series of ๐‘“๐‘–. 7: Select all tags in ๐‘†, issue a query cycle with the frame size ๐‘“, compute the estimated tag size ๐‘›ห†๐‘– and the averaged standard deviation ๐œŽ๐‘– for each category ๐ถ๐‘– โˆˆ ๐‘†. Detect the qualified category set ๐‘„ and unqualified category set ๐‘ˆ. Set ๐‘† = ๐‘† โˆ’ ๐‘„ โˆ’ ๐‘ˆ. 8: if ๐‘ˆ โˆ•= โˆ… then 9: Wipe out all categories unexplored in the singleton slots from ๐‘†. 10: end if 11: end while 12: ๐‘›ห† = ๐‘›ห† โˆ’ โˆ‘ ๐ถ๐‘–โˆˆ๐‘†โ€ฒ ๐‘›ห†๐‘–. ๐‘… = ๐‘… โˆ’ ๐‘†โ€ฒ , ๐‘™ = ๐‘™ + 1. 13: end while 14: Further verify the categories in ๐‘‰ and ๐‘„ for the accuracy constraint. We propose the algorithm for the iceberg query problem in Algorithm 2. Steps 1-4 are quite similar to steps 1-4 in Algo๏ฟพrithm 1, due to lack of space we omit the detailed statements for these steps. Assume that the current set of categories is ๐‘…, during the query cycles of ensemble sampling, the reader continuously updates the statistical value of ๐‘›ห†๐‘– as well as the standard deviation ๐œŽ๐‘– for each category ๐ถ๐‘– โˆˆ ๐‘…. After each query cycle, the categories in ๐‘… can be further divided into the following categories according to the population constraint: โˆ™ Qualified categories ๐‘„: They refer to the categories whose tag sizes are determined to be over the specified threshold ๐‘ก. If ๐‘›ห†๐‘– โ‰ฅ ๐‘ก and ๐œŽ๐‘– โ‰ค ๐‘›ห†๐‘–โˆ’๐‘ก ฮฆโˆ’1(1โˆ’๐›ฝ) , then category ๐ถ๐‘– is identified as qualified for the population constraint. โˆ™ Unqualified categories ๐‘ˆ: They refer to the categories whose tag sizes are determined to be below the specified threshold ๐‘ก. If ๐‘›ห†๐‘– < ๐‘ก and ๐œŽ๐‘– โ‰ค ๐‘กโˆ’๐‘›ห†๐‘– ฮฆโˆ’1(1โˆ’๐›ฝ) , then category ๐ถ๐‘– is identified as unqualified for the population constraint. โˆ™ Undetermined categories ๐‘…: The remaining categories to be verified are undetermined categories. Therefore, after each query cycle of ensemble sampling, those unqualified categories and qualified categories can be immediately wiped out from the ensemble sampling. When at least one category is determined as unqualified, all of the categories in the current group which have not been explored in the singleton slots are wiped out immediately. The query cycles are then continuously issued over those undetermined categories in ๐‘… until ๐‘… = โˆ…
<<ๅ‘ไธŠ็ฟป้กตๅ‘ไธ‹็ฟป้กต>>
©2008-็Žฐๅœจ cucdc.com ้ซ˜็ญ‰ๆ•™่‚ฒ่ต„่ฎฏ็ฝ‘ ็‰ˆๆƒๆ‰€ๆœ‰