Web-log Mining: from Pages to relations Qiang Yang HKUST Hong Kong, China 2021/1/26
2021/1/26 1 Web-log Mining: from Pages to Relations Qiang Yang HKUST, Hong Kong, China
A Spectrum of Web-Log Miners Knowledge Knowledge rich Have logical relations over page contents Database generated pages Can make accurate predictions even without data Knowledge middle Some hierarchical representations of ontology and content Most cases Interesting! Can predict based on similarity Knowledge poor Have only page-level logs No relational knowledge Can predict observed pages only when data is plenty 2021/1/26
2021/1/26 2 A Spectrum of Web-Log Miners ◼ Knowledge Rich: ◼ Have logical relations over page contents ◼ Database generated pages ◼ Can make accurate predictions even without data! ◼ Knowledge Middle: ◼ Some hierarchical representations of ontology and content ◼ Most cases! ◼ Can predict based on similarity ◼ Knowledge Poor: ◼ Have only page-level logs ◼ No relational knowledge ◼ Can predict observed pages only when data is plenty Interesting! Knowledge
1. Knowledge-poor --Web-log mining Association rule based predictive model ABC D Left Hand Side Right Hand A B CF Side ABE AB C B. C B. CD.G B CD G C D 2 Current observations Window of prediction A B C Size=m 1 Extract rules 2 Select rules 2021/1/26
2021/1/26 3 1. Knowledge-poor -- Web-log mining Current Observations Window of Prediction A B C Size=m A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D Left Hand Side Right Hand Side A,B C B,C,D G Association Rule based Predictive Model ? 1 2 1 Extract rules 2 Select rules
Rule-Representation Methods (min sup=2) Subset 2 {A,C}→>C PI Substring A C C BC→C B A C Latest Substring A C C C→C Subsequence Latest Subsequence 2021/1/26
2021/1/26 5 Rule-Representation Methods (min sup=2) ◼ Subset {A, C}→C ◼ Substring “BC”→C ◼ Latest Substring “C”→C ◼ Subsequence ◼ Latest Subsequence W1 W2 A1 A2 A3 P1 P2 A B C A C B C A C D C A C D C
Rule- selection criteria Among the rules whose LHs matches W1 Longest-Match Selection Select a rule whose left hand side is the longest to apply Corresponds to using the strongest signature to predict Most Confident Select the rule with highest confidence to apply Pessimistic selection UCEN is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error UCFE, N cO N 2021/1/26
2021/1/26 6 Rule-Selection Criteria ◼ Among the rules whose LHS matches W1, ◼ Longest-Match Selection ◼ Select a rule whose left hand side is the longest to apply ◼ Corresponds to using the strongest signature to predict ◼ Most Confident ◼ Select the rule with highest confidence to apply ◼ Pessimistic Selection ◼ UCF (E,N) is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error N U E N conf CF p ( , ) = 1−
Comparison matrix Rule Longest Match Most Confident Pessimistic Rep/Methods Selection Selection Selection Subset rules Subsequence rules ?? subsequence rules Substring rules atest-substring Best Com pared with c4. 5 as well Comparison Criteria: Precision Model size 2021/1/26 7
2021/1/26 7 • Compared with C4.5 as well. • Comparison Criteria: Precision & Model Size Rule Rep/Methods Longest Match Selection Most Confident Selection Pessimistic Selection Subset rules ? ? ? Subsequence rules ? ? ? Latestsubsequence rules ? ? ? Substring rules ? ? ? Latest-substring rules ? ? Best! Comparison Matrix
Integrating with Caching Cache replacement algorithm a key value k(p) is assigned to each object p ■[ Arlltt et a.USENⅨ1998,Cao& irani,97] K(P)=L+ F(P*C(p)/S(p) C(p: Cost of loading a page(e.g. amount of time S(p): Size of a page F(p): Frequency Count of a page L: An Inflation factor to reflect cache aging 2021/1/26
2021/1/26 8 Integrating with Caching ◼ Cache replacement algorithm ◼ A key value K(p) is assigned to each object p ◼ [Arlltt et al. USENIX 1998, Cao & Irani, 97] ◼ K(p) = L + F(p) * C(p) / S(p) ◼ C(p): Cost of loading a page (e.g., amount of time) ◼ S(p): Size of a page ◼ F(p): Frequency Count of a page ◼ L: An Inflation factor to reflect cache aging
Predicting future frequency O1:0.70 Session 1 o2:0.90 O2:0.30 Predicted Frequency W1=0.70+0.60+0.70=2.00 O1:0.60 W2=0.90+0.70+0.90=2.50 ession O2:0.70 W2=0.30+0.20 0.50 o3:0.20 W4=0.11+0.30 0.41 Os:0.42 W=0.42+0.33 =0.7 O1:0.70 O2:0.90 Session 3 O4:0.30 O:0.33 K=L+(W+Fi*C/s Wi: Future frequency Fi Past frequen 2021/1/26
2021/1/26 9 Predicting future frequency Session 1 O1: 0.70 O2: 0.90 O3: 0.30 O4: 0.11 Session 2 O1: 0.60 O2: 0.70 O3: 0.20 O5: 0.42 Session 3 O1: 0.70 O2: 0.90 O4: 0.30 O5: 0.33 W1 = 0.70+0.60+0.70 = 2.00 W2 = 0.90+0.70+0.90 = 2.50 W3 = 0.30+0.20 = 0.50 W4 = 0.11+0.30 = 0.41 W5 = 0.42+0.33 = 0.75 Predicted Frequency Ki = L + ( Wi+Fi ) * Ci / Si Wi : Future frequency Fi : Past frequency
Byte-hit Rate measures bandwidth reduction Byte Hit Rate=/ Bytes answered by cache I I Total bytes Byte Hit Rate vs Cache Size on NASA web log NGRAM GDSF GDSize LRU 0 0.002 0.004 0006 0008 0.01 Cache Size(%) 2021/1/26
2021/1/26 10 Byte-hit Rate measures bandwidth reduction Byte Hit Rate vs Cache Size on NASA web log 0 1 0 2 0 3 0 4 0 5 0 0 0.002 0.004 0.006 0.008 0.01 Cache Size (%) Byte Hit Rate % NGRAM GDSF GDSize LFUDA LRU Byte Hit Rate = | Bytes answered by cache | | Total bytes |
Prefetch VS No-prefetch Relative Network traffic NASA Fractional Latency (NASA) 85 NASA Relative Network Traffic 75 Fractionallatency (NASA) 65 55 E+UmL 45 000m004000 35 002004000001 00002000400060008001 -o-Ngram +-Ngramt+Prefetch Cache size(%g) +导的门 2021/1/26
2021/1/26 11 Relative Network Traffic (NASA) NASA Relative Network Traffic 0 10 20 30 40 50 60 70 80 90 100 0 0.002 0.004 0.006 0.008 0.01 Cache size(%) Relative Network Traffic (%) Ngram Ngram+Prefetch NASA Fractional Latency 25 35 45 55 65 75 85 0 0.002 0.004 0.006 0.008 0.01 Cache size(%) Fractional latency (%) Ngram Ngram+Prefetch Fractional latency (NASA) Prefetch vs. No-prefetch