569: Sampling Graphs (2013) - PDF 免费下载

569: Graphs (23) University of Windsor November 8, 23

Outline 2 3 4

Why sampling Applications DeepWeb OSN SouceCode SemanticWeb Properties Distribution Ranking Community Diameter CluterteringCoefficient We need to estimate the properties for two reasons: Data in its entirety is not available (e.g., Facebook), or without central control (e.g., WWW), or evolving. Data is big. Quadratic algorithms are not feasible.

Deep web graph model d2 d9 d3 d4 d7 d d6 d8 d5 q q3 q5 q6 q7 q4 q2 d A: bipartite graph d6 d8 q7 q4 d5 q2 d7 d4 q6 d2 d3 d9 q q3 q5 B: same graph as (A) in spring model layout Figure: Hidden data source as a bipartite graph

What to sample for data size; distributions; clustering coefficient; communities; influential bloggers (degree, pagerank, Katz centralities, etc.) Outliers (spammers, zombies, inflated followers)

How to sample 2 2 6 3 3 5 4 3 2 3 5 4 3 Original graph Random Node 6 3 5 3 2 6 3 5 3 3 Random Edge Random Walk Sample by random node; Sample by random edge; Sample by random walk; Combinations and modifications. e.g., random walk with uniform random restart.

What is different from traditional sampling Most of the networks are scale-free. s have very large variance. random sampling does not work. Precise sampling: quantities are digitalized, making the sampling process precise. e.g., know the exact degree, and can choose uniformly at random from the neighbouring nodes. Only possible for ONLINE social networks not real social networks. Access interface: provide interface APIs, many options. Can design new sampling schemes using APIs.

Applications of web, search engines of Online Social Network (Twitter, ) Number of bugs in programs...

Capture-Recapture Method The estimator often used When all the elements have equal probability of being sampled, N = n n 2 d where n and n 2 are the number of samples in two capture occasions, d is the duplicates. () Lawrence and C. Giles. Searching the world wide web. Science, 28(536):98-, 998. A. Broder and et al. Estimating corpus size via queries. In CIKM, pages 594-63. ACM, 26. L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, pages 597-66. ACM, 2. Petersson et al., Capture recapture in software inspections after years research -theory, evaluation and application, Journal of Systems and Software, 24. Lots of research on obtaining uniform random sampling, using methods such as Metropolis-Hasting Random Walk Bar-Yossef et al. Random sampling from a search engine s index, JACM 28.

Multiple Capture-Recapture Equal sampling probability N = n2 2C where n is total number of sampled elements, C is the number of collisions Unequal sampling probability (2) N = n2 2C Γ (3) where Γ is the normalized variance of the degrees of the graph How large is Γ?

Γ in various datasets Graph N( 3 ) γ or Γ Φ( 5 ) WikiTalk [?] 2,388 26.32 2,7 BerkStan [?] 654 4.5 5.3 EmailEu [?] 224 3.66 3 Stanford [?] 255.5 5.8 Skitter [?],694.46 56 Youtube [?],34 9.64 44 NotreDame [?] 325 6.4 9.4 Gowalla [?] 96 5.54,2 Epinion [?] 75 4.2 6 Google [?] 855 4. 62 Slashdot [?] 82 3.35,9 Facebook [?] 2,937 3.4 59 Flickr [?] 5 2.64 68 IMDB [?] 374 2.5 3 DBLP [?] 5.6 56 Amazon [?] 4.27 98 Gnutella [?] 62.2 9, CitePatents [?] 3,764.2, Table: Statistics of the 8 real-world graphs, sorted in descending order of the coefficient of degree variation γ. Φ is the conductance. For Twitter data, Γ 3.

Outline 2 3 4

Correction Theorem The relative bias of N can be approximated by the reciprocal of E(C), i.e., RB E(C), Dingding Li, Correction in Small Sample from Big Data, TKDE, 23....9.8 /E(C) RB (4).7 mean est N.6.4.3.2. 5 55 6 65 7 75 8 85 9 95 sample size n Figure: RB and /E(C) against sample sizes in simulation study. It shows that N is biased upwards, and the relative bias can be approximated by the reciprocal of E(C).

Our bias corrected estimators.2 x 6..8 N = N = n 2 2(C + ) (5) n 2 2(C + ) Γ (6) hat N hat N* mean est N.6.4.2.98 5 55 6 65 7 75 8 85 9 95 sample size n Figure: N and N S over 4 runs for various sample size in simulation study. Red dotted line is the true value.

Twitter data N=4million..6 /E(C) RB 4.8 x 7 hat N hat N*.4 4.7.2..8 mean est N 4.6 4.5 4.4.6 4.3.4 4.2.2 5 5 2 25 3 35 4 Expected bias 4. 5 5 2 25 3 35 4 sample size n Corrected

Outline 2 3 4

Random not Recommended Lemma (Variance of N N ) The estimated variance of RN estimator N N is Lemma (Variance of N E ) var( N N ) N2 E(C) 2N3 n 2 (7) The estimated variance of RE estimator N E is ( ) var( N E ) 2N3 n 2 + nγcv 2 (Γ), (8) Γ 2N where CV (Γ) is the coefficient of variation of Γ. Theorem (RN vs. RE) To achieve the same variance of N E, N N needs to use at most Γ times more samples.

Estimated vs. observed RSE.2.5.. RSE.5 ()WikiTalk (7)NotreDame.2.5..5. (2)BerkStan (8)Gowalla.2.5..5. (3)EmailEu (9)Epinions.2.5..5. (4)Stanford ()Google.2.5..5. (5)Skitter ()Slashdot.2.5..5. (6)Youtube (2)Facebook. RSE.5 (3)Flickr.5. (4)IMDB.5. (5)DBLP.5. (6)Amazon.5. (7)Gnutella.5. (8)Citation 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample

RN and RE on Facebook Data RSE..9.8.7.6.4.3.2. Observed RN Estimated RN Observed RE Estimated RE 5 2 Sample ( 2N)

Comparison of RN, RE, and RW Figure: Comparison of three sampling methods. The sample size n = 2NC where C =. It shows that for RN sampling (red solid bars), the relative standard error is equal to / C =. across all the datasets. RE sampling is consistently smaller than RN sampling.rw sampling can approximate RE sampling for some datasets. For NotreDame etc. that have low conductance, RW is grossly wrong.

Why RW Varies ~+ ~99 2~9 Figure: Subgraphs obtained by RW sampling from Flickr, EmailEu, Stanford and Youtube. Each subgraph contains 6, nodes. Node colour represents its degree in the original graph. Green=; Blue=2 9; Orange= 99; Red=.

for average degree degree is an important metrics in any network In and out average degrees in are different. Naive method arithmetic sample mean Problem Variance is too large because of the power law Solution Use PPS (RE) sampling and harmonic mean estimator On Twitter, PPS sampling can be hundreds of times better

Estimation Theorem d RN = n n d xi (9) i= [ n ] d RE = d RW = n () d xi i= Suppose the degrees follow Zipf s law with exponent one, i.e., d i = of the random node estimator is ( [ var( d RN ) d 2 N (α + ) ln 2 N + α ] ) n + α ( var( d RE ) d 2 n 2 ln N + α + α A. The variance α+i. () ). (2) Main conclusion RE is better than RN in many cases; RW depends on graph conductance.

distribution Frequency Frequency 7 7 5 6 5 6 ()Twitter 6 (2)WikiTalk (3)BerkStan (4)EmailEu (5)Stanford (6)Skitter 6 4 5 4 5 5 5 4 4 3 4 3 4 3 3 3 3 2 2 2 2 2 2 2 3 4 5 6 7 2 3 4 5 6 2 3 4 5 2 3 4 2 3 4 5 2 3 4 5 6 6 (7)Youtube 5 5 4 4 3 3 2 2 2 3 4 5 2 3 4 5 (8)NotreDame 5 5 6 5 (9)Gowalla ()Epinion ()Google (2)Slashdot 5 4 4 4 4 3 3 3 3 2 2 2 2 2 3 4 5 2 3 4 2 3 4 2 3 4 Frequency 7 6 5 4 3 2 (3)Facebook- 2 3 4 degree 5 4 3 2 (4)Flickr 2 3 4 degree 4 3 2 (5)Facebook-2 2 3 4 degree 5 4 3 2 (6)Amazon 2 3 4 degree 6 5 4 3 2 (7)Citation 2 3 degree 6 5 4 3 2 (8)RoadNet 2 degree Plots are sorted in decreasing order of coefficient of variation γ. Most of them follow power-law, yet they are very different

Comparison of Three methods

Comparison of RN and RE Ratio of RN and RE 25 2 5 5 4 3 2 2 4 6 8 5 5 2 25 3 35 4 γ

Comparison of RN, RE, and RW s RRMSE RRMSE 8 6 4 2.5 ()Twitter (7)Youtube 4 3 2.5.5 (2)WikiTalk (8)NotreDame 2.5 2.5.5.8.6.4.2 (3)BerkStan (9)Gowalla 2.5.5.6.4.2 (4)EmailEu ()Epinion.5.5.6.4.2 (5)Stanford ()Google.5.5.6.4.2 (6)Skitter (2)Slashdot (3)Facebook (4)Flickr (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet RRMSE.5.5.5.6.4.2.3.2..6.4.2.5. 5 Sample 5 Sample 5 Sample 5 Sample 5 Sample 5 Sample Figure: RRMSEs of RN, RE, and RW samplings as a function of sample size for 8 graphs. The dotted, dashed, and solid lines are for RN(... ), RE( ), and RW( ) samplings respectively. It shows that in most cases the sample size does not change the relative positions of the sampling methods. The exceptions are the web graphs 3 and 5 where RW

Sample distribution 4 (2)WikiTalk 3 (3)BerkStan 4 (4)EmailEu 3 (5)Stanford 3 (6)Skitter Frequency 3 2 3 2 2 2 2 2 5 5 2 4 5 5 (7)Youtube (8)NotreDame (9)Gowalla ()Epinion ()Google (2)Slashdot 2 2 2 2 2 5 5 5 2 4 2 4 2 4 3 3 2 4 3 4 (3)Facebook (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet (4)Flickr Frequency 2 2 2 2 2 2 4 2 4 2 4 2 4 2 4 Figure: The degree distributions of the samples obtained from RE (Random Edge) samplings. n=8,. The log-log plots in the first two rows exhibit a V" shape, where the sampled small nodes resemble the distribution of the original graph, while the sampled large nodes have a tail pointing upwards. These plots in the first two rows indicate that both small and large nodes are well represented in the sample. The plots in the last row indicate that the sample distribution is similar to the original distribution, therefore the RRMSE of RE sampling is similar to that of RN sampling.

Graph conductance and RW sampling Ratio of RW and RE 8 6 4 2 8 6 4 Facebbok 2 Youtube Amazon Flickr NotreDame Stanford BerkStan Pearson Corr:.68652 2.5 2 2.5 3 3.5 4 4.5 log(φ) Figure: Standard error ratio between RW and RE vs. graph conductance Φ for 8 datasets. Sample size is 4.

Network Structure Flickr NotreDame Stanford Web Amazon Facebook-2 Youtube ~+ ~99 2~9 Figure: Random walks on six networks. Flickr, NotreDame and Stanford have loosely connected components while Amazon, Facebook and Youtube are well enmeshed. Each random walk contains 6 4 nodes except NotreDame which has 5 4 nodes. Node

Conductance Flickr NotreDame Stanford Conductance - -2-3 -4 2 3 4 5 Number of nodes in the cluster Conductance - -2-3 -4-5 2 3 4 5 6 Number of nodes in the cluster Conductance Number of nodes in the clust - -2-3 -4-5 2 3 4 5 Amazon Facebook-2 Youtube Conductance - -2-3 Conductance - Conductance - -2-4 2 3 4 5 6 Number of nodes in the cluster -2 2 3 4 5 Number of nodes in the cluster Number of nodes in the clust -3 2 3 4 5 Figure: Conductance Φ(S) over S, the size of the the components, for six networks. Plots are drawn using SNAP API described in [?].

Very important We can access only partial information What is the global picture? Distribution Most influential Overall topology (e.g. clustering coefficient) Message diffusion, Critical nodes Communities...

Star Select nodes uniformly at random (e.g., nodes and 6); Take all the neighbours as sample (nodes 2,3,4,5,5,7,9); It approximate PPS (probability proportional to size) sampling; More efficient than random walk by taking all the neighbours instead a random one; We sampled around one million stars; 4 5 6 7 9 8 2 3 2 4 5 6 7 9 8 2 3 2

s and Messages Figure: Estimated out-degree, in-degree, and message distributions of. in-degree and out-degree as 32. (CI 3.9, 32.29) and 54.39 (CI 49.2, 59.76), respectively.

Estimation of followers f i d i d i Difference Ratio 856 23,335,29 6,859,5 6,476,85.38 2 75243 5,945,36 4,92,69,24,237.6 3 747 5,247,64 4,62,354,85,25.7 4 3794 3,394,62 7,58,539 5,876,8.78 5 6962 3,278,6 2,287,38 99,78.8 6 6338 3,53,77 2,554,298 598,879.4 7 59969 2,99,4,892,58,97,883.9 8 57 2,64,27,323,22,28,5. 9 5946 2,97,22,78,52 36,6.2 54264 2,3,37,76,827,242,3. Table: Estimation for the top accounts. f i : capture frequency of account i; d i : claimed in-degree or number of followers; d i : estimated number of followers; Ratio = (d i d i )/ d i.

Estimated Followers Ratio 5 5 5 5 Accounts A Smoothed ratio.5.5.5 5 Accounts B Ratio 3 2 C 2 4 Accounts (large ratio) Followers 2.5 x 7 2.5.5 D Estimated Claimed 2 4 6 Accounts(top 5) Followers 7 6 5 Estimated Claimed 4 2 4 Accounts(all) E Difference 8 x 5 6 4 2 F 2 5 Accounts(all)

Estimated vs. Claimed Followers Estimated followers 2.5 x 7 2.5.5 Account when X=Y.5.5 2 2.5 Claimed followers x 7 Figure: Estimated followers vs. claimed followers in log-log scale. The Pearson correlation coefficient is.9797.

Whether is it correct... Relative standard deviation of the estimator is RSD(ˆd i ) = / f i. (3) 9 EmailEu True Value Est. Error Bound 3 x 4 youtube True Value Est. Error Bound 8 2.5 s 7 6 5 s 2.5 4 3 2.5 2 3 4 5 6 7 8 9 2 3 4 5 Top Users 2 3 4 5 6 7 8 9 2 3 4 5 Top Users Figure: accuracy on existing data