569: Sampling Graphs (2013)

Similar documents
Microsoft PowerPoint _代工實例-1

(baking powder) 1 ( ) ( ) 1 10g g (two level design, D-optimal) 32 1/2 fraction Two Level Fractional Factorial Design D-Optimal D

Improved Preimage Attacks on AES-like Hash Functions: Applications to Whirlpool and Grøstl

UDC Empirical Researches on Pricing of Corporate Bonds with Macro Factors 厦门大学博硕士论文摘要库

Microsoft Word - A doc

Microsoft PowerPoint - Aqua-Sim.pptx

Stochastic Processes (XI) Hanjun Zhang School of Mathematics and Computational Science, Xiangtan University 508 YiFu Lou talk 06/

Microsoft PowerPoint - Performance Analysis of Video Streaming over LTE using.pptx

Microsoft PowerPoint SSBSE .ppt [Modo de Compatibilidade]


08陈会广

Preface This guide is intended to standardize the use of the WeChat brand and ensure the brand's integrity and consistency. The guide applies to all d

12-2 プレート境界深部すべりに係る諸現象の全体像

THE APPLICATION OF ISOTOPE RATIO ANALYSIS BY INDUCTIVELY COUPLED PLASMA MASS SPECTROMETER A Dissertation Presented By Chaoyong YANG Supervisor: Prof.D

清 华 大 学

國立中山大學學位論文典藏.PDF

报 告 1: 郑 斌 教 授, 美 国 俄 克 拉 荷 马 大 学 医 学 图 像 特 征 分 析 与 癌 症 风 险 评 估 方 法 摘 要 : 准 确 的 评 估 癌 症 近 期 发 病 风 险 和 预 后 或 者 治 疗 效 果 是 发 展 和 建 立 精 准 医 学 的 一 个 重 要 前

untitled


<4D F736F F D205F FB942A5CEA668B443C5E9BB73A740B5D8A4E5B8C9A552B1D0A7F75FA6BFB1A4ACFC2E646F63>

[9] R Ã : (1) x 0 R A(x 0 ) = 1; (2) α [0 1] Ã α = {x A(x) α} = [A α A α ]. A(x) Ã. R R. Ã 1 m x m α x m α > 0; α A(x) = 1 x m m x m +

PowerPoint Presentation

Microsoft PowerPoint - NCBA_Cattlemens_College_Darrh_B

國立中山大學學位論文典藏

~ 10 2 P Y i t = my i t W Y i t 1000 PY i t Y t i W Y i t t i m Y i t t i 15 ~ 49 1 Y Y Y 15 ~ j j t j t = j P i t i = 15 P n i t n Y

2008年1月11日に岩手県釜石沖で発生した地震(M4.7)について

Microsoft Word - ChineseSATII .doc

ENGG1410-F Tutorial 6

The Development of Color Constancy and Calibration System

天 主 教 輔 仁 大 學 社 會 學 系 學 士 論 文 小 別 勝 新 婚? 久 別 要 離 婚? 影 響 遠 距 家 庭 婚 姻 感 情 因 素 之 探 討 Separate marital relations are getting better or getting worse? -Exp

2015年4月11日雅思阅读预测机经(新东方版)

論 文 摘 要 本 文 乃 係 兩 岸 稅 務 爭 訟 制 度 之 研 究, 蓋 稅 務 爭 訟 在 行 訴 訟 中 一 直 占 有 相 當 高 的 比 例, 惟 其 勝 訴 率 一 直 偏 低, 民 87 年 10 月 28 日 行 訴 訟 法 經 幅 修 正 後, 審 級 部 分 由 一 級 一

國家圖書館典藏電子全文

A VALIDATION STUDY OF THE ACHIEVEMENT TEST OF TEACHING CHINESE AS THE SECOND LANGUAGE by Chen Wei A Thesis Submitted to the Graduate School and Colleg

Abstract After over ten years development, Chinese securities market has experienced from nothing to something, from small to large and the course of

Construction of Chinese pediatric standard database A Dissertation Submitted for the Master s Degree Candidate:linan Adviser:Prof. Han Xinmin Nanjing

TA-research-stats.key

IPCC CO (IPCC2006) 1 : = ( 1) 1 (kj/kg) (kgc/gj) (tc/t)

%

Corporate Social Responsibility CSR CSR CSR 1 2 ~ CSR 6 CSR 7 CSR 8 CSR 9 10 ~ CSR 14 CSR CSR 2013 A A 23.

Microsoft PowerPoint - STU_EC_Ch08.ppt


第六篇

南華大學數位論文

地質調査研究報告/Bulletin of the Geological Survey of Japan

豐佳燕.PDF


% GIS / / Fig. 1 Characteristics of flood disaster variation in suburbs of Shang

國立中山大學學位論文典藏.PDF

國家圖書館典藏電子全文

168 健 等 木醋对几种小浆果扦插繁殖的影响 第1期 the view of the comprehensive rooting quality, spraying wood vinegar can change rooting situation, and the optimal concent

096STUT DOC

<4D F736F F D20A46AA4AFACECA7DEA46ABEC7B1D0AE76ACE3A873AD70B565A6A8AA47B3F8A769A4AFACE >

untitled

Microsoft PowerPoint - ATF2015.ppt [相容模式]

摘 要 近 年 來 Facebook 使 用 者 人 數 逐 漸 增 加, 各 項 資 料 都 顯 示 Facebook 已 成 了 世 界 上 最 熱 門 的 社 群 網 路 平 台 因 為 眾 多 的 使 用 者 在 Facebook 上 分 享 他 們 的 個 人 資 訊, 關 於 資 訊 隱

~ ~


受訪者編號:

論文集29-1_前6P.indd

Microsoft Word - 01李惠玲ok.doc


Microsoft Word - TIP006SCH Uni-edit Writing Tip - Presentperfecttenseandpasttenseinyourintroduction readytopublish

Microsoft PowerPoint - talk8.ppt

<4D F736F F D C4EAC0EDB9A4C0E04142BCB6D4C4B6C1C5D0B6CFC0FDCCE2BEABD1A15F325F2E646F63>

Microsoft Word - 第四組心得.doc


中山大學學位論文典藏

<4D F736F F D203037AB6EA578C657A467A661A4BDAFABB9B3B455AB61B379A7CEACE3A8732E646F63>

spss.doc

Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial

國立中山大學學位論文典藏

small fire indd

<4D F736F F D203338B4C12D42A448A4E5C3C0B34EC3FE2DAB65ABE1>

WTO

Microsoft PowerPoint - CH 04 Techniques of Circuit Analysis

Microsoft Word - 論文封面 修.doc

Logitech Wireless Combo MK45 English

(Microsoft Word - 11-\261i\256m\253i.doc)

iml88-0v C / 8W T Tube EVM - pplication Notes. IC Description The iml88 is a Three Terminal Current Controller (TTCC) for regulating the current flowi

K301Q-D VRT中英文说明书141009

Introduction to Hamilton-Jacobi Equations and Periodic Homogenization

穨6街舞對抗中正紀念堂_林伯勳張金鶚_.PDF

中 國 學 研 究 期 刊 泰 國 農 業 大 學 บ นทอนเช นก น และส งผลก บการด ดแปลงจากวรรณกรรมมาเป นบทภาพยนตร และบทละคร โทรท ศน ด วยเช นก น จากการเคารพวรรณกรรมต นฉบ บเป นหล

1 SIGMA Lab Sydney IntelliGent MultimediA Laboratory Electrical and Information Engineering department EIE Ph.D. M.Phil. 1.QS (Associate Profess

McGraw-Hill School Education Group Physics : Principles and Problems G S 24

2005硕士论文模版

東吳大學

Microsoft Word - ED-774.docx

SVM OA 1 SVM MLP Tab 1 1 Drug feature data quantization table

输电线路智能监测系统通信技术应用研究

Microsoft PowerPoint - STU_EC_Ch02.ppt

(1) (2) (IVI) (2001) (IVI) 50% ~8% 1~30cm (IVI) Study on the Plant Succession of Slopeland Landslide Areas Following H


簡報技巧

States and capital package

θ 1 = φ n -n 2 2 n AR n φ i = 0 1 = a t - θ θ m a t-m 3 3 m MA m 1. 2 ρ k = R k /R 0 5 Akaike ρ k 1 AIC = n ln δ 2

1 引言

天 主 教 輔 仁 大 學 社 會 學 系 學 士 論 文 百 善 孝 為 先? 奉 養 父 母 與 接 受 子 女 奉 養 之 態 度 及 影 響 因 素 : 跨 時 趨 勢 分 析 Changes in attitude toward adult children's responsibilit

Japan He Bin Professor, School of Humanities and Social Sciences Tokyo Metropolitan University Abstract In daily life, the food on the table in the fa

Transcription:

569: Graphs (23) University of Windsor November 8, 23

Outline 2 3 4

Why sampling Applications DeepWeb OSN SouceCode SemanticWeb Properties Distribution Ranking Community Diameter CluterteringCoefficient We need to estimate the properties for two reasons: Data in its entirety is not available (e.g., Facebook), or without central control (e.g., WWW), or evolving. Data is big. Quadratic algorithms are not feasible.

Deep web graph model d2 d9 d3 d4 d7 d d6 d8 d5 q q3 q5 q6 q7 q4 q2 d A: bipartite graph d6 d8 q7 q4 d5 q2 d7 d4 q6 d2 d3 d9 q q3 q5 B: same graph as (A) in spring model layout Figure: Hidden data source as a bipartite graph

What to sample for data size; distributions; clustering coefficient; communities; influential bloggers (degree, pagerank, Katz centralities, etc.) Outliers (spammers, zombies, inflated followers)

How to sample 2 2 6 3 3 5 4 3 2 3 5 4 3 Original graph Random Node 6 3 5 3 2 6 3 5 3 3 Random Edge Random Walk Sample by random node; Sample by random edge; Sample by random walk; Combinations and modifications. e.g., random walk with uniform random restart.

What is different from traditional sampling Most of the networks are scale-free. s have very large variance. random sampling does not work. Precise sampling: quantities are digitalized, making the sampling process precise. e.g., know the exact degree, and can choose uniformly at random from the neighbouring nodes. Only possible for ONLINE social networks not real social networks. Access interface: provide interface APIs, many options. Can design new sampling schemes using APIs.

Applications of web, search engines of Online Social Network (Twitter, ) Number of bugs in programs...

Capture-Recapture Method The estimator often used When all the elements have equal probability of being sampled, N = n n 2 d where n and n 2 are the number of samples in two capture occasions, d is the duplicates. () Lawrence and C. Giles. Searching the world wide web. Science, 28(536):98-, 998. A. Broder and et al. Estimating corpus size via queries. In CIKM, pages 594-63. ACM, 26. L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, pages 597-66. ACM, 2. Petersson et al., Capture recapture in software inspections after years research -theory, evaluation and application, Journal of Systems and Software, 24. Lots of research on obtaining uniform random sampling, using methods such as Metropolis-Hasting Random Walk Bar-Yossef et al. Random sampling from a search engine s index, JACM 28.

Multiple Capture-Recapture Equal sampling probability N = n2 2C where n is total number of sampled elements, C is the number of collisions Unequal sampling probability (2) N = n2 2C Γ (3) where Γ is the normalized variance of the degrees of the graph How large is Γ?

Γ in various datasets Graph N( 3 ) γ or Γ Φ( 5 ) WikiTalk [?] 2,388 26.32 2,7 BerkStan [?] 654 4.5 5.3 EmailEu [?] 224 3.66 3 Stanford [?] 255.5 5.8 Skitter [?],694.46 56 Youtube [?],34 9.64 44 NotreDame [?] 325 6.4 9.4 Gowalla [?] 96 5.54,2 Epinion [?] 75 4.2 6 Google [?] 855 4. 62 Slashdot [?] 82 3.35,9 Facebook [?] 2,937 3.4 59 Flickr [?] 5 2.64 68 IMDB [?] 374 2.5 3 DBLP [?] 5.6 56 Amazon [?] 4.27 98 Gnutella [?] 62.2 9, CitePatents [?] 3,764.2, Table: Statistics of the 8 real-world graphs, sorted in descending order of the coefficient of degree variation γ. Φ is the conductance. For Twitter data, Γ 3.

Outline 2 3 4

Correction Theorem The relative bias of N can be approximated by the reciprocal of E(C), i.e., RB E(C), Dingding Li, Correction in Small Sample from Big Data, TKDE, 23....9.8 /E(C) RB (4).7 mean est N.6.4.3.2. 5 55 6 65 7 75 8 85 9 95 sample size n Figure: RB and /E(C) against sample sizes in simulation study. It shows that N is biased upwards, and the relative bias can be approximated by the reciprocal of E(C).

Our bias corrected estimators.2 x 6..8 N = N = n 2 2(C + ) (5) n 2 2(C + ) Γ (6) hat N hat N* mean est N.6.4.2.98 5 55 6 65 7 75 8 85 9 95 sample size n Figure: N and N S over 4 runs for various sample size in simulation study. Red dotted line is the true value.

Twitter data N=4million..6 /E(C) RB 4.8 x 7 hat N hat N*.4 4.7.2..8 mean est N 4.6 4.5 4.4.6 4.3.4 4.2.2 5 5 2 25 3 35 4 Expected bias 4. 5 5 2 25 3 35 4 sample size n Corrected

Outline 2 3 4

Random not Recommended Lemma (Variance of N N ) The estimated variance of RN estimator N N is Lemma (Variance of N E ) var( N N ) N2 E(C) 2N3 n 2 (7) The estimated variance of RE estimator N E is ( ) var( N E ) 2N3 n 2 + nγcv 2 (Γ), (8) Γ 2N where CV (Γ) is the coefficient of variation of Γ. Theorem (RN vs. RE) To achieve the same variance of N E, N N needs to use at most Γ times more samples.

Estimated vs. observed RSE.2.5.. RSE.5 ()WikiTalk (7)NotreDame.2.5..5. (2)BerkStan (8)Gowalla.2.5..5. (3)EmailEu (9)Epinions.2.5..5. (4)Stanford ()Google.2.5..5. (5)Skitter ()Slashdot.2.5..5. (6)Youtube (2)Facebook. RSE.5 (3)Flickr.5. (4)IMDB.5. (5)DBLP.5. (6)Amazon.5. (7)Gnutella.5. (8)Citation 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample 5 2 Sample

RN and RE on Facebook Data RSE..9.8.7.6.4.3.2. Observed RN Estimated RN Observed RE Estimated RE 5 2 Sample ( 2N)

Comparison of RN, RE, and RW Figure: Comparison of three sampling methods. The sample size n = 2NC where C =. It shows that for RN sampling (red solid bars), the relative standard error is equal to / C =. across all the datasets. RE sampling is consistently smaller than RN sampling.rw sampling can approximate RE sampling for some datasets. For NotreDame etc. that have low conductance, RW is grossly wrong.

Why RW Varies ~+ ~99 2~9 Figure: Subgraphs obtained by RW sampling from Flickr, EmailEu, Stanford and Youtube. Each subgraph contains 6, nodes. Node colour represents its degree in the original graph. Green=; Blue=2 9; Orange= 99; Red=.

for average degree degree is an important metrics in any network In and out average degrees in are different. Naive method arithmetic sample mean Problem Variance is too large because of the power law Solution Use PPS (RE) sampling and harmonic mean estimator On Twitter, PPS sampling can be hundreds of times better

Estimation Theorem d RN = n n d xi (9) i= [ n ] d RE = d RW = n () d xi i= Suppose the degrees follow Zipf s law with exponent one, i.e., d i = of the random node estimator is ( [ var( d RN ) d 2 N (α + ) ln 2 N + α ] ) n + α ( var( d RE ) d 2 n 2 ln N + α + α A. The variance α+i. () ). (2) Main conclusion RE is better than RN in many cases; RW depends on graph conductance.

distribution Frequency Frequency 7 7 5 6 5 6 ()Twitter 6 (2)WikiTalk (3)BerkStan (4)EmailEu (5)Stanford (6)Skitter 6 4 5 4 5 5 5 4 4 3 4 3 4 3 3 3 3 2 2 2 2 2 2 2 3 4 5 6 7 2 3 4 5 6 2 3 4 5 2 3 4 2 3 4 5 2 3 4 5 6 6 (7)Youtube 5 5 4 4 3 3 2 2 2 3 4 5 2 3 4 5 (8)NotreDame 5 5 6 5 (9)Gowalla ()Epinion ()Google (2)Slashdot 5 4 4 4 4 3 3 3 3 2 2 2 2 2 3 4 5 2 3 4 2 3 4 2 3 4 Frequency 7 6 5 4 3 2 (3)Facebook- 2 3 4 degree 5 4 3 2 (4)Flickr 2 3 4 degree 4 3 2 (5)Facebook-2 2 3 4 degree 5 4 3 2 (6)Amazon 2 3 4 degree 6 5 4 3 2 (7)Citation 2 3 degree 6 5 4 3 2 (8)RoadNet 2 degree Plots are sorted in decreasing order of coefficient of variation γ. Most of them follow power-law, yet they are very different

Comparison of Three methods

Comparison of RN and RE Ratio of RN and RE 25 2 5 5 4 3 2 2 4 6 8 5 5 2 25 3 35 4 γ

Comparison of RN, RE, and RW s RRMSE RRMSE 8 6 4 2.5 ()Twitter (7)Youtube 4 3 2.5.5 (2)WikiTalk (8)NotreDame 2.5 2.5.5.8.6.4.2 (3)BerkStan (9)Gowalla 2.5.5.6.4.2 (4)EmailEu ()Epinion.5.5.6.4.2 (5)Stanford ()Google.5.5.6.4.2 (6)Skitter (2)Slashdot (3)Facebook (4)Flickr (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet RRMSE.5.5.5.6.4.2.3.2..6.4.2.5. 5 Sample 5 Sample 5 Sample 5 Sample 5 Sample 5 Sample Figure: RRMSEs of RN, RE, and RW samplings as a function of sample size for 8 graphs. The dotted, dashed, and solid lines are for RN(... ), RE( ), and RW( ) samplings respectively. It shows that in most cases the sample size does not change the relative positions of the sampling methods. The exceptions are the web graphs 3 and 5 where RW

Sample distribution 4 (2)WikiTalk 3 (3)BerkStan 4 (4)EmailEu 3 (5)Stanford 3 (6)Skitter Frequency 3 2 3 2 2 2 2 2 5 5 2 4 5 5 (7)Youtube (8)NotreDame (9)Gowalla ()Epinion ()Google (2)Slashdot 2 2 2 2 2 5 5 5 2 4 2 4 2 4 3 3 2 4 3 4 (3)Facebook (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet (4)Flickr Frequency 2 2 2 2 2 2 4 2 4 2 4 2 4 2 4 Figure: The degree distributions of the samples obtained from RE (Random Edge) samplings. n=8,. The log-log plots in the first two rows exhibit a V" shape, where the sampled small nodes resemble the distribution of the original graph, while the sampled large nodes have a tail pointing upwards. These plots in the first two rows indicate that both small and large nodes are well represented in the sample. The plots in the last row indicate that the sample distribution is similar to the original distribution, therefore the RRMSE of RE sampling is similar to that of RN sampling.

Graph conductance and RW sampling Ratio of RW and RE 8 6 4 2 8 6 4 Facebbok 2 Youtube Amazon Flickr NotreDame Stanford BerkStan Pearson Corr:.68652 2.5 2 2.5 3 3.5 4 4.5 log(φ) Figure: Standard error ratio between RW and RE vs. graph conductance Φ for 8 datasets. Sample size is 4.

Network Structure Flickr NotreDame Stanford Web Amazon Facebook-2 Youtube ~+ ~99 2~9 Figure: Random walks on six networks. Flickr, NotreDame and Stanford have loosely connected components while Amazon, Facebook and Youtube are well enmeshed. Each random walk contains 6 4 nodes except NotreDame which has 5 4 nodes. Node

Conductance Flickr NotreDame Stanford Conductance - -2-3 -4 2 3 4 5 Number of nodes in the cluster Conductance - -2-3 -4-5 2 3 4 5 6 Number of nodes in the cluster Conductance Number of nodes in the clust - -2-3 -4-5 2 3 4 5 Amazon Facebook-2 Youtube Conductance - -2-3 Conductance - Conductance - -2-4 2 3 4 5 6 Number of nodes in the cluster -2 2 3 4 5 Number of nodes in the cluster Number of nodes in the clust -3 2 3 4 5 Figure: Conductance Φ(S) over S, the size of the the components, for six networks. Plots are drawn using SNAP API described in [?].

Very important We can access only partial information What is the global picture? Distribution Most influential Overall topology (e.g. clustering coefficient) Message diffusion, Critical nodes Communities...

Star Select nodes uniformly at random (e.g., nodes and 6); Take all the neighbours as sample (nodes 2,3,4,5,5,7,9); It approximate PPS (probability proportional to size) sampling; More efficient than random walk by taking all the neighbours instead a random one; We sampled around one million stars; 4 5 6 7 9 8 2 3 2 4 5 6 7 9 8 2 3 2

s and Messages Figure: Estimated out-degree, in-degree, and message distributions of. in-degree and out-degree as 32. (CI 3.9, 32.29) and 54.39 (CI 49.2, 59.76), respectively.

Estimation of followers f i d i d i Difference Ratio 856 23,335,29 6,859,5 6,476,85.38 2 75243 5,945,36 4,92,69,24,237.6 3 747 5,247,64 4,62,354,85,25.7 4 3794 3,394,62 7,58,539 5,876,8.78 5 6962 3,278,6 2,287,38 99,78.8 6 6338 3,53,77 2,554,298 598,879.4 7 59969 2,99,4,892,58,97,883.9 8 57 2,64,27,323,22,28,5. 9 5946 2,97,22,78,52 36,6.2 54264 2,3,37,76,827,242,3. Table: Estimation for the top accounts. f i : capture frequency of account i; d i : claimed in-degree or number of followers; d i : estimated number of followers; Ratio = (d i d i )/ d i.

Estimated Followers Ratio 5 5 5 5 Accounts A Smoothed ratio.5.5.5 5 Accounts B Ratio 3 2 C 2 4 Accounts (large ratio) Followers 2.5 x 7 2.5.5 D Estimated Claimed 2 4 6 Accounts(top 5) Followers 7 6 5 Estimated Claimed 4 2 4 Accounts(all) E Difference 8 x 5 6 4 2 F 2 5 Accounts(all)

Estimated vs. Claimed Followers Estimated followers 2.5 x 7 2.5.5 Account when X=Y.5.5 2 2.5 Claimed followers x 7 Figure: Estimated followers vs. claimed followers in log-log scale. The Pearson correlation coefficient is.9797.

Whether is it correct... Relative standard deviation of the estimator is RSD(ˆd i ) = / f i. (3) 9 EmailEu True Value Est. Error Bound 3 x 4 youtube True Value Est. Error Bound 8 2.5 s 7 6 5 s 2.5 4 3 2.5 2 3 4 5 6 7 8 9 2 3 4 5 Top Users 2 3 4 5 6 7 8 9 2 3 4 5 Top Users Figure: accuracy on existing data