1 569: Graphs (23) University of Windsor November 8, 23
2 Outline 2 3 4
3 Why sampling Applications DeepWeb OSN SouceCode SemanticWeb Properties Distribution Ranking Community Diameter CluterteringCoefficient We need to estimate the properties for two reasons: Data in its entirety is not available (e.g., Facebook), or without central control (e.g., WWW), or evolving. Data is big. Quadratic algorithms are not feasible.
4 Deep web graph model d2 d9 d3 d4 d7 d d6 d8 d5 q q3 q5 q6 q7 q4 q2 d A: bipartite graph d6 d8 q7 q4 d5 q2 d7 d4 q6 d2 d3 d9 q q3 q5 B: same graph as (A) in spring model layout Figure: Hidden data source as a bipartite graph
5 What to sample for data size; distributions; clustering coefficient; communities; influential bloggers (degree, pagerank, Katz centralities, etc.) Outliers (spammers, zombies, inflated followers)
6 How to sample Original graph Random Node Random Edge Random Walk Sample by random node; Sample by random edge; Sample by random walk; Combinations and modifications. e.g., random walk with uniform random restart.
7 What is different from traditional sampling Most of the networks are scale-free. s have very large variance. random sampling does not work. Precise sampling: quantities are digitalized, making the sampling process precise. e.g., know the exact degree, and can choose uniformly at random from the neighbouring nodes. Only possible for ONLINE social networks not real social networks. Access interface: provide interface APIs, many options. Can design new sampling schemes using APIs.
8 Applications of web, search engines of Online Social Network (Twitter, ) Number of bugs in programs...
9 Capture-Recapture Method The estimator often used When all the elements have equal probability of being sampled, N = n n 2 d where n and n 2 are the number of samples in two capture occasions, d is the duplicates. () Lawrence and C. Giles. Searching the world wide web. Science, 28(536):98-, 998. A. Broder and et al. Estimating corpus size via queries. In CIKM, pages ACM, 26. L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW, pages ACM, 2. Petersson et al., Capture recapture in software inspections after years research -theory, evaluation and application, Journal of Systems and Software, 24. Lots of research on obtaining uniform random sampling, using methods such as Metropolis-Hasting Random Walk Bar-Yossef et al. Random sampling from a search engine s index, JACM 28.
10 Multiple Capture-Recapture Equal sampling probability N = n2 2C where n is total number of sampled elements, C is the number of collisions Unequal sampling probability (2) N = n2 2C Γ (3) where Γ is the normalized variance of the degrees of the graph How large is Γ?
11 Γ in various datasets Graph N( 3 ) γ or Γ Φ( 5 ) WikiTalk [?] 2, ,7 BerkStan [?] Eu [?] Stanford [?] Skitter [?], Youtube [?], NotreDame [?] Gowalla [?] ,2 Epinion [?] Google [?] Slashdot [?] ,9 Facebook [?] 2, Flickr [?] IMDB [?] DBLP [?] Amazon [?] Gnutella [?] , CitePatents [?] 3,764.2, Table: Statistics of the 8 real-world graphs, sorted in descending order of the coefficient of degree variation γ. Φ is the conductance. For Twitter data, Γ 3.
12 Outline 2 3 4
13 Correction Theorem The relative bias of N can be approximated by the reciprocal of E(C), i.e., RB E(C), Dingding Li, Correction in Small Sample from Big Data, TKDE, /E(C) RB (4).7 mean est N sample size n Figure: RB and /E(C) against sample sizes in simulation study. It shows that N is biased upwards, and the relative bias can be approximated by the reciprocal of E(C).
14 Our bias corrected estimators.2 x 6..8 N = N = n 2 2(C + ) (5) n 2 2(C + ) Γ (6) hat N hat N* mean est N sample size n Figure: N and N S over 4 runs for various sample size in simulation study. Red dotted line is the true value.
15 Twitter data N=4million..6 /E(C) RB 4.8 x 7 hat N hat N* mean est N Expected bias sample size n Corrected
16 Outline 2 3 4
17 Random not Recommended Lemma (Variance of N N ) The estimated variance of RN estimator N N is Lemma (Variance of N E ) var( N N ) N2 E(C) 2N3 n 2 (7) The estimated variance of RE estimator N E is ( ) var( N E ) 2N3 n 2 + nγcv 2 (Γ), (8) Γ 2N where CV (Γ) is the coefficient of variation of Γ. Theorem (RN vs. RE) To achieve the same variance of N E, N N needs to use at most Γ times more samples.
19 RN and RE on Facebook Data RSE Observed RN Estimated RN Observed RE Estimated RE 5 2 Sample ( 2N)
20 Comparison of RN, RE, and RW Figure: Comparison of three sampling methods. The sample size n = 2NC where C =. It shows that for RN sampling (red solid bars), the relative standard error is equal to / C =. across all the datasets. RE sampling is consistently smaller than RN sampling.rw sampling can approximate RE sampling for some datasets. For NotreDame etc. that have low conductance, RW is grossly wrong.
21 Why RW Varies ~+ ~99 2~9 Figure: Subgraphs obtained by RW sampling from Flickr, Eu, Stanford and Youtube. Each subgraph contains 6, nodes. Node colour represents its degree in the original graph. Green=; Blue=2 9; Orange= 99; Red=.
22 for average degree degree is an important metrics in any network In and out average degrees in are different. Naive method arithmetic sample mean Problem Variance is too large because of the power law Solution Use PPS (RE) sampling and harmonic mean estimator On Twitter, PPS sampling can be hundreds of times better
23 Estimation Theorem d RN = n n d xi (9) i= [ n ] d RE = d RW = n () d xi i= Suppose the degrees follow Zipf s law with exponent one, i.e., d i = of the random node estimator is ( [ var( d RN ) d 2 N (α + ) ln 2 N + α ] ) n + α ( var( d RE ) d 2 n 2 ln N + α + α A. The variance α+i. () ). (2) Main conclusion RE is better than RN in many cases; RW depends on graph conductance.
24 distribution Frequency Frequency ()Twitter 6 (2)WikiTalk (3)BerkStan (4) Eu (5)Stanford (6)Skitter (7)Youtube (8)NotreDame (9)Gowalla ()Epinion ()Google (2)Slashdot Frequency (3)Facebook degree (4)Flickr degree (5)Facebook degree (6)Amazon degree (7)Citation 2 3 degree (8)RoadNet 2 degree Plots are sorted in decreasing order of coefficient of variation γ. Most of them follow power-law, yet they are very different
25 Comparison of Three methods
26 Comparison of RN and RE Ratio of RN and RE γ
27 Comparison of RN, RE, and RW s RRMSE RRMSE ()Twitter (7)Youtube (2)WikiTalk (8)NotreDame (3)BerkStan (9)Gowalla (4) Eu ()Epinion (5)Stanford ()Google (6)Skitter (2)Slashdot (3)Facebook (4)Flickr (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet RRMSE Sample 5 Sample 5 Sample 5 Sample 5 Sample 5 Sample Figure: RRMSEs of RN, RE, and RW samplings as a function of sample size for 8 graphs. The dotted, dashed, and solid lines are for RN(... ), RE( ), and RW( ) samplings respectively. It shows that in most cases the sample size does not change the relative positions of the sampling methods. The exceptions are the web graphs 3 and 5 where RW
28 Sample distribution 4 (2)WikiTalk 3 (3)BerkStan 4 (4) Eu 3 (5)Stanford 3 (6)Skitter Frequency (7)Youtube (8)NotreDame (9)Gowalla ()Epinion ()Google (2)Slashdot (3)Facebook (5)Facebook 2 (6)Amazon (7)Citation (8)RoadNet (4)Flickr Frequency Figure: The degree distributions of the samples obtained from RE (Random Edge) samplings. n=8,. The log-log plots in the first two rows exhibit a V" shape, where the sampled small nodes resemble the distribution of the original graph, while the sampled large nodes have a tail pointing upwards. These plots in the first two rows indicate that both small and large nodes are well represented in the sample. The plots in the last row indicate that the sample distribution is similar to the original distribution, therefore the RRMSE of RE sampling is similar to that of RN sampling.
29 Graph conductance and RW sampling Ratio of RW and RE Facebbok 2 Youtube Amazon Flickr NotreDame Stanford BerkStan Pearson Corr: log(φ) Figure: Standard error ratio between RW and RE vs. graph conductance Φ for 8 datasets. Sample size is 4.
30 Network Structure Flickr NotreDame Stanford Web Amazon Facebook-2 Youtube ~+ ~99 2~9 Figure: Random walks on six networks. Flickr, NotreDame and Stanford have loosely connected components while Amazon, Facebook and Youtube are well enmeshed. Each random walk contains 6 4 nodes except NotreDame which has 5 4 nodes. Node
31 Conductance Flickr NotreDame Stanford Conductance Number of nodes in the cluster Conductance Number of nodes in the cluster Conductance Number of nodes in the clust Amazon Facebook-2 Youtube Conductance Conductance - Conductance Number of nodes in the cluster Number of nodes in the cluster Number of nodes in the clust Figure: Conductance Φ(S) over S, the size of the the components, for six networks. Plots are drawn using SNAP API described in [?].
32 Very important We can access only partial information What is the global picture? Distribution Most influential Overall topology (e.g. clustering coefficient) Message diffusion, Critical nodes Communities...
33 Star Select nodes uniformly at random (e.g., nodes and 6); Take all the neighbours as sample (nodes 2,3,4,5,5,7,9); It approximate PPS (probability proportional to size) sampling; More efficient than random walk by taking all the neighbours instead a random one; We sampled around one million stars;
34 s and Messages Figure: Estimated out-degree, in-degree, and message distributions of. in-degree and out-degree as 32. (CI 3.9, 32.29) and (CI 49.2, 59.76), respectively.
35 Estimation of followers f i d i d i Difference Ratio ,335,29 6,859,5 6,476, ,945,36 4,92,69,24, ,247,64 4,62,354,85, ,394,62 7,58,539 5,876, ,278,6 2,287,38 99, ,53,77 2,554, , ,99,4,892,58,97, ,64,27,323,22,28, ,97,22,78,52 36, ,3,37,76,827,242,3. Table: Estimation for the top accounts. f i : capture frequency of account i; d i : claimed in-degree or number of followers; d i : estimated number of followers; Ratio = (d i d i )/ d i.
36 Estimated Followers Ratio Accounts A Smoothed ratio Accounts B Ratio 3 2 C 2 4 Accounts (large ratio) Followers 2.5 x D Estimated Claimed Accounts(top 5) Followers Estimated Claimed Accounts(all) E Difference 8 x F 2 5 Accounts(all)
37 Estimated vs. Claimed Followers Estimated followers 2.5 x Account when X=Y Claimed followers x 7 Figure: Estimated followers vs. claimed followers in log-log scale. The Pearson correlation coefficient is.9797.
38 Whether is it correct... Relative standard deviation of the estimator is RSD(ˆd i ) = / f i. (3) 9 Eu True Value Est. Error Bound 3 x 4 youtube True Value Est. Error Bound s s Top Users Top Users Figure: accuracy on existing data
4302 動態光散射儀 (Dynamic Light Scattering) 代工實例與結果解析 生醫暨非破壞性分析團隊 2016.10 updated Which Size to Measure? Diameter Many techniques make the useful and convenient assumption that every particle is a sphere. The
SSBSE 2015, Bergamo Transformed Search Based Software Engineering: A New Paradigm of SBSE He JIANG, Zhilei Ren, Xiaochen Li, Xiaochen Lai email@example.com School of Software, Dalian Univ. of Tech. Outline
A Study on Grading and Sequencing of Senses of Grade-A Polysemous Adjectives in A Syllabus of Graded Vocabulary for Chinese Proficiency 2002 I II Abstract ublished in 1992, A Syllabus of Graded Vocabulary
10384 070302 9825042 UDC 2001.6. 2001.7. 20016 THE APPLICATION OF ISOTOPE RATIO ANALYSIS BY INDUCTIVELY COUPLED PLASMA MASS SPECTROMETER A Dissertation Presented By Chaoyong YANG Supervisor: Prof.Dr. Xiaoru
Decision analysis 量化決策分析方法專論 2011/5/26 1 Problem formulation- states of nature In the decision analysis, decision alternatives are referred to as chance events. The possible outcomes for a chance event
Introduction to Genetics Darrh Bullock University of Kentucky The Model Trait = Genetics + Environment Genetics Additive Predictable effects that get passed from generation to generation Non-Additive Primarily
I II III IV The theories of leadership seldom explain the difference of male leaders and female leaders. Instead of the assumption that the leaders leading traits and leading styles of two sexes are the
* 35 4 2011 7 Vol. 35 No. 4 July 2011 3 Population Research 1950 ~ 1981 The Estimation Method and Its Application of Cohort Age - specific Fertility Rates Wang Gongzhou Hu Yaoling Abstract Based on the
2008 1 11 M4.7 On the M4.7 earthquake off Kamaishi, Iwate prefecture, Japan, on January 11, 2008. Graduate School of Science, Tohoku University 2008 1 11 M4.7 Matsuzawa et al. (2002) M-T M4.9 23Hz DD Waldhauser
Research Analysis MICHAEL BERNSTEIN CS 376 Last time What is a statistical test? Chi-square t-test Paired t-test 2 Today ANOVA Posthoc tests Two-way ANOVA Repeated measures ANOVA 3 Recall: hypothesis testing
Jianwen Zhao Department of Computer Science and Engineering The Chinese University of Hong Kong 1/16 Problem 1. Matrix Diagonalization Diagonalize the following matrix: A = [ ] 1 2 4 3 2/16 Solution The
The Development of Color Constancy and Calibration System The Development of Color Constancy and Calibration System LabVIEW CCD BMP ii Abstract The modern technologies develop more and more faster, and
2003 MBA 600795 SWOT Abstract After over ten years development, Chinese securities market has experienced from nothing to something, from small to large and the course of being standardized. To all securities
38 1 2014 1 Vol. 38No. 1 January 2014 51 Population Research 2010 2010 2010 65 100028 Changing Lineal Families with Three Generations An Analysis of the 2010 Census Data Wang Yuesheng Abstract In contemporary
The Digital Divide on the Remote Area: Regarding the community of Ta-Pang in Mt. A-li Abstract Base on the coming of information society, the digital science and technology usage suppose to be the basic
Shigeru Suto, Takayuki Inomata, Hisashi Sasaki and Sakae Mukoyama (2007) Data base of the volcanic ash fall distribution map of Japan. Bull. Geol. Surv. Japan, vol. 58(9/10), p.261-321, 8 figs, 2 tables,
Application of Information Literacy to firstname.lastname@example.org information literacy Theme-oriented teaching. Abstract Based on the definition of Information Literacy and Six core concepts of the problem
A Study on the Job Stress and the Ways of Coping for the Director of Elementary School in the Middle Area of Taiwan Abstract This study aims at probing the subject current status as related to stress and
i YouTube was established in 2005 until now only more than 3 years. Although it was established just more than 3 years, it has already become the one of multitudinous video shares website that most people
數 Quadratic Equations 數 Contents 錄 : Quadratic Equations Distinction between identities and equations. Linear equation in one unknown 3 ways to solve quadratic equations 3 Equations transformed to quadratic
Improving the Video Totalized Method of Stopwatch Calibration Samuel C.K. Ko, Aaron Y.K. Yan and Henry C.K. Ma The Government of Hong Kong Special Administrative Region (SCL) 31 Oct 2015 1 Contents Introduction
土木史研究論文集 Vol.29 2010 年 * ** Abstract Fukuchiyama Line, which locates in suburban of Osaka Megalopolis, has performed a part of Japanese trunk railway network. Because it was chosen as a national railway
2010 年 理 工 类 AB 级 阅 读 判 断 例 题 精 选 (2) Computer mouse How does the mouse work? We have to start at the bottom, so think upside down for now. It all starts with mouse ball. As the mouse ball in the bottom
-- IEMBA 3 ii 4W2H Who( ) When( ) What( ) Why( ) How much( ) How to do( ) iii Abstract Pharmaceutical industry can be regard as one of the knowledge-intensive industries. Designing a sales promotion for
Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial Fulfillment of the Requirements For the Degree of Master
88 2 The Research of head-hunting Industry. 8741605 87 Wu Po-Hui Yeh, Kuang S. head-hunting executive search Transaction cost Agency theory 1 This study attempts to investigate and analyze the Executive
RReadingWritingArithmetic People with goals succeed because they know where they are going 2 Prepare our students for life Promote their thinking and broaden their horizons Owing to the infl ux of digital
Chap. 4 Techniques of Circuit Analysis Contents 4.1 Terminology 4.2 Introduction to the Node-Voltage Method 4.3 The Node-Voltage Method and Dependent Sources 4.4 The Node-Voltage Method: Some Special Cases
THE INSTALLING INSTRUCTION FOR CONCEALED TANK Important instuction:.. Please confirm the structure and shape before installing the toilet bowl. Meanwhile measure the exact size H between outfall and infall
Introduction to Hamilton-Jacobi Equations and Periodic Yu-Yu Liu NCKU Math August 22, 2012 Yu-Yu Liu (NCKU Math) H-J equation and August 22, 2012 1 / 15 H-J equations H-J equations A Hamilton-Jacobi equation
(1) (2) 921 7-200120022006 2008 (IVI)(2001) (IVI) 50% 2008 7~8% 1~30cm (IVI) Study on the Plant Succession of Slopeland Landslide Areas Following Hydroseeding Process Chiao-Shu Feng Graduate Student, Department
ABSTRACT ABSTRACT As we know the Sinology has a long history. As earily as 19 th century some works have already been done in this field. And among this the studies of lineages and folk beliefs in Southeast
: 1 Students are required to know 50 states and capitals and their geological locations. This is an independent working packet to learn about 50 states and capital. Students are asked to fulfill 4 activities
2007 10 177-196 Symbolic Meanings of the Food in the First Lunar Month between China and 177 Japan He Bin Professor, School of Humanities and Social Sciences Tokyo Metropolitan University Abstract In daily