Categorical Data Analysis (For this handout, the tables and figures used are from introduction to Categorical Data Analysis, 1 st and 2 nd eds, 1996 a

類別資料分析 (Categorical Data Analysis) Instructor: 蔣國司 (Kuo-Szu Chiang), Ph.D. kucst@dragon.nchu.edu.tw Phone: 22840777 ext 301 or 305 Time: Monday at 9: AM to noon Place: 2F Computer Room of Crop Science Building Text Book: An Introduction to Categorical Data Analysis, 2 ed. Alan Agresti, Wiley Series, 2007. Reference book: Categorical Data Analysis, 2 ed. Alan Agresti, Wiley Series, 2002. Contents: Introduction Contingency tables Generalized linear model Logistic regression Loglinear model And the other topics if time permitted Grade: Homework and Project (40%), Midterm (30%), Final (30%). 1

Categorical Data Analysis (For this handout, the tables and figures used are from introduction to Categorical Data Analysis, 1 st and 2 nd eds, 1996 and 2007, Allan Agresti, Wiley Series) Preliminary knowledge of Categorical Data Analysis (1) Two-sample Test for Binomial Proportions (Normal-Theory Test) H : p p H : p p 0 1 2 0 1 2 => 1 2 Z p p x x p1 n n 1 1 pq n 1 n 2 q1 p Indep. samples The samples will be assumed large enough So that the normal approx. to the binomial dist. is valid. 1 2 1 2 (2) Contingency-Table Method o 11 o 12 o 21 o 22 1 Test whether or not the proportions are the same in the two indep. Samples. 2 Test for the indep. of two characteristics. o e 2 e 2 2 ~ ( r 1)( c 1) if no expected value in the table is less than 5. Some Corrected? (e.g. Yates) (3) If the prob. of a success = p p then the odds in favor of success = 1 p let p1, p2 be the prob. of success for two populations. The odds ratio (OR) 2

p1 q pq OR p pq pq 1 1 2 1 2 estimated by OR 2 2 1 pq 2 1 q2 yes disease no exposure yes a b a+b no c d c+d a+c b+d Estimation of the Risk Ratio for case control studies ad RR OR bc a d a b c d ad OR c b bc c d a b point interval Estimation for the odds ratio (4) Fisher s Exact Test exist.) (for the standard chi-square test is not applicable, because small expected values (5) R C Contingency Tables Trend (6) Mantel-Haenszel Test To assess the association between a dichotomous disease and dichotomous exposure variable after controlling for one or more confounding variable. (7) McNemar s test two-sample Test for Binomial Proportions for Method-Pair Data 3

Chapter 1 Introduction 類別資料的兩個主要的機率分布 : 二項分布和卜瓦松分布 1.1 Categorical response data count data not measure data response variable not explanatory variable two primary types of measurement scales:ordinal variables( 有次序分別 ) and nominal variables( 無大小次序分別 ). ordinal 的分析方法不能用在 nominal, 但 nominal 的分析方法可用在 ordinal 和 nominal 1.2 Sampling models Poisson sampling 類別資料中有一種重要的抽樣模型是把每一個類別的次數當成是獨立的卜瓦松觀測值, 這中抽樣叫做卜瓦松抽樣, 卜瓦松分布有一個特性是變異數會隨著平均值上升, 實際上次數觀測值的變異數常常超過平均數, 這種現象叫做過度離勢 (overdispersion) Binomial sampling Define: N 次獨立且相同的試驗, 每次試驗的結果不是成功就是失敗當二項分布和卜瓦松分布的平均數變大時, 會更接近鐘型並趨近常態分布 1.3 Inference for a proportion MLE: 讓觀測資料出現機率最大的參數值估計量 (estimator): 在我們看資料之前, 估計值是未知的, 他是一個抽樣分布的變量, 我們稱這個變量為估計量信賴區間的優點在於比點估計多了不確定性 4

資料類型 ( 離散型 ) 1. 二項分佈 (e.g. 發芽率 ) 2. 社會科學 (e.g. 助教課是否有助於學生成績提升 ) 3. o e 2 2 2 ~ ( r 1)( c 1) ( 大樣本底下適用 (n 5)) e 2x2 table (Contingency-Table) o 11 o 12 o 21 o 22 Fisher exact test ( 小樣本適用 ) (1)case contral;(2)cohort study 4. 種子是否帶毒 (group testing) 5. 診斷是否有病 ( 貝氏的計算 ) (1) 敏感度 : 有病檢測出有病 ;(2) 特異度 : 沒病檢測出沒病兩者具有拮抗作用, 無法同時提高 Generalized Linear Model Y = B 0 + B 1 X 1 + B 2 X 2 + + B k X k +ε If ε~bin(n,p) as y coding 0 or coding 1 π=prob(y=1) ln ( π 1 π ) 將該轉換稱為 logit transformation 其中 ( π 1 π ) 則稱為勝算 (odds) 因為使用 logit transformation, 所以稱之為 logistic regression 5

CH 1. Introduction Probability dist. 1. Binomial dist. 2. Multinomial dist. Continuous data => quantitative data Categorical data => qualitative data Categorical data : 1. nominal data ex: 政黨性別 ( 不可以用 ordinal data 的方法 ) 2. ordinal data ex: 汽車品牌水準 ( 有次序性, 可以用 nominal data 的方法, 但是會 lose information) 何謂參數? 用來描述或形容族群, 是未知固定的 parameter 的估計 :M.L.E. y 0 1 x y: 反應 (dependent) 變數 x: 解釋 (independent) 變數在 categorical data 中, 有時無法分辨反應或解釋變數 ex: two-way ANOVA Binomial dist. 的假設 :1. 每次試驗彼此獨立 2. p 是固定的 =>i.i.d 6

統計估計的兩大學派 : 1. Frequentist approach - 參數是固定未知的 2. Bayesian approach -prior dist. and posterior dist. x ; f x f p x x f x f x p x Why 常使用 M.L.E.?1. 不偏 2. Var 最小下圖如何畫出? 如何求得 M.L.E.? 令 1 次微分 = 0,2 次微分 < 0 點估計沒有將樣本數的 information 考慮進去, 所以區間估計較好 x t Z S. Ex pˆ Z S. Epˆ Meta analysis: 牽扯到不同研究所做的合併 Multinomial dist.: Binomial dist. 的擴充 7

法一 : Wald C.I. for Bin. Proportion => p Z S.E α 2, S.E p 1-p n E.X:Survey results on legalizing abortion 400 p 0. 893 Yes 400 448 C.I. 0.448 1.96 0. 448 1-0. 448 893 0. 415, 0. 481 p 用這個公式 ( 1 -p p Z S.E, S.E ) 的限制 : (1)p 接近 0.5 (2) 或是 n 很大因此當 p 接近 0 or 1 時並不適用法二 : Score E.X:p = 0.90 : 0.596 α 2 H0 0 n By Eq.13 0. 9 01. 0. 9 196. 0. 714, 86. 上界超出 1 π p-π 0 0 1-π 0 n 196. 0.9-0.596 Z 1.96, Z 0.596 0.404 0.9-0.982 0.982 0.018 1.96 法三 : paper(1998) 9 2 0. 786 0. 214 p = 0. 786 S.E 4 14 C.I. for 95% is(0.57,1) 8

Probabilit y theory Measurement theory 統計理論 Math. Stat. Estimation Inference Wald, Score, and Likelihood-Ratio Wald : Score : Likelihood-Ratio : 0.9-0.5 0.9 0.1 0.9-0.5 0. 5 P 0.001 P 0.011-2 ln 7. 36 P 0. 007 0.5 0.5 0. 9 Wald 是 3 種方法中最不適合小樣本的,Wald 須在大樣本下才有較好的結果 Small sample Binomial inference 若要使用 Z score 檢定最好是在 np 5的情況下 p-value 在離散型的分布上都會大一點 E.X: Py p-value 1. 0000. 001 0. 9990. 01 0. 00. 001 0. 59 0. 5 mid p-value P 9 E.X: P 0. 006 2 9

大樣本的估計方法 :(1)wald test (2)likelihood ratio test (3)scroe test Wald test: 唯一好處就是方便計算 p = x n var(p ) = var ( x n ) = np(1 p) n 2 = p(1 p) n p ± 1.96 p (1 p ) n Likelihook function L(X 1, X 2,, X n ; θ) = f(x 1 )f(x 2 ) f(x n ) e.g. ( n x )px (1 p) n x if n=1, p x (1 p) 1 x, p n i=1 x i(1 p) n n i=1 x i 補充 : 兩層的 Binormal( 實際上就是 group testing) P group = 1 (1 P individual ) k, k:group size 台灣使用 group testing : 蘭花種子檢測種苗廠 (potato) 農友會估計 group testing 的 P value =? group testing { 分類 group 裡頭是否帶毒估計 P individual 時會有偏差 (bias) 如何去分配 group? 估計完後, 檢測是否帶毒 dilutior effect 稀釋效應 ( 類似 group testing ) 差別在建立於不同分配 { group teating binormal dilutior effect poission