Statstcs & Data Aalyss 3 Zhu Huaqu @Pekg Uversty Get your facts frst, the you ca dstort them as you please. Mark Tw Our behavor atttudes ad sometmes actos Our behavor, atttudes, ad sometmes actos are based o samples. Elemets of Samplg Theory ad Methods
3.1 31 (Image by MIT OpeCourseWare. Based o Glbert, Norma. Statstcs. W.B. Sauders Co., 1976.)
3.1.1 {Ω, F Ω }Φ {Ω, F Ω, Φ}statstcal model statstcal structure {Ω, F Ω, Φ}{Ω, F Ω, Φ }, F F,, F,, F, {Ω,, F Ω Ω, Φ}{Ω, {, F Ω Ω, Φ}
3.1. X, Θ {X } Heght Θ ~ X, Θ {X } Θ ~ Θ
Samplg s that part of statstcal practce cocered wth the selecto of dvdual observatos teded to yeld some kowledge about a populato of cocer, especally for the purposes of statstcal ferece. X μ radom samplg XX radom sample X={x }=1,,, N {X 1, X,, X }, <N X {X } =1,,,
3.1.3 XN{x{ }=1, 1,, N 1 N populato mea N 1 x populato total N 1 x N populato varace N N 1 1 ( x ) ( x ) N 1 N 1 populato stadard devato s
Example 3.1E. col K-1Eschercha col K-1 MG1655 GCE. col K-1 4,639,14C GAT4,89 GCDNA GC GC%4,89 %, GC N=489μ=51.01σ =3.91s=4.89
3. 3 : X 1,X,,X <+
3..1 χ tf χ Ch-square dstrbuto XN(0, 1)(X 1, X,, X ) X (=1,, ) f X X X 1... x 1 1 x e, x 0 ( x ) 0, x 0 χ χ ~χ () f x χ () () x 1 1 x e, x 0 0, x 0 =1 = =3 =4 =5
x, F x χ () ( ) =1 = =3 =4 =5 χ 1 1 ~ ( 1), ~ ( ) 1, ~ ( ) 1 1 χ ~χ () E ( ) ( ) D
χ 3χ ()αα0<α<1χ α () P ( ) f( x) dx χ α ()χ ()α χ α ()
χ χ Pearso- χ (1) () (3) Frst descrbed by the Germa statstca Helmert papers of 1875 where he computed the samplg dstrbuto of the sample varace of a ormal populato. p Thus Germa ths was tradtoally kow as Helmert dstrbuto. Idepedetly redscovered by the Eglsh mathematca Pearso the cotext of goodess of ft, for whch he developed hs Pearso's ch-squared test, wth computed table of values publshed 190. Fsher Karl Pearso (1857 1936) Fredrch Robert Helmert (1843 1917) t t dstrbuto X~N(0, 1)Y~χ ()X, X Y X T Y / 1 1 t f () t 1, t TtStudet T~t()
Wllam Sealy Gosset 18761937 1937 Studett 1 1 t f() t 1, t =1 =4 =8 =1 N(0, 1) t()
=1 = =5 =10 =+ t() t 1f(t)t = 0 Studet N(0, 1)N(0, 1) 1 lm f ( t ) e t Desty of the t-dstrbuto for 1,, 3, 5, 10, ad 30 df compared to the stadard ormal dstrbuto (blue).
t 3t()α α0<α<1t α () t ( ) P t t ( ) f ( t) dt t α ()t()α t α () f t dt t ( ) () 1 α α
F F dstrbutofsher-sedecor U~χ (), V~χ (m)u, V U / F V / m m 1 ( ) y m, y 0 m f ( y ) m y ( ) ( ) 1 m 0, y 0 F(, m)ff~f(, F(, m) 1 m ( ) y m, y 0 m f ( y) m y ( ) ( ) 1 m 0, y 0 =1, m=1 =, m=1 =5, m= =100, m=1 =100, m=100 F(, m)
=1, m=1 =, m=1 =5, m= =100, m=1 =100, m=100 F(, m) F 1F~F(, m) =1 1 ~ Fm (, ) F F(1, m) t( m)
3α0<α<1F α (, m) F (, m) P F F (, m) f( y) dy F α (, m)f(, m)α F 4 1 1 (, m) F ( m, ) F α (, m)
3.. statstc samplg dstrbuto XN(μ, σ )(X(X 1, X,, X ) X (=1,, ) 1 1 X ~ N (, ), where X X 1 1 S ~ ( 1), where S ( X ) X 1 X 1 S 3 1
XN(μ, σ )(X 1, X,, X ) X (=1,, ) T ( X ) S ~ t( 1) (X 1, X,, X )(Y 1, Y,, Y m )N(μ 1, σ 1 )N(μ, σ ) ( X Y ) ( 1 ) ~ ( ) T t m 1 1 S w m ( 1) S1 ( m 1) S S w m m 1 1 S1 ( X X ), S ( Y Y) 1 m 1 1 1
(X 1, X,, X )(Y 1, Y,, Y m ) N(μ 1, σ 1 )N(μ, σ ) S / F ~ F ( 1, m 1) S 1 1 / 1 1 S X X S Y Y m 1 ( ), ( ) 1 1 m 1 1 3.3 33 3 Smple Radom Samplg χ tf Samples are draw radomly from the populato, esurg that all parts of the populato lt have a statstcally ttt equal chace of beg selected. Wth SRS method, each member of the populato has a statstcally equal chace of beg selected as a sample, thus reducg bas the sample. SRS s the best approach f lttle or othg s kow about the populato. lt
(smple radom samplgsrs) X X(X 1, X,, X ) (X( 1, X,, X ) ) IID 1(X 1,X,,X ) (X 1, X,, X )X (X 1, X,, X ) (X 1, X,, X ) 1 1XX XX 3XX
1 (wth replacemet) SRSSRS N 3 X =1,,, X 4 X 1, X,, X 5wthout replacemet < 0.05N
Example 3. 0 XN(X(X 1, X,, X ) 1sample mea N 1 1 X X ~ x N 1 1 sample varace 1 1 s X X x N ( ) ~ ( ) 1 1 N 1
(X 1, X,, X ) 3.3.1 331 31 Example 3 3E colk 1Eschercha col K 1 MG1655 Example 3.3E. col K-1Eschercha col K-1 MG1655 GC489 = 10, 0, 30, 40500
μ X E ( X ) 1 X 1 ubased estmate
ubased estmate: A statstc t t s sad to be a ubased estmate t of a gve parameter whe the mea of the samplg dstrbuto of that statstc ca be show to be equal to the parameter beg estmated. E ( X ) X 1 1 DX X ( ) (1 ) N 1 Nσ 1 X <<N X
3.3. 33 3 1 s ( X X ) 1 1 N Es ( ) N 1 Nσ based estmate
N 1 1 N 1 s. s X X 1 N 1 N N>>N 1 ( ) 1 1 s X X 1 ; 1 s ( X X ) 1 X 1
3.3.3 333 XN{x{x }=1,,, N 1.. rad( ) srad( ) 3.
3.3.4 334 34 xynn r N 1 y N y N 1 x x N N 1 1 N 1 1 y x r y x 1 N N 1 XN {(X 1, Y 1 ), (X, Y ),, (X, Y )} R r R 1 Y Y X 1 X R 1 1 1 1 Y X
R 1 X Y XY x DR ( ) r r 1 1 1 1 r r N1 1 N x y N xy x y 1 x N y X Y N 1 x y N N 1 x y xy Y X xy N XY N 1 ρ x x y y R 1 1 1 D R r r ( ) 1 x y x y N 1 x
Y R X 1 1 1 ER ( ) r 1 r N 1 x x x y 1D(R), ( E(R) r μ x μ x D(R), ( E(R) r R 1 1 1 D ( R ) 1 R s x sy Rsxy N 1 X 1 1 1 E ( R ) r 1 R s x Rs xy N 1 X 1 s 1 ( X ) x X s ( Y ) y Y 1 1 1 1 1 s ( X )( ) xy X Y Y 1 1
3 RE(R)-r o(1/) R 4 1 Y 1 X 1 Y 1 1 X 1 Example 3.41976Herkso39319681
3.4 34 Why? 1 N1 1 N 1 X s. s X X 1 1 N 1 N 1 1 E( X) DX ( ) (1 ) X N 1 1 N Es ( ) N 1 X
Stratfed Radom Samplg Whe the populato exhbts a regularly occurrg or predctable strata, samples ca be draw radomly from each of the stratum. Ths method esures that all dvduals wth a stratum have a statstcally equal chace of beg selected for the study. subpopulato 1 1 3
3.4.1 341 XLN l, l=1,,, Ll N1N... NL N l l l xl, 1 1,,..., Nl, l 1 1,,..., L l L N l L L 1 1 xl Nl l Wl l, N N W l=1 l=l L N l ( xl ) l1 1 l1 l1 l1 1 l Nl N 1 N lweght l N Nl 1 l 1 x l l X ll 1l, Xl,..., Xl l l l 1 1 X l X s l l Xl X l l 1 1 l 1 L L N l s l l l l1 N l1 X X W X l=1 l=l L L N l s l l l l1 N l1 s s Ws
X s L WX l l1 L s l l l 1 E X W E X l L 1 l 1 s l 1 l D X W 1 l l N l 1 3.4. strata allocato scheme D X W W L L 1 1 l l s l 1 l l l1 l Nl 1 l1 l l=1 l=l
Neyma Neyma optmal allocato scheme D X s W l l l L, l 1 1,,..., L W k 11 k k 1... L Neyma 1 W l l W l l L 1 s l l Neyma l 1 D X W
1... L W, l 1 1,,..., L l l proportoal allocato scheme 1 1 L s W l l D X proporto l 1
Example 3.5KID (Kds Ipatet Database ): database of hosptal patet stays for chldre. To detfy, track, ad aalyze atoal treds health care utlzato, access, charges, qualty, ad outcomes. Composed of more tha 100 clcal ad oclcal varables for each hosptal stay. These clude: (1) Prmary ad secodary dagoses; () Prmary ad secodary procedures; (3) Admsso ad dscharge status; (4) Patet demographcs; (5) Expected paymet source; (6) Total charges; (7) Legth of stay; (8) Hosptal characterstcs (e.g., (eg owershp, sze, teachg status). For the purposes of calculatg dscharge weghts, we stratfed hosptals o sx characterstcs: (1) Geographc Rego: Northeast, Mdwest, West, ad South; () Cotrol : govermet ofederal, prvate ot-for-proft, ft ad prvate vestor-owed; (3) Locato: urba or rural. (4) Teachg Status: teachg or oteachg. (5) Bedsze: small, medum, ad large. (6) Hosptal Type : chldre s or other hosptal 1
3.5 35 3.5.1 1. systematc c radom samplg K = N / N Systematc Radom Samplg Be useful for samplg data that are moble or dyamc, such as dscharge from a process (e.g., takg a total 0 samples every fve mutes for oe hour).
. cluster radom samplg groups, clusters Populato Smple radom samplg o groups Sample
1 1 3. mult-stage radom samplg groups, or clusters Populato Smple radom samplg o groups Radom samplg Sample
1 1 3 3.5. 35 1
1. Quota samplg strata 1. Sowball samplg A type ofpurposve p samplg, t volves detfyg oe or more people from the populato beg researched who ca the detfy other members of the populato who, tur ca detfy further members ad so o. A substatal umber of people ca be detfed ad approached to take part the research. Sowball samplg s also kow as etwork samplg ad s partcularly useful whe t s dffcult to detfy members of a populato as may be the case whe researchg hard-to-reach groups.
3. Voluteer samplg A commo method of voluteer samplg s phoe- samplg, used maly to gauge publc opo o curret affars ssues such as preferred poltcal party, captal pushmet, etc. People are asked to telephoe ther vote o a partcular ssue wth a certa tme, wth o lmt to the umber of people who ca call. Ufortuately, there s also o lmt to the umber of tmes a perso may phoe through ther vote. Ths s a major reaso why t s hghly ulkely that ths type of samplg wll produce a represetatve sample. As well, people who ted to call for these surveys may have qute dfferet vews from those who do ot to call. Example 3.61936Alfred M Lado Frakl D Roosevelt Lterary Dgest 10003057% LadoLterary DgestLado Roosevelt6%
3.6 36* 1.. 1 (Weghtg adjustmet method) Poltz-Smmos
Imputato method Mea mputato Deductve mputato Rato mputato Regressve g mputato Nearest eghbor mputato Hot deck mputato 3.7 37* 3.7.1 3.7. 37 Gbbs
Further readg Govdarajulu Z., Elemets of Samplg Theory ad Methods, by Pearso Educato Asa Ltd., 1999 005 010