CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/11/08 CAP5510 1
Reading The following slides come from a series of talks by Rafael Irizzary from Johns Hopkins Much of the material can be found in detail in the following papers from [http://www.biostat.jhsph.edu/~ririzarr/papers/] Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf, U, Speed, TP (2003) Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics. Vol. 4, Number 2: 249-264. Bolstad, B.M., Irizarry RA, Astrand, M, and Speed, TP (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 19(2):185-193. 3/11/08 CAP5510 2
Inference Process 3/11/08 CAP5510 3
Affymetrix Genechip Design 3/11/08 CAP5510 4
Workflow: Analyzing Affy data 3/11/08 CAP5510 5
Affy Files DAT file: image file, about 10 million pixels, 30-50 MB CEL file: cell intensity file with probe level PM and MM values CDF file: chip description file describing which probes go in which probe sets and the location of probe-pair sets (genes, gene fragments, ESTs) 3/11/08 CAP5510 6
Image analysis & Background Correction Each probe cell: 10 X 10 pixels Gridding estimates location of probe cell centers Signal is computed by Ignoring outer 36 pixels leaving a 8 X 8 pixel area Taking the 75 percentile of the signal from the 8 X 8 pixel area Background signal is computed as the average of the lowest 2% probe cell values, which is then subtracted from the individual signals 3/11/08 CAP5510 7
Analyzing Affy data MAS 4.0 Works with PM-MM Negative values result very often Very noisy for low expressed genes Averages without log-transformation dchip [Li & Wong, PNAS 98(1):31-36] Accounts for probe effect Uses non-linear normalization Multi-chip analysis reveals outliers MAS 5.0 Improves on problems with MAS 4.0 3/11/08 CAP5510 8
Why you use log-transforms? SD SD Average Intensity Average Intensity 3/11/08 CAP5510 9
Problem with using (transformed) PM-MM 3/11/08 CAP5510 10
Bimodality for large expression values 3/11/08 CAP5510 11
MAS 5.0 MAS 5.0 is Affymetrix software for microarray data analysis. Ad hoc background procedure used For summarization, they use: Signal = TukeyBiweight{log(PM j -MM j *)} Tukey Biweight: B(x) = (1 - (x/c) 2 ) 2, if x<c = 0 otherwise Ad hoc scale normalization used & PhD thesis by Astrand 3/11/08 CAP5510 12
2 replicate arrays Expression from corresponding probes are highly correlated Expression not correlated when probes randomly partitioned 3/11/08 CAP5510 13
We have to deal with variations! 3/11/08 CAP5510 14
MvA Plots 3/11/08 CAP5510 15
Spike-in Experiment Replicate RNA samples were hybridized to various arrays Some probe sets were spiked in at different concentrations across the different arrays Goal was to see if these spiked probe sets stood out as differentially expressed 3/11/08 CAP5510 16
Analyzing Spike-in data with MAS 5.0 3/11/08 CAP5510 17
Robust Multiarray normalization (RMA) Background correction separately for each array Find E{Sig Sig+Bgd = PM} Bgd is normal and Sig is exponential Uses quantile normalization to achieve identical empirical distributions of intensities on all arrays Summarization: Performed separately for each probe set by fitting probe level additive model Uses median polish algorithm to robustly estimate expression on a specific chip Also see GCRMA [Wu, Irizzary et al., 2004] & PhD thesis by Astrand 3/11/08 CAP5510 18
Analyzing Spike-in data with RMA 3/11/08 CAP5510 19
MvA and q-q plots MAS 4.0 MAS 5.0 3/11/08 CAP5510 20
MvA and q-q Plots MBEI RMA 3/11/08 CAP5510 21
Before and after quantile normalization 3/11/08 CAP5510 22
Bioconductor Bioconductor is an open source and open development software project for the analysis of biomedical and genomic data. World-wide project started in 2001 R and the R package system are used to design and distribute software Commercial version of Bioconductor software called ArrayAnalyzer 3/11/08 CAP5510 23
R: A Statistical Programming Language Try the tutorial at: [http://www.cyclismo.org/tutorial/r/] Also at: [http://www.math.ilstu.edu/dhkim/rstuff/rtutor.html] 3/11/08 CAP5510 24
Installing a package from Bioconductor Let s consider LIMMA: Linear Models for Microarray Data. It is a software package for the analysis of gene expression microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression. The package includes pre-processing capabilities for two-color spotted arrays. The differential expression methods apply to all array platforms and treat Affymetrix, single channel and two channel experiments in a unified way. Here s how you install and load it: Here is an installation script > source("http://www.bioconductor.org/bioclite.r") > bioclite("limma") > bioclite("statmod") If you want to install some other package (say affy ), then you type: > bioclite( affy ) 3/11/08 CAP5510 25
Analyzing E. coli Lrp Data (Affymetrix) Follow instructions in Section 8.3 of LIMMA User s Guide (http://pbil.univ-lyon1.fr/library/limma/doc/usersguide.html) Data for the experiment is not from the address given in Sec 8.3, but from: http://cybert.microarray.ics.uci.edu/tutorial/affy%20data/ 3/11/08 CAP5510 26