第10章 表达分析
10.12 Summary、Internet Resources、Further Reading 与 References
范围:PDF page 329 - PDF page 334;合并 Summary、Internet Resources、Further Reading、References。
---
Summary
本章为基因表达数据分析提供了一张路线图;而且由于这一领域变化极快,这张路线图天然不可能穷尽所有内容。自 20 世纪 90 年代早期通过 expressed sequence tags 测序开展首次全基因组范围表达分析以来,技术平台的变化、分析方法的进步,以及大量辅助数据的爆发式增长——例如众多物种的基因组序列及其基因注释——共同重塑了这一研究领域。
具体选择哪一种软件工具或分析方法,当然始终可以讨论;今天看来最优的方案,明天也可能被更新的方法取代。但良好的实验设计原则与扎实、可靠的分析实践并不会因此改变。
因此,与其把本章当成一本按步骤照做的 cookbook,不如把它理解为一张 roadmap:它为研究者指明一条更有可能获得成功、也更能增强结果可信度的分析路径。本着这样的初衷,我们希望本章概述的方法,能够成为你进入表达分析领域的一份有用导论与实践指南。
---
Internet Resources
---
Further Reading
以下保留原书题录英文原文;仅将题录后的说明性文字译为中文:
Brazma, A., Hingamp, P., Quackenbush, J. et al. (2001). Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat. Genet. 29 (4): 365–371. https://doi.org/10.1038/ng1201-365.
这篇奠基性论文确立了公开共享、且具充分注释的基因组数据集的基本原则,直到今天仍然经得起检验。MIAME 标准中提出的报告要求,在设计和撰写所有大规模实验时都应被认真考虑。
Conesa, A., Madrigal, P., Tarazona, S. et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biol. 17:13. https://doi.org/10.1186/s13059-016-0881-8.
Conesa, A., Madrigal, P. et al. (2016). Erratum to: A survey of best practices for RNA-seq data analysis. Genome Biol. 17 (1): 181. https://doi.org/10.1186/s13059-016-1047-4.
这篇论文(以及其勘误)很好地综述了 RNA-seq 数据分析方法,并给出了关于 best practices 的指导。尽管该领域仍在快速演进,但所谓“最佳实践”在很大程度上,仍然是建立在 DNA microarray 分析阶段积累下来的那些硬经验之上。
Ching, T., Huang, S., and Garmire, L.X. (2014). Power analysis and sample size estimation for RNA-seq differential expression. RNA 20 (11): 1684–1696. https://doi.org/10.1261/rna.046011.114.
对基因表达分析进行 power calculation 一向非常困难,主要原因在于不同基因之间的表达水平与方差差异都很大。这篇论文为如何估计统计功效,以及 RNA-seq 研究所需的样本量,提供了有价值的方法学指导。
Paulson, J.N., Chen, C.Y., Lopes-Ramos, C.M. et al. (2017). Tissue-aware RNA-seq processing and normalization for heterogeneous and sparse data. BMC Bioinf. 18 (1): 437. https://doi.org/10.1186/s12859-017-1847-x.
这篇论文描述了一条简洁的数据质量控制与标准化流程,可用于检查(部分)样本注释信息,并对来自 heterogeneous samples 的数据进行标准化处理。
Glass, K., Huttenhower, C., Quackenbush, J., and Yuan, G.C. (2013). Passing messages between biological networks to refine predicted interactions. PLoS One 8 (5): e64832. https://doi.org/10.1371/journal.pone.0064832.
基因调控网络推断的方法有很多。作者团队开发这一方法,是基于这样两点认识:其一,transcription factors 会调控基因表达;其二,不同 phenotypes 之间的网络结构本身也会发生变化。
Hung, J.H., Yang, T.H., Hu, Z. et al. (2012). Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings Bioinf. 13 (3): 281–291. https://doi.org/10.1093/bib/bbr049.
尽管 gene set enrichment 方法仍在不断演化,这篇综述仍然较好地比较了多种方法,指出了各自的 strengths 和 weaknesses,并对 best practices 提供了富有启发性的建议。
---
References
以下参考文献题录按原书英文原文保留:
Alexa, A., Rahnenfuhrer, J., and Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 22 (13): 1600–1607. https://doi.org/10.1093/bioinformatics/btl140.
Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 1 (10): R106. https://doi.org/10.1186/gb-2010-11-10-r106.
Beck, A.H., Knoblauch, N.W., Hefti, M.M. et al. (2013). Significance analysis of prognostic signatures. PLoS Comput. Biol. 9 (1): e1002875. https://doi.org/10.1371/journal.pcbi.1002875.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Methodol. 57 (1): 289–300.
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 (4): 1165–1188. https://doi.org/10.1214/aos/1013699998.
Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 19. https://doi.org/10.1093/bioinformatics/19.2.185.
Bolstad, B.M., Collin, F., Simpson, K.M. et al. (2004). Experimental design and low-level analysis of microarray data. Int. Rev. Neurobiol. 60: 25–58.
Bray, N.L., Pimentel, H., Melsted, P., and Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34 (5): 525–527. https://doi.org/10.1038/nbt.3519.
Brettschneider, J., Collin, F., Bolstad, B.M., and Speed, T.P. (2008). Quality assessment for short oligonucleotide microarray data. Technometrics. 50 (3): 241–264.
Butler, A., Hoffman, P., Smibert, P. et al. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36 (5): 411–420. https://doi.org/10.1038/nbt.4096.
Callow, M.J., Dudoit, S., Gong, E.L. et al. (2000). Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res. 10 (12): 2022–2029.
Chen, W., Gardeux, V., Meireles-Filho, A., and Deplancke, B. (2017). Profiling of single-cell transcriptomes. Curr. Protoc. Mouse Biol. 7 (3): 145–175. https://doi.org/10.1002/cpmo.30.
Cole, M.B., Risso, D., Wagner, A. et al. (2018). Performance assessment and selection of normalization procedures for single-cell RNA-seq. bioRxiv biorxiv.org/content/early/2018/05/18/235382.abstract.
De Jay, N., Papillon-Cavanagh, S., Olsen, C. et al. (2013). mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics. 29 (18): 2365–2368. https://doi.org/10.1093/bioinformatics/btt383.
DeRisi, J., Penland, L., Brown, P.O. et al. (1996). Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet. 14 (4): 457–460.
Ding, C. and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3 (2): 185–205.
Dobin, A., Davis, C.A., Schlesinger, F. et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 95 (25): 14863–14868.
Gardeux, V., David, F.P.A., Shajkofci, A. et al. (2017). ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data. Bioinformatics. 33 (19): 3123–3125. https://doi.org/10.1093/bioinformatics/btx337.
Gautier, L., Cope, L., Bolstad, B.M., and Irizarry, R.A. (2004). affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 20 (3): 307–315.
Golub, T.R., Slonim, D.K., Tamayo, P. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 286 (5439): 531–537.
Haibe-Kains, B., Desmedt, C., Loi, S. et al. (2012). A three-gene model to robustly identify breast cancer molecular subtypes. J. Natl. Cancer Inst. 104 (4): 311–325. https://doi.org/10.1093/jnci/djr545.
Hashimshony, T., Wagner, F., Sher, N., and Yanai, I. (2012). CEL-Seq: single-cell RNA-seq by multiplexed linear amplification. Cell Rep. 2 (3): 666–673. https://doi.org/10.1016/j.celrep.2012.08.003.
Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Predictions. New York, NY: Springer.
Hastie, T., Tibshirani, R., and Friedman, J.H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2e. New York, NY: Springer.
Hegde, P., Qi, R., Abernathy, K. et al. (2000). A concise guide to cDNA microarray analysis. Biotechniques 29 (3): 548–550, 52–44, 56, passim.
da Huang, W., Sherman, B.T., and Lempicki, R.A. (2009a). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37 (1): 1–13. https://doi.org/10.1093/nar/gkn923.
da Huang, W., Sherman, B.T., and Lempicki, R.A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 (1): 44–57. https://doi.org/10.1038/nprot.2008.211.
Ioannidis, J.P., Allison, D.B., Ball, C.A. et al. (2009). Repeatability of published microarray gene expression analyses. Nat. Genet. 41 (2): 149–155.
Irizarry, R.A., Bolstad, B.M., Collin, F. et al. (2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31 (4): e15.
Irizarry, R.A., Warren, D., Spencer, F. et al. (2005). Multiple-laboratory comparison of microarray platforms. Nat. Methods 2 (5): 345–350. https://doi.org/10.1038/nmeth756.
Ishmael, N., Dunning Hotopp, J.C., Ioannidis, P. et al. (2009). Extensive genomic diversity of closely related Wolbachia strains. Microbiology 155 (Pt 7): 2211–2222.
Jiang, L., Chen, H., Pinello, L., and Yuan, G.C. (2016). GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol. 17 (1): 144. https://doi.org/10.1186/s13059-016-1010-4.
Johnson, W.,.E., Li, C., and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 8 (1): 118–127. https://doi.org/10.1093/biostatistics/kxj037.
Kahvejian, A., Quackenbush, J., and Thompson, J.F. (2008). What would you do if you could sequence everything? Nat. Biotechnol. 26 (10): 1125–1133. https://doi.org/10.1038/nbt1494.
Konstantinopoulos, P.A., Cannistra, S.A., Fountzilas, H. et al. (2011). Integrated analysis of multiple microarray datasets identifies a reproducible survival predictor in ovarian cancer. PLoS One 6 (3): e18202.
Lander, E.S., Linton, L.M., Birren, B. et al., International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409 (6822): 860–921. https://doi.org/10.1038/35057062.
Langmead, B. and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9 (4): 357–359. https://doi.org/10.1038/nmeth.1923.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 (3): R25. https://doi.org/10.1186/gb-2009-10-3-r25.
Larkin, J.E., Frank, B.C., Gavras, H. et al. (2005). Independence and reproducibility across microarray platforms. Nat. Methods. 2 (5): 337–344. https://doi.org/10.1038/nmeth757.
Leek, J.T., Johnson, W.E., Parker, H.S. et al. (2012). The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 28 (6): 882–883. https://doi.org/10.1093/bioinformatics/bts034.
Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25 (14): 1754–1760. https://doi.org/10.1093/bioinformatics/btp324.
Li, P., Piao, Y., Shon, H.S., and Ryu, K.H. (2015). Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-seq data. BMC Bioinf. 16: 347. https://doi.org/10.1186/s12859-015-0778-7.
Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
van der Maaten, L. and Hinton, G.E. (2008). Visualizing high-dimensional data using t-SNE. J. Machine Learn. Res. 9: 2579–2605. prlab.tudelft.nl/sites/default/files/vandermaaten08a.pdf.
Macosko, E.Z., Basu, A., Satija, R. et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 161 (5): 1202–1204. https://doi.org/10.1016/j.cell.2015.05.002.
Michaels, G.S., Carr, D.B., Askenazi, M. et al. (1998). Cluster analysis and data visualization of large-scale gene expression data. Pac. Symp. Biocomput 1998: 42–53.
Nagalakshmi, U., Wang, Z., Waern, K. et al. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320 (5881): 1344–1349. https://doi.org/10.1126/science.1158441.
Oron, A.P., Jiang, Z., and Gentleman, R. (2008). Gene set enrichment analysis using linear models and diagnostics. Bioinformatics. 24 (22): 2586–2591. https://doi.org/10.1093/bioinformatics/btn465.
Patro, R., Mount, S.M., and Kingsford, C. (2014). Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32 (5): 462–464. https://doi.org/10.1038/nbt.2862.
Patro, R., Duggal, G., Love, M.I. et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 14 (4): 417–419. https://doi.org/10.1038/nmeth.4197.
Paulson, J.N., Chen, C.Y., Lopes-Ramos, C.M. et al. (2017). Tissue-aware RNA-seq processing and normalization for heterogeneous and sparse data. BMC Bioinf. 18 (1): 437. https://doi.org/10.1186/s12859-017-1847-x.
Perou, C.M., Jeffrey, S.S., van de Rijn, M. et al. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. USA. 96 (16): 9212–9217.
Pop, M., Paulson, J.N., Chakraborty, S. et al. (2016). Individual-specific changes in the human gut microbiota after challenge with enterotoxigenic Escherichia coli and subsequent ciprofloxacin treatment. BMC Genomics. 17: 440. https://doi.org/10.1186/s12864-016-2777-0.
Quackenbush, J. (2005). Extracting meaning from functional genomics experiments. Toxicol. Appl. Pharmacol. 207 (2 Suppl): 195–199.
Ramskold, D., Luo, S., Wang, Y.C. et al. (2012). Full-length mRNA-seq from single-cell levels of RNA and individual circulating tumor cells. Nat. Biotechnol. 30 (8): 777–782. https://doi.org/10.1038/nbt.2282.
Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1): 139–140. https://doi.org/10.1093/bioinformatics/btp616.
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 270 (5235): 467–470.
Simon, R., Radmacher, M.D., and Dobbin, K. (2002). Design of studies using DNA microarrays. Genet. Epidemiol. 23 (1): 21–36.
Spellman, P.T., Sherlock, G., Zhang, M.Q. et al. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 (12): 3273–3297.
Subramanian, A., Tamayo, P., Mootha, V.K. et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA. 102 (43): 15545–15550. https://doi.org/10.1073/pnas.0506580102.
Toker, L., Feng, M., and Pavlidis, P. (2016). Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Res. 5: 2103. https://doi.org/10.12688/f1000research.9471.2.
Tsoucas, D. and Yuan, G.C. (2018). GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol. 19 (1): 58. https://doi.org/10.1186/s13059-018-1431-3.
Venet, D., Dumont, J.E., and Detours, V. (2011). Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput. Biol. 7 (10): e1002240. https://doi.org/10.1371/journal.pcbi.1002240.
Venter, J.C., Adams, M.D., Myers, E.W. et al. (2001). The sequence of the human genome. Science. 291 (5507): 1304–1351.
Wen, X., Fuhrman, S., Michaels, G.S. et al. (1998). Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA. 95 (1): 334–339.
Wilson, C.L. and Miller, C.J. (2005). Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis. Bioinformatics. 21 (18): 3683–3685.
Yang, A., Troup, M., Lin, P., and Ho, J.W. (2017). Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics. 33 (5): 767–769. https://doi.org/10.1093/bioinformatics/btw732.