蔡涛 - 我眼中的生物信息学——Bioinformatics = Data + Algorithm

肇雪兰

2018/05/13 发布于 技术 分类

北京生命科学研究所测序中心主任蔡涛,在本场论坛中是唯一一位从生物信息学角度来讨论大数据应用实践的,他认为,互联网背景下的大数据和生物学、医学以及生命科学中传统方法采集的数据各有特征,目前互联网背景下的大数据还不能和上述三个领域的数据在数据量上相比较。简单说,首先,细胞,组织等结构是具有活性的,其功能,表达水平甚至其分子结构在时间维度上也是连续变化的。拘于现有的采集技术和数据整理水平,当前各类数据库提供的数据往往是静态的,这造成研究人员面临样本数目稀少的局面,但这并不是说客观世界本来就数据稀少,因为动态性,或者说运动,才是生命活动的固有属性;其次,即使是从静态角度来看,互联网背景下的大数据在特征组合方面也远低于上述三个领域的数据,在这些领域,通常都面临特征组合的爆炸问题,比如所有已知物种的蛋白质分子的空间结构预测问题,就已经远超现有人类的科学计算能力。不是说生物信息学将消亡,而是刚刚才开始。正因为生物信息的重要性,生物信息学才融入到相关领域的各个方面。个人认为目前还处于数据收集的阶段,数据远不够丰富,距离客观实际还有差距,因此生物信息学领域要出现重要成果,还需要等待,巧妇难为无米之炊。

文字内容
1. 我眼中的生物信息学 Bioinformatics = Data + Algorithm 蔡涛 caitao@nibs.ac.cn
2. Bioinformatics Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. --NIH Bioinformatics Definition Committee
3. The Central dogma A A ARNDC G G EQGHI C C LKMFP T U STWYV
4. Primary database • NCBI GenBank /USA • EMBL-EBI resource /EU • DDBJ /Japan
6. >gi 631226408 ref NP_001278826.1 insulin preproprotein [Homo sapiens] MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREA EDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
7. The “Big” Data • From 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months http://www.ncbi.nlm.nih.gov/genbank/statistics
8. DNA Sequencer
9. IBM transistor sequencer
10. CryoEM
11. Mass Spec
12. The Central Dogma in 21st century
13. Secondary database • UCSC database • GPCR database
14. Database search “alignment” • Longest Common Subsequences • Smith-waterman algorithm • heuristic search (BLAST, BLAT, Burrows- Wheeler Aligner, etc) Sequence 1 = A--CACACTA Sequence 2 = AGCACAC-A
15. Hidden Markov Model • GenScan • Pfam/HMMER 0.99 LGC 0.01 0.9 0.1 HGC A 0.4 C 0.1 G 0.1 T 0.4 A 0.05 C 0.4 G 0.5 T 0.05
16. Data mining in biological data
17. Human genetic study -- a case study • Background – SNP (biomarker) • Candidate-gene based study – HIV opportunistic infection • Whole genome wide study – Hepatitis C
18. SNP (single nucleotide polymorphism) • Definition – DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered • Common human variation – 11 million (MAF>=1%) – SNP frequency varies in different population • 1000 genome project – Genotyped 25 population, ~2500 individuals • Detected method – Sequencing – PCR-based methods – Chip (Illumina, Affy) –…
19. SNP detection
20. Genotyping platform on Chip • Affy – Genome-wide Human SNP array 6.0 • Illumina – Human 1M-duo • Coverage
22. Association study in infection • Infection disease exert evolution pressure in human population – Malaria and sickle-cell anaemia risk allele • Advantage of association study – Linkage analysis need multiple affected and unaffected relatives – Family-based, case-control or cohort data – Fine localization and identification of causative loci with high-throughout technology
23. Association study methodology • Chi-squared • Regression Allele 1 Allele 2 Disease Disease Unaffected p1D p1U p2D p2U
24. Association study methodology • Confound factors • Powers – 1-P(false negative) – Case-control study: genetic effect, Allele frequency …
25. European population structure 1,387 samples ~200K SNPs Novembre et al, Nature, 2008
26. GWAS-basic analysis • Quality control – MAF>0.05, HWE>0.001, GENO>0.95 – Remove duplication or other mistakes • Association analysis – Genetic model: Allelic (chisq 1df), Additive, Dominate, Recessive, Cochran-Armitage trend test, Genotypic test (chisq 2df) – QQplot and Manhattan plot • Available software – Plink, GenABEL(R package) …
27. GWAS-advanced analysis • Population stratification – 2 divided by genomic inflation – IBS clustering in PLINK – PCA in EIGENSTRAT • Imputation – MACH, IMPUTE… – MACH cutoff(>0.9) means free genotyping • Meta-analysis – Reverse variance pooling method – Carefully prepare the data (same population, same reference allele, same phenotype unit, etc)
28. TLR4 SNPs association study in HIV opportunistic infection
29. TLR (Toll-like receptors) history • Toll: function in the embryonic dosal-ventral development of Drosophila (1988, cell 52:269) • Drosophila with a loss of function mutation for Toll exhibits a high susceptibility to fungal infection (1996, cell 86:973) • TLR: the so-called Toll-like receptors, human homolog genes for Toll(1997, 1998) • TLR4 is the LPS sensor in both mice and humans (1998, Science 282:2085) • Inflammatory caspases are innate immune receptors for intracellular LPS (2014, Nature 514:187)
30. The TLR4 D299G SNP Influences Susceptibility to Opportunistic Infections in the Swiss HIV Cohort Study • 1585 Caucasian patients are included from SHCS • Poisson regression used to detected association – Neutral model – Additive model • adjusted by cofactor such as age, sex, infection risk factors and year of SHCS entry • OIs – Fungal infection • severe candidiasis (mainly candida oesophagitis) • Pneumocysitis jirovecii pneumonia (PCP) – Viral infection • HSV infection (mucocutaneous ulceration or HSV disease) • VZV infection (e.g. multidermatoma or relapsing zona) • CMV infection (CMV disease or retinitis) – Mycobacterium infection • tuberculosis – Parasite infection • toxoplasmosis • The permutation false discovery rate (FDR) – the genotyped SNPs in all the patients are randomly shuffled, and then the same poisson regression is done. We take the ratio of the cases in 1000 times shuffle in which random pvalue is less than the real one as Qvalue
31. CD4 distribution of OIs CD4 curve of patient: 10190 CD4 boxplot for different OI 1000 1000 800 800 600 600 CD4 CD4 400 400 200 200 0 0 1986 1989 1992 1995 1998 2001 2004 2007 CAND_SEV PCP N= 179 N= 136 HSV N= 49 VZV N= 189 CMV N= 45 TBC N= 30 TOXO N= 34
32. Neutral model CD4+ OIs TLR_D299G below CAND_SEV 0/0 300 0/1 300 1/1 300 PCP 0/0 200 0/1 200 1/1 200 HSV 0/0 200 0/1 200 1/1 200 VZV 0/0 400 0/1 400 1/1 400 CMV 0/0 100 0/1 100 1/1 100 TBC 0/0 400 0/1 400 1/1 400 TOXO 0/0 200 0/1 200 1/1 200 Case of IR Days at risk OI Others (per year) IRR 1021865 97 711 0.0347 - 96420 11 72 0.0417 1.2018 (1.652 0.874) 3953 0 2 0 0 (Inf 0) 477977 54 403 0.0413 - 49465 11 37 0.0812 1.9684 (2.74 1.414) 2662 0 2 0 0 (Inf 0) 477977 29 428 0.0222 - 49465 4 44 0.0295 1.3328 (2.272 0.782) 2662 0 2 0 0 (Inf 0) 1841694 82 1083 0.0163 - 190490 12 113 0.023 1.4149 (1.927 1.039) 4419 0 2 0 0 (Inf 0) 193765 32 239 0.0603 - 21441 5 20 0.0852 1.4121 (2.284 0.873) 0 0 0 NaN - 1841694 12 1153 0.0024 - 190490 4 121 0.0077 3.2227 (5.741 1.809) 4419 0 2 0 0 (Inf 0) 477977 21 436 0.016 - 49465 6 42 0.0443 2.7608 (4.386 1.738) 2662 0 2 0 0 (Inf 0) Pvalue - 0.563 1 - 0.041 1 - 0.59 1 - 0.262 1 - 0.473 - 0.043 1 - 0.028 1 Incidence of OIs under immune suppression by TLR4 SNP
33. Additive model OIs CD4 cutoff Incidence Rate Ratio 95% CI CAND_SEV 300 1.1 0.8-1.5 PCP 200 2.0 1.4-2.7 HSV 200 1.3 0.8-2.2 VZV 400 1.4 1.0-1.9 CMV 100 1.4 0.9-2.2 TBC 400 2.6 1.5-4.3 TOXO 200 2.4 1.6-3.7 Pvalue Qvalue 0.7 0.7 0.047 0.040 0.6 0.6 0.3 0.3 0.3 0.3 0.077 0.057 0.041 0.031
34. Genome wide association study in Hepatitis C █ < 1% █ 1.0-2.4% █ 2.5-4.9% █ 5-10% █ >10% █ NA
35. Method -- clinic • Chronic HCV infection – anti-HCV seropositivity (using ELISA/RIBA) and detectable HCV RNA by quantitative assays • Spontaneous clearance – HCV-seropositivity and undetectable HCV RNA in patients without previous antiviral treatment • Response to treatment – at least 80% of the recommended dose PEG-IFN /RBV during the first 12 weeks – Sustained viral response (SVR) • undetectable HCV RNA in serum >24 weeks after treatment termination – Non-response (NR) • Others
36. Method -- genotyping • Illumina 1M-Duo chip for SCCS • Illumina Humanhap650-Quad beadchips for SHCS study (including part of work using Illumina Humanhap550) • Illumina Beadstudio software used for genotype calling
37. Method --association analysis • Quality control – MAF>0.01, HWE>0.001, GENO>0.95, mind>0.95 – Remove duplication and other cryptic relatedness • Basic association analysis – Allelic based analysis or Cochran-Armitage trend test – Logistics regression considering covariates – Significance cutoff 5E-8 – QQplot and Manhattan plot • Applied software – Plink and Haploview
38. Spontaneous Clearance demographic table Characteristics (N, proportion) N Age (median, IQR) Male sex HBV antigen positive Log HCV RNA (median, IQR) HCV genotypes 1 2 3 4 Other/unknown Chronic Infection 201 33.75 (8.93) 105 (52.2%) SHCS Spontaneous Clearance 199 33.86 (9.65) 136 (68.3%) P 0.7 0.001 21 (10.4%) 8 (4%) 0.01 6.086 (1.346) 78 (39.2%) 5 (2.5%) 60 (30.2%) 22 (11.1%) 36 (18%) Chronic Infection 828 44.15 (14.03) 516 (62.3%) SCCS Spontaneous Clearance 87 (+73 DE) 37.47 (8.59) 48 (55.2%) P <0.001 0.2 8 (1%) 4 (4.6%) 0.03 5.877 (0.993) 396 (47.8%) 83 (10%) 240 (29%) 70 (8.5%) 39 (4.7%)
39. Response to Treatment demographic table Characteristics (N, proportion) N Age (median, IQR) Male sex NR 174 19 (9) 119 (68.4%) SVR P 315 20 (9) 0.04 185 (58.7%) 0.04 HBV antigen positive Log HCV RNA (median, IQR) HCV genotypes 1 2 3 4 Other/unknown Heavy drinker 2 (1.1%) 5.964 (0.844) 3 (1%) 0. 9 5.835 (1.226) <0.001 105 (60.3%) 8 (4.6%) 29 (16.7%) 19 (10.9%) 13 (7.5%) 31 (17.8%) 94 (29.8%) 53 (16.8%) 142 (45.1%) 17 (5.4%) 9 (2.9%) 35 (11.1%) Ref <0.001 <0.001 1 0.6 0.03 Liver biopsy Inflammation steatosis Severe fibrosis 23 (13.2%) 85 (48.9%) 55 (31.6%) 45 (14.3%) 150 (47.6%) 59 (18.7%) 0.5 0.5 0.003
40. IL28B identification in the GWA for response to treatment IL28B
41. Region association plot of IL28B -log10(Observed p) Recombination rate (cM/Mb) 12 r2 to rs8099917 10 0.00-0.24 0.25-0.49 0.50-0.79 0.80-1.00 8 6 4 2 0 PAPL PAK4 44300 60 rs12980275 rs8099917 40 20 SYCN NCCRP1 IL28B IL28A LRFN1 IL29 GMFG 44400 Chromosome 19 position (kb) 44500 PAF1 0 SAMD4B MED29
43. 未来 ▪ 已来
44. Diagnostically relevant facial gestalt information from ordinary photos Elife, 2014 control Angelman Apert Cornelia de Lange Down Fragile X Progeria Treacher-Collins Williams-Beuren