GWAS中的赢家诅咒与其校正 Winner’s curse correction

  1. GWAS中的赢家诅咒 Winner’s curse
  2. 赢家诅咒的校正 WC correction
  3. winnerscurse R包
  4. 参考

GWAS中的赢家诅咒 Winner’s curse

GWAS中的赢家诅咒是指遗传效应的大小由于GWAS中的筛选过程(通过全基因组显著阈值筛选lead SNP)而被系统性地过高估计

赢家诅咒本用来指代在拍卖中类似的现象。即使一件拍卖品对所有买家来说都有相同的价值(出价是无偏的),最后拍得物品的赢家很可能过高估计了拍卖偏的内在价值。类比于GWAS,lead SNP即为赢家,而它的效应量可能过高估计了真实的遗传效应。
image

赢家诅咒的校正 WC correction

假设观察到的\beta_{Observed}的近似分布为:

\beta_{Observed} \sim N(\beta_{True},\sigma^2)

\beta_{Observed}的一个例子
image

  • $c$ : 显著性阈值对应的Z分数

上面的式子等价于

{{\beta_{Observed} - \beta_{True}}\over{\sigma}} \sim N(0,1)

{{\beta_{Observed} - \beta_{True}}\over{\sigma}}的一个例子
image

在通过阈值筛选的情况下,\beta_{Observed}的近似抽样分布(实际上为一个截断正态分布 truncated normal distribution)为:

f(x,\beta_{True}) ={{1}\over{\sigma}} {{\phi({{{x - \beta_{True}}\over{\sigma}}})} \over {\Phi({{{\beta_{True}}\over{\sigma}}-c}) + \Phi({{{-\beta_{True}}\over{\sigma}}-c})}}

其中

|{{x}\over{\sigma}}|\geq c

  • \phi(x) : 标准正态分布的概率密度函数
  • \Phi(x) : 标准正态分布的累积分布函数

从以上的近似抽样分布可以得到,筛选出来的SNP的效应量的期望分布为:

E(\beta_{Observed}; \beta_{True}) = \beta_{True} + \sigma {{\phi({{{\beta_{True}}\over{\sigma}}-c}) - \phi({{{-\beta_{True}}\over{\sigma}}-c})} \over {\Phi({{{\beta_{True}}\over{\sigma}}-c}) + \Phi({{{-\beta_{True}}\over{\sigma}}-c})}}

  • \beta_{Observed} is biased.
  • 偏差的大小由 \beta_{True}, SE \sigma, 以及用于筛选的显著性阈值决定.

公式推导可以参考 Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074. 中的Appendix A

用这个式子便可以对效应量进行赢家诅咒的校正。

winnerscurse R包

可以使用这个R包进行赢家诅咒的校正。

https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

参考

  • Bazerman, M. H., & Samuelson, W. F. (1983). I won the auction but don’t want the prize. Journal of conflict resolution, 27(4), 618-634.
  • Göring, H. H., Terwilliger, J. D., & Blangero, J. (2001). Large upward bias in estimation of locus-specific effects from genomewide scans. The American Journal of Human Genetics, 69(6), 1357-1369.
  • Zhong, H., & Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, 9(4), 621-634.
  • Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.

Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

哈迪-温伯格平衡精确检验 HWE

  1. 哈迪-温伯格平衡
  2. 哈迪-温伯格平衡精确检验检验原理
  3. 使用PLINK进行HWE检验
  4. 参考

哈迪-温伯格平衡

回顾: 哈迪温伯格平衡 Hardy– Weinberg equilibrium

哈迪-温伯格平衡精确检验检验原理

假设有N个无亲缘关系的样本 (对应有2N个等位)

在哈迪温伯格平衡下,在N个样本的群体中观察到有n_{AB}个样本为AB基因型的精确概率为:

P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\over{n_{AA}!n_{AB}!n_{BB}!}} \times {{n_A!n_B!}\over{n_A!n_B!}}

计算哈迪温伯格平衡精确检验的统计量时,我们需要把概率小于观察到的概率(n_{AB}个样本为AB基因型)的情况的概率进行加和,如下所示:

P_{HWE} = \sum_{n^{*}_{AB}} I[P( N_{AB} = n_{AB}|N, n_A)

\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \times P(N_{AB} = n^{*}_{AB} | N, n_A)

I(x) 为一个指示函数. 如果x为真, I(x) = 1; 否则, I(x) = 0.

实际使用软件计算时,通常会采用一些近似方法来避免大量的计算,可以参考PLINK中的HWE的算法

使用PLINK进行HWE检验

PLINK提供了计算哈迪温伯格平衡精确检验的统计量--hardy以及基于统计量进行过滤--hwe的选项:

plink \
    --bfile ${genotypeFile} \
    --hardy \
    --out plink_results

输出结果如下, P列即为哈迪温伯格平衡精确检验的结果:

$ head plink_results.hwe
 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
   1      1:13273:G:C  ALL(NP)    C    G             1/61/442    0.121   0.1172       0.7113
   1      1:14599:T:A  ALL(NP)    A    T             1/88/415   0.1746   0.1626       0.1625
   1      1:14604:A:G  ALL(NP)    G    A             1/88/415   0.1746   0.1626       0.1625
   1      1:14930:A:G  ALL(NP)    G    A             4/409/91   0.8115   0.4851    1.679e-61
   1      1:69897:T:C  ALL(NP)    T    C            7/111/386   0.2202   0.2173            1
   1      1:86331:A:G  ALL(NP)    G    A             0/88/416   0.1746   0.1594      0.02387
   1      1:91581:G:A  ALL(NP)    A    G          137/228/139   0.4524      0.5      0.03271
   1     1:122872:T:G  ALL(NP)    G    T            1/259/244   0.5139   0.3838     8.04e-19
   1     1:135163:C:T  ALL(NP)    T    C             1/91/412   0.1806   0.1675       0.1066

或者可以通过--hwe 1e-6 直接过滤掉P小于1e-6的SNP

plink \
    --bfile ${genotypeFile} \
    --hwe 1e-6 \
    --out plink_results

参考

https://www.cog-genomics.org/plink/1.9/dev#exact

https://www.cog-genomics.org/plink/1.9/basic_stats#hardy

Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link

GWAS检验效能 Power analysis for GWAS

第一类错误,第二类错误以及检验效能

该表列举了零假设H_0与统计学检验结果(是否拒绝原假设H_0)之间的关系

H0 为真 H0 为假
不拒绝原假设 真阴性 : 1 - \alpha 第二类错误 (伪阴性) : \beta
拒绝原假设 第一类错误 (伪阳性) : \alpha 真阳性 : 1 -  \beta

\alpha : 显著性水平

根据定义,检验效能( statistical power )指某检验正确地拒绝零假设的概率,也就是上表中的真阳性( True positive)。

Power = Pr ( Reject\ | H_0\ is\ False) = 1 -  \beta

image

影响检验效能的因素 Factors affecting power

  • 总的样本量 Total sample size
  • 病例与对照的比例 Case and control ratio
  • 变异的效应量大小 Effect size of the variant
  • 风险等位的频率 Risk allele frequency
  • 显著性阈值 Significance threshold

非中心参数 Non-centrality parameter

非中心参数 : 非中心参数(Non-centrality parameter; NCP)用于描述零假设H_0与备择假设H_1之间差异的程度。

考虑如下的线性模型:

y = \mu +\beta x + \epsilon

误差项的方差为:

\sigma^2 = Var(y) - Var(x)\beta^2

通常情况下单个SNP所能解释的表型的方差是极其有限的,所以我们可以近似地认为

\sigma^2  \thickapprox Var(y)

在哈迪温伯格平衡下,有

Var(x) = 2f(1-f)

  • f : 该变异的等位频率(allele frequency)

自由度为1的\chi^2分布的非中心参数NCP则为

\lambda = ({{\beta}\over{SE_{\beta}}})^2

数量表型的检验效能

\lambda = ({{\beta}\over{SE_{\beta}}})^2 \thickapprox N \times {{Var(x)\beta^2}\over{\sigma^2}} \thickapprox N \times {{2f(1-f) \beta^2 }\over {Var(y)}}

显著性阈值: C = CDF_{\chi^2}^{-1}(1 - \alpha,df=1)

  • CDF_{\chi^2}^{-1}(x) : \chi^2分布的累积分布函数的反函数

Power = Pr(\lambda > C ) = CDF_{\chi^2}(C, ncp = \lambda,df=1)

  • CDF_{\chi^2}(x, ncp= \lambda) : 非中心参数NCP为\lambda\chi^2分布的累积分布函数

病例对照表型的检验效能 Power for large-scale case-control genome-wide association studies

  • P_{case} : 在病例中风险等位的频率 Risk allele frequency in cases
  • N_{case} : 病例的样本量 Number of cases. The total allele count for cases is then 2N_{case}.
  • P_{control} : 在对照中风险等位的频率 Risk allele frequency in controls
  • N_{control} : 对照的样本量 Number of control. The total allele count for control is then 2N_{control}.

这种情况下零假设为 : P_{case} = P_{control} , 即风险等位的频率在病例中与对照中是一样的。

检验两个正态分布的比例的不同时,所用的统计量为

z = {{P_{case} - P_{control}}\over {\sqrt{ {{P_{case}(1 - P_{case})}\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\over{2N_{control}}} }}}

显著性阈值: C = \Phi^{-1}(1 - \alpha / 2 )

Power = Pr(|Z|>C) = 1 - \Phi(-C-z) + \Phi(C-z)

计算GWAS统计效能的网页工具 GAS power calculator

GAS power calculator工具实现了上述的计算方法,可以通过网页工具,指定参数后进行计算。

GAS power calculator

示例: image

参考

  • https://cloufield.github.io/GWASTutorial/20_power_analysis/
  • Skol, A. D., Scott, L. J., Abecasis, G. R., & Boehnke, M. (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature genetics, 38(2), 209-213.
  • Johnson, J. L., & Abecasis, G. R. (2017). GAS Power Calculator: web-based power calculator for genetic association studies. BioRxiv, 164343.
  • Sham, P. C., & Purcell, S. M. (2014). Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics, 15(5), 335-346.

多基因风险分数 PRS( Polygenic risk score)系列之十: PRS-CSx 跨祖先PRS的构建

本文内容:

  1. PRS-CSx简介
  2. PRS-CSx使用方法
  3. PRScsx实例应用
  4. 参考

回顾

  1. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之一:概念入门
  2. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之二:使用PLINK计算PRS(C+T方法)
  3. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之三:使用PRSice计算PRS(C+T方法)
  4. ldpred
  5. GWASLab:多基因风险分数 PRS(Polygenic risk score)系列之五:使用PRS-CS计算PRS(beta-shrinkage方法)
  6. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之六:metaGRS介绍
  7. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之七:Pathway-based PRS 通路PRS
  8. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之八:PGS Catalog 数据库
  9. GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之九: 使用PLINK2分染色体计算PRS并加和

PRS-CSx简介

先前的文章中介绍了PRS研究中的一大问题便是在A群体中的构建的PRS难以直接转移应用到B群体中。为了解决这一问题,Yunfeng Ruan等人开发了PRS-CSx。

PRS-CSx是一个贝叶斯多基因模型构建与预测的框架,通过整合多个族裔的GWAS概括性统计数据来提升跨群体PRS的预测能力。该方法为PRS-CS的扩展 (参考:GWASLab:多基因风险分数 PRS(Polygenic risk score)系列之五:使用PRS-CS计算PRS(beta-shrinkage方法))。

原理上,PRS-CSx利用了一个共同的连续收缩先验分布来整合各个群体中SNP的效应,该方法通过在GWAS概括性统计数据之间共享先验分布,利用不同群体间的LD信息,来达到更准确的效应估计量。这个共享的先验分布考虑到了效应估计量在不同群体中相互关联但又存在差异的特点,保持了模型框架的灵活性。

PRS-CSx使用的先验分布 其中全局与局部收缩系数不随群体k变化

给定GWAS概括性统计数据,以及相应群体的LD参考面板,PRS-CSx可以对每个群体计算分别的PRS,并通过最优线性组合来得出最终的PRS.

PRS-CSx使用方法

https://github.com/getian107/PRScsx

PRScsx是一个基于Python的命令行工具,需要安装scipy与h5py这两个依赖包。从github上下载PRS-CSx:

git clone https://github.com/getian107/PRScsx.git

LD 参考面板与 PRS-CS 所使用文件相同 (参考:GWASLab:多基因风险分数 PRS(Polygenic risk score)系列之五:使用PRS-CS计算PRS(beta-shrinkage方法))。

下载链接(国内可用的FTP):https://personal.broadinstitute.org/hhuang//public//PRS-CSx/Reference

记得同时下载对应面板的snp list:snpinfo_mult_1kg_hm3 (1kg),或是 snpinfo_mult_ukbb_hm3(ukbb)

选项

python PRScsx.py \
--ref_dir=PATH_TO_REFERENCE \
--bim_prefix=VALIDATION_BIM_PREFIX \
--sst_file=SUM_STATS_FILE \
--n_gwas=GWAS_SAMPLE_SIZE \
--pop=POPULATION \
--out_dir=OUTPUT_DIR \
--out_name=OUTPUT_FILE_PREFIX \
--a=PARAM_A \
--b=PARAM_B \
--phi=PARAM_PHI \
--n_iter=MCMC_ITERATIONS \
--n_burnin=MCMC_BURNIN \
--thin=MCMC_THINNING_FACTOR \
--chrom=CHROM \
--meta=META_FLAG \
--seed=SEED

必须的参数:

  • PATH_TO_REFERENCE:LD参考面板的路径,路径下应包含相应群体的参考面板以及snp list. 例如,纳入群体为EUR以及EAS,指定路径为:./ldref ,那么该路径下应该有 ldblk_1kg_eas,ldblk_1kg_eur 这两个文件夹, 以及snpinfo_mult_1kg_hm3这个文件。
  • VALIDATION_BIM_PREFIX:目标数据集的bim文件。
  • SUM_STATS_FILE:sumstats的完整路径,由逗号分隔。
  • GWAS_SAMPLE_SIZE:sumstats的样本量大小,由逗号分隔,顺序与SUM_STATS_FILE一致。
  • POPULATION:对应的群体,可以为 AFR, AMR, EAS, EUR, SAS,由逗号分隔,顺序与SUM_STATS_FILE一致。
  • OUTPUT_DIR: 输出的路径
  • OUTPUT_FILE_PREFIX:输出文件前缀

其余为可选参数:

META_FLAG : 如果为True,则输出inverse-variance-weighted meta-analysis of the population-specific posterior effect size estimates。

PARAM_A, PARAM_B, PARAM_PHI,MCMC_ITERATIONS,MCMC_BURNIN,MCMC_BURNIN,SEED与CHROM 使用方法与PRScs一致。(参考:GWASLab:多基因风险分数 PRS(Polygenic risk score)系列之五:使用PRS-CS计算PRS(beta-shrinkage方法)

示例代码

python PRScsx.py \
--ref_dir=path_to_ref \
--bim_prefix=path_to_bim/test \
--sst_file=path_to_sumstats/EUR_sumstats.txt,path_to_sumstats/EAS_sumstats.txt \
--n_gwas=200000,100000 \
--pop=EUR,EAS \
--chrom=22 \
--phi=1e-2 \
--out_dir=path_to_output \
--out_name=test

注意:将路径替换为自己的路径

大约一分钟即可完成计算。

运行log如下:

*** 2 discovery populations detected ***

##### process chromosome 22 #####
... parse reference file: /home/heyunye/tools/prscs/ldref/snpinfo_mult_1kg_hm3 ...
... 18944 SNPs on chromosome 22 read from /home/heyunye/tools/prscs/ldref/snpinfo_mult_1kg_hm3 ...
... parse bim file: /home/heyunye/tools/prscsx/PRScsx/test_data/test.bim ...
... 1000 SNPs on chromosome 22 read from /home/heyunye/tools/prscsx/PRScsx/test_data/test.bim ...
... parse EUR sumstats file: /home/heyunye/tools/prscsx/PRScsx/test_data/EUR_sumstats.txt ...
... 1000 SNPs read from /home/heyunye/tools/prscsx/PRScsx/test_data/EUR_sumstats.txt ...
... 1000 common SNPs in the EUR reference, EUR sumstats, and validation set ...
... parse EAS sumstats file: /home/heyunye/tools/prscsx/PRScsx/test_data/EAS_sumstats.txt ...
... 1000 SNPs read from /home/heyunye/tools/prscsx/PRScsx/test_data/EAS_sumstats.txt ...
... 901 common SNPs in the EAS reference, EAS sumstats, and validation set ...
... parse EUR reference LD on chromosome 22 ...
... parse EAS reference LD on chromosome 22 ...
... align reference LD on chromosome 22 across populations ...
... 1000 valid SNPs across populations ...
... MCMC ...
--- iter-100 ---
--- iter-200 ---
--- iter-300 ---
--- iter-400 ---
--- iter-500 ---
--- iter-600 ---
--- iter-700 ---
--- iter-800 ---
--- iter-900 ---
--- iter-1000 ---
--- iter-1100 ---
--- iter-1200 ---
--- iter-1300 ---
--- iter-1400 ---
--- iter-1500 ---
--- iter-1600 ---
--- iter-1700 ---
--- iter-1800 ---
--- iter-1900 ---
--- iter-2000 ---
... Done ...

输出为EUR以及EAS的PRS:

test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt

test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt

head test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt
22      rs9605903       17054720        C       T       8.694291e-04
22      rs5746647       17057138        G       T       -1.005430e-03
22      rs5747999       17075353        C       A       -2.499230e-04
22      rs2845380       17203103        A       G       6.037999e-04
22      rs2247281       17211075        G       A       4.780305e-04
22      rs2845346       17214252        C       T       7.767527e-04
22      rs2845347       17214669        C       T       1.671207e-03
22      rs1807512       17221495        C       T       -1.778397e-03
22      rs5748593       17227461        T       C       9.849030e-04
22      rs9606468       17273728        C       T       1.442600e-04

使用该文件便可以利用plink进行PRS计算:

GWASLab:多基因风险分数 PRS( Polygenic risk score)系列之九: 使用PLINK2分染色体计算PRS并加和

PRScsx实例应用

PRScsx的通讯作者以第一作者的身份,将PRScsx应用于二型糖尿病的跨族裔PRS研究中, 文中使用PRScsx和European, African American,以及East Asian的GWAS数据,构建了二型糖尿病的跨族裔PRS。

https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01074-2

参考

Ruan, Y., Lin, Y. F., Feng, Y. C. A., Chen, C. Y., Lam, M., Guo, Z., … & Ge, T. (2022). Improving polygenic prediction in ancestrally diverse populations. Nature Genetics54(5), 573-580.

Ge, T., Irvin, M. R., Patki, A., Srinivasasainagendra, V., Lin, Y. F., Tiwari, H. K., … & Karlson, E. W. (2022). Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome medicine14(1), 1-16.

GWAS入门 – 综述推荐与导读

前言

受推上业内大佬启发,本文将总结对于初学GWAS有较大帮助的综述文章,这些文章多由领域内的leading scientist执笔,引用上千,有较大影响力。对于想快速了解十几年来GWAS发展的同学来说,是不可错过的文章。本文基于Abdel Abdellaoui的推文以及作者个人经验。如有其他推荐,欢迎补充。

综述推荐与导读

第一篇

Hirschhorn, J. N., & Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nature reviews genetics6(2), 95-108.

https://www.nature.com/articles/nrg1521

首先介绍最早的关于GWAS介绍的Review之一,于GWAS刚刚萌芽的2005年发表,那时人类基因组测序刚刚完成,dbSNP开始建立,Hapmap项目也开始启动,这些项目奠定了GWAS研究发展的基础。这篇综述该介绍了GWAS相比于传统遗传学方法的优缺点,当时可用的测序高通量测序方法,以及GWAS研究中需要注意的核心问题等。可以说是将传统遗传学与现代基因组学衔接的一篇开山之作之一,值得一读。

个人推荐指数:10

第二篇

Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature reviews genetics7(10), 781-791.

https://www.nature.com/articles/nrg1916

该综述介绍了早期GWAS研究中可用的的统计学工具。简要的介绍了GWAS研究核心的遗传学与统计学原理 ,并简要梳理了GWAS各个环节上会用到的基础的统计学原理与工具。该文章对于初学者理解GWAS的检验原理有很大帮助,后续的GWAS检验方法基本是基于这些基本原理的扩展与补充,万变不离其宗。

个人推荐指数:8

第三篇

McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J., & Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics9(5), 356-369.

https://www.nature.com/articles/nrg2344

该文发表于2008年,正是第一波GWAS的热潮结果发表后的时期,文中基于第一波GWAS的文章,总结了当时GWAS的研究现状,着重梳理了当时GWAS研究的不足与挑战,为接下来的GWAS研究指出了方向。

个人推荐指数:9

第四篇

Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., … & Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature461 (7265), 747-753.

https://www.nature.com/articles/nature08494

寻找复杂疾病的“丢失的遗传力“自始至终都是GWAS研究中的一个热门话题,该文总结了丢失的遗传力可能的来源并给出了可能研究的方法。是一篇较有影响力的文章。

个人推荐指数:8

第五篇

Ioannidis, J., Thomas, G., & Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics10(5), 318-329.

https://www.nature.com/articles/nrg2544

该文发表于2009,当时GWAS研究已经发现大量的与疾病关联的位点,但这些位点大多都只是真正引起功能改变的因果变异的marker,如何确定在大量的关联中找出因果变异变成了一个不可回避的问题。该文章总结了可以提高GWAS结果可靠性与寻找因果变异的早期方法。

个人推荐指数:7

第六篇

Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Reviews Genetics11(7), 499-511.

https://www.nature.com/articles/nrg2796

该文对基因型插补(genotype imputation)方法进行了总结,介绍了相关的基本概念与常用指标,该文对于理解基因型插补的理论基础有较大帮助。

个人推荐指数:6

第七篇

Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature reviews genetics11(7), 459-463.

https://www.nature.com/articles/nrg2813

群体分层一直是GWAS研究中一个必须要妥善应对的问题,该文总结了对于群体分层的处理方法。文章较短,但梳理得简洁明了,对于理解 λgc,PCA,线性混合模型等帮助很大。推荐阅读。

个人推荐指数:9

第八篇

Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics101(1), 5-22.

https://www.sciencedirect.com/science/article/pii/S0002929717302409?via%3Dihub

对GWAS问世10年来GWAS研究发展与成果的总结,该文介绍了GWAS的科学基础,并基于大量GWAS研究总结出了一些普遍的结论,同时举出了三个被广泛研究的复杂疾病的典型例子。

个人推荐指数:8

第九篇

Pasaniuc, B., & Price, A. L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nature reviews genetics18(2), 117-127.

https://www.nature.com/articles/nrg.2016.142

该文发表于2017年,随着GWAS的summary statistics不断积累,使用summary statistics的下游分析方法也如雨后春笋般出现,该文总结了使用GWAS summary statistics来对疾病分析的post-GWAS方法,例如gene-based analysis,fine-mapping,以及PRS等。

个人推荐指数:8

第十篇

Tam, V., Patel, N., Turcotte, M., Bossé, Y., Paré, G., & Meyre, D. (2019). Benefits and limitations of genome-wide association studies. Nature Reviews Genetics20(8), 467-484.

https://www.nature.com/articles/s41576-019-0127-1

该文总结了GWAS研究的优势与不足,对于加深对GWAS的理解与了解未来发展方向有较大帮助。

个人推荐指数:7

第十一篇

Uffelmann, E., Huang, Q. Q., Munung, N. S., De Vries, J., Okada, Y., Martin, A. R., … & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers1(1), 1-21.

https://www.nature.com/articles/s43586-021-00056-9

发表于2021年,目前最新的GWAS完整流程讲解,总结较为全面,可以查漏补缺,值得一读。

个人推荐指数:9

参考

Summary of major biobanks and cohorts v1

主要的生物银行 biobanks 以及队列 cohorts总结.v1

前言

本文主要列举世界范围内主要的生物银行 biobanks 以及队列 cohorts,仅供参考。目前仅列举各个生物银行及队列的基础信息,包括样本量(概数),位置,网站链接以及简要介绍。未来会不断更新,下一步是补全缩写,增加研究类型,族裔信息,样本量中分开总样本量与基因分型的样本量,以及对应的数据公开的链接等。(个人手动整理,难免有差错,如有遗漏或错误,欢迎评论区指正,感谢!)

本文为CTGCatalog (Complex Trait Genetics Catalog, 主要收集整理Complex Trait Genetics 领域内常用参考数据与资源,公开的sumstats,以及常用工具等)的一部分:

https://cloufield.github.io/CTGCatalog/Reference_data_Biobanks_Cohorts_README/

Contents : Biobanks and Cohorts v1 (20221006)

  • Biobank of the Americas
  • Biobank Graz
  • Biobank Japan
  • BioMe
  • BioVU
  • CanPath – Ontario Health Study
  • China Kadoorie Biobank
  • Colorado Center for Personalized Medicine
  • deCODE Genetics
  • Estonian Biobank
  • FinnGen
  • Generation Scotland
  • Genes & Health
  • HUNT
  • IARC Biobank
  • Lifelines
  • Massachusetts General Brigham Biobank
  • Michigan Genomics Initiative
  • Million Veteran Program (MVP)
  • National Biobank of Korea
  • Nigerian 100K Genome Project
  • Penn Medicine Biobank
  • Qatar Biobank
  • QIMR Berghofer – QIMR Biobank (QSkin and GenEpi)
  • Taiwan Biobank
  • The Malaysian Cohort (TMC)
  • UCLA Precision Health Biobank
  • Uganda Genome Resource
  • UK Biobank

EUROPE

UK Biobank (UKB)

  • SAMPLE SIZE: ~500k
  • LOCATION: U.K.
  • URL: https://www.ukbiobank.ac.uk/
  • DESCRIPTION: UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is regularly augmented with additional data and is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. It is a major contributor to the advancement of modern medicine and treatment and has enabled several scientific discoveries that improve human health.
  • CITATION: Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., … & Marchini, J. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726), 203-209.

FinnGen

  • SAMPLE SIZE: ~343k
  • LOCATION: Finland
  • URL:https://www.finngen.fi/en
  • DESCRIPTION: FinnGen study launched in Finland in the autumn of 2017 is a unique study that combines genome information with digital health care data. The FinnGen study is an unprecedented global research project representing one of the largest studies of this type. Project aims to improve human health through genetic research, and ultimately identify new therapeutic targets and diagnostics for treating numerous diseases. The collaborative nature of the project is exceptional compare to many ongoing studies, and all the partners are working closely together to ensure appropriate transparency, data security and ownership.
  • CITATION:Kurki, M. I., Karjalainen, J., Palta, P., Sipilä, T. P., Kristiansson, K., Donner, K., … & Nelis, M. (2022). FinnGen: Unique genetic insights from combining isolated population and national health register data. medRxiv.

Estonian Biobank

  • SAMPLE SIZE: ~200k
  • LOCATION: Estonia
  • URL:https://genomics.ut.ee/en/content/estonian-biobank
  • DESCRIPTION:The Estonian Biobank has established a population-based biobank of Estonia with a current cohort size of more than 200,000 individuals (genotyped with genome-wide arrays), reflecting the age, sex and geographical distribution of the adult Estonian population. Considering the fact that about 20% of Estonia’s adult population has joined the programme, it is indeed a database that is very important for the development of medical science both domestically and internationally.
  • CITATION:Leitsalu, L., Haller, T., Esko, T., Tammesoo, M. L., Alavere, H., Snieder, H., … & Metspalu, A. (2015). Cohort profile: Estonian biobank of the Estonian genome center, university of Tartu. International journal of epidemiology, 44(4), 1137-1147.

Lifelines

  • SAMPLE SIZE: ~167k
  • LOCATION: Netherlands
  • URL: https://www.lifelines.nl/researcher
  • DESCRIPTION: Lifelines is a large, multigenerational cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. We included participants from three generations, who are followed for at least 30 years, to obtain insight into healthy ageing. The aim of Lifelines is to be a resource for the national and international scientific community.
  • CITATION: Scholtens, S., Smidt, N., Swertz, M. A., Bakker, S. J., Dotinga, A., Vonk, J. M., … & Stolk, R. P. (2015). Cohort Profile: LifeLines, a three-generation cohort study and biobank. International journal of epidemiology, 44(4), 1172-1180.

HUNT

  • SAMPLE SIZE: ~88k
  • LOCATION: Norway
  • URL: https://www.ntnu.edu/hunt/hunt-biobank
  • DESCRIPTION:HUNT Biobank is an established and modern research biobank with high-technology equipment for storage, analysis, sample handling and delivery of samples. Our samples satisfy high quality standards and are stored in accordance with the Data Inspectorates laws and regulations. HUNT Biobank engages in sample handling from The Nord-Trøndelag Health Study (HUNT), Cohort of Norway (CONOR), and can receive samples from other researchers and research projects for storage, analysis and processing of DNA. We do not store samples from private individuals.
  • CITATION: Brumpton, B. M., Graham, S., Surakka, I., Skogholt, A. H., Løset, M., Fritsche, L. G., … & Willer, C. J. (2021). The HUNT Study: a population-based cohort for genetic research. medRxiv.

Generation Scotland

  • SAMPLE SIZE: ~24k
  • LOCATION: Scotland
  • URL: https://www.ed.ac.uk/generation-scotland
  • DESCRIPTION: Generation Scotland is a research study looking at the health and well-being of volunteers and their families. Generation Scotland combines responses to questionnaires of health and well-being from birth through life. We combine this with NHS health records and innovative laboratory science to understand health trajectories. We work closely with researchers and our volunteers to create a rich evidence base for understanding health. Through this rigorous, ethical and safe approach to research, we seek to enable meaningful change in public health.  
  • CITATION: Smith, B. H., Campbell, A., Linksted, P., Fitzpatrick, B., Jackson, C., Kerr, S. M., … & Morris, A. D. (2013). Cohort Profile: Generation Scotland: Scottish Family Health Study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. International journal of epidemiology, 42(3), 689-700.

East London Genes & Health

  • SAMPLE SIZE: ~100k
  • LOCATION:U.K.
  • URL: https://www.genesandhealth.org/
  • DESCRIPTION: Genes & Health is a huge long-term study of 100,000 people of Bangladeshi and Pakistani origin. We will link genes with health records, to study disease and treatments. Some volunteers may be invited for further studies. We are inviting volunteers to take part in two regions of the UK: East London (East London Genes & Health) and Bradford (Bradford Genes & Health).
  • CITATION: Finer, S., Martin, H. C., Khan, A., Hunt, K. A., MacLaughlin, B., Ahmed, Z., … & van Heel, D. A. (2020). Cohort Profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people. International journal of epidemiology, 49(1), 20-21i.

deCODE Genetics

  • SAMPLE SIZE: ~250k
  • LOCATION: Iceland
  • URL:https://www.decode.com/
  • DESCRIPTION:deCODE leads the world in the discovery of genetic risk factors for common diseases. Our gene discovery engine is driven by our unique approach and resources, including detailed genetic and medical information on some 500,000 individuals from around the globe taking part in our discovery work and proprietary statistical algorithms and informatics tools for gathering, analyzing, visualizing and storing large amounts of data.

The International Agency for Research on Cancer (IARC) Biobank (IBB)

  • SAMPLE SIZE: ~560k
  • LOCATION: France
  • URL: https://ibb.iarc.fr/
  • DESCRIPTION: The IARC BioBank (IBB) is one of the largest, most varied and richest International collections of samples in the world. The Biobank is publicly funded, (approximately 60% of its budget is provided by IARC Participating States through the regular budget and the remainder is from research grants) and hosts over 50 different studies, led or coordinated by IARC scientists. The IBB contains both population-based collections from research projects focusing on gene-environment interactions (as in the European Prospective Investigation into Cancer and Nutrition (EPIC) study) and disease-based collections which focus on biomarkers (as in the International Head and Neck Cancer Epidemiology (INHANCE)). Study designs include case-series, prevalence studies, case-control and cohort studies, etc. The IBB contains 5.1 million biological samples from 562,000 individuals. 4 million of the samples are from the EPIC study (over 370,000 individuals) and about one million samples from other collections (close to 200,000 individuals). Most of the samples are body fluids, including plasma, serum and urine as well as extracted DNA samples.

Biobank Graz

  • SAMPLE SIZE: ~1200k
  • LOCATION: Austria
  • URL:https://biobank.medunigraz.at/en/?link=http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2F&cHash=3b3a94b34935e2b8509a838b4a34b0eb
  • DESCRIPTION: Biobank Graz is one of the largest and most well-known clinical biobanks in the world. Around 20 million individual specimens of body fluids and human tissue are stored here. Biobank Graz allows access to these specimens and associated data for scientific research purposes. The common goal is to develop approaches to diagnosing and treating disease.
  • CITATION: Huppertz, B., Bayer, M., Macheiner, T., & Sargsyan, K. (2016). Biobank Graz: the hub for innovative biomedical research. Open journal of bioresources, 3(1).

ASIA

China Kadoorie Biobank (CKB)

  • SAMPLE SIZE: ~500k
  • LOCATION: China
  • URL:https://www.ckbiobank.org/
  • DESCRIPTION:The China Kadoorie Biobank is one of the world’s largest prospective cohort studies. A long-term collaboration between the UK and China, it aims to generate reliable evidence about the lifestyle, environmental and genetic determinants of a wide range of common diseases that can inform disease prevention, risk prediction and treatment worldwide.
  • CITATION:Chen, Z., Chen, J., Collins, R., Guo, Y., Peto, R., Wu, F., & Li, L. (2011). China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. International journal of epidemiology, 40(6), 1652-1666.

Taiwan Biobank (TWB)

  • SAMPLE SIZE: ~150k
  • LOCATION: China, Taiwan
  • URL:https://www.twbiobank.org.tw/
  • DESCRIPTION:The Taiwan Biobank (TWB) is an ongoing prospective study of over 150,000 individuals aged 30-70 recruited from across Taiwan beginning in 2012. A comprehensive list of phenotypes was collected for each consented participant at recruitment and follow-up visits through structured interviews and physical measurements. Biomarkers and genetic data were also generated for all participants from blood and urine samples.
  • CITATION:Feng, Y. C. A., Chen, C. Y., Chen, T. T., Kuo, P. H., Hsu, Y. H., Yang, H. I., … & Lin, Y. F. (2021). Taiwan Biobank: a rich biomedical research database of the Taiwanese population. medRxiv.

BioBank Japan (BBJ)

  • SAMPLE SIZE: ~200k
  • LOCATION: Japan
  • URL:https://biobankjp.org/
  • DESCRIPTION:In 2003, BioBank Japan (BBJ) started developing one of the world’s largest disease biobanks, creating a foundation for research aimed at achieving medical care tailored to the individual traits of each patient. From a total of 260,000 patients representing 440,000 cases of 51 primarily multifactorial (common) diseases, BBJ has collected DNA, serum, medical records (clinical information), etc. with their consent. No less than 5,800 items of screened information are available for research, including the patients’ survival information, with 95% of the patients tracked over an average of 10 years. In addition to large-scale genomic analyses, omics analyses including whole genome sequencing and metabolome/proteome analyses have been performed on the DNA, serum and other biological samples collected, producing significant research findings. The genomic information acquired through the analyses continues to be used as data. The biological samples and data are widely distributed and used by researchers.
  • CITATION:Nagai, A., Hirata, M., Kamatani, Y., Muto, K., Matsuda, K., Kiyohara, Y., … & Kubo, M. (2017). Overview of the BioBank Japan Project: study design and profile. Journal of epidemiology, 27(Supplement_III), S2-S8.

Tohoku Medical Megabank (TMM)

  • SAMPLE SIZE: ~157k
  • LOCATION: Japan
  • URL: https://www.megabank.tohoku.ac.jp/english/
  • DESCRIPTION:Tohoku University Tohoku Medical Megabank Organization was founded to establish an advanced medical system to foster the reconstruction from the Great East Japan Earthquake. The organization has been developing a biobank that combines medical and genome information during the process of rebuilding the community medical system and supporting health and welfare in the Tohoku area. The information from the brand-new biobank will create a new medical system, and, based on the findings of its analysis, the organization aims to attract more medical practitioners from all over the country to the area, promote industry-academic partnerships, create employment in related fields, and restore the medical system in Tohoku.
  • CITATION:Kuriyama, S., Yaegashi, N., Nagami, F., Arai, T., Kawaguchi, Y., Osumi, N., … & Tohoku Medical Megabank Project Study Group. (2016). The Tohoku medical megabank project: design and mission. Journal of epidemiology, 26(9), 493-511.

National Biobank of Korea

  • SAMPLE SIZE: ~80K
  • LOCATION: Korea
  • URL:https://nih.go.kr/NIH/cms/content/eng/14/65714_view.html
  • DESCRIPTION:The NBK is the national control center for the collection, management, and utilization of human bioresources in Korea. And NBK manages KBN, it contributes to the development of policies related to human bioresources, standardization of human bioresource management, and advancement of domestic biobanks through developing and providing support for human bioresource technologies. For guaranteeing the fairness in bioresource distribution and development of an efficient distribution system, the NBK also serves as the human bioresource supply hub that supports national healthcare and medical R&D.
  • CITATION:Cho, S. Y., Hong, E. J., Nam, J. M., Han, B., Chu, C., & Park, O. (2012). Opening of the national biobank of Korea as the infrastructure of future biomedical science in Korea. Osong public health and research perspectives, 3(3), 177-184.

Qatar Biobank

  • SAMPLE SIZE: ~80K
  • LOCATION: Qatar
  • URL : https://www.qatarbiobank.org.qa/
  • DESCRIPTION: Qatar Biobank, a center within Qatar Foundation, was created in collaboration with Hamad Medical Corporation and the Ministry of Public Health to enable local scientists to conduct medical research on prevalent health issues in Qatar.
  • CITATION:Al Kuwari, H., Al Thani, A., Al Marri, A., Al Kaabi, A., Abderrahim, H., Afifi, N., … & Elliott, P. (2015). The Qatar Biobank: background and methods. BMC public health, 15(1), 1-9.

The Malaysian Cohort (TMC)

  • Cohort Size: ~100k
  • LOCATION: Malaysia
  • URL:https://www.ukm.my/mycohort/ms/
  • DESCRIPTION:The Malaysian Cohort study was initiated in 2005 by the Malaysian government. The top-down approach to this population-based cohort study ensured the allocation of sufficient funding for the project which aimed to recruit 100 000 individuals aged 35–70 years. Participants were recruited from rural and urban areas as well as from various socioeconomic groups. The main objectives of the study were to identify risk factors, to study gene-environment interaction and to discover biomarkers for the early detection of cancers and other diseases.
  • CITATION:Jamal, R., Syed Zakaria, S. Z., Kamaruddin, M. A., Abd Jalal, N., Ismail, N., Mohd Kamil, N., … & Malaysian Cohort Study Group. (2015). Cohort profile: The Malaysian Cohort (TMC) project: a prospective study of non-communicable diseases in a multi-ethnic population. International journal of epidemiology, 44(2), 423-431.

AFRICA

Uganda Genome Resource

  • SAMPLE SIZE: ~6k
  • URL:https://ega-archive.org/studies/EGAS00001000545
  • DESCRIPTION:Genomic studies in African populations provide unique opportunities to understand disease aetiology, human genetic diversity and population history in a regional and a global context. To leverage the relative benefits of different strategies, we undertook a combined approach of genotyping and whole-genome sequencing (WGS) in a population-based study of 6,400 individuals from a geographically defined rural community in South-West Uganda. We present data from 4,778 individuals with genotypes for ~2.2 million SNPs from the Uganda GWAS resource (UGWAS), and sequence data on up to 1,978 individuals spanning 41.5M SNPs and 4.5M indels (UG2G); 343 individuals overlap between the two datasets. We highlight the value of the largest sequence panel from Africa to date as a global resource for variant discovery, imputation and understanding the mutational spectrum and its clinical relevance in African populations. Alongside phenotype data, we provide a rich new genomic resource for researchers in Africa and globally
  • CITATION:Gurdasani, D., Carstensen, T., Fatumo, S., Chen, G., Franklin, C. S., Prado-Martinez, J., … & Sandhu, M. S. (2019). Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell, 179(4), 984-1002.

Nigerian 100K Genome Project (coming soon)

  • CITATION:Fatumo, S., Yakubu, A., Oyedele, O., Popoola, J., Attipoe, D. A., Eze-Echesi, G., … & Ene-Obong, A. (2022). Promoting the genomic revolution in Africa through the Nigerian 100K Genome Project. Nature Genetics, 54(5), 531-536.

NORTH AMERICA

Michigan Genomics Initiative

  • SAMPLE SIZE: ~55k
  • LOCATION: U.S.
  • URL:https://precisionhealth.umich.edu/our-research/michigangenomics/
  • DESCRIPTION:The Michigan Genomics Initiative (MGI) is a collaborative research effort among physicians, researchers, and patients at the University of Michigan (U-M) with the goal of combining patient electronic health record (EHR) data with corresponding genetic data to gain novel biomedical insights. There are currently ~84K consented participants through the MGI and partner studies and the addition of ~10K new participants per year is anticipated. Currently, all MGI participants with available genetic data have received care at the University of Michigan Health System.
  • CITATION:Zawistowski, M., Fritsche, L. G., Pandit, A., Vanderwerff, B., Patil, S., Scmidt, E. M., … & Zoellner, S. (2021). The Michigan Genomics Initiative: a biobank linking genotypes and electronic clinical records in Michigan Medicine patients. medRxiv.

Penn Medicine Biobank

  • SAMPLE SIZE: ~40k
  • LOCATION: U.S.
  • URL:https://pmbb.med.upenn.edu/
  • DESCRIPTION:The Penn Medicine BioBank (PMBB) is a research program created to study the causes and treatments of many diseases. Any Penn Medicine patient (age 18 and up) can sign up. The PMBB is a collection of biological samples, such as blood or tissue, that are donated by patient volunteers. These samples are then connected to clinical information, such as diseases or lab measures. These data are then used by researchers to discover new ways to detect, treat, and maybe even prevent or cure disease. Some of these studies may be about how genes affect health and disease. Other studies look at how genes affect response to medicines.

UCLA Precision Health Biobank

  • SAMPLE SIZE: ~27k
  • LOCATION: U.S.
  • URL:https://www.uclahealth.org/precision-health/programs/ucla-atlas-community-health-initiative/ucla-atlas-precision-health-biobank
  • DESCRIPTION:The UCLA ATLAS Precision Health Biobank, under the supervision of the Translational Pathology Core Laboratory (TCPL), collects biological samples from patients who have consented to participate in the UCLA ATLAS Community Health Initiative. As a collaborator with UCLA ATLAS Community Health Initiative, the UCLA ATLAS Precision Health Biobank manages the collection and distribution of biological samples by removing the personally identifiable information.
  • CITATION:Johnson, R. D., Ding, Y., Bhattacharya, A., Chiu, A., Lajonchere, C., Geschwind, D. H., & Pasaniuc, B. (2022). The UCLA ATLAS Community Health Initiative: promoting precision health research in a diverse biobank. medRxiv.

BioMe

  • SAMPLE SIZE: ~32k
  • LOCATION: U.S.
  • URL:https://icahn.mssm.edu/research/ipm/programs/biome-biobank
  • DESCRIPTION:The Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai is leading the movement toward diagnosis and classification of disease according to the patient’s molecular profile. This approach accommodates differences at all possible levels of exposure (genome, environment, and lifestyle) and at all stages of the process, from prevention to post-treatment follow-up. At the center of this effort is BioMe, an electronic medical record-linked biobank that enables researchers to rapidly and efficiently conduct genetic, epidemiologic, molecular, and genomic studies on large collections of research specimens linked with medical information.

BioVU

  • SAMPLE SIZE: ~120k
  • LOCATION: U.S.
  • URL:https://www.vumc.org/dbmi/biovu
  • DESCRIPTION:Planning for BioVU began in mid-2004 and the first samples were collected in February 2007. Prior to collecting DNA samples, all aspects of the BioVU project were extensively tested. BioVU now accrues 500-1000 samples per week, totaling more than 275,000 DNA samples as of January 2022. Vanderbilt clinic patients may sign the BioVU Consent Form if they wish to donate their excess blood samples, or not sign the form if they do not wish to participate.
  • CITATION:Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large‐scale de‐identified DNA biobank to enable personalized medicine. Clinical Pharmacology & Therapeutics, 84(3), 362-369.

Biobank of the Americas

  • SAMPLE SIZE: ~20k
  • LOCATION: U.S.
  • URL:https://bbofa.org/
  • URL: https://www.galatea.bio/#main-biobank
  • DESCRIPTION: Biobank consented samples with associated clinical data from diverse populations from throughout the United States and Latin America via healthcare and biopharma partnerships.

Colorado Center for Personalized Medicine

  • SAMPLE SIZE: ~34k
  • LOCATION: U.S.
  • URL:https://medschool.cuanschutz.edu/cobiobank
  • DESCRIPTION:Established in 2014 as a partnership between UCHealth and University of Colorado Anschutz Medical Campus, the Colorado Center for Personalized Medicine (CCPM) brings together multiple disciplines and institutions to uncover advancements in genomics that can improve diagnosis and treatment of disease, and identify more tailored approaches to population health management.To facilitate discoveries in personalized medicine, CCPM has created a Biobank that aims to be one of the largest academic medicine biospecimen repositories in the mountain and midwest regions of the U.S. The CCPM Biobank is able to link biospecimens and genotype information with patient health information from electronic medical records in an enterprise data warehouse (Health Data Compass) to support a broad range of research, operational, and clinical quality improvement agendas.

CanPath – Ontario Health Study

  • SAMPLE SIZE: ~7.3k
  • LOCATION: Canada
  • URL:https://canpath.ca/cohort/ontario-health-study/
  • DESCRIPTION:The Ontario Health Study (OHS) is a resource for investigating the ways in which lifestyle, the environment and genetics affect people’s health. It is one of the regional cohorts that collectively form the Canadian Partnership for Tomorrow’s Health (CanPath)—a pan-Canadian cohort with >330 000 participants. The linking of Canada’s rich collection of administrative health data with the cohort’s data represents a powerful means to disseminate high-quality, timely data.
  • CITATION:Kirsh, V. A., Skead, K., McDonald, K., Kreiger, N., Little, J., Menard, K., … & Awadalla, P. (2022). Cohort Profile: The Ontario Health Study (OHS). International Journal of Epidemiology.

Massachusetts General Brigham Biobank

  • SAMPLE SIZE: ~26K
  • LOCATION: U.S.
  • URL:https://www.massgeneralbrigham.org/en/research-and-innovation/participate-in-research/biobank
  • DESCRIPTION: The Mass General Brigham Biobank is a large research program designed to help researchers understand how people’s health is affected by their genes, lifestyle, and environment. By participating in the Mass General Brigham Biobank, you can help us better understand, treat, and even prevent the diseases that might affect your health and the health of future generations. 
  • CITATION: Boutin, N. T., Schecter, S. B., Perez, E. F., Tchamitchian, N. S., Cerretani, X. R., Gainer, V. S., … & Smoller, J. W. (2022). The Evolution of a Large Biobank at Mass General Brigham. Journal of Personalized Medicine, 12(8), 1323.
  • CITATION:Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., … & Murphy, S. N. (2022). The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4), 643-651.

Million Veteran Program (MVP)

  • SAMPLE SIZE: ~900k
  • LOCATION: U.S.
  • URL:https://www.mvp.va.gov/pwa/
  • DESCRIPTION: The Million Veteran Program (MVP) is a national research program to learn how genes, lifestyle, and military exposures affect health and illness. Since launching in 2011, over 900,000 Veteran partners have joined one of the world’s largest programs on genetics and health.
  • CITATION:Gaziano, J. M., Concato, J., Brophy, M., Fiore, L., Pyarajan, S., Breeling, J., … & O’Leary, T. J. (2016). Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of clinical epidemiology, 70, 214-223.

OCIENIA

QIMR Berghofer – QIMR Biobank (QSkin and GenEpi)

参考:

Home | Global Biobank Meta

多基因风险分数 PRS( Polygenic risk score)系列之九: 使用PLINK2分染色体计算PRS并加和

本文内容:

  1. 下载PGS分数文件
  2. 第一步 计算每条染色体的原始分数
  3. 第二步 将各个染色体的结果提取
  4. 第三步 将分数进行简单加和

本文主要演示分染色体计算PRS并求和的过程。

下载PGS分数文件 Download PGS score

演示PGS分数文件下载自PGScatalog

下载harmonise后的文件:

wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000012/ScoringFiles/Harmonized/PGS000012_hmPOS_GRCh37.txt.gz
gunzip PGS000012_hmPOS_GRCh37.txt.gz
head PGS000012_hmPOS_GRCh37.txt
###PGS CATALOG SCORING FILE - see <https://www.pgscatalog.org/downloads/#dl_ftp_scoring> for additional information
#format_version=2.0
##POLYGENIC SCORE (PGS) INFORMATION
#pgs_id=PGS000012
#pgs_name=GRS49K
#trait_reported=Coronary artery disease
#trait_mapped=coronary artery disease
#trait_efo=EFO_0001645
#genome_build=hg19
#variants_number=49310

#下载后去除文件头部的comment,方便Plink2使用
awk '$1!~/#/{print $0}' PGS000012_hmPOS_GRCh37.txt > PGS000012_hmPOS_GRCh37_plink2.txt

head PGS000012_hmPOS_GRCh37_plink2.txt
rsID    effect_allele   effect_weight   hm_source       hm_rsID hm_chr  hm_pos  hm_inferOtherAllele
rs1333045       C       0.187251        ENSEMBL rs1333045       9       22119195        T
rs1537370       T       0.17296 ENSEMBL rs1537370       9       22084310        C
rs9863247       T       0.136328        ENSEMBL rs9863247       3       161125373       C
rs11203077      T       0.131725        ENSEMBL rs11203077      10      91097085        G
rs10268558      C       0.12966 ENSEMBL rs10268558      7       18924927        T
rs7148203       C       0.128366        ENSEMBL rs7148203       14      45062275        T
rs12747328      T       0.127016        ENSEMBL rs12747328      1       159429702       C
rs17810947      G       0.126626        ENSEMBL rs17810947      18      70754886        A/T
rs2412710       A       0.125677        ENSEMBL rs2412710       15      42683787        G

基因组文件使用 https://choishingwan.github.io/PRS-Tutorial/plink/ 所提供的的EUR文件

EUR.QC.bed
EUR.QC.bim
EUR.QC.fam

注:本教程计算结果无实际意义,仅为演示计算过程使用。

第一步 计算每条染色体的原始分数 step1 calculate the raw scores for each chromosome

score=PGS000012_hmPOS_GRCh37_plink2.txt

for chr in $(seq 1 22)
do
plink2 \
    --bfile ./EUR.QC \
    --score ${score} 1 2 3 header list-variants cols=+scoresums \
    --chr ${chr} \
    --out EUR_CAD.chr${chr} 
done

—score 后跟 分数文件路径, 并指定 文件的 ID, effect allele 以及 effect_weight的列数。

header :分数文件有header,计算时会自动跳过第一行

list-variants : 将计算所用的所有variant ID写进plink2.sscore.vars

—chr 指定所计算的染色体编号

注1: 这里要加上cols=+scoresums 来获取原始结果,默认输出的仅为平均后的结果

--score applies one or more linear scoring systems to each sample, and reports results to plink2.sscore. More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score computes the vector-matrix product aTG, and then divides by the number of variants when reporting score-averages.

注2:有时候分数文件中不包括某些染色体的variants,所以有些染色会计算失败,求和时应注意

所得文件

EUR_CAD.chr10.log          EUR_CAD.chr15.log          EUR_CAD.chr1.log           EUR_CAD.chr3.log          EUR_CAD.chr8.log
EUR_CAD.chr10.sscore       EUR_CAD.chr15.sscore       EUR_CAD.chr1.sscore        EUR_CAD.chr3.sscore       EUR_CAD.chr8.sscore
EUR_CAD.chr10.sscore.vars  EUR_CAD.chr15.sscore.vars  EUR_CAD.chr1.sscore.vars   EUR_CAD.chr3.sscore.vars  EUR_CAD.chr8.sscore.vars
EUR_CAD.chr11.log          EUR_CAD.chr16.log          EUR_CAD.chr20.log          EUR_CAD.chr4.log          EUR_CAD.chr9.log
EUR_CAD.chr11.sscore       EUR_CAD.chr16.sscore       EUR_CAD.chr20.sscore       EUR_CAD.chr4.sscore       EUR_CAD.chr9.sscore
EUR_CAD.chr11.sscore.vars  EUR_CAD.chr16.sscore.vars  EUR_CAD.chr20.sscore.vars  EUR_CAD.chr4.sscore.vars  EUR_CAD.chr9.sscore.vars
EUR_CAD.chr12.log          EUR_CAD.chr17.log          EUR_CAD.chr21.log          EUR_CAD.chr5.log          
EUR_CAD.chr12.sscore       EUR_CAD.chr17.sscore       EUR_CAD.chr21.sscore       EUR_CAD.chr5.sscore       
EUR_CAD.chr12.sscore.vars  EUR_CAD.chr17.sscore.vars  EUR_CAD.chr21.sscore.vars  EUR_CAD.chr5.sscore.vars  
EUR_CAD.chr13.log          EUR_CAD.chr18.log          EUR_CAD.chr22.log          EUR_CAD.chr6.log          
EUR_CAD.chr13.sscore       EUR_CAD.chr18.sscore       EUR_CAD.chr22.sscore       EUR_CAD.chr6.sscore       
EUR_CAD.chr13.sscore.vars  EUR_CAD.chr18.sscore.vars  EUR_CAD.chr22.sscore.vars  EUR_CAD.chr6.sscore.vars  
EUR_CAD.chr14.log          EUR_CAD.chr19.log          EUR_CAD.chr2.log           EUR_CAD.chr7.log
EUR_CAD.chr14.sscore       EUR_CAD.chr19.sscore       EUR_CAD.chr2.sscore        EUR_CAD.chr7.sscore
EUR_CAD.chr14.sscore.vars  EUR_CAD.chr19.sscore.vars  EUR_CAD.chr2.sscore.vars   EUR_CAD.chr7.sscore.vars

分数文件内容

head EUR_CAD.chr1.sscore
#FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM SCORE1_AVG      SCORE1_SUM
HG00096 HG00096 1046    486     0.00757595      7.92444
HG00097 HG00097 1046    484     0.00783352      8.19386
HG00099 HG00099 1046    505     0.00798495      8.35226
HG00101 HG00101 1046    491     0.00791584      8.27997
HG00102 HG00102 1046    474     0.00769178      8.04561
HG00103 HG00103 1046    502     0.00799328      8.36097
HG00105 HG00105 1046    500     0.00793847      8.30364
HG00107 HG00107 1046    468     0.00770316      8.0575
HG00108 HG00108 1046    469     0.00732639      7.66341

分染色体的计算的目的是,实际操作中,通常会使用巨大的pgen或bgen进行计算,合并文件并不现实, 这是一般需要分染色体计算并手动求和:

for chr in $(seq 1 22)
do
plink2 \
    --pfile chr${chr} \
    --score ${score} 1 2 3 header list-variants cols=+scoresums \
    --out chr${chr}
done

第二步 将各个染色体的结果提取 step2 extract raw scores into a single file

scoreFile=EUR_CAD

count=1
for chr in $(seq 1 22)
do
    file=${scoreFile}.chr${chr}.sscore
    if [ -s "$file" ] ; then 
        if [ ${count} -eq 1 ] ;then
        echo "ID        chr${chr}" > ./${scoreFile}.chrALL.score
        awk 'NR>1{print $1,$6}' ./${file} >> ./${scoreFile}.chrALL.score
        else
        paste ./${scoreFile}.chrALL.score <(cut -f6 ./${file} | awk -v chr=chr${chr} 'NR==1 {$0=chr} 1') >./${scoreFile}.temp&&mv ./${scoreFile}.temp ./${scoreFile}.chrALL.score
        fi
        count=`expr $count + 1`
     fi
done


所得文件

head EUR_CAD.chrALL.score
ID      chr1    chr2    chr3    chr4    chr5    chr6    chr7    chr8    chr9    chr10   chr11   chr12   chr13   chr14   chr15   chr16   chr17   chr18   chr19   chr20       chr21   chr22
HG00096 7.92444 6.41628 4.9392  3.9937  5.08024 7.50241 4.61922 3.59464 4.24131 4.56312 5.03334 5.55227 2.95436 2.32917 3.39488 3.61625 3.40913 2.02032 1.95575 2.74394     0.79493 1.33408
HG00097 8.19386 6.33598 5.0713  4.0514  4.92887 7.30242 4.19193 3.66247 4.514   4.42716 5.26378 5.11748 2.7589  2.47905 3.82365 3.83448 3.44519 2.09788 1.80506 2.7079      0.811121        1.27481
HG00099 8.35226 6.61542 5.199   3.94708 4.88824 7.72775 4.31098 3.58159 4.30785 4.81232 5.2327  6.05063 2.8803  2.3651  3.36827 3.54378 3.29725 2.08449 1.88188 2.52587     0.767397        1.21001
HG00101 8.27997 6.13397 5.36465 4.04592 4.88809 7.86323 4.49118 3.60872 4.27056 4.6258  5.42121 5.26046 2.70375 2.52093 3.56581 3.45794 3.30534 2.1327  1.70916 2.66579     0.818805        1.3808
HG00102 8.04561 6.32975 5.20324 3.86051 4.67724 7.78206 4.69662 3.38095 4.40059 4.67174 5.2312  5.04913 2.69084 2.48935 3.62454 3.61557 3.3044  2.12043 1.96028 2.82496     0.808453        1.46265
HG00103 8.36097 6.1202  5.31402 4.24136 4.91736 7.34715 4.55809 3.60716 4.15705 4.66235 5.04477 5.53387 2.70537 2.36537 3.76842 3.64108 3.33273 2.11081 2.00393 2.68089     0.763437        1.32735
HG00105 8.30364 6.13937 5.15716 4.19656 5.06392 7.8889  4.33995 3.78158 4.24975 4.58436 5.32105 5.72797 2.84577 2.43968 3.39822 3.60722 3.44641 1.9986  1.775   2.7736      0.793725        1.41736
HG00107 8.0575  6.51259 5.19973 4.06067 4.78711 7.14954 4.66686 3.81154 4.42637 4.45144 5.26257 5.46242 2.82289 2.28362 3.33811 3.58244 3.25041 2.11758 2.00922 2.70618     0.917279        1.43502
HG00108 7.66341 6.18669 4.98761 4.05088 4.7294  7.39041 4.08688 3.53173 4.43728 4.80822 5.02485 5.52436 3.12156 2.35199 3.54311 3.49076 2.97043 2.12883 1.91326 2.51056     0.817221        1.47096

第三步 将分数进行简单加和 step3 combine the scores using awk

scoreFile=EUR_CAD
#sum the scores from each chr
awk 'BEGIN{t=0}
        {if(NR==1){
                print "ID","score"
                  }
                else{
                        for(i=2;i<=NF;i++) t+=$i 
                        print $1,t
                    }
        }' ./${scoreFile}.chrALL.score > ./${scoreFile}.summary.score

所得文件

head EUR_CAD.summary.score
ID score
HG00096 88.013
HG00097 176.112
HG00099 265.062
HG00101 353.577
HG00102 441.807
HG00103 530.37
HG00105 619.62
HG00107 707.931
HG00108 794.672

求和完成,获得了原始分数(没有进行任何操作,没有标准化,没有平均,只是所有variant的效应量与dosage乘积的和),可以用于后续PRS分析

参考:

https://www.cog-genomics.org/plink/2.0/score

https://www.pgscatalog.org/browse/scores/

https://choishingwan.github.io/PRS-Tutorial/plink/

多基因风险分数 PRS( Polygenic risk score)系列之八:PGS Catalog

本文内容

  1. PGS Catalog 简介
  2. PGS Catalog的纳入标准
  3. 从PGS Catalog寻找PGS
  4. PGS分数文件格式与下载
  5. 下载PGS后使用PLINK计算PGS
  6. 参考

回顾

PGS Catalog 简介

本文简要介绍PGS Catalog的基本信息与使用方法。该数据库对于未来的PGS相关研究可以说是必不可少的,或多或少都需要通过此数据库查询,下载已有的PGS或上传自己的PGS模型。

与 GWAS catalog 类似, PGS Catalog 是一个已发表多基因风险分数 (polygenic scores)的公开数据库。在PGS Catalog中的每个PGS都被统一地标注了相关的元信息:包括分数文件(variants, effect alleles/weights),PGS如何构建与应用的注释,以及其预测表现的评价等。PGS对应的表型会被连接到相应的EFO(Experimental Factor Ontology,https://www.ebi.ac.uk/efo/)以保持研究间的统一(GWAS catalog 也使用EFO)。

PGS Catalog旨在对PGS构建索引,并以标准化的形式分发每个PGS的关键信息(variants,结果,实验设计等),以促进对PGS分析有效性的评价。

该数据库由剑桥大学Michael Inouye(在推上很活跃的大佬,建议关注)组的Samuel Lambert与HDR UK及NHGRI-EBI (GWAS Catalog)合作开发。

PGS Catalog 的主页

PGS Catalog的纳入标准

纳入PGSCatalog的标准主要有两大块:

  1. 新近开发的PGS,包含其分数与预测能力的必要基础信息 (需要在独立样本中评估
  2. 对已开发的PGS在新的群体中进行评估。

纳入后每一个PGS都被赋予了识别编号, 例如 PGS000001

从PGS Catalog寻找PGS

查询PGS时, 可以通过搜索框直接搜索关键词查询PGS,或是通过表型,发表的文献等方式浏览数据库中的PGS.

以breast cancer 为例,查询后可以看到数据库中目前有112个乳腺癌相关的PGS:

点击后,可以查看这些PGS的汇总信息:

可以通过ancestry对PGS进行过滤,列表中的ancestry distribution表示的是,所用样本群体中各个族裔的构成。

选取感兴趣的PGS后,可以点击进入查看详细信息,或是直接下载PGS模型文件。

每个PGS的页面包括了PGS的详细信息,构建方法与参数,原始GWAS数据,评价指标,评价时所用样本信息等等。

PGS分数文件格式与下载

PGS Catalog数据库中的文件格式说明可以参考:https://www.pgscatalog.org/downloads/

如下所示,Scoring File Format由两部分组成,header和数据。

Header部分主要为该文件版本信息,PGS的基础信息,以及原始研究的信息,数据部分则包括了variant和计算PGS的allele与权重,大多可以来直接使用。

###PGS CATALOG SCORING FILE - see <https://www.pgscatalog.org/downloads/#dl_ftp_scoring> for additional information
#format_version=2.0
##POLYGENIC SCORE (PGS) INFORMATION
#pgs_id=PGS000001
#pgs_name=PRS77_BC
#trait_reported=Breast Cancer
#trait_mapped=breast carcinoma
#trait_efo=EFO_0000305
#weight_type=NR
#genome_build=NR
#variants_number=77
##SOURCE INFORMATION
#pgp_id=PGP000001
#citation=Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036
rsID	chr_name	effect_allele	other_allele	effect_weight	locus_name	OR
rs78540526	11	T	C	0.16220387987485377	CCND1	1.1761
...

下载PGS后使用PLINK计算PGS

确认基因组版本等信息无误后,结合手头的基因型数据,可以通过PLINK来计算PGS。(与手头基因型文件的variant ID不一致时需要重新匹配)

https://www.cog-genomics.org/plink/2.0/scoreGWASLab:多基因风险分数 PRS( Polygenic risk score)系列之二:使用PLINK计算PRS(C+T方法)的后半部分介绍了计算方法。

plink2 --score <filename> [i] [j] [k] [{header | header-read}]
                   [{center | variance-standardize | dominant | recessive}]
                   ['no-mean-imputation'] ['se'] ['zs'] ['ignore-dup-ids']
                   [{list-variants | list-variants-zs}]
                   ['cols='<column set descriptor>]

需要注意的是,有时我们手里的插补后的dosage文件是分染色体的,而PGS模型文件通常包括所有染色体上的variants,这种情况下一般需要分染色体进行计算单纯的分数加和 (使用—score里的cols=+scoresums 选项),然后再把22条染色体的分数再次加和算得总分数。

PLINK2 score的输出文件表头

Header	Column set	Contents
FID	maybefid, fid	Family ID
IID	(required)	Individual ID
SID	maybesid, sid	Source ID
PHENO1	pheno1	All-missing phenotype column, if none loaded
<Pheno name>, ...	pheno1, phenos	Phenotype value(s) (only first if just 'pheno1')
ALLELE_CT	nallele	Number of alleles across scored variants
DENOM	denom	Denominator used for score average
NAMED_ALLELE_DOSAGE_SUM	dosagesum	Sum of named allele dosages
<Score name>_AVG, ...	scoreavgs	Score averages
<Score name>_SUM, ...	scoresums	Score sums #分染色体时使用这一列求和

参考

https://www.pgscatalog.org/

Samuel A. Lambert, Laurent Gil, Simon Jupp, Scott C. Ritchie, Yu Xu, Annalisa Buniello, Aoife McMahon, Gad Abraham, Michael Chapman, Helen Parkinson, John Danesh, Jacqueline A. L. MacArthur and Michael Inouye.

The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation

Nature Geneticsdoi: 10.1038/s41588-021-00783-5 (2021).

https://www.ebi.ac.uk/efo/

多基因风险分数 PRS( Polygenic risk score)系列之五:使用PRS-CS计算PRS(beta-shrinkage方法)

本文内容

  1. PRS系列回顾
  2. PRS-CS简介
  3. 概念框架
  4. 使用方法
  5. 参考

系列回顾:

PRS-CS简介

本文将介绍一种使用GWAS数据与外部LD参考面板来计算SNP后验效应量的方法, PRS-CS。

该方法利用了高维贝叶斯回归的框架,与之前类似研究最大的不同点便是对于beta使用了连续收缩(continuous shrinkage, CS)先验分布,适应更多样的遗传结构,并大幅提高了计算效率,使局部LD特征的多变量建模成为可能。

概念框架

  1. 考虑一个贝叶斯高维回归框架
  2. 常用的beta的先验分布通常可以表示为多个正态分布的比例混合
  3. 为了构建更为灵活的模型以更好反映遗传结构,一些方法采用了两个或多个质点或分布的离散混合,相比单纯的正太先验分布,这类方法通常能够构建更广的效应量分布。例如LDpred。
  4. 但此类方法后验推断需要极大的计算量,可能不能准确的建模出局部的LD结构。
  5. 为了解决这些问题,PRS-CS的作者采用了一种与先前离散混合概念不同的先验分布,也就是连续收缩先验分布,也就是正态分布的全局-局部比例混合
  6. PRS-CS对于全局比例系数采取两种不同算法进行估计。一是搜索固定的系数,而是通过指定全局系数的先验分布来在给定数据中进行估计。

详细推导请参考原论文:T Ge, CY Chen, Y Ni, YCA Feng, JW Smoller. Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors. Nature Communications, 10:1776, 2019.

使用方法

PRS-CS由Python语言写成,是python2与3同时兼容的。需要安装scipy与h5py这两个包。

PRS-CS 本体与外部LD参考面板的下载:

https://github.com/getian107/PRScs

git clone <https://github.com/getian107/PRScs.git>

下载完成后,检验是否安装成功:

./PRScs.py --help

下载对应人群的LD参考面板,作者提供了千人基因组以及UKB数据的参考面板:

https://github.com/getian107/PRScs#getting-started

输入准备

  1. 上面下载的LD参考面板 plink格式
  2. 目标群体的bim文件:该文件只是用来提供target数据集中的SNP列表
  3. GWAS sumstats

sumstats的格式如下

SNP          A1   A2   BETA      P
rs4970383    C    A    -0.0064   4.7780e-01
rs4475691    C    T    -0.0145   1.2450e-01
rs13302982   A    G    -0.0232   2.4290e-01

或是

SNP          A1   A2   OR        P
rs4970383    A    C    0.9825    0.5737                 
rs4475691    T    C    0.9436    0.0691
rs13302982   A    G    1.1337    0.0209
...

注意A1是effect allele, A2是non-effect allele

PRS模型计算

使用方法如下

python PRScs.py \
  --ref_dir=PATH_TO_REFERENCE \ # LD参考面板的文件夹路径
  --bim_prefix=VALIDATION_BIM_PREFIX \ #目标群体plink的bim格式文件的路径
  --sst_file=SUM_STATS_FILE \  # GWAS sumstats的路径
  --n_gwas=GWAS_SAMPLE_SIZE \ # GWAS的样本量大小
  --out_dir=OUTPUT_DIR  #输出文件夹

#以下为可选项
  #--a=PARAM_A  \ gamma-gamma prior中的参数a,默认为1
  #--b=PARAM_B  \ gamma-gamma prior中的参数b,默认为0.5
  #--phi=PARAM_PHI \ 全局比例系数,不指定时自动估计(PRS-CS-auto),也可以小规模网格搜索(phi=1e-6, 1e-4, 1e-2, 1)
  #--n_iter=MCMC_ITERATIONS \ MCMC迭代次数,默认1000
  #--n_burnin=MCMC_BURNIN \ MCMC中burnin的次数,默认500
  #--thin=MCMC_THINNING_FACTOR \ 马尔科夫链的thinning factor,默认为5
  #--chrom=CHROM \ 可以只单独计算一条染色体
  #--beta_std=BETA_STD \ 若为True,则输出标准化的后验SNP效应量 
  #--seed=SEED 随机数种子

PRS-CS使用scipy,会自动占用所有可用的cpu核心,使用服务器时可能会出现干扰,可以通过以下shell脚本指定使用核心数:

export MKL_NUM_THREADS=$N_THREADS
export NUMEXPR_NUM_THREADS=$N_THREADS
export OMP_NUM_THREADS=$N_THREADS

例如,只用一个核时,N_THREADS=1

输出结果

输出文件包含五列,rsID,碱基位置,A1,A2 与后验效应量估计值

然后使用该文件和plink可以对目标人群计算PRS: https://www.cog-genomics.org/plink/1.9/score

分染色体计算时,最后加和各个染色体的PRS即可

测试例子

使用eur的参考面板与test_data文件夹里的gwas数据,以及一个有22号染色体上1000个SNP的bim文件:

python PRScs.py \
--ref_dir=path_to_ref/ldblk_1kg_eur \
--bim_prefix=path_to_bim/test \
--sst_file=path_to_sumstats/sumstats.txt \
--n_gwas=200000 \
--chrom=22 \
--phi=1e-2 \
--out_dir=path_to_output/eur

参考

T Ge, CY Chen, Y Ni, YCA Feng, JW Smoller. Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors. Nature Communications, 10:1776, 2019.

rsID的介绍与chr:pos转换时的陷阱

很多小伙伴都觉得位置转换rsID是很麻烦的事情,有时会偷懒只用手头文件的chr:pos位置信息匹配rsID,但这样做带来的的问题却少有人讨论,本文将主要介绍什么是rsID,以及rsID在使用和转换中的一些常见问题。

本文内容
什么是rsID
主要优点
rsID可能表示的变异类型
(重点)rsID与chrpos转换时的常见错误
解决办法
参考

什么是rsID

rsID 就是 dbSNP的Reference SNP ID (缩写为rs 或者RefSNP),一个由dbSNP设定的,为了识别变异位点的一串数字编号。rsID设计上是非冗余的,也就是全局唯一的id,用户提交的变异会被归类整理注释,重复的变异会被整合。

主要优点

rsID无关参考基因组版本,不像chrpos会随版本变化而变化, rsID在不同版本间是一致的。对于群体遗传学或是精准医学的大规模的研究来说会更加方便,rsID提供了稳定的变异表示方法。(摘自官网,个人认为有时候rsID的转换带来的问题远超不转换的问题,有好有坏,但是传统上还是需要转换)

rsID表示变异的类型

rsID中的rs尽管是Reference SNP的首字母缩写,但实际上一些其他类型的变异也会被赋予rsID。(通常变异的长度小于50bp)

  • 单核苷酸变异 Single nucleotide variation (SNV)
  • 短多核苷酸变异 Short multi-nucleotide changes (MNV)
  • 较小的短插入或删除 Small deletions or insertions (INDEL)
  • 较小的短串联重复序列 Small STR repeats
  • 逆转录转座子插入 retrotransposable element insertions

rsID是把双刃剑:仅凭chr:pos与rsID互相转换时的陷阱

  • 仅使用chr:pos 转换 rsID时的问题:
  1. 对应位点rsID不存在,可能是新变异等等原因,通常可以以chr:pos:ref:alt的形式替代。但还有个问题就是Alt allele不存在,比如rs123456 对应chr1:123456的 T>C,A 而你手里的数据是chr1:123456的 T>G, 那问题来了,这应不应该给他们相同的rsID?仅凭位点和类型来说应该给,或许下个版本的dbsnp会加上这个变异,但其实我也没有明确的答案(欢迎评论区讨论),不过实际操作中我会倾向于保守一点,用chr:pos:ref:alt 而不是rsID来表示。
  2. 如上rsID的介绍所述,rsID并不止只用来表示单一核苷酸的SNP,也会表示其他变异类型,这会导致同一位点有多个rsID表示的变异,最常见的就是某个位点同时有SNP和INDEL,仅凭chr:pos信息而不管allele的话会混淆并大量的错误匹配SNP与INDEL的rsID,后续功能分析会引起很大的不便,举个例子: rs123456 对应chr1:123456的 T>C ,而rs987654 同样对应chr1:123456这个位置,但是这个变异是个INDEL, T>TA, 如果仅凭chr:pos匹配会混淆SNP与INDEL,虽然是同样的位置,但变异造成的影响会完全不同。解释时本应是rs987654这个INDEL造成的影响却错误地解释到rs123456这个SNP上,这种情况应该被避免。这么做破坏了rsID的唯一性特点,是不是有点违背初衷,本末倒置了。
  3. 还有一个问题就是手头数据里的变异是否已经标准化? 未标准化的变异的chrpos是不准确的,进行左对齐与节俭原则的标准化后可能产生位移,用未标准化chrpos匹配时可能会错位匹配到其他相邻的位点上。比如手头的变异可能是 chr1:123456:AA:AT ,标准化后则是chr1:123457:A:T,向后移了一位,如果你看过1000genome的原始数据就会发现这样的情况大量存在,所以应当注意(参考:GWASLab:变异的标准化 Variant Normalization
  4. 0起点还是1起点的参考系问题,处理数据时应该注意,这里不做过多赘述。(GWASLab:LiftOver 基因组坐标变换 与 01坐标系统

rsID 向 chr:pos 某参考基因组版本的位置转换时,会遇到的问题:

  1. 设计上rsID是唯一对应某个变异的,但实际上由于dbSNP版本的不同或其他原因,手头GWAS的sumstats里的rsID可能对应两个位置, 而多个rsID又可能对应同一个位置上相同的变异
  2. 在对应参考基因组版本上的位置不存在等等

解决办法

rsID转换chrpos时要尽量明确原始数据的dbsnp版本,能确定版本的时候用对应版本,不能的时候要制定统一标准(为了研究的可重复性),转换时要使用统一的dbsnp的版本。

而chrpos转换rsID时,不贪多,不求快,老老实实用先确认标准化,然后利用注释的方法,也就是相应基因组版本的 位置chr:pos以及 ref与alt全部与rsID全部匹配时才进行转换。

可以参考以下内容:

GWASLab:使用ANNOVAR对变异进行功能注释

GWASLab:SNP的rsID与位置信息的相互匹配 rsID/ chr:pos conversion

参考:

https://www.ncbi.nlm.nih.gov/snp/do