GWAS中的有效样本容量 GWAS Effective sample size

GWAS中的有效样本容量对于后续分析十分重要，错误或不准确的估计会影响对GWAS结果的进一步解读。这里简单介绍病例对照GWAS中有效样本容量的定义，以及与一些容易混淆的概念的区分。

病例对照GWAS中有效样本容量

病例对照GWAS中有效样本容量(Effective sample size, ESS)通常定义为在病例与对照比例平衡 (Case: Control = 1:1; 也就是病例，对照各占比50%) 的研究里，得到等效检验效能的样本量。这样定义使得不同研究间的样本量可以进行比较。

注意: 有效样本容在不同的前后文背景下有更一般化的定义，也就是在简单随机样本的情况下，能够达到与目标样本某项数值同等化精度的样本量大小，例如在问卷调查中，Markov chain Monte Carlo中，时间序列分析中等。本文只针对病例对照GWAS中的Effective sample size进行介绍。

假设v为病例所占总体的比例，N为总样本量，即病例加对照，Neff为有效样本量，对于某个SNP其效应量beta的se可以写为

则有效样本容量可以通过下式计算

在其他文献中也时常看到其等价的表示方式

N_eff = 4 * Ncase * Ncontrol / (Ncase + Ncontrol)

或者

N_eff = 4/(1/ncase + 1/ncontrol)

同时还有其他的计算方法（假设的case, control比例不同）

https://www.nature.com/articles/nprot.2014.071

N_eff = 2/(1/ncase + 1/ncontrol)

荟萃分析时有效样本量的计算

需要注意的问题 : 单个研究的有效样本量的加和不等于整体prevalence计算得到的有效样本量

两种推荐的计算方法:

简单方法：单独的研究有效样本量的简单加和
没有单独的研究有效样本量数据时可基于荟萃后的sumstats近似估计： 4/(2pq x SE^2). 其中pq为 MAF*（1-MAF）

详细推导和分析可以参考

Grotzinger, A. D., de la Fuente, J., Privé, F., Nivard, M. G., & Tucker-Drob, E. M. (2023). Pervasive downward bias in estimates of liability-scale heritability in genome-wide association study meta-analysis: a simple solution. Biological psychiatry, 93(1), 29-36.

与有效群体大小区别

有效群体大小 (Effective population size, Ne) 群体遗传学中一个重要的概念，其描述的是等效的理想化Wright–Fisher population群体的大小，决定了由genetic drift导致的群体构成发生的变化。其定义中的等效是对于某种遗传数值而言，可以指 allele variance (variance effective population size) 或 inbreeding coefficient (inbreeding effective population size)。概念上与有效样本量有类似之处，但具体所指对象不同，注意不要混淆。

估计数量性状的线性混合模型GWAS中的有效样本量

目前大多GWAS都采用了线性混合模型，其优点是允许纳入存在亲缘关系的个体，但会影响有效样本量的大小，对于这部分的讨论可以参考

Ziyatdinov, A., Kim, J., Prokopenko, D., Privé, F., Laporte, F., Loh, P. R., … & Aschard, H. (2021). Estimating the effective sample size in association studies of quantitative traits. G3, 11(6), jkab057.

参考

https://github.com/GenomicSEM/GenomicSEM/wiki/2.1-Calculating-Sum-of-Effective-Sample-Size-and-Preparing-GWAS-Summary-Statistics

https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html

多基因风险分数PRS临床应用的争论

背景

近期一篇发表在BMJ Medicine的有关PRS在疾病筛查，预测以及风险分层的文章引发了激烈讨论。

文章：Hingorani, A. D., Gratton, J., Finan, C., Schmidt, A. F., Patel, R., Sofat, R., … & Wald, N. J. (2023). Performance of polygenic risk scores in screening, prediction, and risk stratification: secondary analysis of data in the Polygenic Score Catalog. BMJ medicine, 2(1).

Hingorani等人的文章主要研究对象为PGS Catalog中已发表PGS的性能指标估计值(hazard ratio, odds ratio等), 并将其转化为临床检验常用的指标 DR5 （detection rate for a 5% false positive rate）进行二次分析。这篇文章得出了负面的结论，PRS在疾病筛查，预测以及风险分层等方面表现较差，对PRS的强调与其在健康体系中的实际效果不成正比。

该文章发表于2023年10月17日，随即遭到了曼彻斯特大学D Gareth Evans教授的激烈质疑，“他们关于不要使用PRS的论断就跟说你不应该用任何风险因子评估疾病风险一样。 一个有两倍PRS的女性，如果没有乳腺癌家族史，第一次怀孕过早，和过晚的初潮，其风险也只有平均值。与其他风险因素合并使用可以显著降低文中提到的假阳性以及发现有真的高风险的个体。” 同一天谢菲尔德大学的Harry Hill也对此提出了相似的质疑。几天后塔尔图大学的Padrik等人则举出乳腺癌的例子进行质疑，他们认为评估PRS对临床的潜在影响，需要在各个疾病的特殊临床背景下进行分析。

几周后本文的第一作者Hingorani对上述质疑一一进行了回应，主要举例数据并引用其他研究说明，即使结合其他风险因子，PRS对于筛查性能的提升是很小的，上述几位的质疑对该文章的主要结论没有影响。

上个月月底，Samuel A. Lambert等一众该领域知名学者（包括PGS Catalog的主要贡献者）联名发文反驳该文的观点，强调PRS既不是诊断性检查，也不是单独的风险因子，并举例说明PRS的应用与其成本效益，指出该文有缺陷的建模与不完整的理论前提。同时致信杂志编辑 “他们的结论基于了不完整的前提，没有考虑临床背景。我们认为他们的文章并没有推动PGS潜在临床应用的讨论，并且会误导临床工作者与大众” 。

目前编辑还没有回复，上面的回复原文可以在 https://bmjmedicine.bmj.com/content/2/1/e000554.responses 找到。有兴趣的同学可以看一看原文。

一些个人理解

（主要基于 Lewis, C. M., & Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome medicine, 12(1), 1-11.）

PRS反映的是个体遗传因素相关的疾病风险，对于个体，定义上PRS是独立风险位点的风险等位拷贝数的加权之和。这个模型简单实用，但还有很多不完美之处，例如这个计算方法通常只考虑加性遗传结构，没有考虑可能的基因间相互作用或基因环境相互作用等等，反过来说这些也都是未来PRS方法研究可能的方向。

对于PRS可能的临床应用（疾病预测，风险分层等），应当考虑到PRS的性能会受到多方面影响，例如疾病的多因子性质，遗传信号测量中可能出现的不准确或错误的问题等等。对于复杂疾病，PRS单独使用时性能性能不会太高，解读时，应当认为PRS为传统风险因子预测模型的补充，而非替代。

对于PRS的理解要避免基因决定论的错误印象。如果将PRS单独用于复杂疾病预测就会类似于刻舟求剑，完全忽略了周围不断变化的因素的影响。概念上来讲，个体的遗传易感性（Genetic liability）是固定的，但其所引起的风险却是动态变化的，这个变化依赖于变化的因素例如年龄，环境暴露，家族史，个人病史等等。一个最简单的例子就是，假如一个人有很高的酗酒的遗传风险，但如果他因为一些原因从来没有见过酒，那这个遗传风险自然就无从谈起。

对于PRS的实际应用，还存在诸多需要解决的问题，但相关研究的出发点需要基于对PRS的科学理解，才能避免偏差。

参考

（批判性参考）Hingorani, A. D., Gratton, J., Finan, C., Schmidt, A. F., Patel, R., Sofat, R., … & Wald, N. J. (2023). Performance of polygenic risk scores in screening, prediction, and risk stratification: secondary analysis of data in the Polygenic Score Catalog. BMJ medicine, 2(1). https://bmjmedicine.bmj.com/content/2/1/e000554

Lewis, C. M., & Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome medicine, 12(1), 1-11.

scDRS 单细胞疾病相关分数

基因组学与单细胞RNA测序的结合

目前复杂疾病研究方法热点之一就是多组学多方法的结合，近来多种新的结合方法中，scDRS是比较有代表性的将基因组学与单细胞RNA测序的结合的方法，类似的方法还有sc-linker等，这类方法也属于一个新的交叉领域单细胞遗传学（Single-cell Genetics）。

传统的基因组学与单细胞测序的结合的方法，例如MAGMA或是LDSC-SEG本质上还是基于细胞特异表达的基因集而进行的富集检测，而近来以scDRS为代表的方法则利用scRNA-seq的表达矩阵进一步深入至单个细胞的层面，能够或得更高的分辨率，这对解析疾病异质性，疾病关联的细胞亚群等方面可以发挥巨大作用（就是让精准医疗更精准）。

scDRS的全称是 single-cell disease relevance score, 单细胞疾病相关分数，正如其名称类似于多基因分风险分数的PRS，该方法结构上或多或少也类似PRS的构建和计算方法，不过这里的个体是一个一个的细胞。PRS用于评估个体疾病的风险，而scDRS则评估细胞是否高表达疾病相关的基因。注意这里的R，一个是risk风险（有方向），一个是relevance相关（没有方向），概念上的差异。

scDRS的方法概况

首先从GWAS结果构建疾病基因集：使用MAGMA和目标疾病的GWAS sumstats进行基因水平的关联检验，得到每个基因与疾病关联的Z分数，选定前1000个基因作为假定的疾病基因（这个步骤不是方法的重点，除了MAGMA也可以使用其他类似方法构建疾病基因集）
然后计算单个细胞的疾病分数：scDRS会对每个细胞进行计算，量化假定疾病基因的整体表达。为了最大化检验效能，会根据MAGMA所得Z分数进行加权，同时根据每个基因在单细胞测序中特异的技术性噪音进行进行加权。
最后scDRS对所有基因集和细胞，标准化其原始的疾病分数以及原始的对照分数。然后基于所有基因集和细胞的标准化后分数的经验分布，计算每个细胞的P值。
利用得到的分数可以进行多种下游分析，分析包括（1）单个细胞层面的关联性检验（2）细胞类型关联性检验以及（3）基因关联性检验等，这些分析的P值均通过MC检验（Monte Carlo 蒙特卡洛检验）获得。

用例（下游分析）

分析与表型相关的细胞类型

相比于传统方法LDSC或MAGMA，scDRS可以检验细胞类型的相关性，还可以检验同一细胞类型内的异质性。

作者列举了 22个表型与19种细胞的关联结果的热力图，Y轴为表型，X轴为细胞类型，方框表示显著相关，×表示细胞类型内的异质性，颜色深浅表示显著关联细胞的比例。

发现存在异质性的细胞亚群

作者研究了自身免疫性疾病中T细胞的异质性，11个T细胞群中（a），与IBD相关的T细胞构成了4个新的群（b）。

基因-分数的关联分析

scDRS检验基因是否与GWAS由来的基因集中的基因共表达。

相比于MAGMA，scDRS能更准确地识别出疾病相关的基因

典型的分析流程

只需要GWAS的Sumstats和单细胞RNA测序的数据，好消息是这两个都可以很容易从公开数据库中获得。

具体流程官方文档以及大牛博客已经写得很详细了，

分析代码可以参考 https://martinjzhang.github.io/scDRS/

以及 https://zhuanlan.zhihu.com/p/592128325

注意的点

scDRS的假设是基因集里的基因与疾病有关，与疾病有关的基因会在疾病相关的细胞群体（可以是疾病细胞或健康细胞）里高表达，与基因的方向无关。 scDRS并不是假设基因集里的基因会在疾病细胞里高表达。注意这里概念上细节的差别。（https://github.com/martinjzhang/scDRS/issues/42）
基本数据要求：为了能够得到足够的检验效能，GWAS的heritability z-score最好大于5,或样本量大于10万。（https://martinjzhang.github.io/scDRS/faq.html#which-gwas-and-scrna-seq-data-to-use）
Seurat格式的数据需要转换为scanpy使用的h5ad格式（表达矩阵不能有负值，Seurat的scaled.data里表达矩阵会有负值，转换时要注意）(https://github.com/martinjzhang/scDRS/issues/44)；转换可以使用 SeuratDisk https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html
为了增加检验效能，单细胞RNA测序可以事先进行 imputation （https://github.com/martinjzhang/scDRS/issues/32）
-adj-prop 可以调整细胞类别的比例 (当某些细胞种类比例过高的时候) （https://github.com/martinjzhang/scDRS/issues/32）
显著细胞的数太少：正常现象，仍然可以进行group分析（检验效能通常更高）（https://martinjzhang.github.io/scDRS/faq.html#scdrs-detected-few-significant-cells-fdr-0-2）

参考

Zhang, M. J., Hou, K., Dey, K. K., Sakaue, S., Jagadeesh, K. A., Weinand, K., … & Price, A. L. (2022). Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nature genetics, 54(10), 1572-1580.

群体遗传学中种族使用上的区分 Race/Ethnicity/Ancestry

英文中的常用的表示种族的词语包括Race，Ethnicity以及Ancestry。但在中文中通常都翻译成种族。本文就这些词的使用区分做简单介绍与讨论。

种族概念的区分

首先介绍Population，这是一个最为广义的词语，可以用于表示任何一群体，可大可小。通常含义基于上下文，没有明确区分。例如，在中国人群中的全基因组关联分析就可以说成 GWAS in a Chinese population.

Race, 种族(人种)，是一个由社会构建的区分系统，但该系统基于对内在的生物学特征或差异错误的认知，典型的例子便是物理特征（诸如肤色）以及社会文化的特征。举例，种族歧视应当被消除。

Ethnicity, 种族 (民族)，是一个表示某一群体的社会政治概念，通常有相连的地理位置，基于共同的遗产或相似文化，例如语言，宗教信仰等。举例，中国有五十六个民族，这个民族就是Ethnic group，汉族Han Chinese 就是一个Ethnic group。Ethnicity与Ancestry容易混淆的点在于，多数情况下Ethnicity所表示的群体通常情况下也会有共同的家系或是遗传继承，但有一些地区Ethnicity表示的仅为社会文化实体而没有遗传学基础。

Ancestry, 种族 (族裔/祖先)，是一个更为复杂的概念，包括了生物学以及社会学的成分。在西方，这个词通常反应群体的社会文化以及所来自大陆的起源，而在东方，以及南半球，这个词通常反映家系或是遗传继承。多数情况下，ancestry是群体遗传学文章中更应当使用的词语。举例，使用频率较高的有 European ancestry， East Asian ancestry， South Asian ancestry等等。

举一个例子来综合上述概念，某研究组收集了中国人群的基因数据用于GWAS研究，那这个群体泛称就可以是一个中国人群体 a Chinese population，其中有汉族和傣族，这里的族就是ethinc group（Han Chinese 和 Chinese Dai），而整个群体在群体遗传学上则都属于East Asian Ancestry。

群体遗传学领域使用上的区分

一个核心上的区别点就在于是否主观与客观， race以及ethinitity存在主观成分，而ancestry则为客观描述性的词语，反映基因组中的某些固定特征。在生物学或遗传学文章中，单纯描述遗传学意义的种族时应使用客观性的词语，即ancestry。

群体遗传学中跨种族跨群体的英文使用

简单来说应当使用 cross-population, cross-ancestry, multi-population 或 multi-ancestry 而不是 trans-ethnic

原因

trans有多种含义，应当使用更准确且而不引起歧义的cross或者multi
ethnic包含社会学成分，存在易变的主观成分，应当使用ancestry，或更广义的population

基于一些历史原因，早期的文章常常混用，早期的文章中例如

Brown, B. C., Ye, C. J., Price, A. L., & Zaitlen, N. (2016). Transethnic genetic-correlation estimates from summary statistics. The American Journal of Human Genetics, 99(1), 76-88.

中使用了，Transethnic，但其含义应为cross-ancestry，比较合适的用例如

Momin, M. M., Shin, J., Lee, S., Truong, B., Benyamin, B., & Lee, S. H. (2023). A method for an unbiased estimate of cross-ancestry genetic correlation using individual-level data. Nature Communications, 14(1), 722.

参考

Kachuri, L., Chatterjee, N., Hirbo, J. et al. Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00637-2

Kamariza, M., Crawford, L., Jones, D., & Finucane, H. (2021). Misuse of the term ‘trans-ethnic’in genomics research. Nature Genetics, 53(11), 1520-1521.

GWAS中的赢家诅咒与其校正 Winner’s curse correction

GWAS中的赢家诅咒 Winner’s curse

GWAS中的赢家诅咒是指遗传效应的大小由于GWAS中的筛选过程（通过全基因组显著阈值筛选lead SNP）而被系统性地过高估计。

赢家诅咒本用来指代在拍卖中类似的现象。即使一件拍卖品对所有买家来说都有相同的价值（出价是无偏的），最后拍得物品的赢家很可能过高估计了拍卖偏的内在价值。类比于GWAS，lead SNP即为赢家，而它的效应量可能过高估计了真实的遗传效应。

赢家诅咒的校正 WC correction

假设观察到的 $\beta_{Observed}$ 的近似分布为:

$\beta_{Observed} \sim N(\beta_{True},\sigma^2)$

$\beta_{Observed}$ 的一个例子

$c$ : 显著性阈值对应的Z分数

上面的式子等价于

${{\beta_{Observed} - \beta_{True}}\over{\sigma}} \sim N(0,1)$

${{\beta_{Observed} - \beta_{True}}\over{\sigma}}$ 的一个例子

在通过阈值筛选的情况下， $\beta_{Observed}$ 的近似抽样分布（实际上为一个截断正态分布 truncated normal distribution）为：

$f(x,\beta_{True}) ={{1}\over{\sigma}} {{\phi({{{x - \beta_{True}}\over{\sigma}}})} \over {\Phi({{{\beta_{True}}\over{\sigma}}-c}) + \Phi({{{-\beta_{True}}\over{\sigma}}-c})}}$

其中

$|{{x}\over{\sigma}}|\geq c$

$\phi(x)$ : 标准正态分布的概率密度函数
$\Phi(x)$ : 标准正态分布的累积分布函数

从以上的近似抽样分布可以得到，筛选出来的SNP的效应量的期望分布为：

$E(\beta_{Observed}; \beta_{True}) = \beta_{True} + \sigma {{\phi({{{\beta_{True}}\over{\sigma}}-c}) - \phi({{{-\beta_{True}}\over{\sigma}}-c})} \over {\Phi({{{\beta_{True}}\over{\sigma}}-c}) + \Phi({{{-\beta_{True}}\over{\sigma}}-c})}}$

$\beta_{Observed}$ is biased.
偏差的大小由 $\beta_{True}$ , SE $\sigma$ , 以及用于筛选的显著性阈值决定.

公式推导可以参考 Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074. 中的Appendix A

用这个式子便可以对效应量进行赢家诅咒的校正。

winnerscurse R包

可以使用这个R包进行赢家诅咒的校正。

https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

参考

Bazerman, M. H., & Samuelson, W. F. (1983). I won the auction but don’t want the prize. Journal of conflict resolution, 27(4), 618-634.
Göring, H. H., Terwilliger, J. D., & Blangero, J. (2001). Large upward bias in estimation of locus-specific effects from genomewide scans. The American Journal of Human Genetics, 69(6), 1357-1369.

Zhong, H., & Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, 9(4), 621-634.
Ghosh, A., Zou, F., & Wright, F. A. (2008). Estimating odds ratios in genome scans: an approximate conditional likelihood approach. The American Journal of Human Genetics, 82(5), 1064-1074.

Also see reference: https://amandaforde.github.io/winnerscurse/articles/winners_curse_methods.html

哈迪-温伯格平衡精确检验 HWE

哈迪-温伯格平衡

回顾：哈迪温伯格平衡 Hardy– Weinberg equilibrium

哈迪-温伯格平衡精确检验检验原理

假设有N个无亲缘关系的样本（对应有2N个等位）

在哈迪温伯格平衡下，在N个样本的群体中观察到有 $n_{AB}$ 个样本为AB基因型的精确概率为：

$P(N_{AB} = n_{AB} | N, n_A) = {{2^{n_{AB}}}N!\over{n_{AA}!n_{AB}!n_{BB}!}} \times {{n_A!n_B!}\over{n_A!n_B!}}$

计算哈迪温伯格平衡精确检验的统计量时，我们需要把概率小于观察到的概率（ $n_{AB}$ 个样本为AB基因型）的情况的概率进行加和，如下所示：

$P_{HWE} = \sum_{n^{*}_{AB}} I[P( N_{AB} = n_{AB}|N, n_A)$

$\geqq P(N_{AB} = n^{*}_{AB} | N, n_A)] \times P(N_{AB} = n^{*}_{AB} | N, n_A)$

$I(x)$ 为一个指示函数. 如果x为真, $I(x) = 1$ ; 否则, $I(x) = 0$ .

实际使用软件计算时，通常会采用一些近似方法来避免大量的计算，可以参考PLINK中的HWE的算法。

使用PLINK进行HWE检验

PLINK提供了计算哈迪温伯格平衡精确检验的统计量--hardy以及基于统计量进行过滤--hwe的选项：

plink \
    --bfile ${genotypeFile} \
    --hardy \
    --out plink_results

输出结果如下， P列即为哈迪温伯格平衡精确检验的结果：

$ head plink_results.hwe
 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
   1      1:13273:G:C  ALL(NP)    C    G             1/61/442    0.121   0.1172       0.7113
   1      1:14599:T:A  ALL(NP)    A    T             1/88/415   0.1746   0.1626       0.1625
   1      1:14604:A:G  ALL(NP)    G    A             1/88/415   0.1746   0.1626       0.1625
   1      1:14930:A:G  ALL(NP)    G    A             4/409/91   0.8115   0.4851    1.679e-61
   1      1:69897:T:C  ALL(NP)    T    C            7/111/386   0.2202   0.2173            1
   1      1:86331:A:G  ALL(NP)    G    A             0/88/416   0.1746   0.1594      0.02387
   1      1:91581:G:A  ALL(NP)    A    G          137/228/139   0.4524      0.5      0.03271
   1     1:122872:T:G  ALL(NP)    G    T            1/259/244   0.5139   0.3838     8.04e-19
   1     1:135163:C:T  ALL(NP)    T    C             1/91/412   0.1806   0.1675       0.1066

或者可以通过--hwe 1e-6 直接过滤掉P小于1e-6的SNP

plink \
    --bfile ${genotypeFile} \
    --hwe 1e-6 \
    --out plink_results

参考

https://www.cog-genomics.org/plink/1.9/dev#exact

https://www.cog-genomics.org/plink/1.9/basic_stats#hardy

Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link

GWAS检验效能 Power analysis for GWAS

第一类错误，第二类错误以及检验效能

该表列举了零假设 $H_0$ 与统计学检验结果（是否拒绝原假设 $H_0$ ）之间的关系

	H0 为真	H0 为假
不拒绝原假设	真阴性 : $1 - \alpha$	第二类错误 (伪阴性) : $\beta$
拒绝原假设	第一类错误 (伪阳性) : $\alpha$	真阳性 : $1 - \beta$

$\alpha$ : 显著性水平

根据定义，检验效能（ statistical power ）指某检验正确地拒绝零假设的概率，也就是上表中的真阳性（ True positive）。

$Power = Pr ( Reject\ | H_0\ is\ False) = 1 - \beta$

影响检验效能的因素 Factors affecting power

总的样本量 Total sample size
病例与对照的比例 Case and control ratio
变异的效应量大小 Effect size of the variant
风险等位的频率 Risk allele frequency
显著性阈值 Significance threshold

非中心参数 Non-centrality parameter

非中心参数：非中心参数（Non-centrality parameter; NCP）用于描述零假设 $H_0$ 与备择假设 $H_1$ 之间差异的程度。

考虑如下的线性模型：

$y = \mu +\beta x + \epsilon$

误差项的方差为：

$\sigma^2 = Var(y) - Var(x)\beta^2$

通常情况下单个SNP所能解释的表型的方差是极其有限的，所以我们可以近似地认为

$\sigma^2 \thickapprox Var(y)$

在哈迪温伯格平衡下，有

$Var(x) = 2f(1-f)$

$f$ : 该变异的等位频率（allele frequency）

自由度为1的 $\chi^2$ 分布的非中心参数NCP则为

$\lambda = ({{\beta}\over{SE_{\beta}}})^2$

数量表型的检验效能

$\lambda = ({{\beta}\over{SE_{\beta}}})^2 \thickapprox N \times {{Var(x)\beta^2}\over{\sigma^2}} \thickapprox N \times {{2f(1-f) \beta^2 }\over {Var(y)}}$

显著性阈值: $C = CDF_{\chi^2}^{-1}(1 - \alpha,df=1)$

$CDF_{\chi^2}^{-1}(x)$ : $\chi^2$ 分布的累积分布函数的反函数

$Power = Pr(\lambda > C ) = CDF_{\chi^2}(C, ncp = \lambda,df=1)$

$CDF_{\chi^2}(x, ncp= \lambda)$ : 非中心参数NCP为 $\lambda$ 的 $\chi^2$ 分布的累积分布函数

病例对照表型的检验效能 Power for large-scale case-control genome-wide association studies

设

$P_{case}$ : 在病例中风险等位的频率 Risk allele frequency in cases
$N_{case}$ : 病例的样本量 Number of cases. The total allele count for cases is then $2N_{case}$ .
$P_{control}$ : 在对照中风险等位的频率 Risk allele frequency in controls
$N_{control}$ : 对照的样本量 Number of control. The total allele count for control is then $2N_{control}$ .

这种情况下零假设为 : $P_{case} = P_{control}$ ，即风险等位的频率在病例中与对照中是一样的。

检验两个正态分布的比例的不同时，所用的统计量为

$z = {{P_{case} - P_{control}}\over {\sqrt{ {{P_{case}(1 - P_{case})}\over{2N_{case}}} + {{P_{control}(1 - P_{control})}\over{2N_{control}}} }}}$

显著性阈值: $C = \Phi^{-1}(1 - \alpha / 2 )$

$Power = Pr(|Z|>C) = 1 - \Phi(-C-z) + \Phi(C-z)$

计算GWAS统计效能的网页工具 GAS power calculator

GAS power calculator工具实现了上述的计算方法，可以通过网页工具，指定参数后进行计算。

GAS power calculator

示例：

参考

https://cloufield.github.io/GWASTutorial/20_power_analysis/
Skol, A. D., Scott, L. J., Abecasis, G. R., & Boehnke, M. (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature genetics, 38(2), 209-213.
Johnson, J. L., & Abecasis, G. R. (2017). GAS Power Calculator: web-based power calculator for genetic association studies. BioRxiv, 164343.
Sham, P. C., & Purcell, S. M. (2014). Statistical power and significance testing in large-scale genetic studies. Nature Reviews Genetics, 15(5), 335-346.

多基因风险分数 PRS( Polygenic risk score)系列之十: PRS-CSx 跨祖先PRS的构建

本文内容：

PRS-CSx简介
PRS-CSx使用方法
PRScsx实例应用
参考

回顾

PRS-CSx简介

先前的文章中介绍了PRS研究中的一大问题便是在A群体中的构建的PRS难以直接转移应用到B群体中。为了解决这一问题，Yunfeng Ruan等人开发了PRS-CSx。

PRS-CSx是一个贝叶斯多基因模型构建与预测的框架，通过整合多个族裔的GWAS概括性统计数据来提升跨群体PRS的预测能力。该方法为PRS-CS的扩展（参考：GWASLab：多基因风险分数 PRS(Polygenic risk score)系列之五：使用PRS-CS计算PRS（beta-shrinkage方法））。

原理上，PRS-CSx利用了一个共同的连续收缩先验分布来整合各个群体中SNP的效应，该方法通过在GWAS概括性统计数据之间共享先验分布，利用不同群体间的LD信息，来达到更准确的效应估计量。这个共享的先验分布考虑到了效应估计量在不同群体中相互关联但又存在差异的特点，保持了模型框架的灵活性。

给定GWAS概括性统计数据，以及相应群体的LD参考面板，PRS-CSx可以对每个群体计算分别的PRS，并通过最优线性组合来得出最终的PRS.

PRS-CSx使用方法

https://github.com/getian107/PRScsx

PRScsx是一个基于Python的命令行工具，需要安装scipy与h5py这两个依赖包。从github上下载PRS-CSx：

git clone https://github.com/getian107/PRScsx.git

LD 参考面板与 PRS-CS 所使用文件相同（参考：GWASLab：多基因风险分数 PRS(Polygenic risk score)系列之五：使用PRS-CS计算PRS（beta-shrinkage方法））。

下载链接（国内可用的FTP）：https://personal.broadinstitute.org/hhuang//public//PRS-CSx/Reference

记得同时下载对应面板的snp list：snpinfo_mult_1kg_hm3 (1kg),或是 snpinfo_mult_ukbb_hm3（ukbb）

选项

python PRScsx.py \
--ref_dir=PATH_TO_REFERENCE \
--bim_prefix=VALIDATION_BIM_PREFIX \
--sst_file=SUM_STATS_FILE \
--n_gwas=GWAS_SAMPLE_SIZE \
--pop=POPULATION \
--out_dir=OUTPUT_DIR \
--out_name=OUTPUT_FILE_PREFIX \
--a=PARAM_A \
--b=PARAM_B \
--phi=PARAM_PHI \
--n_iter=MCMC_ITERATIONS \
--n_burnin=MCMC_BURNIN \
--thin=MCMC_THINNING_FACTOR \
--chrom=CHROM \
--meta=META_FLAG \
--seed=SEED

必须的参数：

PATH_TO_REFERENCE：LD参考面板的路径，路径下应包含相应群体的参考面板以及snp list. 例如，纳入群体为EUR以及EAS，指定路径为：./ldref ，那么该路径下应该有 ldblk_1kg_eas，ldblk_1kg_eur 这两个文件夹，以及snpinfo_mult_1kg_hm3这个文件。
VALIDATION_BIM_PREFIX：目标数据集的bim文件。
SUM_STATS_FILE：sumstats的完整路径，由逗号分隔。
GWAS_SAMPLE_SIZE：sumstats的样本量大小，由逗号分隔，顺序与SUM_STATS_FILE一致。
POPULATION：对应的群体，可以为 AFR, AMR, EAS, EUR, SAS，由逗号分隔，顺序与SUM_STATS_FILE一致。
OUTPUT_DIR: 输出的路径
OUTPUT_FILE_PREFIX：输出文件前缀

其余为可选参数：

META_FLAG ：如果为True，则输出inverse-variance-weighted meta-analysis of the population-specific posterior effect size estimates。

PARAM_A， PARAM_B， PARAM_PHI，MCMC_ITERATIONS，MCMC_BURNIN，MCMC_BURNIN，SEED与CHROM 使用方法与PRScs一致。（参考：GWASLab：多基因风险分数 PRS(Polygenic risk score)系列之五：使用PRS-CS计算PRS（beta-shrinkage方法））

示例代码

python PRScsx.py \
--ref_dir=path_to_ref \
--bim_prefix=path_to_bim/test \
--sst_file=path_to_sumstats/EUR_sumstats.txt,path_to_sumstats/EAS_sumstats.txt \
--n_gwas=200000,100000 \
--pop=EUR,EAS \
--chrom=22 \
--phi=1e-2 \
--out_dir=path_to_output \
--out_name=test

注意：将路径替换为自己的路径

大约一分钟即可完成计算。

运行log如下：

*** 2 discovery populations detected ***

##### process chromosome 22 #####
... parse reference file: /home/heyunye/tools/prscs/ldref/snpinfo_mult_1kg_hm3 ...
... 18944 SNPs on chromosome 22 read from /home/heyunye/tools/prscs/ldref/snpinfo_mult_1kg_hm3 ...
... parse bim file: /home/heyunye/tools/prscsx/PRScsx/test_data/test.bim ...
... 1000 SNPs on chromosome 22 read from /home/heyunye/tools/prscsx/PRScsx/test_data/test.bim ...
... parse EUR sumstats file: /home/heyunye/tools/prscsx/PRScsx/test_data/EUR_sumstats.txt ...
... 1000 SNPs read from /home/heyunye/tools/prscsx/PRScsx/test_data/EUR_sumstats.txt ...
... 1000 common SNPs in the EUR reference, EUR sumstats, and validation set ...
... parse EAS sumstats file: /home/heyunye/tools/prscsx/PRScsx/test_data/EAS_sumstats.txt ...
... 1000 SNPs read from /home/heyunye/tools/prscsx/PRScsx/test_data/EAS_sumstats.txt ...
... 901 common SNPs in the EAS reference, EAS sumstats, and validation set ...
... parse EUR reference LD on chromosome 22 ...
... parse EAS reference LD on chromosome 22 ...
... align reference LD on chromosome 22 across populations ...
... 1000 valid SNPs across populations ...
... MCMC ...
--- iter-100 ---
--- iter-200 ---
--- iter-300 ---
--- iter-400 ---
--- iter-500 ---
--- iter-600 ---
--- iter-700 ---
--- iter-800 ---
--- iter-900 ---
--- iter-1000 ---
--- iter-1100 ---
--- iter-1200 ---
--- iter-1300 ---
--- iter-1400 ---
--- iter-1500 ---
--- iter-1600 ---
--- iter-1700 ---
--- iter-1800 ---
--- iter-1900 ---
--- iter-2000 ---
... Done ...

输出为EUR以及EAS的PRS：

test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt

test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt

head test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt
22      rs9605903       17054720        C       T       8.694291e-04
22      rs5746647       17057138        G       T       -1.005430e-03
22      rs5747999       17075353        C       A       -2.499230e-04
22      rs2845380       17203103        A       G       6.037999e-04
22      rs2247281       17211075        G       A       4.780305e-04
22      rs2845346       17214252        C       T       7.767527e-04
22      rs2845347       17214669        C       T       1.671207e-03
22      rs1807512       17221495        C       T       -1.778397e-03
22      rs5748593       17227461        T       C       9.849030e-04
22      rs9606468       17273728        C       T       1.442600e-04

使用该文件便可以利用plink进行PRS计算：

GWASLab：多基因风险分数 PRS( Polygenic risk score)系列之九: 使用PLINK2分染色体计算PRS并加和

PRScsx实例应用

PRScsx的通讯作者以第一作者的身份，将PRScsx应用于二型糖尿病的跨族裔PRS研究中，文中使用PRScsx和European, African American,以及East Asian的GWAS数据，构建了二型糖尿病的跨族裔PRS。

https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01074-2

参考

Ruan, Y., Lin, Y. F., Feng, Y. C. A., Chen, C. Y., Lam, M., Guo, Z., … & Ge, T. (2022). Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54(5), 573-580.

Ge, T., Irvin, M. R., Patki, A., Srinivasasainagendra, V., Lin, Y. F., Tiwari, H. K., … & Karlson, E. W. (2022). Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome medicine, 14(1), 1-16.

GWAS入门 – 综述推荐与导读

前言

受推上业内大佬启发，本文将总结对于初学GWAS有较大帮助的综述文章，这些文章多由领域内的leading scientist执笔，引用上千，有较大影响力。对于想快速了解十几年来GWAS发展的同学来说，是不可错过的文章。本文基于Abdel Abdellaoui的推文以及作者个人经验。如有其他推荐，欢迎补充。

综述推荐与导读

第一篇

Hirschhorn, J. N., & Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nature reviews genetics, 6(2), 95-108.

https://www.nature.com/articles/nrg1521

首先介绍最早的关于GWAS介绍的Review之一，于GWAS刚刚萌芽的2005年发表，那时人类基因组测序刚刚完成，dbSNP开始建立，Hapmap项目也开始启动，这些项目奠定了GWAS研究发展的基础。这篇综述该介绍了GWAS相比于传统遗传学方法的优缺点，当时可用的测序高通量测序方法，以及GWAS研究中需要注意的核心问题等。可以说是将传统遗传学与现代基因组学衔接的一篇开山之作之一，值得一读。

个人推荐指数：10

第二篇

Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature reviews genetics, 7(10), 781-791.

https://www.nature.com/articles/nrg1916

该综述介绍了早期GWAS研究中可用的的统计学工具。简要的介绍了GWAS研究核心的遗传学与统计学原理，并简要梳理了GWAS各个环节上会用到的基础的统计学原理与工具。该文章对于初学者理解GWAS的检验原理有很大帮助，后续的GWAS检验方法基本是基于这些基本原理的扩展与补充，万变不离其宗。

个人推荐指数：8

第三篇

McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J., & Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5), 356-369.

https://www.nature.com/articles/nrg2344

该文发表于2008年，正是第一波GWAS的热潮结果发表后的时期，文中基于第一波GWAS的文章，总结了当时GWAS的研究现状，着重梳理了当时GWAS研究的不足与挑战，为接下来的GWAS研究指出了方向。

个人推荐指数：9

第四篇

Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., … & Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature, 461 (7265), 747-753.

https://www.nature.com/articles/nature08494

寻找复杂疾病的“丢失的遗传力“自始至终都是GWAS研究中的一个热门话题，该文总结了丢失的遗传力可能的来源并给出了可能研究的方法。是一篇较有影响力的文章。

个人推荐指数：8

第五篇

Ioannidis, J., Thomas, G., & Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics, 10(5), 318-329.

https://www.nature.com/articles/nrg2544

该文发表于2009，当时GWAS研究已经发现大量的与疾病关联的位点，但这些位点大多都只是真正引起功能改变的因果变异的marker，如何确定在大量的关联中找出因果变异变成了一个不可回避的问题。该文章总结了可以提高GWAS结果可靠性与寻找因果变异的早期方法。

个人推荐指数：7

第六篇

Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Reviews Genetics, 11(7), 499-511.

https://www.nature.com/articles/nrg2796

该文对基因型插补（genotype imputation）方法进行了总结，介绍了相关的基本概念与常用指标，该文对于理解基因型插补的理论基础有较大帮助。

个人推荐指数：6

第七篇

Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature reviews genetics, 11(7), 459-463.

https://www.nature.com/articles/nrg2813

群体分层一直是GWAS研究中一个必须要妥善应对的问题，该文总结了对于群体分层的处理方法。文章较短，但梳理得简洁明了，对于理解 λgc，PCA，线性混合模型等帮助很大。推荐阅读。

个人推荐指数：9

第八篇

Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1), 5-22.

https://www.sciencedirect.com/science/article/pii/S0002929717302409?via%3Dihub

对GWAS问世10年来GWAS研究发展与成果的总结，该文介绍了GWAS的科学基础，并基于大量GWAS研究总结出了一些普遍的结论，同时举出了三个被广泛研究的复杂疾病的典型例子。

个人推荐指数：8

第九篇

Pasaniuc, B., & Price, A. L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nature reviews genetics, 18(2), 117-127.

https://www.nature.com/articles/nrg.2016.142

该文发表于2017年，随着GWAS的summary statistics不断积累，使用summary statistics的下游分析方法也如雨后春笋般出现，该文总结了使用GWAS summary statistics来对疾病分析的post-GWAS方法，例如gene-based analysis，fine-mapping，以及PRS等。

个人推荐指数：8

第十篇

Tam, V., Patel, N., Turcotte, M., Bossé, Y., Paré, G., & Meyre, D. (2019). Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8), 467-484.

https://www.nature.com/articles/s41576-019-0127-1

该文总结了GWAS研究的优势与不足，对于加深对GWAS的理解与了解未来发展方向有较大帮助。

个人推荐指数：7

第十一篇

Uffelmann, E., Huang, Q. Q., Munung, N. S., De Vries, J., Okada, Y., Martin, A. R., … & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers, 1(1), 1-21.

https://www.nature.com/articles/s43586-021-00056-9

发表于2021年，目前最新的GWAS完整流程讲解，总结较为全面，可以查漏补缺，值得一读。

个人推荐指数：9

参考

And @NatureRevGenet papers were essential to me in learning about GWAS:
– https://t.co/UGn702hBJZ
– https://t.co/c3BNVmvGSp
– https://t.co/c95mZiyc4I
– https://t.co/JKOzExpa60
– https://t.co/2yphDKzaep
– https://t.co/P3PSMi3yl6
– https://t.co/Hv2H57j1NI
– https://t.co/5a2b56ImCD pic.twitter.com/eQ8Aj4KDy2
— Abdel Abdellaoui (@dr_appie) October 9, 2022

Summary of major biobanks and cohorts v1

主要的生物银行 biobanks 以及队列 cohorts总结.v1

前言

本文主要列举世界范围内主要的生物银行 biobanks 以及队列 cohorts，仅供参考。目前仅列举各个生物银行及队列的基础信息，包括样本量（概数），位置，网站链接以及简要介绍。未来会不断更新，下一步是补全缩写，增加研究类型，族裔信息，样本量中分开总样本量与基因分型的样本量，以及对应的数据公开的链接等。（个人手动整理，难免有差错，如有遗漏或错误，欢迎评论区指正，感谢！）

本文为CTGCatalog （Complex Trait Genetics Catalog，主要收集整理Complex Trait Genetics 领域内常用参考数据与资源，公开的sumstats，以及常用工具等）的一部分:

https://cloufield.github.io/CTGCatalog/Reference_data_Biobanks_Cohorts_README/

Contents : Biobanks and Cohorts v1 (20221006)

Biobank of the Americas
Biobank Graz
Biobank Japan
BioMe
BioVU
CanPath – Ontario Health Study
China Kadoorie Biobank
Colorado Center for Personalized Medicine
deCODE Genetics
Estonian Biobank
FinnGen
Generation Scotland
Genes & Health
HUNT
IARC Biobank
Lifelines
Massachusetts General Brigham Biobank
Michigan Genomics Initiative
Million Veteran Program (MVP)
National Biobank of Korea
Nigerian 100K Genome Project
Penn Medicine Biobank
Qatar Biobank
QIMR Berghofer – QIMR Biobank (QSkin and GenEpi)
Taiwan Biobank
The Malaysian Cohort (TMC)
UCLA Precision Health Biobank
Uganda Genome Resource
UK Biobank

EUROPE

UK Biobank (UKB)

SAMPLE SIZE: ~500k
LOCATION: U.K.
URL: https://www.ukbiobank.ac.uk/
DESCRIPTION: UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is regularly augmented with additional data and is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. It is a major contributor to the advancement of modern medicine and treatment and has enabled several scientific discoveries that improve human health.
CITATION: Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., … & Marchini, J. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726), 203-209.

FinnGen

SAMPLE SIZE: ~343k
LOCATION: Finland
URL:https://www.finngen.fi/en
DESCRIPTION: FinnGen study launched in Finland in the autumn of 2017 is a unique study that combines genome information with digital health care data. The FinnGen study is an unprecedented global research project representing one of the largest studies of this type. Project aims to improve human health through genetic research, and ultimately identify new therapeutic targets and diagnostics for treating numerous diseases. The collaborative nature of the project is exceptional compare to many ongoing studies, and all the partners are working closely together to ensure appropriate transparency, data security and ownership.
CITATION:Kurki, M. I., Karjalainen, J., Palta, P., Sipilä, T. P., Kristiansson, K., Donner, K., … & Nelis, M. (2022). FinnGen: Unique genetic insights from combining isolated population and national health register data. medRxiv.

Estonian Biobank

SAMPLE SIZE: ~200k
LOCATION: Estonia
URL:https://genomics.ut.ee/en/content/estonian-biobank
DESCRIPTION:The Estonian Biobank has established a population-based biobank of Estonia with a current cohort size of more than 200,000 individuals (genotyped with genome-wide arrays), reflecting the age, sex and geographical distribution of the adult Estonian population. Considering the fact that about 20% of Estonia’s adult population has joined the programme, it is indeed a database that is very important for the development of medical science both domestically and internationally.
CITATION:Leitsalu, L., Haller, T., Esko, T., Tammesoo, M. L., Alavere, H., Snieder, H., … & Metspalu, A. (2015). Cohort profile: Estonian biobank of the Estonian genome center, university of Tartu. International journal of epidemiology, 44(4), 1137-1147.

Lifelines

SAMPLE SIZE: ~167k
LOCATION: Netherlands
URL: https://www.lifelines.nl/researcher
DESCRIPTION: Lifelines is a large, multigenerational cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. We included participants from three generations, who are followed for at least 30 years, to obtain insight into healthy ageing. The aim of Lifelines is to be a resource for the national and international scientific community.
CITATION: Scholtens, S., Smidt, N., Swertz, M. A., Bakker, S. J., Dotinga, A., Vonk, J. M., … & Stolk, R. P. (2015). Cohort Profile: LifeLines, a three-generation cohort study and biobank. International journal of epidemiology, 44(4), 1172-1180.

HUNT

SAMPLE SIZE: ~88k
LOCATION: Norway
URL: https://www.ntnu.edu/hunt/hunt-biobank
DESCRIPTION:HUNT Biobank is an established and modern research biobank with high-technology equipment for storage, analysis, sample handling and delivery of samples. Our samples satisfy high quality standards and are stored in accordance with the Data Inspectorates laws and regulations. HUNT Biobank engages in sample handling from The Nord-Trøndelag Health Study (HUNT), Cohort of Norway (CONOR), and can receive samples from other researchers and research projects for storage, analysis and processing of DNA. We do not store samples from private individuals.
CITATION: Brumpton, B. M., Graham, S., Surakka, I., Skogholt, A. H., Løset, M., Fritsche, L. G., … & Willer, C. J. (2021). The HUNT Study: a population-based cohort for genetic research. medRxiv.

Generation Scotland

SAMPLE SIZE: ~24k
LOCATION: Scotland
URL: https://www.ed.ac.uk/generation-scotland
DESCRIPTION: Generation Scotland is a research study looking at the health and well-being of volunteers and their families. Generation Scotland combines responses to questionnaires of health and well-being from birth through life. We combine this with NHS health records and innovative laboratory science to understand health trajectories. We work closely with researchers and our volunteers to create a rich evidence base for understanding health. Through this rigorous, ethical and safe approach to research, we seek to enable meaningful change in public health.  
CITATION: Smith, B. H., Campbell, A., Linksted, P., Fitzpatrick, B., Jackson, C., Kerr, S. M., … & Morris, A. D. (2013). Cohort Profile: Generation Scotland: Scottish Family Health Study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. International journal of epidemiology, 42(3), 689-700.

East London Genes & Health

SAMPLE SIZE: ~100k
LOCATION：U.K.
URL: https://www.genesandhealth.org/
DESCRIPTION: Genes & Health is a huge long-term study of 100,000 people of Bangladeshi and Pakistani origin. We will link genes with health records, to study disease and treatments. Some volunteers may be invited for further studies. We are inviting volunteers to take part in two regions of the UK: East London (East London Genes & Health) and Bradford (Bradford Genes & Health).
CITATION: Finer, S., Martin, H. C., Khan, A., Hunt, K. A., MacLaughlin, B., Ahmed, Z., … & van Heel, D. A. (2020). Cohort Profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people. International journal of epidemiology, 49(1), 20-21i.

deCODE Genetics

SAMPLE SIZE: ~250k
LOCATION: Iceland
URL:https://www.decode.com/
DESCRIPTION:deCODE leads the world in the discovery of genetic risk factors for common diseases. Our gene discovery engine is driven by our unique approach and resources, including detailed genetic and medical information on some 500,000 individuals from around the globe taking part in our discovery work and proprietary statistical algorithms and informatics tools for gathering, analyzing, visualizing and storing large amounts of data.

The International Agency for Research on Cancer (IARC) Biobank (IBB)

SAMPLE SIZE: ~560k
LOCATION: France
URL: https://ibb.iarc.fr/
DESCRIPTION: The IARC BioBank (IBB) is one of the largest, most varied and richest International collections of samples in the world. The Biobank is publicly funded, (approximately 60% of its budget is provided by IARC Participating States through the regular budget and the remainder is from research grants) and hosts over 50 different studies, led or coordinated by IARC scientists. The IBB contains both population-based collections from research projects focusing on gene-environment interactions (as in the European Prospective Investigation into Cancer and Nutrition (EPIC) study) and disease-based collections which focus on biomarkers (as in the International Head and Neck Cancer Epidemiology (INHANCE)). Study designs include case-series, prevalence studies, case-control and cohort studies, etc. The IBB contains 5.1 million biological samples from 562,000 individuals. 4 million of the samples are from the EPIC study (over 370,000 individuals) and about one million samples from other collections (close to 200,000 individuals). Most of the samples are body fluids, including plasma, serum and urine as well as extracted DNA samples.

Biobank Graz

SAMPLE SIZE: ~1200k
LOCATION: Austria
URL:https://biobank.medunigraz.at/en/?link=http%3A%2F%2F169.254.169.254%2Flatest%2Fmeta-data%2F&cHash=3b3a94b34935e2b8509a838b4a34b0eb
DESCRIPTION: Biobank Graz is one of the largest and most well-known clinical biobanks in the world. Around 20 million individual specimens of body fluids and human tissue are stored here. Biobank Graz allows access to these specimens and associated data for scientific research purposes. The common goal is to develop approaches to diagnosing and treating disease.
CITATION: Huppertz, B., Bayer, M., Macheiner, T., & Sargsyan, K. (2016). Biobank Graz: the hub for innovative biomedical research. Open journal of bioresources, 3(1).

ASIA

China Kadoorie Biobank (CKB)

SAMPLE SIZE: ~500k
LOCATION: China
URL:https://www.ckbiobank.org/
DESCRIPTION:The China Kadoorie Biobank is one of the world’s largest prospective cohort studies. A long-term collaboration between the UK and China, it aims to generate reliable evidence about the lifestyle, environmental and genetic determinants of a wide range of common diseases that can inform disease prevention, risk prediction and treatment worldwide.
CITATION:Chen, Z., Chen, J., Collins, R., Guo, Y., Peto, R., Wu, F., & Li, L. (2011). China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. International journal of epidemiology, 40(6), 1652-1666.

Taiwan Biobank (TWB)

SAMPLE SIZE: ~150k
LOCATION: China, Taiwan
URL:https://www.twbiobank.org.tw/
DESCRIPTION:The Taiwan Biobank (TWB) is an ongoing prospective study of over 150,000 individuals aged 30-70 recruited from across Taiwan beginning in 2012. A comprehensive list of phenotypes was collected for each consented participant at recruitment and follow-up visits through structured interviews and physical measurements. Biomarkers and genetic data were also generated for all participants from blood and urine samples.
CITATION:Feng, Y. C. A., Chen, C. Y., Chen, T. T., Kuo, P. H., Hsu, Y. H., Yang, H. I., … & Lin, Y. F. (2021). Taiwan Biobank: a rich biomedical research database of the Taiwanese population. medRxiv.

BioBank Japan (BBJ)

SAMPLE SIZE: ~200k
LOCATION: Japan
URL:https://biobankjp.org/
DESCRIPTION:In 2003, BioBank Japan (BBJ) started developing one of the world’s largest disease biobanks, creating a foundation for research aimed at achieving medical care tailored to the individual traits of each patient. From a total of 260,000 patients representing 440,000 cases of 51 primarily multifactorial (common) diseases, BBJ has collected DNA, serum, medical records (clinical information), etc. with their consent. No less than 5,800 items of screened information are available for research, including the patients’ survival information, with 95% of the patients tracked over an average of 10 years. In addition to large-scale genomic analyses, omics analyses including whole genome sequencing and metabolome/proteome analyses have been performed on the DNA, serum and other biological samples collected, producing significant research findings. The genomic information acquired through the analyses continues to be used as data. The biological samples and data are widely distributed and used by researchers.
CITATION:Nagai, A., Hirata, M., Kamatani, Y., Muto, K., Matsuda, K., Kiyohara, Y., … & Kubo, M. (2017). Overview of the BioBank Japan Project: study design and profile. Journal of epidemiology, 27(Supplement_III), S2-S8.

Tohoku Medical Megabank (TMM)

SAMPLE SIZE: ~157k
LOCATION: Japan
URL: https://www.megabank.tohoku.ac.jp/english/
DESCRIPTION:Tohoku University Tohoku Medical Megabank Organization was founded to establish an advanced medical system to foster the reconstruction from the Great East Japan Earthquake. The organization has been developing a biobank that combines medical and genome information during the process of rebuilding the community medical system and supporting health and welfare in the Tohoku area. The information from the brand-new biobank will create a new medical system, and, based on the findings of its analysis, the organization aims to attract more medical practitioners from all over the country to the area, promote industry-academic partnerships, create employment in related fields, and restore the medical system in Tohoku.
CITATION:Kuriyama, S., Yaegashi, N., Nagami, F., Arai, T., Kawaguchi, Y., Osumi, N., … & Tohoku Medical Megabank Project Study Group. (2016). The Tohoku medical megabank project: design and mission. Journal of epidemiology, 26(9), 493-511.

National Biobank of Korea

SAMPLE SIZE: ~80K
LOCATION: Korea
URL:https://nih.go.kr/NIH/cms/content/eng/14/65714_view.html
DESCRIPTION:The NBK is the national control center for the collection, management, and utilization of human bioresources in Korea. And NBK manages KBN, it contributes to the development of policies related to human bioresources, standardization of human bioresource management, and advancement of domestic biobanks through developing and providing support for human bioresource technologies. For guaranteeing the fairness in bioresource distribution and development of an efficient distribution system, the NBK also serves as the human bioresource supply hub that supports national healthcare and medical R&D.
CITATION:Cho, S. Y., Hong, E. J., Nam, J. M., Han, B., Chu, C., & Park, O. (2012). Opening of the national biobank of Korea as the infrastructure of future biomedical science in Korea. Osong public health and research perspectives, 3(3), 177-184.

Qatar Biobank

SAMPLE SIZE: ~80K
LOCATION: Qatar
URL : https://www.qatarbiobank.org.qa/
DESCRIPTION: Qatar Biobank, a center within Qatar Foundation, was created in collaboration with Hamad Medical Corporation and the Ministry of Public Health to enable local scientists to conduct medical research on prevalent health issues in Qatar.
CITATION:Al Kuwari, H., Al Thani, A., Al Marri, A., Al Kaabi, A., Abderrahim, H., Afifi, N., … & Elliott, P. (2015). The Qatar Biobank: background and methods. BMC public health, 15(1), 1-9.

The Malaysian Cohort (TMC)

Cohort Size: ~100k
LOCATION: Malaysia
URL:https://www.ukm.my/mycohort/ms/
DESCRIPTION:The Malaysian Cohort study was initiated in 2005 by the Malaysian government. The top-down approach to this population-based cohort study ensured the allocation of sufficient funding for the project which aimed to recruit 100 000 individuals aged 35–70 years. Participants were recruited from rural and urban areas as well as from various socioeconomic groups. The main objectives of the study were to identify risk factors, to study gene-environment interaction and to discover biomarkers for the early detection of cancers and other diseases.
CITATION:Jamal, R., Syed Zakaria, S. Z., Kamaruddin, M. A., Abd Jalal, N., Ismail, N., Mohd Kamil, N., … & Malaysian Cohort Study Group. (2015). Cohort profile: The Malaysian Cohort (TMC) project: a prospective study of non-communicable diseases in a multi-ethnic population. International journal of epidemiology, 44(2), 423-431.

AFRICA

Uganda Genome Resource

SAMPLE SIZE: ~6k
URL:https://ega-archive.org/studies/EGAS00001000545
DESCRIPTION:Genomic studies in African populations provide unique opportunities to understand disease aetiology, human genetic diversity and population history in a regional and a global context. To leverage the relative benefits of different strategies, we undertook a combined approach of genotyping and whole-genome sequencing (WGS) in a population-based study of 6,400 individuals from a geographically defined rural community in South-West Uganda. We present data from 4,778 individuals with genotypes for ~2.2 million SNPs from the Uganda GWAS resource (UGWAS), and sequence data on up to 1,978 individuals spanning 41.5M SNPs and 4.5M indels (UG2G); 343 individuals overlap between the two datasets. We highlight the value of the largest sequence panel from Africa to date as a global resource for variant discovery, imputation and understanding the mutational spectrum and its clinical relevance in African populations. Alongside phenotype data, we provide a rich new genomic resource for researchers in Africa and globally
CITATION:Gurdasani, D., Carstensen, T., Fatumo, S., Chen, G., Franklin, C. S., Prado-Martinez, J., … & Sandhu, M. S. (2019). Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell, 179(4), 984-1002.

Nigerian 100K Genome Project (coming soon)

CITATION:Fatumo, S., Yakubu, A., Oyedele, O., Popoola, J., Attipoe, D. A., Eze-Echesi, G., … & Ene-Obong, A. (2022). Promoting the genomic revolution in Africa through the Nigerian 100K Genome Project. Nature Genetics, 54(5), 531-536.

NORTH AMERICA

Michigan Genomics Initiative

SAMPLE SIZE: ~55k
LOCATION: U.S.
URL:https://precisionhealth.umich.edu/our-research/michigangenomics/
DESCRIPTION:The Michigan Genomics Initiative (MGI) is a collaborative research effort among physicians, researchers, and patients at the University of Michigan (U-M) with the goal of combining patient electronic health record (EHR) data with corresponding genetic data to gain novel biomedical insights. There are currently ~84K consented participants through the MGI and partner studies and the addition of ~10K new participants per year is anticipated. Currently, all MGI participants with available genetic data have received care at the University of Michigan Health System.
CITATION:Zawistowski, M., Fritsche, L. G., Pandit, A., Vanderwerff, B., Patil, S., Scmidt, E. M., … & Zoellner, S. (2021). The Michigan Genomics Initiative: a biobank linking genotypes and electronic clinical records in Michigan Medicine patients. medRxiv.

Penn Medicine Biobank

SAMPLE SIZE: ~40k
LOCATION: U.S.
URL:https://pmbb.med.upenn.edu/
DESCRIPTION:The Penn Medicine BioBank (PMBB) is a research program created to study the causes and treatments of many diseases. Any Penn Medicine patient (age 18 and up) can sign up. The PMBB is a collection of biological samples, such as blood or tissue, that are donated by patient volunteers. These samples are then connected to clinical information, such as diseases or lab measures. These data are then used by researchers to discover new ways to detect, treat, and maybe even prevent or cure disease. Some of these studies may be about how genes affect health and disease. Other studies look at how genes affect response to medicines.

UCLA Precision Health Biobank

SAMPLE SIZE: ~27k
LOCATION: U.S.
URL:https://www.uclahealth.org/precision-health/programs/ucla-atlas-community-health-initiative/ucla-atlas-precision-health-biobank
DESCRIPTION:The UCLA ATLAS Precision Health Biobank, under the supervision of the Translational Pathology Core Laboratory (TCPL), collects biological samples from patients who have consented to participate in the UCLA ATLAS Community Health Initiative. As a collaborator with UCLA ATLAS Community Health Initiative, the UCLA ATLAS Precision Health Biobank manages the collection and distribution of biological samples by removing the personally identifiable information.
CITATION:Johnson, R. D., Ding, Y., Bhattacharya, A., Chiu, A., Lajonchere, C., Geschwind, D. H., & Pasaniuc, B. (2022). The UCLA ATLAS Community Health Initiative: promoting precision health research in a diverse biobank. medRxiv.

BioMe

SAMPLE SIZE: ~32k
LOCATION: U.S.
URL:https://icahn.mssm.edu/research/ipm/programs/biome-biobank
DESCRIPTION:The Institute for Personalized Medicine at the Icahn School of Medicine at Mount Sinai is leading the movement toward diagnosis and classification of disease according to the patient’s molecular profile. This approach accommodates differences at all possible levels of exposure (genome, environment, and lifestyle) and at all stages of the process, from prevention to post-treatment follow-up. At the center of this effort is BioMe, an electronic medical record-linked biobank that enables researchers to rapidly and efficiently conduct genetic, epidemiologic, molecular, and genomic studies on large collections of research specimens linked with medical information.

BioVU

SAMPLE SIZE: ~120k
LOCATION: U.S.
URL:https://www.vumc.org/dbmi/biovu
DESCRIPTION:Planning for BioVU began in mid-2004 and the first samples were collected in February 2007. Prior to collecting DNA samples, all aspects of the BioVU project were extensively tested. BioVU now accrues 500-1000 samples per week, totaling more than 275,000 DNA samples as of January 2022. Vanderbilt clinic patients may sign the BioVU Consent Form if they wish to donate their excess blood samples, or not sign the form if they do not wish to participate.
CITATION:Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large‐scale de‐identified DNA biobank to enable personalized medicine. Clinical Pharmacology & Therapeutics, 84(3), 362-369.

Biobank of the Americas

SAMPLE SIZE: ~20k
LOCATION: U.S.
URL:https://bbofa.org/
URL: https://www.galatea.bio/#main-biobank
DESCRIPTION: Biobank consented samples with associated clinical data from diverse populations from throughout the United States and Latin America via healthcare and biopharma partnerships.

Colorado Center for Personalized Medicine

SAMPLE SIZE: ~34k
LOCATION: U.S.
URL:https://medschool.cuanschutz.edu/cobiobank
DESCRIPTION:Established in 2014 as a partnership between UCHealth and University of Colorado Anschutz Medical Campus, the Colorado Center for Personalized Medicine (CCPM) brings together multiple disciplines and institutions to uncover advancements in genomics that can improve diagnosis and treatment of disease, and identify more tailored approaches to population health management.To facilitate discoveries in personalized medicine, CCPM has created a Biobank that aims to be one of the largest academic medicine biospecimen repositories in the mountain and midwest regions of the U.S. The CCPM Biobank is able to link biospecimens and genotype information with patient health information from electronic medical records in an enterprise data warehouse (Health Data Compass) to support a broad range of research, operational, and clinical quality improvement agendas.

CanPath – Ontario Health Study

SAMPLE SIZE: ~7.3k
LOCATION: Canada
URL:https://canpath.ca/cohort/ontario-health-study/
DESCRIPTION:The Ontario Health Study (OHS) is a resource for investigating the ways in which lifestyle, the environment and genetics affect people’s health. It is one of the regional cohorts that collectively form the Canadian Partnership for Tomorrow’s Health (CanPath)—a pan-Canadian cohort with >330 000 participants. The linking of Canada’s rich collection of administrative health data with the cohort’s data represents a powerful means to disseminate high-quality, timely data.
CITATION:Kirsh, V. A., Skead, K., McDonald, K., Kreiger, N., Little, J., Menard, K., … & Awadalla, P. (2022). Cohort Profile: The Ontario Health Study (OHS). International Journal of Epidemiology.

Massachusetts General Brigham Biobank

SAMPLE SIZE: ~26K
LOCATION: U.S.
URL:https://www.massgeneralbrigham.org/en/research-and-innovation/participate-in-research/biobank
DESCRIPTION: The Mass General Brigham Biobank is a large research program designed to help researchers understand how people’s health is affected by their genes, lifestyle, and environment. By participating in the Mass General Brigham Biobank, you can help us better understand, treat, and even prevent the diseases that might affect your health and the health of future generations. 
CITATION: Boutin, N. T., Schecter, S. B., Perez, E. F., Tchamitchian, N. S., Cerretani, X. R., Gainer, V. S., … & Smoller, J. W. (2022). The Evolution of a Large Biobank at Mass General Brigham. Journal of Personalized Medicine, 12(8), 1323.
CITATION:Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., … & Murphy, S. N. (2022). The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4), 643-651.

Million Veteran Program (MVP)

SAMPLE SIZE: ~900k
LOCATION: U.S.
URL:https://www.mvp.va.gov/pwa/
DESCRIPTION: The Million Veteran Program (MVP) is a national research program to learn how genes, lifestyle, and military exposures affect health and illness. Since launching in 2011, over 900,000 Veteran partners have joined one of the world’s largest programs on genetics and health.
CITATION:Gaziano, J. M., Concato, J., Brophy, M., Fiore, L., Pyarajan, S., Breeling, J., … & O’Leary, T. J. (2016). Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of clinical epidemiology, 70, 214-223.