多基因风险分数 PRS( Polygenic risk score)系列之九: 使用PLINK2分染色体计算PRS并加和

本文内容:

  1. 下载PGS分数文件
  2. 第一步 计算每条染色体的原始分数
  3. 第二步 将各个染色体的结果提取
  4. 第三步 将分数进行简单加和

本文主要演示分染色体计算PRS并求和的过程。

下载PGS分数文件 Download PGS score

演示PGS分数文件下载自PGScatalog

下载harmonise后的文件:

wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000012/ScoringFiles/Harmonized/PGS000012_hmPOS_GRCh37.txt.gz
gunzip PGS000012_hmPOS_GRCh37.txt.gz
head PGS000012_hmPOS_GRCh37.txt
###PGS CATALOG SCORING FILE - see <https://www.pgscatalog.org/downloads/#dl_ftp_scoring> for additional information
#format_version=2.0
##POLYGENIC SCORE (PGS) INFORMATION
#pgs_id=PGS000012
#pgs_name=GRS49K
#trait_reported=Coronary artery disease
#trait_mapped=coronary artery disease
#trait_efo=EFO_0001645
#genome_build=hg19
#variants_number=49310

#下载后去除文件头部的comment,方便Plink2使用
awk '$1!~/#/{print $0}' PGS000012_hmPOS_GRCh37.txt > PGS000012_hmPOS_GRCh37_plink2.txt

head PGS000012_hmPOS_GRCh37_plink2.txt
rsID    effect_allele   effect_weight   hm_source       hm_rsID hm_chr  hm_pos  hm_inferOtherAllele
rs1333045       C       0.187251        ENSEMBL rs1333045       9       22119195        T
rs1537370       T       0.17296 ENSEMBL rs1537370       9       22084310        C
rs9863247       T       0.136328        ENSEMBL rs9863247       3       161125373       C
rs11203077      T       0.131725        ENSEMBL rs11203077      10      91097085        G
rs10268558      C       0.12966 ENSEMBL rs10268558      7       18924927        T
rs7148203       C       0.128366        ENSEMBL rs7148203       14      45062275        T
rs12747328      T       0.127016        ENSEMBL rs12747328      1       159429702       C
rs17810947      G       0.126626        ENSEMBL rs17810947      18      70754886        A/T
rs2412710       A       0.125677        ENSEMBL rs2412710       15      42683787        G

基因组文件使用 https://choishingwan.github.io/PRS-Tutorial/plink/ 所提供的的EUR文件

EUR.QC.bed
EUR.QC.bim
EUR.QC.fam

注:本教程计算结果无实际意义,仅为演示计算过程使用。

第一步 计算每条染色体的原始分数 step1 calculate the raw scores for each chromosome

score=PGS000012_hmPOS_GRCh37_plink2.txt

for chr in $(seq 1 22)
do
plink2 \
    --bfile ./EUR.QC \
    --score ${score} 1 2 3 header list-variants cols=+scoresums \
    --chr ${chr} \
    --out EUR_CAD.chr${chr} 
done

—score 后跟 分数文件路径, 并指定 文件的 ID, effect allele 以及 effect_weight的列数。

header :分数文件有header,计算时会自动跳过第一行

list-variants : 将计算所用的所有variant ID写进plink2.sscore.vars

—chr 指定所计算的染色体编号

注1: 这里要加上cols=+scoresums 来获取原始结果,默认输出的仅为平均后的结果

--score applies one or more linear scoring systems to each sample, and reports results to plink2.sscore. More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score computes the vector-matrix product aTG, and then divides by the number of variants when reporting score-averages.

注2:有时候分数文件中不包括某些染色体的variants,所以有些染色会计算失败,求和时应注意

所得文件

EUR_CAD.chr10.log          EUR_CAD.chr15.log          EUR_CAD.chr1.log           EUR_CAD.chr3.log          EUR_CAD.chr8.log
EUR_CAD.chr10.sscore       EUR_CAD.chr15.sscore       EUR_CAD.chr1.sscore        EUR_CAD.chr3.sscore       EUR_CAD.chr8.sscore
EUR_CAD.chr10.sscore.vars  EUR_CAD.chr15.sscore.vars  EUR_CAD.chr1.sscore.vars   EUR_CAD.chr3.sscore.vars  EUR_CAD.chr8.sscore.vars
EUR_CAD.chr11.log          EUR_CAD.chr16.log          EUR_CAD.chr20.log          EUR_CAD.chr4.log          EUR_CAD.chr9.log
EUR_CAD.chr11.sscore       EUR_CAD.chr16.sscore       EUR_CAD.chr20.sscore       EUR_CAD.chr4.sscore       EUR_CAD.chr9.sscore
EUR_CAD.chr11.sscore.vars  EUR_CAD.chr16.sscore.vars  EUR_CAD.chr20.sscore.vars  EUR_CAD.chr4.sscore.vars  EUR_CAD.chr9.sscore.vars
EUR_CAD.chr12.log          EUR_CAD.chr17.log          EUR_CAD.chr21.log          EUR_CAD.chr5.log          
EUR_CAD.chr12.sscore       EUR_CAD.chr17.sscore       EUR_CAD.chr21.sscore       EUR_CAD.chr5.sscore       
EUR_CAD.chr12.sscore.vars  EUR_CAD.chr17.sscore.vars  EUR_CAD.chr21.sscore.vars  EUR_CAD.chr5.sscore.vars  
EUR_CAD.chr13.log          EUR_CAD.chr18.log          EUR_CAD.chr22.log          EUR_CAD.chr6.log          
EUR_CAD.chr13.sscore       EUR_CAD.chr18.sscore       EUR_CAD.chr22.sscore       EUR_CAD.chr6.sscore       
EUR_CAD.chr13.sscore.vars  EUR_CAD.chr18.sscore.vars  EUR_CAD.chr22.sscore.vars  EUR_CAD.chr6.sscore.vars  
EUR_CAD.chr14.log          EUR_CAD.chr19.log          EUR_CAD.chr2.log           EUR_CAD.chr7.log
EUR_CAD.chr14.sscore       EUR_CAD.chr19.sscore       EUR_CAD.chr2.sscore        EUR_CAD.chr7.sscore
EUR_CAD.chr14.sscore.vars  EUR_CAD.chr19.sscore.vars  EUR_CAD.chr2.sscore.vars   EUR_CAD.chr7.sscore.vars

分数文件内容

head EUR_CAD.chr1.sscore
#FID    IID     ALLELE_CT       NAMED_ALLELE_DOSAGE_SUM SCORE1_AVG      SCORE1_SUM
HG00096 HG00096 1046    486     0.00757595      7.92444
HG00097 HG00097 1046    484     0.00783352      8.19386
HG00099 HG00099 1046    505     0.00798495      8.35226
HG00101 HG00101 1046    491     0.00791584      8.27997
HG00102 HG00102 1046    474     0.00769178      8.04561
HG00103 HG00103 1046    502     0.00799328      8.36097
HG00105 HG00105 1046    500     0.00793847      8.30364
HG00107 HG00107 1046    468     0.00770316      8.0575
HG00108 HG00108 1046    469     0.00732639      7.66341

分染色体的计算的目的是,实际操作中,通常会使用巨大的pgen或bgen进行计算,合并文件并不现实, 这是一般需要分染色体计算并手动求和:

for chr in $(seq 1 22)
do
plink2 \
    --pfile chr${chr} \
    --score ${score} 1 2 3 header list-variants cols=+scoresums \
    --out chr${chr}
done

第二步 将各个染色体的结果提取 step2 extract raw scores into a single file

scoreFile=EUR_CAD

count=1
for chr in $(seq 1 22)
do
    file=${scoreFile}.chr${chr}.sscore
    if [ -s "$file" ] ; then 
        if [ ${count} -eq 1 ] ;then
        echo "ID        chr${chr}" > ./${scoreFile}.chrALL.score
        awk 'NR>1{print $1,$6}' ./${file} >> ./${scoreFile}.chrALL.score
        else
        paste ./${scoreFile}.chrALL.score <(cut -f6 ./${file} | awk -v chr=chr${chr} 'NR==1 {$0=chr} 1') >./${scoreFile}.temp&&mv ./${scoreFile}.temp ./${scoreFile}.chrALL.score
        fi
        count=`expr $count + 1`
     fi
done


所得文件

head EUR_CAD.chrALL.score
ID      chr1    chr2    chr3    chr4    chr5    chr6    chr7    chr8    chr9    chr10   chr11   chr12   chr13   chr14   chr15   chr16   chr17   chr18   chr19   chr20       chr21   chr22
HG00096 7.92444 6.41628 4.9392  3.9937  5.08024 7.50241 4.61922 3.59464 4.24131 4.56312 5.03334 5.55227 2.95436 2.32917 3.39488 3.61625 3.40913 2.02032 1.95575 2.74394     0.79493 1.33408
HG00097 8.19386 6.33598 5.0713  4.0514  4.92887 7.30242 4.19193 3.66247 4.514   4.42716 5.26378 5.11748 2.7589  2.47905 3.82365 3.83448 3.44519 2.09788 1.80506 2.7079      0.811121        1.27481
HG00099 8.35226 6.61542 5.199   3.94708 4.88824 7.72775 4.31098 3.58159 4.30785 4.81232 5.2327  6.05063 2.8803  2.3651  3.36827 3.54378 3.29725 2.08449 1.88188 2.52587     0.767397        1.21001
HG00101 8.27997 6.13397 5.36465 4.04592 4.88809 7.86323 4.49118 3.60872 4.27056 4.6258  5.42121 5.26046 2.70375 2.52093 3.56581 3.45794 3.30534 2.1327  1.70916 2.66579     0.818805        1.3808
HG00102 8.04561 6.32975 5.20324 3.86051 4.67724 7.78206 4.69662 3.38095 4.40059 4.67174 5.2312  5.04913 2.69084 2.48935 3.62454 3.61557 3.3044  2.12043 1.96028 2.82496     0.808453        1.46265
HG00103 8.36097 6.1202  5.31402 4.24136 4.91736 7.34715 4.55809 3.60716 4.15705 4.66235 5.04477 5.53387 2.70537 2.36537 3.76842 3.64108 3.33273 2.11081 2.00393 2.68089     0.763437        1.32735
HG00105 8.30364 6.13937 5.15716 4.19656 5.06392 7.8889  4.33995 3.78158 4.24975 4.58436 5.32105 5.72797 2.84577 2.43968 3.39822 3.60722 3.44641 1.9986  1.775   2.7736      0.793725        1.41736
HG00107 8.0575  6.51259 5.19973 4.06067 4.78711 7.14954 4.66686 3.81154 4.42637 4.45144 5.26257 5.46242 2.82289 2.28362 3.33811 3.58244 3.25041 2.11758 2.00922 2.70618     0.917279        1.43502
HG00108 7.66341 6.18669 4.98761 4.05088 4.7294  7.39041 4.08688 3.53173 4.43728 4.80822 5.02485 5.52436 3.12156 2.35199 3.54311 3.49076 2.97043 2.12883 1.91326 2.51056     0.817221        1.47096

第三步 将分数进行简单加和 step3 combine the scores using awk

scoreFile=EUR_CAD
#sum the scores from each chr
awk 'BEGIN{t=0}
        {if(NR==1){
                print "ID","score"
                  }
                else{
                        for(i=2;i<=NF;i++) t+=$i 
                        print $1,t
                    }
        }' ./${scoreFile}.chrALL.score > ./${scoreFile}.summary.score

所得文件

head EUR_CAD.summary.score
ID score
HG00096 88.013
HG00097 176.112
HG00099 265.062
HG00101 353.577
HG00102 441.807
HG00103 530.37
HG00105 619.62
HG00107 707.931
HG00108 794.672

求和完成,获得了原始分数(没有进行任何操作,没有标准化,没有平均,只是所有variant的效应量与dosage乘积的和),可以用于后续PRS分析

参考:

https://www.cog-genomics.org/plink/2.0/score

https://www.pgscatalog.org/browse/scores/

https://choishingwan.github.io/PRS-Tutorial/plink/

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

您正在使用您的 WordPress.com 账号评论。 注销 /  更改 )

Facebook photo

您正在使用您的 Facebook 账号评论。 注销 /  更改 )

Connecting to %s