背景介绍

在进行meta分析之前，我们首先要对gwas的sumstats进行预处理，这一步看似简单，但却是高质量meta分析中必不可少的一步，本文将简单介绍预处理过程中需要注意的事项，并简单介绍一款实用的预处理软件。

不同GWAS sumstats通常格式千差万别，单单统一格式就已经让人烦恼，而更令人万念俱灰的是，不同sumstats的参考的基因组版本并不相同，allele的正负链也不统一，SNPID有的用rsID（rsID的版本又是个问题~_~），有的又用chrpos…

针对以上种种的不统一，我们都要逐一进行处理，要注意事项主要包括：

参考基因组的版本：hg19还是hg38 或其他
表示变异等位的正负链是否统一
如果是a1，a2这种表示形式，则需要明确effect allele与non-efffect allele
对于palindromic SNP （A/T, 或是G/C如何处理？）
表示效应量的是beta还是OR

Pipeline

预处理简要的pipeline：

确定参考所用的VCF文件, 最好包含相应群体的allele frequency（AF）数据
对于基因组版本不一致的sumstats首先liftover转换成参考vcf的基因组版本
对sumstats进行整理并统一beta与OR的使用，
1. 数量表型用beta与SE
2. 二分类表型用OR与OR_95L，OR_95U
对不同群体的sumstats，使用相应的AF,利用harmoniser工具分别进行harmonise
1. 设定依据AF判断正负链的阈值
对于处理后的gwas sumstats进行统一的质控 QC
1. INFO
2. MAF等

Genetics-Sumstats-Harmoniser工具使用教程

下载与安装

harmonise可以使用的工具

该工具由opentargets项目组成员开发

github链接：https://github.com/opentargets/genetics-sumstat-harmoniser

# Download 从github下载
git clone <https://github.com/opentargets/sumstat_harmoniser.git>
cd sumstat_harmoniser

# Install dependencies into isolated environment 创建环境
conda env create -n sumstat_harmoniser --file environment.yaml

# Activate environment 激活环境
source activate sumstat_harmoniser

# Run tests 检验是否安装成功
cd tests
./run_tests.py

环境要求 Requirements

python3
HTSlib (for tabix)

参考数据 Reference data

可以使用的参考数据：

以global biobank meta-analysis initiative为例，其参考数据使用了gnomAD的数据：

https://gnomad.broadinstitute.org/downloads

示例:

INFO字段中包含了不同族裔中的allele frequency的信息：

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	10108	rs62651026	CAACCCT	C	46514.3	RF	AF_eas=0;AF_nfe=0;AF_fin=0.0227273
1	10109	rs376007522	AACCCT	A	89837.3	RF	AF_eas=0;AF_nfe=0.0512821;AF_fin=0.0588235

使用方法

输入选项

需要两个文件：一是gwas的sumstats，二是参考的vcf文件

Input files:
  --sumstats <file>     GWAS summary statistics file
  --vcf <file>          Reference VCF file. Use # as chromosome wildcard.

其中gwas sumstats还需要指定对应的列名：

Input column names:
  --chrom_col <str>     Chromosome column 染色体
  --pos_col <str>       Position column 碱基位置
  --effAl_col <str>     Effect allele column 效应等位
  --otherAl_col <str>   Other allele column 非效应等位
  --beta_col <str>      beta column
  --or_col <str>        Odds ratio column 
  --or_col_lower <str>  Odds ratio lower CI column
  --or_col_upper <str>  Odds ratio upper CI column
  --eaf_col <str>       Effect allele frequency column
  --rsid_col <str>      rsID column in the summary stat file

输出选项

输出3个文件，Harmonise后的文件，Harmonise过程中的统计文件，以及forward/reverse/palindromic variants的数量的文件，通过以下选项指定：

Output files:
  --hm_sumstats <file>  Harmonised sumstat output file (use 'gz' extension to
                        gzip)
  --hm_statfile <file>  Statistics from harmonisation process output file.
                        Should only be used in conjunction with --hm_sumstats.
  --strand_counts <file>
                        Output file showing number of variants that are
                        forward/reverse/palindromic

其他选项

Other args:
  --only_chrom <str>    Only process this chromosome
  --in_sep <str>        Input file column separator (default: tab)
  --out_sep <str>       Output file column separator (default: tab)
  --na_rep_in <str>     How NA are represented in the input file (default: "")
  --na_rep_out <str>    How to represent NA values in output (default: "")
  --chrom_map <str> [<str> ...]
                        Map summary stat chromosome names, e.g. `--chrom_map
                        23=X 24=Y`

Harmonisation mode 模式选择:

可以选择以下四种模式：[infer|forward|reverse|drop]

对于 palindromic variants 的处理方式为:

(a) infer strand from effect-allele freq, 根据AF推测

(b) assume forward strand, 假设全部都为正链

(d) drop palindromic variants 舍弃所有palindromic variants

AF字段名与阈值

如果选择通过AF推测正负链（–palin_mode infer），那么还需要制定vcf文件中af的字段名与推测的阈值 –af_vcf_field <str> VCF info field containing alt allele freq (default:AF_NFE)

指定VCF文件中INFO里包含AF信息的字段

–infer_maf_threshold <float>Max MAF that will be used to infer palindrome strand (default: 0.42)

推测palindrome strand的最大MAF的阈值

使用例

如果输入的sumstats为以下格式：

CHROMOSOME      POSITION        MARKERNAME      NEA     EA      EAF     INFO    N       BETA    SE      P       OR      OR_95L  OR_95U

那么harmonise时，可以使用如下的代码：

sumstat_harmoniser \
 --sumstats ${sumstats} \
 --vcf gnomad.vcf.gz \    #以gnomad为参考
 --eaf_col EAF \
 --af_vcf_field AF_eas \
 --chrom_col CHROMOSOME \
 --pos_col POSITION \
 --effAl_col EA \
 --otherAl_col NEA \
 --beta_col BETA \
 --or_col OR \
 --or_col_lower OR_95L \
 --or_col_upper OR_95U \
 --palin_mode infer \
 --hm_sumstats ${out}.hm \
 --hm_statfile ${out}.stats

harmonise后的结果为原文件前加上以下harmonise后的列

hm_varid hm_rsid hm_chrom hm_pos hm_other_allele hm_effect_allele hm_beta hm_OR hm_OR_lowerCI hm_OR_upperCI hm_eaf hm_code

参考

https://gnomad.broadinstitute.org/downloads

GWAS Sumstats Harmonization GWAS数据的协调统一

背景介绍

Pipeline

Genetics-Sumstats-Harmoniser工具使用教程

下载与安装

环境要求 Requirements

参考数据 Reference data

使用方法

输入选项

输出选项

其他选项

Harmonisation mode 模式选择:

AF字段名与阈值

使用例

参考

《GWAS Sumstats Harmonization GWAS数据的协调统一》有2个想法

留下评论取消回复

背景介绍

Pipeline

Genetics-Sumstats-Harmoniser工具使用教程

下载与安装

环境要求 Requirements

参考数据 Reference data

使用方法

输入选项

输出选项

其他选项

Harmonisation mode 模式选择:

AF字段名与阈值

使用例

参考

共享此文章：

相关

《GWAS Sumstats Harmonization GWAS数据的协调统一》有2个想法

留下评论 取消回复

留下评论取消回复