SNP的rsID与位置信息的相互匹配 rsID/ chr:pos conversion

本文内容:

1 个别SNP位点或rsID查询 
	dbsnp网页查询
2 少量转换可用的网页工具 (100,000 snp以内)
	SNP-nexus网页工具 
	vep注释工具网页版
3 大量转换使用的命令行工具 
	ANNOVAR命令行
	VEP命令行
4 参考

rsID与染色体位置信息的匹配或转换是生物信息学研究中必不可少的,但却又是十分繁琐的一项步骤,很多同学都在纠结这个问题,本文将总结常用的转换方法,以供参考:

转换前需要考虑的问题:

  1. 变异的标准化 :Variant Normalization 变异的标准化
  2. 参考基因组的版本 :人类参考基因组 Human reference genome hg19/hg38/GRCh37/GRCh38/b37/hs375d

1.个别SNP的转换:

直接使用dbsnp(详见:SNP数据库 – dbSNP)查询,查询个别SNP时方便快捷,可以直接搜rsID

也可以以chr:pos的形式搜rsID

2.少量转换可用的网页工具 (100,000 snp以内)

2.1 SNPnexus : https://www.snp-nexus.org/v4/

首先输入用户信息,学术用途是免费的,使用自己的edu邮箱即可

之后选择assembly的版本

选择后可以通过多种方式提交自己要查询的SNP

2.1.1 这里我们通过上传文件的方式对rsID进行注释

输入文件 snp.txt 格式如下 (其他格式的具体描述详见:https://www.snp-nexus.org/v4/guide/

dbsnp   rs293794
dbsnp   rs1052133
dbsnp   rs3136820
dbsnp   rs2272615
dbsnp   rs2953993
dbsnp   rs1799782
dbsnp   rs25487
dbsnp   rs2248690
dbsnp   rs4918
dbsnp   rs1071592

然后是注释用的数据选择,各数据库版本见:https://www.snp-nexus.org/v4/guide/#data_source

因为我们只查询位置信息,就不勾选其他数据库,只使用默认的Ensembl

点击submit query后,稍作等待,结果就会显示出来,可以导出为VCF或txt文本格式。

2.1.2 位点转换为rsID时,输入文件如下:

格式为:< Type Name Position Allele1 Allele2 Strand > # Genomic position data for novel SNPs

chromosome	1	9262273	G	A	1
chromosome	1	45015006	G	A	1
chromosome	1	226982947	C	T	1
chromosome	2	189001450	G	A	1
chromosome	17	46010375	T	C	1

提交后即可查询,结果如下所示

2.2 VEP网页版

VEP注释工具使用详见(预留链接)

VEP网页版:https://asia.ensembl.org/info/docs/tools/vep/online/index.html

点击launch VEP

2.2.1 可以在input data处直接粘贴要查询的rsID,或是上传文件

点击输入框下方的example,可以查看可用的输入格式

这里以Variant identifiers 为例:

选择合适数据库,提交后可以看到查询状态

完成后点击View results

可查看基本信息,也可以导出

2.2.2 位点转换rsID时,使用如下Ensembl默认的输入格式即可:

3.大量转换时的命令行工具

3.1 可使用如上所述VEP工具的命令行版本

下载地址与文档见:https://asia.ensembl.org/info/docs/tools/vep/index.html

3.2 也可以使用ANNOVAR

安装与使用具体使用方法见:使用ANNOVAR 对Variants进行功能注释 Annotation POST-GWAS analysis

这里只简单介绍rsID与chr:pos互相转换的方法

3.2.1 rsID转chr:pos

使用ANNOVAR的convert2annovar.pl 程序与-format rsid选项即可注释位置信息,使用例如下所示,

(注意参考基因组的版本)

[kaiwang@biocluster ~/]$ cat example/snplist.txt 
rs74487784
rs41534544
rs4308095
rs12345678

[kaiwang@biocluster ~/]$ convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snplist.avinput 
NOTICE: Scanning dbSNP file humandb/hg19_snp138.txt...
NOTICE: input file contains 4 rs identifiers, output file contains information for 4 rs identifiers
WARNING: 1 rs identifiers have multiple records (due to multiple mapping) and they are all written to output

[kaiwang@biocluster ~/]$ cat snplist.avinput 
chr2 186229004 186229004 C T rs4308095
chr7 6026775 6026775 T C rs41534544
chr7 6777183 6777183 G A rs41534544
chr9 3901666 3901666 T C rs12345678
chr22 24325095 24325095 A G rs74487784

3.2.2 chr:pos转rsID

位点信息转化为rsID时需要使用基于筛选的注释:

这里使用avsnp150,具体描述可见:https://annovar.openbioinformatics.org/en/latest/user-guide/download/

使用方法如下:

#首先下载注释用数据库:
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 avsnp150 humandb/

#完成后即可进行转换
annotate_variation.pl ex1.avinput humandb/ -filter -build hg19 -dbtype avsnp150

输入文件使用上述的(类bed文件),ex1.avinput

chr2 186229004 186229004 C T
chr7 6026775 6026775 T C
chr7 6777183 6777183 G A
chr9 3901666 3901666 T C
chr22 24325095 24325095 A G 

输出文件如下,ex1.avinput.hg19_avsnp150_dropped,

(注意某些SNP的rsID有合并等原因,版本不同rsID注释结果不一定相同)

avsnp150        rs4308095       chr2 186229004 186229004 C T
avsnp150        rs140195        chr22 24325095 24325095 A G
avsnp150        rs2228006       chr7 6026775 6026775 T C
avsnp150        rs766725467     chr7 6777183 6777183 G A
avsnp150        rs12345678      chr9 3901666 3901666 T C

注释rsID可能会遇到的问题:

https://annovar.openbioinformatics.org/en/latest/articles/dbSNP/

参考:

dbsnp:https://www.ncbi.nlm.nih.gov/snp/

annovar:https://annovar.openbioinformatics.org/en/latest/

vep:https://asia.ensembl.org/info/docs/tools/vep/index.html

snpnexus:https://www.snp-nexus.org/v4/

使用ANNOVAR 对Variants进行功能注释 Annotation POST-GWAS analysis

在GWAS分析中,或是进行针对rare variants的gene-based test时,对SNP进行功能注释是必不可少的一个步骤,本文将简单介绍一款最为常用的SNP注释软件 ANNOVAR url: https://annovar.openbioinformatics.org/en/latest/

ANNOVAR是一款对基因变异进行功能注释的软件,可以对多种生物的变异进行注释(包括 human genome hg18, hg19, hg38, 以及 mouse, worm, fly, yeast 等)。 给定一个包含染色体,起点,终点,参考核苷酸与观测核苷酸, ANNOVAR可以进行如下的功能注释:

  1. 基于基因的注释 Gene-based annotation:主要针对SNP或CNV是否引起蛋白编码改变进行注释,可以灵活选用 RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等多种不同来源的基因定义系统。
  2. 基于区域的注释 Region-based annotation:针对基因组某一特定区域的变异进行注释,例如44个物种的保守区域,预测的转录因子结合位点,GWAS hit, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites,ChIP-Seq peaks, RNA-Seq peaks等
  3. 基于筛选的注释 Filter-based annotation:使用某一特定的数据库进行筛选注释,例如注释变异的rs id,1000基因组项目中的MAF,或是ExAC、gnomAD等,再例如SIFT/ PolyPhen/ LRT/ MutationTaster/ MutationAssessor/ FATHMM/ MetaSVM/ MetaLR 分数等。

实战操作

ANNOVAR是由perl编写的程序,首先通过下载页面,填写注册表格(你还可以给作者留言,作者貌似指着这个留言写基金申请,靠用户感动评审~),学术用途可以免费试用,几分钟后邮件会发来下载链接。

下载并解压(将路径添加到环境中),

我们还需要下载注释用的参考数据库:

目前可用的参考库:https://annovar.openbioinformatics.org/en/latest/user-guide/download/

注意:annovar只能使用官方提供的注释数据库

对于初学者来说,使用ANNOVAR最简单的方法就是使用 table_annovar.pl 程序。 我们通过以下例子来解释annovar的用法,

# 1.下载注释用数据库
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/ 
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/ 
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp30a humandb/

# 2.使用table_annovar.pl对 example/ex1.avinput输入文件进行注释
table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt

如上所示,第一部分代码,下载了对应hg19的各种数据库到humandb/ 文件夹(可以自定义)里,数据库内容可以参考以下链接:https://annovar.openbioinformatics.org/en/latest/user-guide/filter/

包含了对各个数据库详细的介绍:

第二段代码使用table_annovar.pl主程序对example/ex1.avinput输入文件同时进行多数据库的注释,

  • -buildver hg19 : 参考基因组使用hg19
  • -out myanno : 输出前缀为 myanno
  • -remove : 注释完成后删除缓存文件
  • -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a : 所用的数据库包括 ExAC version 0.3 (exac03) dbNFSP version 3.0a (dbnsfp30a), dbSNP version 147 with left-normalization (avsnp147) 数据库
  • -operation gx,r,f,f,f :指定针对protocol的操作(与-protocol一一对应),可选的操作包括g (gene-based), gx (gene-based with cross-reference annotation from -xref argument), r (region-based) 以及 f(filter-based).
  • -nastring “.” : 缺失的注释用 “.”替代
  • -csvout 输出为csv格式
  • -xref example/gene_xref.txt 交叉引用文件,例如 已知的由这个基因变异引起的疾病

其中annovar的标准输入格式.avinput的具体内容如下:

第一至五列为必须,分别是染色体号,起始位点,结束位点,参考等位基因 reference allele 以及 替代等位基因 alternative allele,第五列之后可自由添加所需要的信息

1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

当然,除了自有的avinput格式外,ANNOVAR还支持VCF等多种常用格式输入文件(-vcfinput)。

table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -vcfinput -polish

另外,我们也可以利用ANNOVAR的核心程序 annotate_variation.pl,快速简便的完成单一类型的注释

# 基于基因
annotate_variation.pl -geneanno -dbtype refGene -buildver hg19 example/ex1.avinput humandb/
#基于区域
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/ 
#基于筛选
annotate_variation.pl -filter -dbtype exac03 -buildver hg19 example/ex1.avinput humandb/

以基于基因的注释为例(用法三者类似),

第一步下载数据库 refGene

annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

第二步 注释 (这里没有指定注释的数据库,因为 annotate_variation.pl 默认的参数是 –geneanno -dbtype refGene)

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/
NOTICE: The --geneanno operation is set to ON by default
NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 48660 transcripts (including 10375 without coding sequence annotation) for 25588 unique genes
NOTICE: Reading FASTA sequences from humandb/hg19_refGeneMrna.fa ... Done with 14 sequences
WARNING: A total of 333 sequences will be ignored due to lack of correct ORF annotation
NOTICE: Finished gene-based annotation on 15 genetic variants in example/ex1.avinput
NOTICE: Output files were written to ex1.variant_function, ex1.exonic_variant_function

第三步 查看结果

除了log文件外,还有三个后缀分别为.variant_function ,.exonic_variant_function 以及 .invalid_input 的文件生成:

文件1 .variant_function

cat ex1.variant_function 
UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
intronic DNASE2B 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
intergenic LOC645354(dist=11566),LOC391003(dist=116902) 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
intergenic UBIAD1(dist=55105),PTCHD2(dist=135699) 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
intergenic LOC100129138(dist=872538),NONE(dist=NONE) 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
exonic IL23R 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
exonic ATG16L1 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
exonic NOD2 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
exonic NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
exonic NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
exonic GJB2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
exonic CRYL1,GJB6 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

第一列就表示这个变异的功能注释,是位于外显子还是内含子等等,可能的注释结果如下图所示:

第二个文件,.exonic_variant_function:包含了外显子上的变异所引起的具体的氨基酸变化。如下所示,

cat ex1.exonic_variant_function 
line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
line10 nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
line15 frameshift deletion GJB6:NM_001110221:wholegene,GJB6:NM_001110220:wholegene,GJB6:NM_001110219:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

可能的取值如下所示,如果我们需要进行gene-based test来找到rare variants,那我们就可以依据下表选中我们想要纳入研究的variants,制作相应的group file。

第三个文件, .invalid_input 则是注释失败的输入文件中的变异。

参考:

https://www.nature.com/articles/nprot.2015.105

https://annovar.openbioinformatics.org/en/latest/user-guide/gene/