在GWAS分析中,或是进行针对rare variants的gene-based test时,对SNP进行功能注释是必不可少的一个步骤,本文将简单介绍一款最为常用的SNP注释软件 ANNOVAR url: https://annovar.openbioinformatics.org/en/latest/
ANNOVAR是一款对基因变异进行功能注释的软件,可以对多种生物的变异进行注释(包括 human genome hg18, hg19, hg38, 以及 mouse, worm, fly, yeast 等)。 给定一个包含染色体,起点,终点,参考核苷酸与观测核苷酸, ANNOVAR可以进行如下的功能注释:
- 基于基因的注释 Gene-based annotation:主要针对SNP或CNV是否引起蛋白编码改变进行注释,可以灵活选用 RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等多种不同来源的基因定义系统。
- 基于区域的注释 Region-based annotation:针对基因组某一特定区域的变异进行注释,例如44个物种的保守区域,预测的转录因子结合位点,GWAS hit, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites,ChIP-Seq peaks, RNA-Seq peaks等
- 基于筛选的注释 Filter-based annotation:使用某一特定的数据库进行筛选注释,例如注释变异的rs id,1000基因组项目中的MAF,或是ExAC、gnomAD等,再例如SIFT/ PolyPhen/ LRT/ MutationTaster/ MutationAssessor/ FATHMM/ MetaSVM/ MetaLR 分数等。
实战操作
ANNOVAR是由perl编写的程序,首先通过下载页面,填写注册表格(你还可以给作者留言,作者貌似指着这个留言写基金申请,靠用户感动评审~),学术用途可以免费试用,几分钟后邮件会发来下载链接。
下载并解压(将路径添加到环境中),

我们还需要下载注释用的参考数据库:
目前可用的参考库:https://annovar.openbioinformatics.org/en/latest/user-guide/download/
注意:annovar只能使用官方提供的注释数据库
对于初学者来说,使用ANNOVAR最简单的方法就是使用 table_annovar.pl 程序。 我们通过以下例子来解释annovar的用法,
# 1.下载注释用数据库
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp30a humandb/
# 2.使用table_annovar.pl对 example/ex1.avinput输入文件进行注释
table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt
如上所示,第一部分代码,下载了对应hg19的各种数据库到humandb/ 文件夹(可以自定义)里,数据库内容可以参考以下链接:https://annovar.openbioinformatics.org/en/latest/user-guide/filter/
包含了对各个数据库详细的介绍:

第二段代码使用table_annovar.pl主程序对example/ex1.avinput输入文件同时进行多数据库的注释,
- -buildver hg19 : 参考基因组使用hg19
- -out myanno : 输出前缀为 myanno
- -remove : 注释完成后删除缓存文件
- -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a : 所用的数据库包括 ExAC version 0.3 (exac03) dbNFSP version 3.0a (dbnsfp30a), dbSNP version 147 with left-normalization (avsnp147) 数据库
- -operation gx,r,f,f,f :指定针对protocol的操作(与-protocol一一对应),可选的操作包括
g
(gene-based),gx
(gene-based with cross-reference annotation from-xref
argument),r
(region-based) 以及f
(filter-based). - -nastring “.” : 缺失的注释用 “.”替代
- -csvout 输出为csv格式
- -xref example/gene_xref.txt 交叉引用文件,例如 已知的由这个基因变异引起的疾病
其中annovar的标准输入格式.avinput的具体内容如下:
第一至五列为必须,分别是染色体号,起始位点,结束位点,参考等位基因 reference allele 以及 替代等位基因 alternative allele,第五列之后可自由添加所需要的信息
1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
当然,除了自有的avinput格式外,ANNOVAR还支持VCF等多种常用格式输入文件(-vcfinput)。
table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -vcfinput -polish
另外,我们也可以利用ANNOVAR的核心程序 annotate_variation.pl,快速简便的完成单一类型的注释
# 基于基因
annotate_variation.pl -geneanno -dbtype refGene -buildver hg19 example/ex1.avinput humandb/
#基于区域
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
#基于筛选
annotate_variation.pl -filter -dbtype exac03 -buildver hg19 example/ex1.avinput humandb/
以基于基因的注释为例(用法三者类似),
第一步下载数据库 refGene
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory
第二步 注释 (这里没有指定注释的数据库,因为 annotate_variation.pl 默认的参数是 –geneanno -dbtype refGene)
annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/
NOTICE: The --geneanno operation is set to ON by default
NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 48660 transcripts (including 10375 without coding sequence annotation) for 25588 unique genes
NOTICE: Reading FASTA sequences from humandb/hg19_refGeneMrna.fa ... Done with 14 sequences
WARNING: A total of 333 sequences will be ignored due to lack of correct ORF annotation
NOTICE: Finished gene-based annotation on 15 genetic variants in example/ex1.avinput
NOTICE: Output files were written to ex1.variant_function, ex1.exonic_variant_function
第三步 查看结果
除了log文件外,还有三个后缀分别为.variant_function
,.exonic_variant_function
以及 .invalid_input 的文件生成:
文件1 .variant_function
cat ex1.variant_function
UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
intronic DNASE2B 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
intergenic LOC645354(dist=11566),LOC391003(dist=116902) 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
intergenic UBIAD1(dist=55105),PTCHD2(dist=135699) 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
intergenic LOC100129138(dist=872538),NONE(dist=NONE) 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
exonic IL23R 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
exonic ATG16L1 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
exonic NOD2 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
exonic NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
exonic NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
exonic GJB2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
exonic CRYL1,GJB6 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
第一列就表示这个变异的功能注释,是位于外显子还是内含子等等,可能的注释结果如下图所示:

第二个文件,.exonic_variant_function:包含了外显子上的变异所引起的具体的氨基酸变化。如下所示,
cat ex1.exonic_variant_function
line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
line10 nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
line15 frameshift deletion GJB6:NM_001110221:wholegene,GJB6:NM_001110220:wholegene,GJB6:NM_001110219:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
可能的取值如下所示,如果我们需要进行gene-based test来找到rare variants,那我们就可以依据下表选中我们想要纳入研究的variants,制作相应的group file。

第三个文件, .invalid_input 则是注释失败的输入文件中的变异。
参考:
https://www.nature.com/articles/nprot.2015.105
https://annovar.openbioinformatics.org/en/latest/user-guide/gene/
One thought on “使用ANNOVAR 对Variants进行功能注释 Annotation POST-GWAS analysis”