$ head plink_results.hwe
CHR SNP TEST A1 A2 GENO O(HET) E(HET) P
1 1:13273:G:C ALL(NP) C G 1/61/442 0.121 0.1172 0.7113
1 1:14599:T:A ALL(NP) A T 1/88/415 0.1746 0.1626 0.1625
1 1:14604:A:G ALL(NP) G A 1/88/415 0.1746 0.1626 0.1625
1 1:14930:A:G ALL(NP) G A 4/409/91 0.8115 0.4851 1.679e-61
1 1:69897:T:C ALL(NP) T C 7/111/386 0.2202 0.2173 1
1 1:86331:A:G ALL(NP) G A 0/88/416 0.1746 0.1594 0.02387
1 1:91581:G:A ALL(NP) A G 137/228/139 0.4524 0.5 0.03271
1 1:122872:T:G ALL(NP) G T 1/259/244 0.5139 0.3838 8.04e-19
1 1:135163:C:T ALL(NP) T C 1/91/412 0.1806 0.1675 0.1066
Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10108 rs62651026 CAACCCT C 46514.3 RF AF_eas=0;AF_nfe=0;AF_fin=0.0227273
1 10109 rs376007522 AACCCT A 89837.3 RF AF_eas=0;AF_nfe=0.0512821;AF_fin=0.0588235
使用方法
输入选项
需要两个文件:一是gwas的sumstats,二是参考的vcf文件
Input files:
--sumstats <file> GWAS summary statistics file
--vcf <file> Reference VCF file. Use # as chromosome wildcard.
其中gwas sumstats还需要指定对应的列名:
Input column names:
--chrom_col <str> Chromosome column 染色体
--pos_col <str> Position column 碱基位置
--effAl_col <str> Effect allele column 效应等位
--otherAl_col <str> Other allele column 非效应等位
--beta_col <str> beta column
--or_col <str> Odds ratio column
--or_col_lower <str> Odds ratio lower CI column
--or_col_upper <str> Odds ratio upper CI column
--eaf_col <str> Effect allele frequency column
--rsid_col <str> rsID column in the summary stat file
Output files:
--hm_sumstats <file> Harmonised sumstat output file (use 'gz' extension to
gzip)
--hm_statfile <file> Statistics from harmonisation process output file.
Should only be used in conjunction with --hm_sumstats.
--strand_counts <file>
Output file showing number of variants that are
forward/reverse/palindromic
其他选项
Other args:
--only_chrom <str> Only process this chromosome
--in_sep <str> Input file column separator (default: tab)
--out_sep <str> Output file column separator (default: tab)
--na_rep_in <str> How NA are represented in the input file (default: "")
--na_rep_out <str> How to represent NA values in output (default: "")
--chrom_map <str> [<str> ...]
Map summary stat chromosome names, e.g. `--chrom_map
23=X 24=Y`
Harmonisation mode 模式选择:
可以选择以下四种模式:[infer|forward|reverse|drop]
对于 palindromic variants 的处理方式为:
(a) infer strand from effect-allele freq, 根据AF推测
(b) assume forward strand, 假设全部都为正链
(c) assume reverse strand, 假设全部为负链
(d) drop palindromic variants 舍弃所有palindromic variants
AF字段名与阈值
如果选择通过AF推测正负链(–palin_mode infer),那么还需要制定vcf文件中af的字段名与推测的阈值 –af_vcf_field <str> VCF info field containing alt allele freq (default:AF_NFE)
指定VCF文件中INFO里包含AF信息的字段
–infer_maf_threshold <float>Max MAF that will be used to infer palindrome strand (default: 0.42)
推测palindrome strand的最大MAF的阈值
使用例
如果输入的sumstats为以下格式:
CHROMOSOME POSITION MARKERNAME NEA EA EAF INFO N BETA SE P OR OR_95L OR_95U
UMAP(uniform manifold approximation and projection)是近年来新出现的一种相对灵活的非线性降维算法,目前在统计遗传学等领域也有了较为广泛的应用。
UMAP的理论基础基于流形理论(manifold theory)与拓扑分析,
主要基于以下假设:
存在一个数据均一分布的流形。
这个目标流形是局部相连的。
该算法的主要目标是保存此流形的拓扑结构。
总体来看,UMAP利用了局部流形近似,并拼接模糊单纯集合表示(local fuzzy simplicial set representation),以构建高维数据的拓扑表示。在给定低维数据时,UMAP会采取相似的手法构建一个等价的拓扑表示。最后UMAP会对低维空间中数据表示的布局进行最优化,以最小化高维和低维两种拓扑表示之间的交叉熵(cross-entropy)。
Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569 (2020). https://doi.org/10.1038/s41467-020-15194-z
可以参考:Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569
该文章对日本人群结构做了分析,并对比了各种降维方法的结果,强烈推荐。
参考
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569
Diaz-Papkovich, A., Anderson-Trocmé, L. & Gravel, S. A review of UMAP in population genetics. J. Hum. Genet. 66, 85–91 (2021).
[kaiwang@biocluster ~/]$ cat example/snplist.txt
rs74487784
rs41534544
rs4308095
rs12345678
[kaiwang@biocluster ~/]$ convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snplist.avinput
NOTICE: Scanning dbSNP file humandb/hg19_snp138.txt...
NOTICE: input file contains 4 rs identifiers, output file contains information for 4 rs identifiers
WARNING: 1 rs identifiers have multiple records (due to multiple mapping) and they are all written to output
[kaiwang@biocluster ~/]$ cat snplist.avinput
chr2 186229004 186229004 C T rs4308095
chr7 6026775 6026775 T C rs41534544
chr7 6777183 6777183 G A rs41534544
chr9 3901666 3901666 T C rs12345678
chr22 24325095 24325095 A G rs74487784
chr2 186229004 186229004 C T
chr7 6026775 6026775 T C
chr7 6777183 6777183 G A
chr9 3901666 3901666 T C
chr22 24325095 24325095 A G
输出文件如下,ex1.avinput.hg19_avsnp150_dropped,
(注意某些SNP的rsID有合并等原因,版本不同rsID注释结果不一定相同)
avsnp150 rs4308095 chr2 186229004 186229004 C T
avsnp150 rs140195 chr22 24325095 24325095 A G
avsnp150 rs2228006 chr7 6026775 6026775 T C
avsnp150 rs766725467 chr7 6777183 6777183 G A
avsnp150 rs12345678 chr9 3901666 3901666 T C
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to HapMap_3_r3_1.log.
Options in effect:
--bfile HapMap_3_r3_1
--check-sex
--extract HapMap_3_r3_1.prune.in
--out HapMap_3_r3_1
191875 MB RAM detected; reserving 95937 MB for main workspace.
Allocated 3037 MB successfully, after larger attempt(s) failed.
1457897 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
--extract: 166240 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Warning: 35 het. haploid genotypes present (see HapMap_3_r3_1.hh ); many
commands treat these as missing.
Total genotyping rate is 0.996691.
166240 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls. (53 phenotypes
are missing.)
--check-sex: 3993 Xchr and 0 Ychr variant(s) scanned, 1 problem detected.
Report written to HapMap_3_r3_1.sexcheck .
–check-sex: 3993 Xchr and 0 Ychr variant(s) scanned, 1 problem detected.从日志文件中可以看出我们检测出了1个不一致的样本。上述代码运行结果如下:
head HapMap_3_r3_1.sexcheck
FID IID PEDSEX SNPSEX STATUS F
1328 NA06989 2 2 OK 0.02151
1377 NA11891 1 1 OK 1
1349 NA11843 1 1 OK 1
1330 NA12341 2 2 OK 0.009813
1444 NA12739 1 1 OK 1
1344 NA10850 2 2 OK 0.0533
1328 NA06984 1 1 OK 0.9989
1463 NA12877 1 1 OK 1
1418 NA12275 2 2 OK -0.0747
1.ped,pedigree file。ped文件一般包含6+2N列,第一至六列分别为 1. Family ID 2. Individual ID 3.Mother ID 4.Father ID 5.Sex 6.Phenotype。第六列以后为各个SNP的等位基因,两列一组,可以使用具体的碱基,也可以使用拷贝数(0,1)。
2. map,与ped文件相伴随的文件,主要包含ped文件中SNP的位置信息。一般包含4列。分别是1. 染色体号 2.SNP ID 3.遗传图距(单位为摩根或厘摩,通常分析不需要这一列,使用哑值(dummy value) 0 填充) 4.碱基对坐标。每行一个SNP,顺序与ped文件中的SNP相对应。