$ head plink_results.hwe
CHR SNP TEST A1 A2 GENO O(HET) E(HET) P
1 1:13273:G:C ALL(NP) C G 1/61/442 0.121 0.1172 0.7113
1 1:14599:T:A ALL(NP) A T 1/88/415 0.1746 0.1626 0.1625
1 1:14604:A:G ALL(NP) G A 1/88/415 0.1746 0.1626 0.1625
1 1:14930:A:G ALL(NP) G A 4/409/91 0.8115 0.4851 1.679e-61
1 1:69897:T:C ALL(NP) T C 7/111/386 0.2202 0.2173 1
1 1:86331:A:G ALL(NP) G A 0/88/416 0.1746 0.1594 0.02387
1 1:91581:G:A ALL(NP) A G 137/228/139 0.4524 0.5 0.03271
1 1:122872:T:G ALL(NP) G T 1/259/244 0.5139 0.3838 8.04e-19
1 1:135163:C:T ALL(NP) T C 1/91/412 0.1806 0.1675 0.1066
Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893. Link
UMAP(uniform manifold approximation and projection)是近年来新出现的一种相对灵活的非线性降维算法,目前在统计遗传学等领域也有了较为广泛的应用。
UMAP的理论基础基于流形理论(manifold theory)与拓扑分析,
主要基于以下假设:
存在一个数据均一分布的流形。
这个目标流形是局部相连的。
该算法的主要目标是保存此流形的拓扑结构。
总体来看,UMAP利用了局部流形近似,并拼接模糊单纯集合表示(local fuzzy simplicial set representation),以构建高维数据的拓扑表示。在给定低维数据时,UMAP会采取相似的手法构建一个等价的拓扑表示。最后UMAP会对低维空间中数据表示的布局进行最优化,以最小化高维和低维两种拓扑表示之间的交叉熵(cross-entropy)。
Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569 (2020). https://doi.org/10.1038/s41467-020-15194-z
可以参考:Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569
该文章对日本人群结构做了分析,并对比了各种降维方法的结果,强烈推荐。
参考
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569
Diaz-Papkovich, A., Anderson-Trocmé, L. & Gravel, S. A review of UMAP in population genetics. J. Hum. Genet. 66, 85–91 (2021).
[kaiwang@biocluster ~/]$ cat example/snplist.txt
rs74487784
rs41534544
rs4308095
rs12345678
[kaiwang@biocluster ~/]$ convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snplist.avinput
NOTICE: Scanning dbSNP file humandb/hg19_snp138.txt...
NOTICE: input file contains 4 rs identifiers, output file contains information for 4 rs identifiers
WARNING: 1 rs identifiers have multiple records (due to multiple mapping) and they are all written to output
[kaiwang@biocluster ~/]$ cat snplist.avinput
chr2 186229004 186229004 C T rs4308095
chr7 6026775 6026775 T C rs41534544
chr7 6777183 6777183 G A rs41534544
chr9 3901666 3901666 T C rs12345678
chr22 24325095 24325095 A G rs74487784
chr2 186229004 186229004 C T
chr7 6026775 6026775 T C
chr7 6777183 6777183 G A
chr9 3901666 3901666 T C
chr22 24325095 24325095 A G
输出文件如下,ex1.avinput.hg19_avsnp150_dropped,
(注意某些SNP的rsID有合并等原因,版本不同rsID注释结果不一定相同)
avsnp150 rs4308095 chr2 186229004 186229004 C T
avsnp150 rs140195 chr22 24325095 24325095 A G
avsnp150 rs2228006 chr7 6026775 6026775 T C
avsnp150 rs766725467 chr7 6777183 6777183 G A
avsnp150 rs12345678 chr9 3901666 3901666 T C
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to HapMap_3_r3_1.log.
Options in effect:
--bfile HapMap_3_r3_1
--check-sex
--extract HapMap_3_r3_1.prune.in
--out HapMap_3_r3_1
191875 MB RAM detected; reserving 95937 MB for main workspace.
Allocated 3037 MB successfully, after larger attempt(s) failed.
1457897 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
--extract: 166240 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Warning: 35 het. haploid genotypes present (see HapMap_3_r3_1.hh ); many
commands treat these as missing.
Total genotyping rate is 0.996691.
166240 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls. (53 phenotypes
are missing.)
--check-sex: 3993 Xchr and 0 Ychr variant(s) scanned, 1 problem detected.
Report written to HapMap_3_r3_1.sexcheck .
–check-sex: 3993 Xchr and 0 Ychr variant(s) scanned, 1 problem detected.从日志文件中可以看出我们检测出了1个不一致的样本。上述代码运行结果如下:
head HapMap_3_r3_1.sexcheck
FID IID PEDSEX SNPSEX STATUS F
1328 NA06989 2 2 OK 0.02151
1377 NA11891 1 1 OK 1
1349 NA11843 1 1 OK 1
1330 NA12341 2 2 OK 0.009813
1444 NA12739 1 1 OK 1
1344 NA10850 2 2 OK 0.0533
1328 NA06984 1 1 OK 0.9989
1463 NA12877 1 1 OK 1
1418 NA12275 2 2 OK -0.0747
downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
converted BCF files and their indices, ~14 GB
binary PLINK files, ~53 GB
pruned PLINK binary files, ~ <1 Gb
2.2 下载所用数据 (本教程数据主要来自1000 Genomes Phase III imputed genotypes)
2.3 下载对应每条染色体的vcf文件
prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;
suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ;
for chr in {1..22}; do
wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done
2.4 下载 1000 Genomes Phase III 的 ped文件,包含了个体性别,群体等基本信息
#Populations
This file describes the population codes where assigned to samples collected for the 1000 Genomes project. These codes are used to organise the files in the data_collections' project data directories and can also be found in column 11 of many sequence index files.
There are also two tsv files, which contain the population codes and descriptions for both the sub and super populations that were used in phase 3 of the 1000 Genomes Project:
<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.populations.tsv>
<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.superpopulations.tsv>
###Populations and codes
CHB Han Chinese Han Chinese in Beijing, China
JPT Japanese Japanese in Tokyo, Japan
CHS Southern Han Chinese Han Chinese South
CDX Dai Chinese Chinese Dai in Xishuangbanna, China
KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam
CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only)
CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry
TSI Tuscan Toscani in Italia
GBR British British in England and Scotland
FIN Finnish Finnish in Finland
IBS Spanish Iberian populations in Spain
YRI Yoruba Yoruba in Ibadan, Nigeria
LWK Luhya Luhya in Webuye, Kenya
GWD Gambian Gambian in Western Division, The Gambia
MSL Mende Mende in Sierra Leone
ESN Esan Esan in Nigeria
ASW African-American SW African Ancestry in Southwest US
ACB African-Caribbean African Caribbean in Barbados
MXL Mexican-American Mexican Ancestry in Los Angeles, California
PUR Puerto Rican Puerto Rican in Puerto Rico
CLM Colombian Colombian in Medellin, Colombia
PEL Peruvian Peruvian in Lima, Peru
GIH Gujarati Gujarati Indian in Houston, TX
PJL Punjabi Punjabi in Lahore, Pakistan
BEB Bengali Bengali in Bangladesh
STU Sri Lankan Sri Lankan Tamil in the UK
ITU Indian Indian Telugu in the UK
Should you have any queries, please contact info@1000genomes.org.
下面演示用PYTHON进行绘图:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Marees, A.T., de Kluiver, H., Stringer, S., Vorspan, F., Curis, E., Marie-Claire, C., and Derks, E.M. (2018). A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. International Journal of Methods in Psychiatric Research 27.