使用UMAP对基因组数据降维,对比PCA

本文内容:
UMAP简介
在群体遗传学中的应用(与PCA的不同)
UMAP使用方法
使用对示例千人基因组数据进行降维
参考

关键词:UMAP, PCA, t-SNE, PCA-UMAP, 基因组降维


UMAP 简介

UMAP(uniform manifold approximation and projection)是近年来新出现的一种相对灵活的非线性降维算法,目前在统计遗传学等领域也有了较为广泛的应用。

UMAP的理论基础基于流形理论(manifold theory)与拓扑分析

主要基于以下假设:

  1. 存在一个数据均一分布的流形。
  2. 这个目标流形是局部相连的。
  3. 该算法的主要目标是保存此流形的拓扑结构。

总体来看,UMAP利用了局部流形近似,并拼接模糊单纯集合表示(local fuzzy simplicial set representation),以构建高维数据的拓扑表示。在给定低维数据时,UMAP会采取相似的手法构建一个等价的拓扑表示。最后UMAP会对低维空间中数据表示的布局进行最优化,以最小化高维和低维两种拓扑表示之间的交叉熵(cross-entropy)。

算法概览如下:

UMAP与其他降维方法对比:

快,准确:需要降维数据维度增加时,t-SNE计算时间呈指数型增长,umap则线性增长。


在群体遗传学中的应用:

在处理大样本基因组数据时,通常我们会对数据进行降维,以达到数据可视化,并发现存在亲缘关系的样本。

最常用的方法是主成分分析PCA:GWASLab:群体分层与主成分分析教程 Population structure & PCA

UMAP与PCA的不同:

但由于PCA投影找的是最大化方差的方向,通常会忽略掉其他方向的方差,也就是说PCA倾向于发现较大的人群结构,而忽略精细的结构,但UMAP的思路则与PCA不同,如第一节所述,其主要目的是保留其与邻接样本的拓扑结构,也就是更为精细的局部人群结构,而非整体结构。

目前UMAP已经被广泛应用于群体遗传学的研究之中。


下面简单介绍UMAP使用方法:

UMAP (是一个python包),使用方法:

通过pip或conda安装:

pip install umap-learn
#或者
conda install -c conda-forge umap-learn

使用文档详见:https://umap-learn.readthedocs.io/en/latest/parameters.html

一个最简单的示例:

#导入umap包
import umap   

#创建UMAP实例
reducer = umap.UMAP()   

#使用你的数据对该实例进行训练,得到嵌入后的结果
embedding = reducer.fit_transform(your_data)

UMAP() 的主要参数,可参考上述算法概览

  • n_neighbors : 邻接样本的数量,默认15
  • min_dist :控制布局的参数,取值范围0-1,默认0.1
  • n_components :维数,默认2

使用基因型数据进行实际演示:

教程改编与数据改编自:

Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569 (2020). https://doi.org/10.1038/s41467-020-15194-z

代码与数据下载:

https://github.com/saorisakaue/Genotype-dimensionality-reduction

数据在umap_data中:1KG.selected.bed/bim/fam

使用软件:

PLINK,python(umap,pandas,matplotlib)

注意:

  1. UMAP只接受完整的数据,对于有缺失的基因型数据要事先对缺失的SNP进行填补以得到完整的数据,填补方法可参考:GWASLab:Eagle2单倍型定相工具 Haplotype phasing
  2. 对于PLINK格式的数据,一般需要在LD-pruning 后,转换为 genotype matrix

输入文件预处理:

#如果数据有缺失请先进行填补

#首先进行LD-pruning
GENOTYPE="./data_umap/1KG.selected"

plink \
 --bfile ${GENOTYPE} \
 --indep-pairwise 50 5 0.2 \
 --maf 0.01 \
 --hwe 1E-6 \
 --out ${GENOTYPE}

#对二进制plink文件进行转换,转为0/1/2的文本文件
plink \
 --bfile ${GENOTYPE} \
 --extract ${GENOTYPE}.prune.in \
 --recode A \
 --out ${GENOTYPE}.pruned

# 2) 抽取出 genotype matrix
cat ${GENOTYPE}.pruned.raw | cut -d " " -f7- | awk 'NR>1{print}' > ${GENOTYPE}.pruned.geno.txt

log文件如下:

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./data_umap/1KG.selected.log.
Options in effect:
  --bfile ./data_umap/1KG.selected
  --hwe 1E-6
  --indep-pairwise 50 5 0.2
  --maf 0.01
  --out ./data_umap/1KG.selected

191875 MB RAM detected; reserving 95937 MB for main workspace.
Allocated 3037 MB successfully, after larger attempt(s) failed.
120489 variants loaded from .bim file.
50 people (25 males, 25 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 50 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--hwe: 14 variants removed due to Hardy-Weinberg exact test.
861 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
119614 variants and 50 people pass filters and QC.
Note: No phenotypes present.
Pruned 5127 variants from chromosome 1, leaving 4909.
Pruned 5171 variants from chromosome 2, leaving 4680.
Pruned 4052 variants from chromosome 3, leaving 3919.
Pruned 3398 variants from chromosome 4, leaving 3517.
Pruned 3594 variants from chromosome 5, leaving 3586.
Pruned 4346 variants from chromosome 6, leaving 3628.
Pruned 3298 variants from chromosome 7, leaving 3193.
Pruned 3421 variants from chromosome 8, leaving 3027.
Pruned 2854 variants from chromosome 9, leaving 2895.
Pruned 3445 variants from chromosome 10, leaving 3244.
Pruned 3235 variants from chromosome 11, leaving 3021.
Pruned 2936 variants from chromosome 12, leaving 3020.
Pruned 2281 variants from chromosome 13, leaving 2326.
Pruned 1916 variants from chromosome 14, leaving 2070.
Pruned 1875 variants from chromosome 15, leaving 1943.
Pruned 1848 variants from chromosome 16, leaving 2005.
Pruned 1589 variants from chromosome 17, leaving 1839.
Pruned 1744 variants from chromosome 18, leaving 1925.
Pruned 947 variants from chromosome 19, leaving 1208.
Pruned 1528 variants from chromosome 20, leaving 1620.
Pruned 874 variants from chromosome 21, leaving 887.
Pruned 759 variants from chromosome 22, leaving 914.
Pruning complete.  60238 of 119614 variants removed.
Marker lists written to ./data_umap/1KG.selected.prune.in and
./data_umap/1KG.selected.prune.out .
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./data_umap/1KG.selected.pruned.log.
Options in effect:
  --bfile ./data_umap/1KG.selected
  --extract ./data_umap/1KG.selected.prune.in
  --out ./data_umap/1KG.selected.pruned
  --recode A

191875 MB RAM detected; reserving 95937 MB for main workspace.
Allocated 3037 MB successfully, after larger attempt(s) failed.
120489 variants loaded from .bim file.
50 people (25 males, 25 females) loaded from .fam.
--extract: 59376 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 50 founders and 0 nonfounders present.
Calculating allele frequencies... done.
59376 variants and 50 people pass filters and QC.
Note: No phenotypes present.
--recode A to ./data_umap/1KG.selected.pruned.raw ... done.

接下来使用umap的默认参数进行降维处理:

import umap

prefix = sys.argv[1]
filename = prefix + ".pruned.geno.txt"
pre_data = pd.read_table(filename, delim_whitespace=True, header=None, low_memory=False)
data = pre_data.dropna(how='any', axis=1)

# UMAP
outfile = prefix + ".umap.txt"
d_embedding = umap.UMAP().fit_transform(data)
pd.DataFrame(d_embedding).to_csv(outfile, sep='\t', header=False, index=False)

查看d_embedding ,基因组数据已经降至2维,结果如下:

array([[ 29.74708  ,   3.0977843],
       [ 29.213293 ,   3.1712177],
       [  8.758959 ,  -1.9003359],
       [  8.377356 ,  -1.4676492],
       [ 29.458044 ,   3.3701472],
       [ 29.87717  ,   3.4556358],
       [ 29.3555   ,   3.8484392],
       [ 28.955818 ,   3.5634878],
       [ 29.105188 ,   2.8024642],
       [ 29.197195 ,   3.7704961],
       [ 29.347229 ,   4.2163496],
       [ 29.865662 ,   3.1569576],
       [  8.503107 ,  -1.3564333],
       [-24.176699 , -14.017474 ],
       [-23.141544 , -13.751009 ],
       [  8.204682 ,  -1.8065634],
       [ 29.94206  ,   2.8537552],
       [-23.501389 , -13.829134 ],
       [ 29.065151 ,   4.255709 ],
       [-23.86171  , -13.711441 ],
       [-23.297297 , -13.290715 ],
       [ 30.9667   , -10.477952 ],
       [-22.91213  , -13.484747 ],
       [-23.249172 , -14.22486  ],
       [-22.951435 , -13.851592 ],
       [ 30.514542 , -11.441013 ],
       [ 30.728407 , -11.229405 ],
       [ 31.13736  , -11.319468 ],
       [ 30.189419 , -10.865067 ],
       [ 30.20413  , -11.283942 ],
       [ 30.89431  , -11.590417 ],
       [ 29.568645 ,   2.9131896],
       [ 29.632095 ,   2.6519256],
       [-23.177608 , -14.538039 ],
       [-23.335829 , -13.95216  ],
       [  8.488959 ,  -2.0645304],
       [  8.731548 ,  -1.8205943],
       [  8.5880165,  -1.6314453],
       [-23.663996 , -13.476801 ],
       [-23.71588  , -14.627886 ],
       [-23.864864 , -14.236027 ],
       [-23.965855 , -14.178329 ],
       [-23.203787 , -14.796338 ],
       [-23.587828 , -14.443732 ],
       [-22.973204 , -14.444413 ],
       [ 30.451714 , -10.90268  ],
       [ 31.011879 , -11.050354 ],
       [ 30.863823 , -10.923892 ],
       [ 30.679998 , -11.574213 ],
       [ 30.77399  , -10.824773 ]], dtype=float32)

绘图后可以看到样本被清楚地分到了4个cluster:

可以自己试着改变UMAP()的参数,看看有什么变化,这里不过多演示,(下图为UMAP(min_dist=1)的结果)

当然,也可以使用PCA后各个样本主成分PCs进行UMAP (PCA-UMAP):

PCA方法:GWASLab:群体分层与主成分分析教程 Population structure & PCA

可以参考:Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569

该文章对日本人群结构做了分析,并对比了各种降维方法的结果,强烈推荐。

参考

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

Sakaue, S., Hirata, J., Kanai, M. et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun 11, 1569

Diaz-Papkovich, A., Anderson-Trocmé, L. & Gravel, S. A review of UMAP in population genetics. J. Hum. Genet. 66, 85–91 (2021).

https://umap-learn.readthedocs.io

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

您正在使用您的 WordPress.com 账号评论。 注销 /  更改 )

Facebook photo

您正在使用您的 Facebook 账号评论。 注销 /  更改 )

Connecting to %s