This document provides an example of creating a dendrogram from a VCF file. The VCF file can be altered using VCFtools to look at a specific area of the genome. The lines starting with ## are what you would expect to see for the output
For a more in depth tutorial of SNPRelate see http://corearray.sourceforge.net/tutorials/SNPRelate/
This document was created using Rmarkdown http://rmarkdown.rstudio.com.
**Thanks to Thomas Quiroz Monnens for catching an error in a previous version of this tutorial!
```
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("SNPRelate")
The packages will have to be loaded each time you open R or Rstudio.
library(gdsfmt)
library(SNPRelate)
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
There are two ways that the VCF file can be loaded into Rstudio if it already on your computer. The first is directly specifying the file path and name, the second is using the file.choose option which will open a browser so you can search for the file. The second option is commented out with # (putting a # before any code means it will be treated as text, to run it, just remove the #).
If you want to follow along with the example data, the data (Approx 1.4 MB) is available for download https://github.com/BenbowLab/BenbowLab.github.io/blob/master/VCFExampleData.vcf
vcf.fn <- "VCFExampleData.vcf"
#vcf.fn<- file.choose()
The next command will turn the VCF file into a less data intensive form (GDS) for easier computing. If you have loaded an entire genome, expect this command to take an hour or more.
snpgdsVCF2GDS(vcf.fn,"data.gds",method ="biallelic.only")
## Start file conversion from VCF to SNP GDS ...
## Method: exacting biallelic SNPs
## Number of samples: 16
## Parsing "VCFExampleData.vcf" ...
## import 2148 variants.
## + genotype { Bit2 16x2148, 8.4K } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'data.gds' (22.4K)
## # of fragments: 46
## save to 'data.gds.tmp'
## rename 'data.gds.tmp' (22.1K, reduced: 312B)
## # of fragments: 20
These commands prepare the data so it is formatted correctly to create a Identity by State matrix (fraction of identity by state for each pair of samples).
genofile<-snpgdsOpen("data.gds")
set.seed(100)
ibs.hc<-snpgdsHCluster(snpgdsIBS(genofile,num.thread=2, autosome.only=FALSE))
## Identity-By-State (IBS) analysis on genotypes:
## Excluding 46 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 16
## # of SNPs: 2,102
## using 2 threads
## IBS: the sum of all selected genotypes (0,1,2) = 48494
## Fri Aug 07 18:26:24 2020 (internal increment: 65536)
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## Fri Aug 07 18:26:24 2020 Done.
This step takes the clustering results from before and turns the numerical values into a dendrogram
rv <- snpgdsCutTree(ibs.hc)
## Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
## Create 1 groups.
plot(rv$dendrogram,main="Dendrogram based on IBS")
This command creates an dissimilarity matrix between all the samples. If you are looking at the X chromosome, make sure the autosome.only= code is changed to autosome.only=False.
*Note: A previous version of this example incorrectly used the snpgdsIBS command here instead of snpgdsDiss. We apologize for any confusion.
dissMatrix = snpgdsDiss(genofile , sample.id=NULL, autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)
## Individual dissimilarity analysis on genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 46 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 16
## # of SNPs: 2,102
## using 2 threads
## Dissimilarity: the sum of all selected genotypes (0,1,2) = 48494
## Dissimilarity: Fri Aug 07 18:26:24 2020 0%
## Dissimilarity: Fri Aug 07 18:26:24 2020 100%
This step performs a clustering analysis similar to the IBS analysis above but using dissimilarity. The next line creates a tree file based on dissimilarity rather than relatedness.
snpHCluster = snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)
cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL,
col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE,
verbose=TRUE)
## Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
## Create 1 groups.
cutTree
## $sample.id
## [1] "AA0035-C" "AA0036-C" "AA0037-C" "AA0038-C" "AB0286-C" "AB0290-C"
## [7] "AB0292-C" "AX0001-C" "AX0002-C" "AY0001-C" "AY0002-C" "BY0001-C"
## [13] "BY0002-C" "BZ0001-C" "BZ0003-C" "BZ0004-C"
##
## $z.threshold
## [1] 15
##
## $outlier.n
## [1] 5
##
## $samp.order
## [1] 9 13 8 12 15 11 7 16 5 10 6 14 1 3 2 4
##
## $samp.group
## [1] G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001
## [16] G001
## Levels: G001
##
## $dmat
## G001
## G001 1.039722
##
## $dendrogram
## 'dendrogram' with 2 branches and 16 members total, at height 1.262116
##
## $merge
## z n1 n2
## 1 0.0000000 1 1
## 2 1.3832952 1 2
## 3 1.7783401 1 3
## 4 1.9990924 1 4
## 5 1.5520638 1 5
## 6 0.0000000 1 1
## 7 0.0000000 1 1
## 8 1.8316610 1 6
## 9 1.2139331 1 2
## 10 0.7770847 1 2
## 11 1.4885736 1 7
## 12 2.9061354 3 8
## 13 1.6057339 1 3
## 14 1.0663228 1 11
## 15 4.6598993 4 12
##
## $clust.count
## G001
## 16
snpgdsClose(genofile)
snpgdsDrawTree(cutTree, main = "Dendrogram based on dissimilarity",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
y.label.kinship=T,leaflab="perpendicular",yaxis.kinship=F)
sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 SNPRelate_1.22.0 gdsfmt_1.24.1
##
## loaded via a namespace (and not attached):
## [1] knitr_1.29 magrittr_1.5 tidyselect_1.1.0 munsell_0.5.0
## [5] colorspace_1.4-1 R6_2.4.1 rlang_0.4.7 dplyr_1.0.1
## [9] stringr_1.4.0 tools_4.0.1 grid_4.0.1 gtable_0.3.0
## [13] xfun_0.16 withr_2.2.0 htmltools_0.5.0 ellipsis_0.3.1
## [17] yaml_2.2.1 digest_0.6.25 tibble_3.0.3 lifecycle_0.2.0
## [21] crayon_1.3.4 purrr_0.3.4 vctrs_0.3.2 glue_1.4.1
## [25] evaluate_0.14 rmarkdown_2.3 stringi_1.4.6 compiler_4.0.1
## [29] pillar_1.4.6 generics_0.0.2 scales_1.1.1 pkgconfig_2.0.3