Creating a Dendrogram from a downloaded VCF file

This document provides an example of creating a dendrogram from a VCF file. The VCF file can be altered using VCFtools to look at a specific area of the genome. The lines starting with ## are what you would expect to see for the output

For a more in depth tutorial of SNPRelate see http://corearray.sourceforge.net/tutorials/SNPRelate/

This document was created using Rmarkdown http://rmarkdown.rstudio.com.

**Thanks to Thomas Quiroz Monnens for catching an error in a previous version of this tutorial!

```

Installing Packages

#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

#BiocManager::install("SNPRelate")

Loading Packages

The packages will have to be loaded each time you open R or Rstudio.

library(gdsfmt)
library(SNPRelate)

## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.2

Loading VCF file

There are two ways that the VCF file can be loaded into Rstudio if it already on your computer. The first is directly specifying the file path and name, the second is using the file.choose option which will open a browser so you can search for the file. The second option is commented out with # (putting a # before any code means it will be treated as text, to run it, just remove the #).

If you want to follow along with the example data, the data (Approx 1.4 MB) is available for download https://github.com/BenbowLab/BenbowLab.github.io/blob/master/VCFExampleData.vcf

vcf.fn <- "VCFExampleData.vcf"
#vcf.fn<- file.choose()

Parsing

The next command will turn the VCF file into a less data intensive form (GDS) for easier computing. If you have loaded an entire genome, expect this command to take an hour or more.

snpgdsVCF2GDS(vcf.fn,"data.gds",method ="biallelic.only")

## Start file conversion from VCF to SNP GDS ...
## Method: exacting biallelic SNPs
## Number of samples: 16
## Parsing "VCFExampleData.vcf" ...
##  import 2148 variants.
## + genotype   { Bit2 16x2148, 8.4K } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'data.gds' (22.4K)
##     # of fragments: 46
##     save to 'data.gds.tmp'
##     rename 'data.gds.tmp' (22.1K, reduced: 312B)
##     # of fragments: 20

Formatting for IBS matrix

These commands prepare the data so it is formatted correctly to create a Identity by State matrix (fraction of identity by state for each pair of samples).

genofile<-snpgdsOpen("data.gds")
set.seed(100)
ibs.hc<-snpgdsHCluster(snpgdsIBS(genofile,num.thread=2, autosome.only=FALSE))

## Identity-By-State (IBS) analysis on genotypes:
## Excluding 46 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
##     # of samples: 16
##     # of SNPs: 2,102
##     using 2 threads
## IBS:    the sum of all selected genotypes (0,1,2) = 48494
## Fri Aug 07 18:26:24 2020    (internal increment: 65536)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
## Fri Aug 07 18:26:24 2020    Done.

Turn the clustering into a tree file and plotting tree

This step takes the clustering results from before and turns the numerical values into a dendrogram

rv <- snpgdsCutTree(ibs.hc)

## Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
## Create 1 groups.

plot(rv$dendrogram,main="Dendrogram based on IBS")

Dissimilarity matrices

This command creates an dissimilarity matrix between all the samples. If you are looking at the X chromosome, make sure the autosome.only= code is changed to autosome.only=False.

*Note: A previous version of this example incorrectly used the snpgdsIBS command here instead of snpgdsDiss. We apologize for any confusion.

dissMatrix  =  snpgdsDiss(genofile , sample.id=NULL, autosome.only=TRUE,remove.monosnp=TRUE,  maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)

## Individual dissimilarity analysis on genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 46 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
##     # of samples: 16
##     # of SNPs: 2,102
##     using 2 threads
## Dissimilarity:    the sum of all selected genotypes (0,1,2) = 48494
## Dissimilarity:   Fri Aug 07 18:26:24 2020    0%
## Dissimilarity:   Fri Aug 07 18:26:24 2020    100%

Clustering Analysis

This step performs a clustering analysis similar to the IBS analysis above but using dissimilarity. The next line creates a tree file based on dissimilarity rather than relatedness.

snpHCluster =  snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)
cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL, 
                        col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, 
                        verbose=TRUE)

## Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
## Create 1 groups.

cutTree

## $sample.id
##  [1] "AA0035-C" "AA0036-C" "AA0037-C" "AA0038-C" "AB0286-C" "AB0290-C"
##  [7] "AB0292-C" "AX0001-C" "AX0002-C" "AY0001-C" "AY0002-C" "BY0001-C"
## [13] "BY0002-C" "BZ0001-C" "BZ0003-C" "BZ0004-C"
## 
## $z.threshold
## [1] 15
## 
## $outlier.n
## [1] 5
## 
## $samp.order
##  [1]  9 13  8 12 15 11  7 16  5 10  6 14  1  3  2  4
## 
## $samp.group
##  [1] G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001 G001
## [16] G001
## Levels: G001
## 
## $dmat
##          G001
## G001 1.039722
## 
## $dendrogram
## 'dendrogram' with 2 branches and 16 members total, at height 1.262116 
## 
## $merge
##            z n1 n2
## 1  0.0000000  1  1
## 2  1.3832952  1  2
## 3  1.7783401  1  3
## 4  1.9990924  1  4
## 5  1.5520638  1  5
## 6  0.0000000  1  1
## 7  0.0000000  1  1
## 8  1.8316610  1  6
## 9  1.2139331  1  2
## 10 0.7770847  1  2
## 11 1.4885736  1  7
## 12 2.9061354  3  8
## 13 1.6057339  1  3
## 14 1.0663228  1 11
## 15 4.6598993  4 12
## 
## $clust.count
## G001 
##   16

snpgdsClose(genofile)

Displaying a tree based on dissimilarity

snpgdsDrawTree(cutTree, main = "Dendrogram based on dissimilarity",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
               y.label.kinship=T,leaflab="perpendicular",yaxis.kinship=F)

Session Info

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2    SNPRelate_1.22.0 gdsfmt_1.24.1   
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.29       magrittr_1.5     tidyselect_1.1.0 munsell_0.5.0   
##  [5] colorspace_1.4-1 R6_2.4.1         rlang_0.4.7      dplyr_1.0.1     
##  [9] stringr_1.4.0    tools_4.0.1      grid_4.0.1       gtable_0.3.0    
## [13] xfun_0.16        withr_2.2.0      htmltools_0.5.0  ellipsis_0.3.1  
## [17] yaml_2.2.1       digest_0.6.25    tibble_3.0.3     lifecycle_0.2.0 
## [21] crayon_1.3.4     purrr_0.3.4      vctrs_0.3.2      glue_1.4.1      
## [25] evaluate_0.14    rmarkdown_2.3    stringi_1.4.6    compiler_4.0.1  
## [29] pillar_1.4.6     generics_0.0.2   scales_1.1.1     pkgconfig_2.0.3

Creating a phylogeny from the 1000 Genome Database

JReceveur

Aug 6, 2020