We will explore different ways of assessing population structure (differentiation). You can think of population structure as identifying clusters or groups of more closely related individuals resulting from reduced gene flow among these groups. Populations can be studied to determine if they are structured by using, for example, population differentiation summary statistics (e.g. Fst), clustering or phylogenetic trees. Today, we will utilize a dataset including populations from Ivory Coast and Ethiopia. Let’s first look at an example of population differentiation based on Fst
Assessing genetic diversity almost always starts with an analysis of a parameter such as FST. We will use the data set named CIV_ETH.vcf.gz containing 70 individuals from Ivory Coast and 34 from Ethiopia. Note that only bi-allelic loci was selected and low quality markers discarded from this training dataset. We expect to see some population differentiation between these two populations. In terms of these diversity measures, an index of FST=0 indicates no differentiation, whereas FST=1 indicates that populations are segregating for differing alleles.
Let’s load package and the example data set:
library(mmod)
library(vcfR)
library(poppr)
library(ape)
library(magrittr)
library(hierfstat)
library(tidyverse)
library(adegenet)
library(reshape2)
input <- read.vcfR('CIV_ETH.vcf.gz')
genind.object <- vcfR2genind(input, return.alleles = TRUE, NA.char = "./.")
# Extract IDs from genind object
indiv <- indNames(genind.object)
# Extract Species
pop <- c(rep("IvoryCoast", length(grep("QQ", indiv))),
rep("Ethiopia", length(grep("QS", indiv))))
pop(genind.object) <- as.factor(pop)
#===========================
# Calculate heterozygosity
# per site
#==========================
# Calculate basic stats using hierfstat
basic_gene = basic.stats(genind.data, diploid = TRUE, digits=4)
# Mean observed heterozygosity
Ho_gene = apply(basic_gene$Ho, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
round(digits = 3)
# Mean expected heterozygosity
He_gene = apply(basic_gene$Hs, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
round(digits = 2)
#===========================
# Visualize heterozygosity
#===========================
# Create a data.frame of site names, Ho and He and then convert to long format
Het_gene_df = data.frame(Site = names(Ho_gene), Ho = Ho_gene, He = He_gene) %>%
melt(id.vars = "Site")
# Italic label
hetlab.o <- expression(italic("H")[O])
hetlab.e <- expression(italic("H")[E])
ylimits <- c(0, max(Het_gene_df$value)+0.01)
ggplot(data = Het_gene_df, aes(x = Site, y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), colour = "black") +
scale_y_continuous(expand = c(0,0), limits = ylimits) +
scale_fill_manual(values = c("royalblue", "#bdbdbd"), labels = c(hetlab.o, hetlab.e)) +
ylab("Heterozygosity")
#==========================
# Compute pairwise FST
# (Weir & Cockerham 1984)
#==========================
gene_fst <- genet.dist(genind.data, method = "WC84")
#=========================
# Visualize pairwise FST.
#=========================
# Convert dist object to data.frame
fst.matrix <- as.matrix(gene_fst)
## Sort column names
fst.matrix <- fst.matrix[order(rownames(fst.matrix)), order(colnames(fst.matrix))]
ind <- which( upper.tri(fst.matrix), arr.ind = TRUE)
fst.df <- data.frame(Site1 = dimnames(fst.matrix)[[2]][ind[,2]],
Site2 = dimnames(fst.matrix)[[1]][ind[,1]],
Fst = fst.matrix[ ind ] %>% round(digits = 3))
# Convert minus values to zero
fst.df$Fst[fst.df$Fst < 0] <- 0
# Fst italic label
fst.label <- expression(italic("F")[ST])
# Extract middle Fst value for gradient argument
mid <- max(fst.df$Fst) / 2
# Plot heatmap
ggplot(data = fst.df, aes(x = Site1, y = Site2, fill = Fst)) +
geom_tile(colour = "black") +
geom_text(aes(label = Fst), color="black", size = 3)+
scale_fill_gradient2(low = "blue", mid = "pink", high = "red",
midpoint = mid, name = fst.label,
limits = c(0, max(fst.df$Fst)))+
scale_x_discrete(expand = c(0,0))+
scale_y_discrete(expand = c(0,0), position = "right")+
theme(axis.text = element_text(colour = "black", size = 10, face = "bold"),
axis.title = element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = "right",
legend.title = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 10))
If we wanted to analyze the relationship between individuals or populations, we would use genetic distance measures which calculate the “distance” between samples based on their genetic profile. These distances can be visualized with heatmaps, dendrograms, or minimum spanning networks. In the package poppr, there are several distances available here
One common way to visualize a genetic distance is with a dendrogram. For this example, we will use the Plasmodium falciparum data set containing information on 70 samples from Ivory Coast and 34 samples from Ethiopia. We can create a dendrogram over all 104 samples, but that would be difficult to visualize. For learning purpose, let’s take 20 random samples and calculate Provesti’s distance, which will return the fraction of the number of differences between samples:
set.seed(10)
ten_samples <- sample(nInd(genind.object), 20)
mic10 <- genind.object[ten_samples]
(micdist <- provesti.dist(mic10))
theTree <- micdist %>%
nj() %>% # calculate neighbor-joining tree
ladderize() # organize branches by clade
plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.
set.seed(999)
aboot(mic10, dist = provesti.dist, sample = 200, tree = "nj", cutoff = 50, quiet = TRUE)
Now that we have a fully filtered VCF, we can start do some cool analyses with it. First of all we will investigate population structure using principal components analysis. Examining population structure can give us a great deal of insight into the history and origin of populations. Model-free methods for examining population structure and ancestry, such as principal components analysis are extremely popular in population genomic research. This is because it is typically simple to apply and relatively easy to interpret.
Essentially, PCA aims to identify the main axes of variation in a dataset with each axis being independent of the next (i.e. there should be no correlation between them). The first component summarizes the major axis variation and the second the next largest and so on, until cumulatively all the available variation is explained. In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes.
To perform a PCA on our Plasmodium falciparum data, we will use plink - specifically version 1.9 (although be aware older and newer versions are available). Note that plink was originally written with human data in mind and has also subsequently been extended to include some model species. As a result, we need to provide a bit of extra info to get it to work on our dataset.
One of the major assumptions of PCA is that the data we use is independent - i.e. there are no spurious correlations among the measured variables. This is obviously not the case for most genomic data as allele frequencies are correlated due to physical linkage and linkage disequilibrium. So as a first step, we need to prune our dataset of variants that are in linkage.
First things first, we will make a directory called plink
We will also simplify our code using some environmental variables. Primarily we set one for our filtered VCF.
This will make it very easy for plink to read in our data. Next we run the linkage pruning. Run the command and we will breakdown what all the arguments mean.
bcftools annotate --set-id +'%CHROM:%POS' $VCF > $VCF
# perform linkage pruning - i.e. identify prune sites
plink --vcf $VCF --double-id --allow-extra-chr \
--set-missing-var-ids @:# \
--indep-pairwise 50 10 0.1 --out civ_eth
So for our plink command, we did the following:
As well as being versatile, plink is very fast. It will quickly produce a linkage analysis for all our data and write plenty of information to the screen. When complete, it will write out two files civ_eth.prune.in and civ_eth.prune.out. The first of these is a list of sites which fell below our linkage threshold - i.e. those we should retain. The other file is the opposite of this. In the next step, we will produce a PCA from these linkage-pruned sites.
Next we rerun plink with a few additional arguments to get it to conduct a PCA. We will run the command and then break it down as it is running.
# prune and create pca
plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# \
--extract civ_eth.prune.in \
--make-bed --pca --out civ_eth
This is very similar to our previous command. What did we do here?
PCA output:
civ_eth.eigenval - the eigenvalues from our analysis
civ_eth.eigenvec- the eigenvectors from our analysis
plink binary output
civ_eth.bed - the civ_eth bed file - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.
civ_eth.bim - a map file (i.e. information file) of the variants contained in the bed file.
civ_eth.fam - a map file for the individuals contained in the bed file.
Next we turn to R to plot the analysis we have produced!
Setting up the R environment First load the tidyverse package and ensure you have moved the plink output into the working directory you are operating in. You may want to set up an RStudio Project to manage this analysis. See here for a guide on how to do this.
Then we will use a combination of readr and the standard scan function to read in the data.
Unfortunately, we need to do a bit of legwork to get our data into reasonable shape. First we will remove a nuisance column (plink outputs the individual ID twice). We will also give our pca data.frame proper column names.
# sort out the pca data
# remove nuisance column
pca <- pca[,-1]
# set names
names(pca)[1] <- "ind"
names(pca)[2:ncol(pca)] <- paste0("PC", 1:(ncol(pca)-1))
Next we can add a species, location and if required, a species x location vector. We will do this using the R version of grep. We then use paste0 to combine the columns.
# sort out the individual populations
population <- rep(NA, length(pca$ind))
population[grep("QQ", pca$ind)] <- "IvoryCoast"
population[grep("QS", pca$ind)] <- "Ethiopia"
## location
# loc <- rep(NA, length(pca$ind))
# loc[grep("Mak", pca$ind)] <- "makobe"
# loc[grep("Pyt", pca$ind)] <- "python"
# combine - if you want to plot each in different colors
# spp_loc <- paste0(spp, "_", loc)
With these variables created, we can remake our data.frame like so. Note the use of as.tibble to ensure that we make a tibble for easy summaries etc.
Now that we have done our housekeeping, we have everything in place to actually visualize the data properly. First we will plot the eigenvalues. It is quite straightforward to translate these into percentage variance explained (although note, you could just plot these raw if you wished).
# first convert to percentage variance explained
pve <- data.frame(PC = 1:16, pve = eigenval/sum(eigenval)*100)
With that done, it is very simple to create a bar plot showing the percentage of variance each principal component explains.
# make plot
a <- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity")
a + ylab("Percentage variance explained") + theme_light()
Cumulatively, they explain 100% of the variance but PC1, PC2 and possible PC3 together explain about 30% of the variance. We could calculate this with the cumsum function, like so:
Next we move on to actually plotting our PCA. Given the work we did earlier to get our data into shape, this doesn’t take much effort at all.
# plot pca
b <- ggplot(pca, aes(PC1, PC2, col = population)) + geom_point(size = 3)
b <- b + scale_colour_manual(values = c("red", "blue"))
b <- b + coord_equal() + theme_light()
b + xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) + ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))
> Note that this R code block also includes arguments to display the percentage of variance explained on each axis. Here we only plot PC1 and PC2. From this figure, we can see PC1 separates out the geographical locations and PC2 separates out the species.