Introduction

We will explore different ways of assessing population structure (differentiation). You can think of population structure as identifying clusters or groups of more closely related individuals resulting from reduced gene flow among these groups. Populations can be studied to determine if they are structured by using, for example, population differentiation summary statistics (e.g. Fst), clustering or phylogenetic trees. Today, we will utilize a dataset including populations from Ivory Coast and Ethiopia. Let’s first look at an example of population differentiation based on Fst

Population Differentiation index FST

Assessing genetic diversity almost always starts with an analysis of a parameter such as FST. We will use the data set named CIV_ETH.vcf.gz containing 70 individuals from Ivory Coast and 34 from Ethiopia. Note that only bi-allelic loci was selected and low quality markers discarded from this training dataset. We expect to see some population differentiation between these two populations. In terms of these diversity measures, an index of FST=0 indicates no differentiation, whereas FST=1 indicates that populations are segregating for differing alleles.

Let’s load package and the example data set:

library(mmod)
library(vcfR)
library(poppr)
library(ape)
library(magrittr)
library(hierfstat)
library(tidyverse)
library(adegenet)
library(reshape2)

input <- read.vcfR('CIV_ETH.vcf.gz')
genind.object <- vcfR2genind(input, return.alleles = TRUE, NA.char = "./.")

# Extract IDs from genind object
indiv <- indNames(genind.object)

# Extract Species
pop <- c(rep("IvoryCoast", length(grep("QQ", indiv))), 
         rep("Ethiopia", length(grep("QS", indiv))))

pop(genind.object) <- as.factor(pop)

#===========================
# Calculate heterozygosity 
# per site
#==========================

# Calculate basic stats using hierfstat
basic_gene = basic.stats(genind.data, diploid = TRUE, digits=4)

# Mean observed heterozygosity
Ho_gene = apply(basic_gene$Ho, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
    round(digits = 3)

# Mean expected heterozygosity
He_gene = apply(basic_gene$Hs, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
    round(digits = 2)

#===========================
# Visualize heterozygosity
#===========================

# Create a data.frame of site names, Ho and He and then convert to long format
Het_gene_df = data.frame(Site = names(Ho_gene), Ho = Ho_gene, He = He_gene) %>%
    melt(id.vars = "Site")

# Italic label
hetlab.o <- expression(italic("H")[O])
hetlab.e <- expression(italic("H")[E])

ylimits <- c(0, max(Het_gene_df$value)+0.01)

ggplot(data = Het_gene_df, aes(x = Site, y = value, fill = variable)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.9), colour = "black") +
    scale_y_continuous(expand = c(0,0), limits = ylimits) + 
    scale_fill_manual(values = c("royalblue", "#bdbdbd"), labels = c(hetlab.o, hetlab.e)) +
    ylab("Heterozygosity")

#==========================
# Compute pairwise FST 
# (Weir & Cockerham 1984)
#==========================
gene_fst <- genet.dist(genind.data, method = "WC84")

#=========================
# Visualize pairwise FST.
#=========================

# Convert dist object to data.frame
fst.matrix <- as.matrix(gene_fst)

## Sort column names
fst.matrix <- fst.matrix[order(rownames(fst.matrix)), order(colnames(fst.matrix))]
ind <- which( upper.tri(fst.matrix), arr.ind = TRUE)
fst.df <- data.frame(Site1 = dimnames(fst.matrix)[[2]][ind[,2]],
                     Site2 = dimnames(fst.matrix)[[1]][ind[,1]],
                     Fst = fst.matrix[ ind ] %>% round(digits = 3))

# Convert minus values to zero
fst.df$Fst[fst.df$Fst < 0] <- 0

# Fst italic label
fst.label <- expression(italic("F")[ST])

# Extract middle Fst value for gradient argument
mid <- max(fst.df$Fst) / 2

# Plot heatmap
ggplot(data = fst.df, aes(x = Site1, y = Site2, fill = Fst)) +
    geom_tile(colour = "black") +
    geom_text(aes(label = Fst), color="black", size = 3)+
    scale_fill_gradient2(low = "blue", mid = "pink", high = "red",
                         midpoint = mid, name = fst.label,
                         limits = c(0, max(fst.df$Fst)))+ 
    scale_x_discrete(expand = c(0,0))+
    scale_y_discrete(expand = c(0,0), position = "right")+
    theme(axis.text = element_text(colour = "black", size = 10, face = "bold"),
          axis.title = element_blank(),
          panel.grid = element_blank(),
          panel.background = element_blank(),
          legend.position = "right",
          legend.title = element_text(size = 14, face = "bold"),
          legend.text = element_text(size = 10))

Phylogenetic tree (Genetic Distance)

If we wanted to analyze the relationship between individuals or populations, we would use genetic distance measures which calculate the “distance” between samples based on their genetic profile. These distances can be visualized with heatmaps, dendrograms, or minimum spanning networks. In the package poppr, there are several distances available here

One common way to visualize a genetic distance is with a dendrogram. For this example, we will use the Plasmodium falciparum data set containing information on 70 samples from Ivory Coast and 34 samples from Ethiopia. We can create a dendrogram over all 104 samples, but that would be difficult to visualize. For learning purpose, let’s take 20 random samples and calculate Provesti’s distance, which will return the fraction of the number of differences between samples:

set.seed(10)
ten_samples <- sample(nInd(genind.object), 20)
mic10       <- genind.object[ten_samples]
(micdist    <- provesti.dist(mic10))

theTree <- micdist %>%
    nj() %>%    # calculate neighbor-joining tree
    ladderize() # organize branches by clade

plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.

set.seed(999)
aboot(mic10, dist = provesti.dist, sample = 200, tree = "nj", cutoff = 50, quiet = TRUE)

Population structure: PCA

Now that we have a fully filtered VCF, we can start do some cool analyses with it. First of all we will investigate population structure using principal components analysis. Examining population structure can give us a great deal of insight into the history and origin of populations. Model-free methods for examining population structure and ancestry, such as principal components analysis are extremely popular in population genomic research. This is because it is typically simple to apply and relatively easy to interpret.

Essentially, PCA aims to identify the main axes of variation in a dataset with each axis being independent of the next (i.e. there should be no correlation between them). The first component summarizes the major axis variation and the second the next largest and so on, until cumulatively all the available variation is explained. In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes.

To perform a PCA on our Plasmodium falciparum data, we will use plink - specifically version 1.9 (although be aware older and newer versions are available). Note that plink was originally written with human data in mind and has also subsequently been extended to include some model species. As a result, we need to provide a bit of extra info to get it to work on our dataset.

Linkage pruning

One of the major assumptions of PCA is that the data we use is independent - i.e. there are no spurious correlations among the measured variables. This is obviously not the case for most genomic data as allele frequencies are correlated due to physical linkage and linkage disequilibrium. So as a first step, we need to prune our dataset of variants that are in linkage.

First things first, we will make a directory called plink

# make a plink directory
mkdir plink
# move into it
cd plink

We will also simplify our code using some environmental variables. Primarily we set one for our filtered VCF.

VCF=CIV_ETH.vcf.gz

This will make it very easy for plink to read in our data. Next we run the linkage pruning. Run the command and we will breakdown what all the arguments mean.

bcftools annotate --set-id +'%CHROM:%POS' $VCF > $VCF

# perform linkage pruning - i.e. identify prune sites
plink --vcf $VCF --double-id --allow-extra-chr \
--set-missing-var-ids @:# \
--indep-pairwise 50 10 0.1 --out civ_eth

So for our plink command, we did the following:

–vcf - specified the location of our VCF file.
–double-id - told plink to duplicate the id of our samples (this is because plink typically expects a family and individual id - i.e. for pedigree data - this is not necessary for us.
–allow-extra-chr - allow additional chromosomes beyond the human chromosome set. This is necessary as otherwise plink expects chromosomes 1-22 and the human X chromosome.
–set-missing-var-ids - also necessary to set a variant ID for our SNPs. Human and model organisms often have annotated SNP names and so plink will look for these. We do not have them so instead we set ours to default to chromosome:position which can be achieved in plink by setting the option @:# - see here for more info.
–indep-pairwise - finally we are actually on the command that performs our linkage pruning! The first argument, 50 denotes we have set a window of 50 Kb. The second argument, 10 is our window step size - meaning we move 10 bp each time we calculate linkage. Finally, we set an r2 threshold - i.e. the threshold of linkage we are willing to tolerate. Here we prune any variables that show an r2 of greater than 0.1.
–out Produce the prefix for the output data.

As well as being versatile, plink is very fast. It will quickly produce a linkage analysis for all our data and write plenty of information to the screen. When complete, it will write out two files civ_eth.prune.in and civ_eth.prune.out. The first of these is a list of sites which fell below our linkage threshold - i.e. those we should retain. The other file is the opposite of this. In the next step, we will produce a PCA from these linkage-pruned sites.

Perform a PCA

Next we rerun plink with a few additional arguments to get it to conduct a PCA. We will run the command and then break it down as it is running.

# prune and create pca
plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# \
--extract civ_eth.prune.in \
--make-bed --pca --out civ_eth

This is very similar to our previous command. What did we do here?

–extract - this just lets plink know we want to extract only these positions from our VCF - in other words, the analysis will only be conducted on these.
–make-bed - this is necessary to write out some additional files for another type of population structure analysis - a model based approach with admixture.
–pca - fairly self explanatory, this tells plink to calculate a principal components analysis. Once the command is run, we will see a series of new files. We will break these down too:

PCA output:

civ_eth.eigenval - the eigenvalues from our analysis
civ_eth.eigenvec- the eigenvectors from our analysis
plink binary output
civ_eth.bed - the civ_eth bed file - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.
civ_eth.bim - a map file (i.e. information file) of the variants contained in the bed file.
civ_eth.fam - a map file for the individuals contained in the bed file.

Plotting the PCA output

Next we turn to R to plot the analysis we have produced!

Setting up the R environment First load the tidyverse package and ensure you have moved the plink output into the working directory you are operating in. You may want to set up an RStudio Project to manage this analysis. See here for a guide on how to do this.

# load tidyverse package
library(tidyverse)

Then we will use a combination of readr and the standard scan function to read in the data.

# read in data
pca <- read_table2("./civ_eth.eigenvec", col_names = FALSE)
eigenval <- scan("./civ_eth.eigenval")

Cleaning up the data

Unfortunately, we need to do a bit of legwork to get our data into reasonable shape. First we will remove a nuisance column (plink outputs the individual ID twice). We will also give our pca data.frame proper column names.

# sort out the pca data
# remove nuisance column
pca <- pca[,-1]
# set names
names(pca)[1] <- "ind"
names(pca)[2:ncol(pca)] <- paste0("PC", 1:(ncol(pca)-1))

Next we can add a species, location and if required, a species x location vector. We will do this using the R version of grep. We then use paste0 to combine the columns.

# sort out the individual populations

population <- rep(NA, length(pca$ind))
population[grep("QQ", pca$ind)] <- "IvoryCoast"
population[grep("QS", pca$ind)] <- "Ethiopia"

## location
# loc <- rep(NA, length(pca$ind))
# loc[grep("Mak", pca$ind)] <- "makobe"
# loc[grep("Pyt", pca$ind)] <- "python"

# combine - if you want to plot each in different colors
# spp_loc <- paste0(spp, "_", loc)

With these variables created, we can remake our data.frame like so. Note the use of as.tibble to ensure that we make a tibble for easy summaries etc.

# remake data.frame
pca <- as.tibble(data.frame(pca, population))

Plotting the data

Now that we have done our housekeeping, we have everything in place to actually visualize the data properly. First we will plot the eigenvalues. It is quite straightforward to translate these into percentage variance explained (although note, you could just plot these raw if you wished).

# first convert to percentage variance explained
pve <- data.frame(PC = 1:16, pve = eigenval/sum(eigenval)*100)

With that done, it is very simple to create a bar plot showing the percentage of variance each principal component explains.

# make plot
a <- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity")
a + ylab("Percentage variance explained") + theme_light()

Cumulatively, they explain 100% of the variance but PC1, PC2 and possible PC3 together explain about 30% of the variance. We could calculate this with the cumsum function, like so:

# calculate the cumulative sum of the percentage variance explained
cumsum(pve$pve)

Next we move on to actually plotting our PCA. Given the work we did earlier to get our data into shape, this doesn’t take much effort at all.

# plot pca
b <- ggplot(pca, aes(PC1, PC2, col = population)) + geom_point(size = 3)
b <- b + scale_colour_manual(values = c("red", "blue"))
b <- b + coord_equal() + theme_light()
b + xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) + ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))

> Note that this R code block also includes arguments to display the percentage of variance explained on each axis. Here we only plot PC1 and PC2. From this figure, we can see PC1 separates out the geographical locations and PC2 separates out the species.

Population Structure: Phylogenetic tree and PCA

Archibald Worwui, Mouhamadou F. DIOP, Lucas A. Etego & Alfred A. NGWA

December 01, 2022