Introduction to Neutrality Statistics and Signature of Selection

Readings:

Why use R?

Until recently, one of the more tedious aspects of conducting a population genetic analysis was the need for repeated reformatting data to conduct different, complimentary analyses in different programs. Often, these programs only ran on one platform. Now, R provides a toolbox with its packages that allows analysis of most data conveniently without tedious reformatting on all major computing platforms including Microsoft Windows, Linux, and Apple’s OS X. R is an open source statistical programming and graphing language that includes tools for statistical, population genetic, genomic, phylogenetic, and comparative genomic analyses.

Note that the R user community is very active and that both R and its packages are regularly updated, critically modified, and noted as deprecated (no longer updated) as appropriate.

Population genetics

Traditional population genetics is based on analysis of observed allele frequencies compared to frequencies expected, assuming a population genetic model. For example, under a Wright-Fisher model you might expect to see populations of diploid individuals that reproduce sexually, with non-overlapping generations. This model ignores effects such as mutation, recombination, selection or changes in population size or structure. More complex models can incorporate different aspects of effects observed in real populations. However, most of these models assume that populations reproduce sexually.

Here, we we’ll take our first look at neutrality statistics, which are statistical tests designed to measure if whether SNPs appear to be undergoing selection within a population. We’ll go over what each of the statistics are, what they mean, and how to interpret them, and then we’ll use them on our own data.

First, let’s get an understanding of what each of these statistics might mean when calculated for our populations…

Tajima’s D

This is a fairly commonly used neutrality statistic. You can read the original paper by Tajima, which details the development of his D statistic as a way of testing for the neutral mutation hypothesis (or Kimura’s theory of neutral mutation). Tajima’s D is a test statistic that compares the average number of pairwise differences in among base pairs in a population sample with the total number of variant sites in that sample. Most simply put, Tajima’s D indicates whether the pattern of mutations found in a particular region of the genome in a population is following the assumptions of neutrality; in other words, if there has been selection for or against a particular allele in the population, the pattern seen should violate the assumptions of the neutral model.

Tajima’s D is calculated using the relationship between the total number of variable loci in a given sequence in a population (\(\theta\)T which is related to S, stands for the number of segregating sites) and the pairwise differences in that same sequence between each individual in the population (\(\theta\)\(\theta\)T, or \(\pi\)\(\pi\)).

When there are very few segregating sites, overall, along with very few pairwise differences between individuals, Tajima’s D will be positive; such a lack of variation suggests that purifying selection is acting on this region of the genome. When there are more pairwise differences than segregating sites, Tajima’s D is also positive, but in this case it indicates balancing selection.

An excess of low frequency polymorphisms will lead to a negative Tajima’s D. This is a violation of neutrality suggesting that positive selection may have occurred, or perhaps a recent bottleneck or founder’s events. In this case, we have multiple rare of an alleles against a background of very few segregating sites. This is often seen when one haplotype takes over a population and begins to accrue random mutations (as happens in a selective sweep or a founder/bottleneck event).

With Tajima’s D, the closer to 0 the D statistic is, the closer the locus being tested is to meeting the assumptions of neutrality.

Today, we’ll calculate Tajima’s D values for Cote d’Ivoire populations in order to determine if any directional selection is happening in Plasmodium falciparum genes.

Fu and Li’s D and F

Fu and Li’s D and F are also fairly straightforward tests derived from the same Θ statistics as Tajima’s D. Like Tajima’s D, these statistics are also based upon Kimura’s neutrality statistic, again meaning that anything deviating from 0 is a violation of neutrality. What these statistics do, specifically, is measure the number of singleton or private mutations (i.e., the number of people within a sample that have a novel mutation specific to themselves). While the D statistic is based on the difference between the number of singletons in the population and the total number of mutations in population, the F statistic is based on the difference between the number of singletons and the average number of nucleotide differences between pairs of sequences. This difference in how they’re derived makes it possible for one statistic to be positive while the other is negative in the same population.

We can interpret these statistics like so:

  • If the statistics are strongly negative, this shows an excess of singletons in the population. This means there are quite a few people whose genomic variants don’t match each other. Such a situation could be from rapid population growth (meaning everyone is closely related and so there’s been no time for mutation to catch up to the increased numbers) or a selective sweep (see below for a formal definition) further back in the population’s history, meaning that strong directional selection for a certain mutation led to an over-representation of that mutation (at the expense of other variants at that SNP).

  • If the statistics are positive, this indicates an excess of old/ancestral variants, that have been selected for in the past. In other words, there are very few unique variants, and the variants that are present are carried by a lot of individuals.

We will calculate Fu and Li’s D and F statistics for our data today.

Integrated Haplotype Score (iHS)

Recent positive selection usually happens by selective sweeps (i.e. the rapid spread of a particular haplotype, driven by selection for particular alleles within that haplotype), which not only act upon the SNP that corresponds to the phenotype under selection, but the entire haplotype or linkage region in which that SNP is found. When this happens, the alleles in these selected haplotypes become almost entirely homozygous, because selection on this region of the genome is so strong. An iHS score, or an Integrated Haplotype Score, is a test that can be used to measure the amount of recent positive selection experienced by an allele indirectly by looking for evidence of selective sweeps, while we can also directly measure extended haplotype homozygosity (EHH) in the population to assess hard selective sweeps. These tests identify extended haplotypes (large sets of alleles on a single region of the chromosome that are identical across individuals) and looking at how many of the haplotypes are homozygous within the population. Mass homozygosity across individuals for a particular haplotype is recorded as a high iHS score, while low levels of haplotype homozygosity is recorded as a low iHS score. We can (and will) also visualize EHH directly.

Learning Outcomes

  • Learn what Fu and Li’s D and F, Tajima’s D, and iHS scores mean, and how to interpret them.

  • Learn to use the PopGenome package to calculate neutrality statistics such as Fu and Li’s D and F, and Tajima’s D.

  • Learn to use the pegas package to explore our Tajima’s D statistic further.

  • Learn about iHS by calculate iHS and EHH in R, and reflect on what the iHS score for our population’s P. falciparum haplotype looks like.

  • Learn about whether or not selection is happening in our populations based on these statistics, and relate it to environments that may cause these alleles in this region to be selected for or against.

Practical

Part 1: Getting Started

Overview of Data

In this practical session, we will use a real-world data set consisting of bi-allelic loci for Cote d’Ivoire population taken from Pf3k. We will be using a simplified version of this data set where low quality sites, non-polymorphic variants, markers and individuals with lot of missing data are removed.

Getting ready to use R

R provides a unique environment for performing population genetic analyses. You will particularly enjoy not having to switch data formats and operating systems to execute a series of analyses, as was the case until now. Furthermore, R provides graphing capabilities that are ready for use in publications, with only a little bit of extra effort. But first, let’s install R, install an integrated development environment, open R, and load R packages.

Installing the required packages

Next, we need to install these following packages in our R Studio space.

  1. vcfR
  2. PopGenome
  3. pegas
  4. tidyverse
  5. rehh

Please copy and paste the below code chunk in it’s entirety to your console or in Rstudio to download R package libraries needed for this practical. If you are having trouble installing any of the R packages, please ask an instructor for help.

# Here we have a list of packages we want to install
load.lib <- c("PopGenome", "vcfR", "pegas", "tidyverse", "rehh")

# Then we select only the packages that aren't currently installed.
install.lib <- load.lib[!load.lib %in% installed.packages()]

# And finally we install the missing packages, including their dependency.
for(lib in install.lib) install.packages(lib, repos = "http://cranr eval=FALSE, warning=FALSEstudio.com", dependencies = TRUE)

Next, load all of those libraries into this session using the code chunk below. Please copy and paste it in its entirety.:

library(vcfR)
library(PopGenome)
library(pegas)

Finally, we need to format our data for analysis. PopGenome is a bit funky about how they look for the files in your directory, so we need to move some files around in our folders to make the GENOME object that the PopGenome package will work with. Specifically, if you haven’t already done so, we need to make a new directory within our working directory, and put a copy of our data into that folder. Do this using the mkdir and cp commands:

mkdir data
cp cotedIvoire_DP10_Q20_miss_maf5.vcf data/cotedIvoire_DP10_Q20_miss_maf5.vcf

Now, we can make our GENOME object by directing the function to the folder we just created. Additionally, we’ll make a DNAbin object to save for later…

Here we’ll make our GENOME object:

ICgenome <- readData("data", format = "VCF")
ICgenome

And here we’ll make our DNAbin object:

IC.data <- read.vcfR("data/cotedIvoire_DP10_Q20_miss_maf5.vcf")
ICdna <- vcfR2DNAbin(IC.data)
ICdna

Part 2: Fu and Li’s D and F

The function neutrality.stats from PopGenome conveniently calculates several different neutrality statistics for a given population. We’ll use it now to calculate Fu and Li’s D and F:

# Remember to replace the variable name with your own GENOME object!
neut <- neutrality.stats(ICgenome)
get.neutrality(neut)
# To see results: 
neut@Fu.Li.F
neut@Fu.Li.D

As we can see here, the Cote d’Ivoire population has D and F statistics that are positive. This suggests an excess of old/ancestral variants in this population.
Before we move on, think about what the results for your population means, based on how we interpret Fu and Li’s D. Is there an excess of singletons? Is there an excess of old/ancestral mutations? What does that tell you about your population’s history? We will come back to this at the end of class.

Part 3: Tajima’s D

Now we’ll look at Tajima’s D! When we ran the neutrality.stats function in PopGenome, it also calcuated our Tajima’s D statistic. We can preview our Tajima’s D results from the PopGenome output…

neut@Tajima.D

What we see from P.falciparum for Tajima’s D is that this population appears to be subject to selection in Cote d’Ivoire. This suggests that the only evolution (remember, that’s a change in allele frequencies over generations, and does not have to be due to selection) happening in the population is selectively neutral.

Now, you can repeat the same analysis on your own population. Then, interpret your own population’s Tajima’s D value. What is the D statistic for your population, and what does it mean in terms of whether and what kind of selection might be occurring? We will revisit this question at the end of the training.

Part 4: Positive selective sweeps using iHS and EHH

Now, we will run our iHS and EHH analyses to assess recent positive selective sweeps in an expanded area surrounding the Pfcrt region on chromosome 7 (NOTE: you MUST download a new VCF file to run this analysis). iHS is intended for application on a single (presumably homogeneous) population, while XP-EHH and Rsb are targeted to differential selection between two populations. All three statistics are based on the concept of Extended Haplotype Homozygosity (EHH) as formulated by (Sabeti et al. 2002). To run this example, I’m using the rehh package, which is specifically designed to look for Extended Haplotype Homozygosity, or evidence of hard positive selective sweeps.

Input files

The package rehh requires as input:

  • Two tiny invented examples, each in our “standard” haplotype format and in variant call format (vcf). The first example was used in (Gautier et al. 2017) for the explanation of the various statistics calculated by this package. The second example is an extension of the former including multi-allelic markers and missing values. Both sets are discussed in depth in a second vignette.

  • Further three tiny examples used for the supplement on Site Frequency Spectrum-based methods of (Klassmann and Gautier 2020).

  • An output file of the program ms containing two small simulated haplotype samples.

  • A data set in various formats that originated from a study on the “Creole cattle breed from Guadeloupe” (CGU) (Gautier and Naves 2011). All files contain the same set of phased SNPs of Bos taurus chromosome 12 from 140 individuals.

All of the above files are copied into the current working directory via the command

make.example.files()

In this session, we will use the previous Ivorian data set which is in VCF format. You can use your own data and try to re-run the analysis using the other file formats cited above.

Ok… now let’s open Rstudio and try it!

Once you’re in R, load the rehh library (if it throws an error, you may need to install it):

library(rehh)

Then, we need to import our newly made files for the Finns:

hap <- data2haplohh(hap_file = "cotedIvoire_DP10_Q20_miss_maf5.vcf",
                    polarize_vcf = FALSE,
                    vcf_reader = "vcfR",
                    chr.name = "Pf3D7_07_v3")

To see the different parameters of data2haplohh function, you can run this command from your R console:

?data2haplohh

Note that the default values were used for the other parameters for thgis practical.

Once those are read in, you should see that our data set consists of 1413 SNPs split into 130 haplotypes.

Now we can run our extended haplotype homozygosity assessment (this step might take a while, be patient):

# perform scan on a single chromosome (calculate iHH values)
ehh <- scan_hh(hap)

Retrieve our iHS scores:

ihs <- ihh2ihs(ehh, freqbin = 0.025)

Note: To retrieve iHS across the whole P.falciparum genome, you might run scan_hh() on VCF data from each chromosome and concatenate the resulting data frames before standardization. In the following example, we use our vcf input data and loop through the 14 chromosomes of the P.falciparum species. The data frame wgscan contains the iHHA and iHHD values for the whole genome and serves as input for the ihh$ihs() function which calculates iHS, i.e. the standardized log ratio of the two iHH values, for each marker.

# DO NOT RUN THESE LINES OF CODE
data <- read.table(VCF, header = FALSE, stringsAsFactors = FALSE)

chromosomes <- data %>% pull(V1) %>% unique()

for(c in 1:length(chromosomes)) {
  # create internal representation
  hh <- data2haplohh(hap_file = vcf_file,
                     polarize_vcf = FALSE,
                     chr.name = chromosomes[c],
                     vcf_reader = "vcfR")
        
  # perform scan on a single chromosome (calculate iHH values)
  scan <- scan_hh(hh)
        
  # concatenate chromosome-wise data frames to
  # a data frame for the whole genome
      
  if (c == 1) {
      wgscan <- scan
  } else {
      wgscan <- rbind(wgscan, scan)
  }
}
     
  # calculate genome-wide iHS values
  wgscan.ihs <- ihh2ihs(wgscan)

We can then look for candidate regions potentially under selection using the iHS score. Here we’re considering a piHS (on the y-axis) greater than 2 as a significant score:

cr.se <- calc_candidate_regions(ihs, threshold = 2, pval = TRUE, 
                                window_size = 3000, overlap = 300, 
                                min_n_extr_mrk = 1)
cr.se

Looks like, in the Plasmodium falciparum from Ivorian population, we have perhaps 6 regions that are potentially under selection.

Let’s plot them:

manhattanplot(ihs, pval = TRUE, threshold = 2, pch = 16, cr = cr.se, 
              chr.name = "Pf3D7_07_v3", main = "iHS (Pf3D7_07_v3)")

In this figure, the only spots of selection that are relevant are between Pf3D7_07_v3:423900-617400, which may include the potential selective sweeps identified in the figure as numbers 2-9. These appear to be immediately up-stream of the region of interest.

Let’s actually check the SNPs making up these significant regions:

library(tidyverse)
pihs <- as_tibble(ihs$ihs[,c(2,4)])

sig.pihs<-
  pihs %>%
  filter(LOGPVALUE >= 2)
sig.pihs

Now we can take a closer look at the EHH of an actual SNP of interest.

We can look at some of these significant SNPs within the region of interest. Let’s take a look at Pf3D7_07_v3:507272, which has among the higher piHS (piHS = 6.114087).

ehh_507272 <- calc_ehh(hap, mrk = "Pf3D7_07_v3:507272")

plot(ehh_507272,
     main="EHH at Pf3D7_07_v3:507272",
     col=c("blue2","gold2"))

We can see here that EHH (on the y-axis) for the haplotypes surrounding the ancestral allele is consistently higher than the EHH surrounding the derived allele, with that EHH getting higher as we approach the SNP along the genome. This suggests that recombination has not yet broken up the linkage block/haplotype surrounding the ancestral allele. In other words, it would appear that the SNP is in a large ancestral haplotype with extensive haplotype homozygosity, suggesting that the ancestral SNP may have experienced a positive selective sweep.

Let’s actually visualize the haplotype structure around our SNP of interest to give us a better idea of the observed effect of selection. The original use of this kind of graph - called a bifurcation graph - is in a paper by Sabeti et al. (2002), in our readings for this module. This will look like a bifurcating tree, with the thicker branches representing the relative number of individuals maintaining that particular haplotype (thicker = more people in the sample have that haplotype) surrounding the core allele (which is at the vertical dashed line). The nodes of the tree are where the haplotype/linkage is broken by a variant outside the original haplotype.

Let’s check it out:

furcation_507272 <- calc_furcation(hap, mrk = "Pf3D7_07_v3:507272")

plot(furcation_507272,
     lwd = 0.1,
     col = c("blue2","gold2"),
     cex.lab = 0.3,
     main = "Bifurcation Pf3D7_07_v3:507272",
     legend.xy.coords = "none")

Here, we can see that there are a lot of individuals in the P.falciparum population that have the derived allele, but that there’s some extensive haplotype homozygosity around the ancestral allele. We can see this because the thicker the branch, the more individuals share haplotypes along that branch. So when we see a very thick branch extend many Mb along the genome, this suggests a selective sweep. In this case, it perhaps looks like a relatively limited selective sweep, given that not eveyrone has this haplotype. For a very clear selective sweep, we would see almost the entire population showing the extended haplotype (e.g., a very large, thick line extending across a large portion of the genomic region).

Now, our most significant iHS hit in this region is Pf3D7_07_v3:507272, which is just outside ………… Let’s take a look:

ehh_507272 <- calc_ehh(hap, mrk = "Pf3D7_07_v3:507272")

plot(ehh_507272,
     main = "EHH at Pf3D7_07_v3:507272",
     col = c("blue2","gold2"))

Ok, this looks like another case where the ancestral allele appears to have more extensive haplotype homozygosity than the derived allele (e.g., the blue line is higher than the gold line across most of the region). Let’s look at the bifurcation diagram:

furcation_46150577 <- calc_furcation(hap, mrk= "3-46150577")

plot(furcation_46150577,
     lwd = 0.1,
     col = c("blue2","gold2"),
     cex.lab = 0.3,
     main = "Bifurcation 3:46150577",
     legend.xy.coords = "none")

That’s more like it! Although linkage isn’t preserved well in all individuals, particularly on the left (as the we follow the genome towards the centromere), the majority of individuals have preserved the haplotype associated with the ancestral allele. For a subset of the population, this haplotype is very well preserved for about 0.4 Mb downstream of the SNP. The previous SNP looks more like a clear case of positive selection around the SNP itself based on iHS, but this one looks like there has been some clearer preservation of the extended haplotype via selection.

Try this with your population. Are there any SNPs that look like clear positive selective sweeps in the P.falciparum GWAS region?

Part 5: What Do Your Results Mean? Discuss with a partner from class:

As we wrap up this lab, think about the results you’ve generated. You have already been prompted to think about what your results might mean as we’ve gone along; now you have the opportunity to talk to a partner about what you think your results might mean.

  • Think about what your results mean, based on how we interpret Fu and Li’s D. Is there an excess of singletons in your population? Is there an excess of old/ancestral mutations? What might that tell you about your population’s history?

  • What is the Tajima’s D statistic and accompanying P-value for your population, and what does that mean regarding potential selection happening on your population?

  • What do you think of the iHS scores in the Malaria risk region in your data? Does this fit with your results based on Tajima’s D or Fu & Li’s D and F? Based on what we’ve discovered today about selection happening in our populations, do you think that a selective sweep may have happened in P.falciparum region for you population? Is there anything that you know about the disease ecology and history of your population that you can use to aid in understanding why there may or may not have been a recent positive selective sweep in your population in this region?





Extra Credit: Sliding window Tajima’s D!

A perhaps more helpful method for assessing Tajima’s D can be done using vcftools. This version analyses Tajima’s D in small sliding windows (here, I’ve chosen 500 bp windows) along the whole chromosome or within the gene region of interest, allowing us to better pinpoint exactly where in the gene region selection may be occurring:

# We chose 500 as the window size around the chromosome 7
vcftools --vcf data/cotedIvoire_DP10_Q20_miss_maf5.vcf --chr Pf3D7_07_v3 --TajimaD 500 
mv out.Tajima.D cotedIvoire.Tajima.D #rename the output file 
mv out.log cotedIvoire.log #rename log file 
less cotedIvoire.Tajima.D 

With this sliding window, we can see that certain sections of P.falciparum region have very different values for Tajima’s D. A good rule of thumb is that a Tajima’s D value above 2 or below -2 is significant. For example, the 500 bp windows starting at position 45823000 has 4 SNPs in it and is negative but not quite significant (D = -1.56823000).

Let’s take a closer look by bringing the output from this analysis into R so that we can filter it according to this significance threshold:

slide.covid <- read.table("cotedIvoire.Tajima.D",header = TRUE)
head(slide.covid)

Now we can filter it by our significance threshold using the tidyverse package:

library(tidyverse)

slide.covid.sig <- slide.covid %>%
  filter(TajimaD >= 2 | TajimaD <= -2)

slide.covid.sig

Here, we can see that we have 12 500-bp windows in P.falciparum region that have significant Tajima’s D values for the Yoruban population. All of these values are positive, suggesting purifying selection on at least one of the SNPs in each of these regions. The most significant region is Pf3D7_07_v3 (row 9 in our table), and there are 5 SNPs in that region that could be the potential focus for selection. Going back to Ensembl, it looks like this 500 bp window is in an intergenic region (between genes), so if it is experiencing purifying selection, it is most likely on an as-yet-unannotated regulatory region for a gene (although, of course, it may also simply be in high LD with a functional region elsewhere). I found this functional information by searching for the 500 bp window, 3:46101500-46102000, in the human reference genome on Ensembl and looking at the Variant Table. Remember, if your population had no variants at certain SNPs, then those SNPs were dropped from your VCF file; this means there may be more SNPs listed on Ensembl than are actually in your sliding window for your population.

The remaining windows listed in our table include intronic variants in the gene LZTFL1 (row 1), intronic variants in the gene FYCO1 (row 2), known regulatory region variants (part of a CTCF binding domain) and transcription factor binding sites (rs535236997, rs553123352, and rs374572137) for the gene NFATC1 (this plays a role in cytokine expression in T-cells) (row 10), an intergenic region containing a variant (rs13063635) known to significantly decrease the eosinophil percentage among granulocytes (these are white blood cells important to modulating inflammatory processes) (row 12), and more functionally unknown intergenic variants (rows 3, 4, 5, 6, 7, 8, 9, 11).

These are some very interesting phenotypes, with potential importance to Malaria response (P.falciparum has, after all, been characterized as a ‘cytokine storm’, and in part a disease of inflammation)! Because of this, it might be interesting to see if there could be any evidence of haplotype homozygosity around the loci found within these windows to bolster support for selection in these regions.

Let’s check out rs6772051 (3:46101567), which is an intergenic variant in our window (row 9; 3:46101500-46102000) that shows the highest value for Tajima’s D:

ehh_46101567 <- calc_ehh(hap, mrk = "3-46101567")

plot(ehh_46101567,
     main="EHH at 3:46101567",
     col=c("blue2","gold2"))

Definitely looks like the ancestral allele has some more extensive haplotype homozygosity than the derived allele (e.g., the blue line is higher than the gold line across most of the region). Let’s look at the bifurcation diagram:

furcation_46101567 <- calc_furcation(hap, mrk = "3-46101567")

plot(furcation_46101567,
     lwd=0.1,
     col=c("blue2","gold2"),
     cex.lab=0.3,
     main="Bifurcation Diagram\nrs6772051 in YRI",
     legend.xy.coords="none")

And yes, we can see some very clear extended haplotype homozygosity for the haplotype containing the ancestral allele! Particularly downstream of the variant.

What regions show significantly positive or negative Tajima’s D in your population? Are they the same as in the YRI population? Are they implicated in similar traits or genomic functions?