in Bioinformatics, Data Science

Using PCA and adegenet library for a population alleles analysis.

First exam is passed, and now I can show, what’ve been doing during it. The data provided with the knitr file, so the research is reproducible.

Data for analysis

Knitr file

Resulting html

PCA for population analysis

Okay, today we use 5 textfiles, containing information about different alleles sequences to see how they differ from one population to another. We will load all the data sets, make one data frame for further interactions. Than we will convert this data to genind object to save some time and just get directly to the Principal Component Analysis.

We begin with loading datasets and required library. Than we add information about from which population, genotypes part were aquired. EastAsians already have a column “population”, so we just skip it.

## Warning: package 'adegenet' was built under R version 2.15.3
## Loading required package: ade4
## Warning: package 'ade4' was built under R version 2.15.3
##    ==========================
##     adegenet 1.3-9.2 is loaded
##    ==========================
##  - to start, type '?adegenet'
##  - to browse adegenet website, type 'adegenetWeb()'
##  - to post questions/comments:
# start reading the data
ASW <- read.delim("ASW.txt")
CEU <- read.delim("CEU.txt")
CLM <- read.delim("CLM.txt")
EastAsians <- read.delim("EastAsians.txt")
YRI <- read.delim("YRI.txt")
# Name populations
ASW$population = "ASW"
CEU$population = "CEU"
CLM$population = "CLM"
YRI$population = "YRI"

Than, we merge these datasets into one dataframe by using rbind function (which adds data by row)

# Start merging the population
test <- rbind(ASW, CEU)
test <- rbind(test, CLM)
test <- rbind(test, EastAsians)
test <- rbind(test, YRI)
# Showing first 6 columns of the reslting dataset
head(test[, 1:6])
##       ind rs35744813 rs4920552 rs3935570 rs12137730 rs7534618
## 1 NA19625        T/T       T/T       G/G        A/A       G/G
## 2 NA19700        T/T       T/T       G/G        A/A       G/G
## 3 NA19701        T/T       C/C       G/G        C/C       T/T
## 4 NA19703        T/T       T/T       G/G        A/A       G/G
## 5 NA19704        T/T       T/T       T/T        A/A       T/T
## 6 NA19707        T/T       C/C       T/T        A/A       G/G

Now we have resulting dataset. Rows are individuals, for which we have the information about the allels. Allels presented in the dataframe itself.

In the next chunk we convert data to genind format for easy data handling. We are replacing the missing the values with calculated mean, because otherwise we won’t be able to run the PCA.

# Converting the data to genind format
data4PCA = df2genind(test, sep = "/", pop = test$population, missing = "mean", 
    ploidy = 2, type = c("codom", "PA"))
data.x = scaleGen(data4PCA)
data_prepared = scaleGen(data4PCA)
# Scaling the data, before perforing the PCA
data_prepared = scaleGen(data4PCA)

We have also scaled the data to prepare it for the PCA.

# PCA itself
data_pca = dudi.pca(data_prepared, cent = F, scale = F, scannf = FALSE, nf = 4)

Now we want to know, how many percents of variance are explained by each variable. They are already sorted, so all we have to do is to calculate it and then check that we have got 100% of sum.

eig = data.frame(data_pca$eig)
eig$percentage = (eig[, 1]/sum(eig$data_pca.eig))
# Check that we've got percentage correctly
## [1] 1
# How much % of variance do first to Principal Components explain?
## [1] 0.1425

Okay, check how much entrys we have per each population

# Now get the number of samples for each population with table function
##   61   85   97  100   60   88
# Check that sum is 491
## [1] 491

And now we plot the results for first two principal components:

# Plotting the data
plot(data_pca$li[, 1:2], type = "pch", pch = c(rep(21, 61), rep(22, 85), rep(23, 
    97), rep(24, 100), rep(25, 60)), bg = c(rep("orange", 61), rep("violet", 
    85), rep("red", 97), rep("blue", 100), rep("green", 60), rep("yellow", 88)))
## Warning: ??? ??????? 'pch' ????? ??????? ?? ??????? ?????
# Plotting the legend
legend(7, 6, legend = c("ASW", "CEU", "CHB", "CHS", "CLM", "YRI"), col = c("orange", 
    "violet", "red", "blue", "green", "yellow"), pch = 16)


This plot shows us, that people from European population (CEU) have a lot of common with people chinese populations (CHB,CHS). Though they are not really far from the african genotype (YRI). CLM is the population from the latin america, so, the plot is only approving the migration theory, showing how differences in allels were spread all other the world.

This is the image from wikipedia, showing the spread of human population: