| tags: [ Genomic Data Linux PCA PLINK R ] categories: [Coding Experiments ]

Performing PCA on genomic data

Introduction

The final step to the second part of aim 1 is to perform a PCA on the genomic data to investigate the population structure. This is important because it will allow me to determine if there are any underlying genetic structures in the genomic data and I can account for relatedness among individuals for subsequent analyses.

Methods and Results

1. Run PCA

plink1.9 --bfile merged_nies --pca 180 var-wts --out plink_output/nies_final_pca

The count had to be increased (default is 20) because there are over 100 components with an eigenvalue >1.

Including the var-wts modifier will generate a file for variant weights.

2. Load eigenvalue results

nies_final_pca_eigenval <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/plink_output/nies_final_pca.eigenval', header = F)

head(nies_final_pca_eigenval)
##        V1
## 1 4.25322
## 2 3.95956
## 3 3.18262
## 4 2.86084
## 5 2.80460
## 6 2.73255

3. Produce screeplot

barplot(nies_final_pca_eigenval$V1, 
        names.arg = 1:nrow(nies_final_pca_eigenval), 
        main = "NIES PCA Eigenvalue",
        xlab = "Principal Components",
        ylab = "Eigenvalue",
        col ="lightskyblue2")
lines(x = 1:nrow(nies_final_pca_eigenval), nies_final_pca_eigenval$V1, 
      type = "b", pch = 19, col = "red")

There are 124 principal components that have an eigenvalue >1.