| tags: [ Genomic Data Linux PLINK R ] categories: [Coding Experiments ]

Generating basic statistics for merged data set

Introduction

Error when extracting common variant IDs needs to be resolved.

After successfully exporting the list of common IDs, the relevant genomic data will need to be extracted and basic statistics will need to be generated.

Methods and results

I believe the error in PLINK when extracting data is due to the formatting of the IDs in the text file.

1.Try removing quotation marks from the list of IDs

write.table(merged_rsID, file = 'C:/Users/Martha/Documents/Honours/Project/honours.project/Data/merged_rsID.txt', row.names = F, col.names = F, quote = F)

2.Check that the quotation marks were removed

head merged_rsID.txt

3.Use exported ID list to extract data

plink1.9 --bfile plink_output/var_out_merge --extract merged_rsID.txt --make-bed --out plink_output/common_id_merge

Note: var_out_merge is the file set that contains 30million+ variants and had the variants with 3+ alleles removed

This was successful, and the output indicates that new file set contains 5,789,188 variants with 614 individuals.

4.Find allele frequency for new file set.

plink1.9 --bfile plink_output/common_id_merge --freq --out plink_output/merged_allele_freq

Output: genotyping rate is 97.02%.

5.Generate missingness stats

plink1.9 --bfile plink_output/common_id_merge --missing --out plink_output/merged_missing

Visualise allele frequency distributions.

6.Load allele frequency files for the original file sets

wgs_allele_freq <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/plink_output/wgs_allele_freq.frq', header = T)

snp_allele_freq <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/plink_output/snp_allele_freq.frq', header = T)

7.Generate histograms

hist(wgs_allele_freq$MAF, main = "Allele Frequency Distributions of WGS Data", xlab = "Minor Allele Frequency", xlim = c(0.0, 0.5), col = "darkseagreen1")

hist(snp_allele_freq$MAF, main = "Allele Frequency Distribution of SNP Data", xlab = "Minor Allele Frequency", xlim = c(0.0, 0.5), col = "cadetblue1")

8.Load allele frequency file for merged data set

merged_allele_freq <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/plink_output/merged_allele_freq.frq', header = T)

9.Generate histogram for allele freq distributions

hist(merged_allele_freq$MAF, main = "Allele Frequency Distributions of Merged Data", xlab = "Minor Allele Frequency", xlim = c(0.0, 0.5), col = "lightpink")