| tags: [ Data Cleaning Genomic Data Linux PLINK PCA R ] categories: [Coding Experiments ]

Extracting common variants from both data sets

Introduction

The merged data set did not produce the expected output. I expected that the merging would only retain data from common variants between both genomic data sets. To resolve this, I need to compare the two data sets and extract the list of common variant IDs and use this to extract the relevant data.

Methods

1.Load .bim files for both data sets

wgs_bim_data <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/NI_wgs_merged_snps.bim', header = F)

head(wgs_bim_data)
##   V1          V2 V3    V4 V5 V6
## 1  1     1:11075  0 11075  A  G
## 2  1 rs546169444  0 14464  T  A
## 3  1   rs2691315  0 15820  T  G
## 4  1 rs375964566  0 15922  G  A
## 5  1 rs112448831  0 15956  A  G
## 6  1 rs372319358  0 16068  C  T
snp_bim_data <- read.table('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/NImerged.impute2.chrAllX.2014-Oct-02.bim', header = F)

head(snp_bim_data)
##   V1          V2 V3    V4 V5 V6
## 1  1  rs58108140  0 10583  A  G
## 2  1 rs189107123  0 10611  G  C
## 3  1 rs180734498  0 13302  T  C
## 4  1 rs144762171  0 13327  C  G
## 5  1 rs151276478  0 13980  C  T
## 6  1 rs140337953  0 30923  G  T

2.Merge files to obtain list of common variants

merged_bim_data <- merge(wgs_bim_data, snp_bim_data, by = 'V2')

head(merged_bim_data)
##           V2 V1.x V3.x      V4.x V5.x V6.x V1.y V3.y      V4.y V5.y V6.y
## 1  rs1000000   12    0 126406434    A    G   12    0 126890980    A    G
## 2 rs10000003    4    0  56695481    A    G    4    0  57561647    A    G
## 3 rs10000005    4    0  84240405    G    A    4    0  85161558    G    A
## 4 rs10000006    4    0 107905227    C    T    4    0 108826383    C    T
## 5 rs10000010    4    0  21617051    C    T    4    0  21618674    T    C
## 6 rs10000012    4    0   1363537    G    C    4    0   1357325    G    C

3.Extract list of common variant IDs

merged_rsID <- merged_bim_data$V2

head(merged_rsID)
## [1] rs1000000  rs10000003 rs10000005 rs10000006 rs10000010 rs10000012
## 9343280 Levels: 1:100002796 1:100003411 1:100003413 1:100003414 ... rs9999998

4.Export list to a .txt file

write.table(merged_rsID, file = 'C:/Users/Martha/Documents/Honours/Project/honours.project/Data/merged_rsID.txt', sep = "\t", row.names = F, col.names = F)

5.Extract data based on list of IDs

plink1.9 --bfile plink_output/var_out_merge --extract merged_rsID.txt --make-bed --out plink_output/merged_common_data

Last line of code in PLINK produces an error message - Error: No variants remaining after –extract. I’ve tried changing separators from tab to spaces and removing column and row names to match the format of the files. PLINK manual suggests that the ‘extract’ function will accept a text file with a list of variant IDs. Recurring error may be a formatting issue or the commands in PLINK may be in the wrong order or missing a step.