| tags: [ Data Cleaning Genomic Data Linux PLINK ] categories: [Coding Experiments ]

Merging genomic data sets

Introduction

Merging the two genomic data sets caused an error because my laptop has inadequate RAM to perform the merge in a single command. Therefore, I will split the data and merge them by chromosome and stitch these together to create a unified merged file set.

Method

1. Use ‘for’ loop to split SNP data by chromosome

for i in {1..23}
  do plink1.9 --bfile plink_output/snp_exclude --chr $i --make-bed --out merged_chr_data/snp_chr$i
done

2. Repeat ‘for’ loop to split WGS data by chromosome

for i in {1..23}
  do plink1.9 --bfile plink_output/wgs_exclude --chr $i --make-bed --out merged_chr_data/wgs_chr$i
done

3. Merge SNP and WGS data by chromosome

for i in {1..23}
  do plink1.9 --bfile merged_chr_data/wgs_chr$i --bmerge merged_chr_data/snp_chr$i.bed merged_chr_data/snp_chr$i.bim merged_chr_data/snp_chr$i.fam --make-bed --out merged_chr_data/merged_chr$i 
done

5. Create .txt file with the file names of all the merged file sets

For example: merged_chr_data/merged_chr2.bed merged_chr_data/merged_chr2.bim merged_chr_data/merged_chr2.fam merged_chr_data/merged_chr3.bed merged_chr_data/merged_chr3.bim merged_chr_data/merged_chr3.fam

4. Stitch together all merged files (chr 1 -23)

plink1.9 --bfile merged_chr_data/merged_chr1 --merge-list merged_chr_data/all_merged.txt --make-bed --out plink_output/var_out_merge

This produced the same ‘out of memory’ error that I received when I attempted to merge the original file sets. To try and resolve the issue, I increased the RAM for PLINK workspace from 8GB to 12GB by:

plink1.9 --bfile merged_chr_data/merged_chr1 --merge-list merged_chr_data/all_merged_chr.txt --memory 12000 --make-bed --out plink_output/var_out_merge

I did not let this run finish as it was producing numerous ‘multiple chromosomes’ warnings. Miles has realised that the SNP array data was imputed using the hg19 build whereas the WGS was with hg38. SNPs can change (move or disappear) between builds, thus the ‘multiple locations/chromosomes’ warning.

Results

I took note of the number of variants and genotyping rates that resulted from splitting and merging the data sets.

chr_var_merge <- read.csv('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/variant_chr_split.csv', header = T)

chr_var_merge
##    Chr WGS_var WGS_geno SNP_var SNP_geno merge_var merge_geno
## 1    1 1471938   0.9982 2045396   0.9866   2794250     0.6876
## 2    2 1539904   0.9983 2230433   0.9885   2987480     0.6986
## 3    3 1245134   0.9986 1868276   0.9886   2460194     0.7076
## 4    4 1230061   0.9986 1866123   0.9868   2427397     0.7142
## 5    5 1149159   0.9983 1717607   0.9891   2271635     0.7052
## 6    6 1142942   0.9985 1661762   0.9877   2223964     0.6985
## 7    7 1059807   0.9983 1516898   0.9859   2036064     0.6967
## 8    8  949817   0.9986 1485583   0.9891   1924936     0.7157
## 9    9  782401   0.9983 1129601   0.9854   1513331     0.6970
## 10  10  782201   0.9982 1292199   0.9881   1684297     0.7063
## 11  11  890397   0.9981 1288324   0.9883   1720229     0.7008
## 12  12  891005   0.9983 1246651   0.9880   1687093     0.6944
## 13  13  646982   0.9983  934764   0.9887   1233829     0.7094
## 14  14  608194   0.9982  857735   0.9845   1151687     0.6970
## 15  15  536960   0.9983  769800   0.9845   1039657     0.6914
## 16  16  582842   0.9983  828336   0.9833   1125027     0.6876
## 17  17  563589   0.9979  714075   0.9836   1033747     0.6555
## 18  18  515967   0.9978  742614   0.9872    990748     0.7012
## 19  19  521449   0.9970  569788   0.9761    881924     0.6234
## 20  20  440425   0.9981  587911   0.9870    822813     0.6751
## 21  21  272203   0.9977  356500   0.9849    503652     0.6694
## 22  22  303125   0.9974  346997   0.9792    521489     0.6389
## 23  23  764933   0.9633 1241272   0.9798   1721025     0.6573

I will abandon this exercise because the SNP array data has to be changed to the hg38 build.