| tags: [ Data Cleaning Genomic Data Linux PLINK ] categories: [Coding Experiments ]
Merging genomic data sets
Introduction
Merging the two genomic data sets caused an error because my laptop has inadequate RAM to perform the merge in a single command. Therefore, I will split the data and merge them by chromosome and stitch these together to create a unified merged file set.
Method
1. Use ‘for’ loop to split SNP data by chromosome
for i in {1..23}
do plink1.9 --bfile plink_output/snp_exclude --chr $i --make-bed --out merged_chr_data/snp_chr$i
done
2. Repeat ‘for’ loop to split WGS data by chromosome
for i in {1..23}
do plink1.9 --bfile plink_output/wgs_exclude --chr $i --make-bed --out merged_chr_data/wgs_chr$i
done
3. Merge SNP and WGS data by chromosome
for i in {1..23}
do plink1.9 --bfile merged_chr_data/wgs_chr$i --bmerge merged_chr_data/snp_chr$i.bed merged_chr_data/snp_chr$i.bim merged_chr_data/snp_chr$i.fam --make-bed --out merged_chr_data/merged_chr$i
done
5. Create .txt file with the file names of all the merged file sets
For example: merged_chr_data/merged_chr2.bed merged_chr_data/merged_chr2.bim merged_chr_data/merged_chr2.fam merged_chr_data/merged_chr3.bed merged_chr_data/merged_chr3.bim merged_chr_data/merged_chr3.fam
4. Stitch together all merged files (chr 1 -23)
plink1.9 --bfile merged_chr_data/merged_chr1 --merge-list merged_chr_data/all_merged.txt --make-bed --out plink_output/var_out_merge
This produced the same ‘out of memory’ error that I received when I attempted to merge the original file sets. To try and resolve the issue, I increased the RAM for PLINK workspace from 8GB to 12GB by:
plink1.9 --bfile merged_chr_data/merged_chr1 --merge-list merged_chr_data/all_merged_chr.txt --memory 12000 --make-bed --out plink_output/var_out_merge
I did not let this run finish as it was producing numerous ‘multiple chromosomes’ warnings. Miles has realised that the SNP array data was imputed using the hg19 build whereas the WGS was with hg38. SNPs can change (move or disappear) between builds, thus the ‘multiple locations/chromosomes’ warning.
Results
I took note of the number of variants and genotyping rates that resulted from splitting and merging the data sets.
chr_var_merge <- read.csv('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/variant_chr_split.csv', header = T)
chr_var_merge
## Chr WGS_var WGS_geno SNP_var SNP_geno merge_var merge_geno
## 1 1 1471938 0.9982 2045396 0.9866 2794250 0.6876
## 2 2 1539904 0.9983 2230433 0.9885 2987480 0.6986
## 3 3 1245134 0.9986 1868276 0.9886 2460194 0.7076
## 4 4 1230061 0.9986 1866123 0.9868 2427397 0.7142
## 5 5 1149159 0.9983 1717607 0.9891 2271635 0.7052
## 6 6 1142942 0.9985 1661762 0.9877 2223964 0.6985
## 7 7 1059807 0.9983 1516898 0.9859 2036064 0.6967
## 8 8 949817 0.9986 1485583 0.9891 1924936 0.7157
## 9 9 782401 0.9983 1129601 0.9854 1513331 0.6970
## 10 10 782201 0.9982 1292199 0.9881 1684297 0.7063
## 11 11 890397 0.9981 1288324 0.9883 1720229 0.7008
## 12 12 891005 0.9983 1246651 0.9880 1687093 0.6944
## 13 13 646982 0.9983 934764 0.9887 1233829 0.7094
## 14 14 608194 0.9982 857735 0.9845 1151687 0.6970
## 15 15 536960 0.9983 769800 0.9845 1039657 0.6914
## 16 16 582842 0.9983 828336 0.9833 1125027 0.6876
## 17 17 563589 0.9979 714075 0.9836 1033747 0.6555
## 18 18 515967 0.9978 742614 0.9872 990748 0.7012
## 19 19 521449 0.9970 569788 0.9761 881924 0.6234
## 20 20 440425 0.9981 587911 0.9870 822813 0.6751
## 21 21 272203 0.9977 356500 0.9849 503652 0.6694
## 22 22 303125 0.9974 346997 0.9792 521489 0.6389
## 23 23 764933 0.9633 1241272 0.9798 1721025 0.6573
I will abandon this exercise because the SNP array data has to be changed to the hg38 build.