| tags: [ Data Cleaning Genomic Data Linux PLINK R ] categories: [Coding Experiments ]
Extracting data for variants common in both file sets
Introduction
The merge function in PLINK appears to exclude variants that are common in both data sets, which is unexpected.The following work around extracts the variants common in both file sets and merges the data.
NOTE: this exercise was performed on taurus.
Methods and Results
1. Extract SNP IDs from both .bim files and sort them numerically.
awk '{print $2}' NI.snp.hg38_final.bim | sort > array_snp_hg38_sorted.txt
awk '{print $2}' NI_wgs_merged_snps.bim | sort > wgs_snp_hg38_sorted.txt
2. Find and extract the IDs common in both files
comm -12 array_snp_hg38_sorted.txt wgs_snp_hg38_sorted.txt > intersecting_snps.txt
3. Check the number of overlapping SNPs
wc -l intersecting_snps.txt
Number of overlaps: 9,165,945 variants
4. Generate random sample of SNPs
shuf -n 10 intersecting_snps.txt
Example output: rs61301417 rs1028805 rs130007 rs75556236 rs60684469
5. ‘grep’ random SNPs and check that they are the same (same position) between files
grep -w 'rs61301417' *.bim
Example of output: NI.snp.hg38_final.bim: 3 rs61301417 0 143378975 NI_wgs_merged_snps.bim:3 rs61301417 0 143378975
This indicates that the SNPs have the same position - checks that the reference genome builds are the same (confirms that the conversion from previous exercise was successful).
6. Extract data using intersecting_snps.txt from origninal file sets
plink --bfile NI_wgs_merged_snps --extract intersecting_snps.txt --make-bed --out plink_output/wgs_intersect_snps
plink --bfile NI.snp.hg38_final --extract intersecting_snps.txt --make-bed --out plink_output/array_intersect_snps
Note: this step creates new file sets
Output (WGS): - 19,208,586 variants loaded from original file - after extracting: 9,165,945 variants remain - total genotyping rate is 99.99% - 7 het. haploid genotype warning remains
Output (SNP): - 26,055,300 variants loaded from original file - after extracting: 9,165,945 variants remain - total genotyping rate ir=s 97.10%
7. Merge the file sets to obtain a list of variants with multiple positions/chromosomes and multiple alleles
plink --bfile plink_output/wgs_intersect_snps --bmerge plink_output/array_intersect_snps.bed plink_output/array_intersect_snps.bim plink_output/array_intersect_snps.fam --make-bed --out plink_output/trial_intersect_merge
Output: - 700+ variants have multiple position warnings - 10,453 variants have 3+ alleles (caused merge fail)
8. Obtain list of variants with multiple positions/chromosomes
grep -Eo 'rs[0-9]{1,}' plink_output/trial_intersect_merge.log | sort | uniq > plink_output/multpos_var.txt
9. Remove variants with multiple positions/chromosomes
plink --bfile plink_output/wgs_intersect_snps --exclude plink_output/multpos_var.txt --make-bed --out plink_output/wgs_multpos_out
plink --bfile plink_output/array_intersect_snps --exclude plink_output/multpos_var.txt --make-bed --out plink_output/array_multpos_out
Note: this step creates new file sets
Output (both): - 9,165,174 variants remain (771 variants removed)
10. Remove variants with 3+ alleles
plink --bfile plink_output/wgs_multpos_out --exclude plink_output/trial_intersect_merge-merge.missnp --make-bed --out plink_output/wgs_filtered
plink --bfile plink_output/array_multpos_out --exclude plink_output/trial_intersect_merge-merge.missnp --make-bed --out plink_output/array_filtered
Note: this step creates new file sets
Output (both): - 9,155,053 variants remain (10,121 variants removed) - this number is less than expected (10,453), assuming the difference was filtered out from mult_pos
11. Merge these file sets - this should create the final set containing all variants for subsequent analysis
plink --bfile plink_output/wgs_filtered --bmerge plink_output/array_filtered.bed plink_output/array_filtered.bim plink_output/array_filtered.fam --make-bed --out plink_output/filtered_snp_merge
Output: - 9,155,053 variants and 614 people pass filters and QC - total genotyping rate is 97.62%
Currently, the files contain data for 614 individuals (108 from WGS, 506 from SNP-array). However, there are individuals with both. WGS data has a different ID system to SNP-array, therefore all individuals are recognised as unique. The IDs in WGS need to be corrected before proceeding - this can be done with the wgs_filtered file sets before merging.