| tags: [ Data Cleaning Genomic Data Linux PLINK R ] categories: [Coding Experiments ]

Extracting data for variants common in both file sets

Introduction

The merge function in PLINK appears to exclude variants that are common in both data sets, which is unexpected.The following work around extracts the variants common in both file sets and merges the data.

NOTE: this exercise was performed on taurus.

Methods and Results

1. Extract SNP IDs from both .bim files and sort them numerically.

awk '{print $2}' NI.snp.hg38_final.bim | sort > array_snp_hg38_sorted.txt

awk '{print $2}' NI_wgs_merged_snps.bim | sort > wgs_snp_hg38_sorted.txt

2. Find and extract the IDs common in both files

comm -12 array_snp_hg38_sorted.txt wgs_snp_hg38_sorted.txt > intersecting_snps.txt

3. Check the number of overlapping SNPs

wc -l intersecting_snps.txt

Number of overlaps: 9,165,945 variants

4. Generate random sample of SNPs

shuf -n 10 intersecting_snps.txt

Example output: rs61301417 rs1028805 rs130007 rs75556236 rs60684469

5. ‘grep’ random SNPs and check that they are the same (same position) between files

grep -w 'rs61301417' *.bim

Example of output: NI.snp.hg38_final.bim: 3 rs61301417 0 143378975 NI_wgs_merged_snps.bim:3 rs61301417 0 143378975

This indicates that the SNPs have the same position - checks that the reference genome builds are the same (confirms that the conversion from previous exercise was successful).

6. Extract data using intersecting_snps.txt from origninal file sets

plink --bfile NI_wgs_merged_snps --extract intersecting_snps.txt --make-bed --out plink_output/wgs_intersect_snps

plink --bfile NI.snp.hg38_final --extract intersecting_snps.txt --make-bed --out plink_output/array_intersect_snps

Note: this step creates new file sets

Output (WGS): - 19,208,586 variants loaded from original file - after extracting: 9,165,945 variants remain - total genotyping rate is 99.99% - 7 het. haploid genotype warning remains

Output (SNP): - 26,055,300 variants loaded from original file - after extracting: 9,165,945 variants remain - total genotyping rate ir=s 97.10%

7. Merge the file sets to obtain a list of variants with multiple positions/chromosomes and multiple alleles

plink --bfile plink_output/wgs_intersect_snps --bmerge plink_output/array_intersect_snps.bed plink_output/array_intersect_snps.bim plink_output/array_intersect_snps.fam --make-bed --out plink_output/trial_intersect_merge

Output: - 700+ variants have multiple position warnings - 10,453 variants have 3+ alleles (caused merge fail)

8. Obtain list of variants with multiple positions/chromosomes

grep -Eo 'rs[0-9]{1,}' plink_output/trial_intersect_merge.log | sort | uniq > plink_output/multpos_var.txt

9. Remove variants with multiple positions/chromosomes

plink --bfile plink_output/wgs_intersect_snps --exclude plink_output/multpos_var.txt --make-bed --out plink_output/wgs_multpos_out

plink --bfile plink_output/array_intersect_snps --exclude plink_output/multpos_var.txt --make-bed --out plink_output/array_multpos_out

Note: this step creates new file sets

Output (both): - 9,165,174 variants remain (771 variants removed)

10. Remove variants with 3+ alleles

plink --bfile plink_output/wgs_multpos_out --exclude plink_output/trial_intersect_merge-merge.missnp --make-bed --out plink_output/wgs_filtered

plink --bfile plink_output/array_multpos_out --exclude plink_output/trial_intersect_merge-merge.missnp --make-bed --out plink_output/array_filtered

Note: this step creates new file sets

Output (both): - 9,155,053 variants remain (10,121 variants removed) - this number is less than expected (10,453), assuming the difference was filtered out from mult_pos

11. Merge these file sets - this should create the final set containing all variants for subsequent analysis

plink --bfile plink_output/wgs_filtered --bmerge plink_output/array_filtered.bed plink_output/array_filtered.bim plink_output/array_filtered.fam --make-bed --out plink_output/filtered_snp_merge

Output: - 9,155,053 variants and 614 people pass filters and QC - total genotyping rate is 97.62%

Currently, the files contain data for 614 individuals (108 from WGS, 506 from SNP-array). However, there are individuals with both. WGS data has a different ID system to SNP-array, therefore all individuals are recognised as unique. The IDs in WGS need to be corrected before proceeding - this can be done with the wgs_filtered file sets before merging.