| tags: [ Data Cleaning Genomic Data PLINK R ] categories: [Coding Experiments ]

Re-trying default merge

Introduction

The missing genotype filtering is still removing more variants than expected. We suspect that PLINK may not be handling the overlapping data/duplicate samples well. Therefore, I will be removing the SNP-array data for individuals that have duplicate data before attempting to merge the data set using the default setting (consensus calls) in PLINK.

Methods and Results

1. Identify individuals with SNP-array only and duplicate data.

Using the master spreadsheet containing UUIDs, I 1) isolated all NIES individuals, 2) isolated NIES individuals with SNP data (GWAS_NIES), 3) separate individuals with potential duplicates and SNP-array data.

From this:

  • 363 UUIDs with SNP-data

  • 73 UUIDs with WGS and/or SNP-array (I do not know for sure which individuals have duplicate data because I do not have the list of IDs for individuals with SNP-array data)

  • 290 UUIDs with ONLY SNP-array data (presumably)

2. Extract SNP-aray data for 290 individuals

array_filtered file contains 9million+ variants extracted from the original SNP-array file.

plink --bfile plink_output/array_filtered --keep snp-array_only_uuid.txt --make-bed --out NIES.array.hg38

Note: data for 288 are extracted

3. Try default merging with new array data set

plink --bfile NIES.array.hg38 --bmerge plink_output/wgs_filtered.bed plink_output/wgs_filtered.bim plink_output/wgs_filtered.fam --make-bed --out merged_def2_nies

I accidentally used the wgs_filtered file which contains data for all 108 individuals with WGS data. Not all are NIES individuals.

4. Extract merged data for NIES

plink --bfile merged_def2_nies --keep gwas_niesID.txt --make-bed --out merged_def3_nies

gwas_niesID.txt contains UUIDs of all NIES individuals with SNP data.

Output:

  • 361 people remain
  • 9,155,053 variants

5. Fix paternal and maternal IDs in merged_def3_nies fam file (change all to 0)

6. Filter variants based on HWE p-value

plink1.9 --bfile merged_def3_nies --hwe 1.8e-7 --make-bed --out plink_output/merged_nies_hwefilter

Output:

  • 8811 variants removed
  • 9,146,242 variants remain

7. Filter based on missing genotypes

plink1.9 --bfile plink_output/merged_nies_hwefilter --mind 0.05 --geno 0.01 --make-bed --out merged_nies_filter

Output:

  • 2 people removed
  • 3,158,627 variants removed
plink1.9 --bfile plink_output/merged_nies_hwefilter --mind 0.1 --geno 0.05 --make-bed --out merged_nies_filter

Output:

  • 0 people removed
  • 1,092,421 variants removed

Trying the default merge yielded very similar results to merge mode 3 with overlapping results.