| tags: [ Genomic Data Linux PLINK ] categories: [Coding Experiments ]
Generating basic statistics for genomic data sets
Introduction
In preparation for merging the two sets of genomic data, some basic statistics must first be generated. These will include allele frequencies, genotyping rates, missingness, and hardy-weinberg equilibrium tests.
Methods
1.Generate allele frequencies for both files
plink1.9 --bfile NI_wgs_merged_snps --freq --out plink_output/wgs_allele_freq
plink1.9 --bfile NImerged.impute2.chrAllX-2014-Oct-02 --freq --out plink_output/snp_allele_freq
View head of output
2.Plot histograms of allele frequency distributions in both files
3.Generate missingness statistics
plink1.9 --bfile NI_wgs_merged_snps --missing --out plink_output/wgs_missing
plink1.9 --bfile NImerged.impute2.chrAllX-2014-Oct-02 --missing --out plink_output/snp_missing
4.Generate hardy-weinberg test statistics
plink1.9 --bfile NI_wgs_merged_snps --missing --out plink_output/wgs_hardy
plink1.9 --bfile NImerged.impute2.chrAllX-2014-Oct-02 --hardy --out plink_output/snp_hardy
5.Filter out variants based on HWE p-value
plink1.9 --bfile NI_wgs_merged_snps --hwe 1.8e-7 --make-bed --out plink_output/wgs_hwe_filter
plink1.9 --bfile NImerged.impute2.chrAllX-2014-Oct-02 --hwe 1.8e-7 --make-bed --out plink_output/snp_hwe_filter
Note: new bed, bim, fam files created.
6.Try merging to find variants with 3+ alleles
plink1.9 --bfile plink_output/wgs_hwe_filter --bmerge plink_output/snp_hwe_filter.bed plink_output/snp_hwe_filter.bim plink_output/snp_hwe_filter.fam --make-bed --out plink_output/trial_merge
Failed merge generated a file that has a list of variant IDs that have 3+ alleles. This can be used to exlude these variants from each file and the merge can be attempted again. Note: 6528 variants were found to have multiple alleles. ##7.Exclude variants with 3+ alleles from both files
plink1.9 --bfile plink_output/wgs_hwe_filter --exclude plink_output/trial_merge-merge.missnp --make-bed --out plink_output/wgs_exclude_out
plink1.9 --bfile plink_output/snp_hwe_filter --exclude plink_output/trial_merge-merge.missnp --make-bed --out plink_output/snp_exclude_out
Output from exclude: WGS = 9,288,730 variants remain (from 9,295,158); SNP = 27,303,763 variants remain (from 27,310,391). Note: new bed, bim, fam files created.
8.Try merge again with new binary files (multi-allelic variants excluded)
plink1.9 --bfile plink_output/wgs_exclude_out --bmerge plink_output/snp_exclude_out.bed plink_output/snp_exclude_out.bim plink_output/snp_exclude_out.fam --make-bed --out plink_output/var_out_merge
Input: 9,288,730 variants loaded from WGS_exclude_out; 27,303,763 variants loaded from SNP_excluded_out 5,787,699 variants are common in both, 21,516,064 variants are new Warnings: 5,731,755 variants with multiple-positions; 69,831 variants with same-positions Output: Total genotyping rate is 77.35%; 30,804,794 variants and 614 individuals pass filters and QC
9.Generate allele frequency statistics for merged data set
plink1.9 --bfile plink_ouput/var_out_merge --freq --out plink_output/merged_allele_freq
Output: 30,804,794 variants loaded, genotyping rate is 77.35%. I thought only common variants across both sets will be retained for the merge?