| tags: [ Data Cleaning Imputation PCA R ] categories: [Coding Experiments ]
Multiple PCA Imputation of Quantitative Phenotypic Data - updated
Introduction
Multiple imputation with PCA performed on larger data set.
Methods and Results
1. Load data
phen.data.age <- read.csv('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/NIES_master_database-age.csv')
#include logMAR and base-out prism test values
phen.data.adults<-phen.data.age[phen.data.age$Age.excel>17,]
quant.variables<- c("Logmar.VA.Right", "RVA.with.PH", "Logmar.VA.Left", "LVA.with.PH", "R.Sph..pre.dilate.", "R.Cyl..pre.dilate.",
"R.Axis..pre.dilate.", "L.Sph..pre.dilate.", "L.Cyl..pre.dilate.", "L.Axis..pre.dilate.",
"R.K.value.H", "R.K.Value.H.Axis", "R.K.value.V", "R.K.value.V.Axis", "L.K.value.H",
"L.K.value.H.Axis", "L.K.value.V", "L.K.value.V.Axis", "R.Pachimetry", "L.Pachimetry", "R.Axial.Length",
"L.Axial.Length", "AC.Depth.R", "AC.Depth.L", "R.IOP.mmHg", "L.IOP.mmHg", "CDR.RE", "CDR.LE")
quant.data.adults<- phen.data.adults[quant.variables]
summary(quant.data.adults)
## Logmar.VA.Right RVA.with.PH Logmar.VA.Left
## Min. :-0.30000 Min. :-0.2400 Min. :-0.30000
## 1st Qu.:-0.08000 1st Qu.:-0.0600 1st Qu.:-0.10000
## Median : 0.02000 Median : 0.0400 Median : 0.02000
## Mean : 0.06608 Mean : 0.0492 Mean : 0.06359
## 3rd Qu.: 0.14000 3rd Qu.: 0.1000 3rd Qu.: 0.13000
## Max. : 2.00000 Max. : 2.0000 Max. : 1.70000
## NA's :3 NA's :200 NA's :6
## LVA.with.PH R.Sph..pre.dilate. R.Cyl..pre.dilate.
## Min. :-0.26000 Min. :-8.0000 Min. :-11.0000
## 1st Qu.:-0.08000 1st Qu.: 0.0000 1st Qu.: -0.7500
## Median : 0.04000 Median : 0.5000 Median : -0.5000
## Mean : 0.05562 Mean : 0.5455 Mean : -0.6497
## 3rd Qu.: 0.10000 3rd Qu.: 1.2500 3rd Qu.: -0.2500
## Max. : 1.06000 Max. : 8.5000 Max. : 0.0000
## NA's :204 NA's :10 NA's :10
## R.Axis..pre.dilate. L.Sph..pre.dilate. L.Cyl..pre.dilate.
## Min. : 0.00 Min. :-9.5000 Min. :-9.7500
## 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.:-0.7500
## Median : 85.00 Median : 0.5000 Median :-0.5000
## Mean : 80.23 Mean : 0.5887 Mean :-0.6522
## 3rd Qu.:119.00 3rd Qu.: 1.2500 3rd Qu.:-0.2500
## Max. :180.00 Max. : 8.0000 Max. : 0.0000
## NA's :10 NA's :9 NA's :9
## L.Axis..pre.dilate. R.K.value.H R.K.Value.H.Axis R.K.value.V
## Min. : 0.00 Min. :36.00 Min. : 0.00 Min. :37.00
## 1st Qu.: 13.00 1st Qu.:42.00 1st Qu.: 42.00 1st Qu.:42.75
## Median : 70.00 Median :43.00 Median : 91.00 Median :43.75
## Mean : 72.25 Mean :42.97 Mean : 93.74 Mean :43.80
## 3rd Qu.:113.00 3rd Qu.:44.00 3rd Qu.:156.00 3rd Qu.:45.00
## Max. :180.00 Max. :48.25 Max. :180.00 Max. :55.00
## NA's :9 NA's :16 NA's :16 NA's :16
## R.K.value.V.Axis L.K.value.H L.K.value.H.Axis L.K.value.V
## Min. : 1.00 Min. :32.75 Min. : 0.0 Min. :39.00
## 1st Qu.: 62.00 1st Qu.:42.00 1st Qu.: 24.0 1st Qu.:42.81
## Median : 89.00 Median :43.00 Median : 88.5 Median :43.75
## Mean : 91.19 Mean :42.97 Mean : 87.8 Mean :43.87
## 3rd Qu.:127.00 3rd Qu.:44.00 3rd Qu.:147.0 3rd Qu.:44.75
## Max. :189.00 Max. :47.25 Max. :180.0 Max. :52.00
## NA's :16 NA's :15 NA's :15 NA's :15
## L.K.value.V.Axis R.Pachimetry L.Pachimetry R.Axial.Length
## Min. : 0.00 Min. :428.0 Min. :424.0 Min. :20.95
## 1st Qu.: 65.25 1st Qu.:527.0 1st Qu.:526.0 1st Qu.:22.90
## Median : 90.00 Median :546.0 Median :547.0 Median :23.50
## Mean : 92.22 Mean :546.2 Mean :546.3 Mean :23.56
## 3rd Qu.:119.75 3rd Qu.:570.0 3rd Qu.:568.0 3rd Qu.:24.09
## Max. :180.00 Max. :656.0 Max. :658.0 Max. :27.66
## NA's :15 NA's :14 NA's :16 NA's :13
## L.Axial.Length AC.Depth.R AC.Depth.L R.IOP.mmHg
## Min. :21.29 Min. :2.090 Min. :2.000 Min. : 6.00
## 1st Qu.:22.87 1st Qu.:3.058 1st Qu.:3.040 1st Qu.:14.00
## Median :23.46 Median :3.310 Median :3.280 Median :16.00
## Mean :23.54 Mean :3.320 Mean :3.306 Mean :15.88
## 3rd Qu.:24.11 3rd Qu.:3.550 3rd Qu.:3.553 3rd Qu.:18.00
## Max. :34.43 Max. :4.950 Max. :5.130 Max. :28.00
## NA's :13 NA's :13 NA's :13 NA's :2
## L.IOP.mmHg CDR.RE CDR.LE
## Min. : 8.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:14.00 1st Qu.:0.3000 1st Qu.:0.3000
## Median :16.00 Median :0.4000 Median :0.4000
## Mean :16.06 Mean :0.4057 Mean :0.4022
## 3rd Qu.:18.00 3rd Qu.:0.5000 3rd Qu.:0.5000
## Max. :33.00 Max. :0.9900 Max. :0.9000
## NA's :3 NA's :19 NA's :17
2. Subset relevant data (quantitative only, including logMAR and prism test values, and excluding information from minors)
phen.data.adults<-phen.data.age[phen.data.age$Age.excel>17,]
quant.variables<- c("Logmar.VA.Right", "RVA.with.PH", "Logmar.VA.Left", "LVA.with.PH", "R.Sph..pre.dilate.", "R.Cyl..pre.dilate.",
"R.Axis..pre.dilate.", "L.Sph..pre.dilate.", "L.Cyl..pre.dilate.", "L.Axis..pre.dilate.",
"R.K.value.H", "R.K.Value.H.Axis", "R.K.value.V", "R.K.value.V.Axis", "L.K.value.H",
"L.K.value.H.Axis", "L.K.value.V", "L.K.value.V.Axis", "R.Pachimetry", "L.Pachimetry", "R.Axial.Length",
"L.Axial.Length", "AC.Depth.R", "AC.Depth.L", "R.IOP.mmHg", "L.IOP.mmHg", "CDR.RE", "CDR.LE")
quant.data.adults<- phen.data.adults[quant.variables]
head(quant.data.adults)
## Logmar.VA.Right RVA.with.PH Logmar.VA.Left LVA.with.PH
## 1 0.02 0.02 -0.04 -0.04
## 2 0.10 0.10 0.16 0.10
## 3 0.00 0.00 0.00 0.00
## 4 0.30 0.10 0.08 0.08
## 5 0.00 0.00 -0.10 -0.10
## 6 0.24 0.16 0.36 0.14
## R.Sph..pre.dilate. R.Cyl..pre.dilate. R.Axis..pre.dilate.
## 1 0.25 0.00 0
## 2 0.00 -0.75 28
## 3 1.25 -1.25 148
## 4 1.25 -0.25 97
## 5 1.25 -0.25 37
## 6 4.00 -0.50 46
## L.Sph..pre.dilate. L.Cyl..pre.dilate. L.Axis..pre.dilate. R.K.value.H
## 1 0.25 -0.50 79 42.00
## 2 -0.50 -0.25 164 41.25
## 3 1.25 -0.25 24 NA
## 4 1.50 -0.50 81 44.75
## 5 1.25 -0.75 164 44.75
## 6 3.75 -0.75 180 42.00
## R.K.Value.H.Axis R.K.value.V R.K.value.V.Axis L.K.value.H
## 1 2 43.00 92 42.50
## 2 7 42.25 97 41.50
## 3 NA NA NA NA
## 4 5 45.00 95 45.00
## 5 0 44.75 90 44.25
## 6 8 43.25 98 42.25
## L.K.value.H.Axis L.K.value.V L.K.value.V.Axis R.Pachimetry L.Pachimetry
## 1 5 43.50 95 532 554
## 2 168 42.00 78 608 612
## 3 NA NA NA 507 510
## 4 60 45.25 150 560 559
## 5 178 44.75 88 556 562
## 6 177 43.25 87 498 501
## R.Axial.Length L.Axial.Length AC.Depth.R AC.Depth.L R.IOP.mmHg
## 1 24.31 24.10 3.09 3.03 14
## 2 25.02 25.21 3.38 3.92 16
## 3 22.78 22.80 3.40 3.45 26
## 4 23.02 22.98 3.00 2.85 14
## 5 21.75 22.04 2.60 2.53 22
## 6 23.06 23.17 2.94 3.04 18
## L.IOP.mmHg CDR.RE CDR.LE
## 1 14 0.9 0.9
## 2 15 0.9 0.7
## 3 22 0.7 0.7
## 4 14 0.2 0.2
## 5 21 0.3 0.3
## 6 20 0.6 0.6
3. View summary statistics of data subset
summary(quant.data.adults)
## Logmar.VA.Right RVA.with.PH Logmar.VA.Left
## Min. :-0.30000 Min. :-0.2400 Min. :-0.30000
## 1st Qu.:-0.08000 1st Qu.:-0.0600 1st Qu.:-0.10000
## Median : 0.02000 Median : 0.0400 Median : 0.02000
## Mean : 0.06608 Mean : 0.0492 Mean : 0.06359
## 3rd Qu.: 0.14000 3rd Qu.: 0.1000 3rd Qu.: 0.13000
## Max. : 2.00000 Max. : 2.0000 Max. : 1.70000
## NA's :3 NA's :200 NA's :6
## LVA.with.PH R.Sph..pre.dilate. R.Cyl..pre.dilate.
## Min. :-0.26000 Min. :-8.0000 Min. :-11.0000
## 1st Qu.:-0.08000 1st Qu.: 0.0000 1st Qu.: -0.7500
## Median : 0.04000 Median : 0.5000 Median : -0.5000
## Mean : 0.05562 Mean : 0.5455 Mean : -0.6497
## 3rd Qu.: 0.10000 3rd Qu.: 1.2500 3rd Qu.: -0.2500
## Max. : 1.06000 Max. : 8.5000 Max. : 0.0000
## NA's :204 NA's :10 NA's :10
## R.Axis..pre.dilate. L.Sph..pre.dilate. L.Cyl..pre.dilate.
## Min. : 0.00 Min. :-9.5000 Min. :-9.7500
## 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.:-0.7500
## Median : 85.00 Median : 0.5000 Median :-0.5000
## Mean : 80.23 Mean : 0.5887 Mean :-0.6522
## 3rd Qu.:119.00 3rd Qu.: 1.2500 3rd Qu.:-0.2500
## Max. :180.00 Max. : 8.0000 Max. : 0.0000
## NA's :10 NA's :9 NA's :9
## L.Axis..pre.dilate. R.K.value.H R.K.Value.H.Axis R.K.value.V
## Min. : 0.00 Min. :36.00 Min. : 0.00 Min. :37.00
## 1st Qu.: 13.00 1st Qu.:42.00 1st Qu.: 42.00 1st Qu.:42.75
## Median : 70.00 Median :43.00 Median : 91.00 Median :43.75
## Mean : 72.25 Mean :42.97 Mean : 93.74 Mean :43.80
## 3rd Qu.:113.00 3rd Qu.:44.00 3rd Qu.:156.00 3rd Qu.:45.00
## Max. :180.00 Max. :48.25 Max. :180.00 Max. :55.00
## NA's :9 NA's :16 NA's :16 NA's :16
## R.K.value.V.Axis L.K.value.H L.K.value.H.Axis L.K.value.V
## Min. : 1.00 Min. :32.75 Min. : 0.0 Min. :39.00
## 1st Qu.: 62.00 1st Qu.:42.00 1st Qu.: 24.0 1st Qu.:42.81
## Median : 89.00 Median :43.00 Median : 88.5 Median :43.75
## Mean : 91.19 Mean :42.97 Mean : 87.8 Mean :43.87
## 3rd Qu.:127.00 3rd Qu.:44.00 3rd Qu.:147.0 3rd Qu.:44.75
## Max. :189.00 Max. :47.25 Max. :180.0 Max. :52.00
## NA's :16 NA's :15 NA's :15 NA's :15
## L.K.value.V.Axis R.Pachimetry L.Pachimetry R.Axial.Length
## Min. : 0.00 Min. :428.0 Min. :424.0 Min. :20.95
## 1st Qu.: 65.25 1st Qu.:527.0 1st Qu.:526.0 1st Qu.:22.90
## Median : 90.00 Median :546.0 Median :547.0 Median :23.50
## Mean : 92.22 Mean :546.2 Mean :546.3 Mean :23.56
## 3rd Qu.:119.75 3rd Qu.:570.0 3rd Qu.:568.0 3rd Qu.:24.09
## Max. :180.00 Max. :656.0 Max. :658.0 Max. :27.66
## NA's :15 NA's :14 NA's :16 NA's :13
## L.Axial.Length AC.Depth.R AC.Depth.L R.IOP.mmHg
## Min. :21.29 Min. :2.090 Min. :2.000 Min. : 6.00
## 1st Qu.:22.87 1st Qu.:3.058 1st Qu.:3.040 1st Qu.:14.00
## Median :23.46 Median :3.310 Median :3.280 Median :16.00
## Mean :23.54 Mean :3.320 Mean :3.306 Mean :15.88
## 3rd Qu.:24.11 3rd Qu.:3.550 3rd Qu.:3.553 3rd Qu.:18.00
## Max. :34.43 Max. :4.950 Max. :5.130 Max. :28.00
## NA's :13 NA's :13 NA's :13 NA's :2
## L.IOP.mmHg CDR.RE CDR.LE
## Min. : 8.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:14.00 1st Qu.:0.3000 1st Qu.:0.3000
## Median :16.00 Median :0.4000 Median :0.4000
## Mean :16.06 Mean :0.4057 Mean :0.4022
## 3rd Qu.:18.00 3rd Qu.:0.5000 3rd Qu.:0.5000
## Max. :33.00 Max. :0.9900 Max. :0.9000
## NA's :3 NA's :19 NA's :17
4. Duplicate data set
ocular_data <- quant.data.adults
5. Convert to double matrix to ensure that all data is ‘numeric’
ocular_data <- as.matrix(ocular_data)
6.Perform imputation
require(missMDA)
## Loading required package: missMDA
require(FactoMineR)
## Loading required package: FactoMineR
nbdim = estim_ncpPCA(ocular_data, method = 'EM', method.cv="Kfold")
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|=========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|====================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================== | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================== | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|============================ | 42%
|
|============================ | 43%
|
|============================= | 44%
|
|============================== | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|================================ | 48%
|
|================================ | 49%
|
|================================= | 51%
|
|================================= | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|=================================== | 55%
|
|==================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 91%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 100%
nbdim
## $ncp
## [1] 5
##
## $criterion
## 0 1 2 3 4 5
## 795754.6 797499.3 793740.9 795657.1 782813.8 769001.8
res.comp = MIPCA(ocular_data, ncp = nbdim$ncp, nboot = 1000)
res.comp$ncp
## NULL
7. Plot individuals plot to view uncertainty of predicted values
png('indiv_supp_plot.png', width = 3200, height = 3200, res = 150)
plot.MIPCA(res.comp, choice = "ind.supp")
dev.off()
## png
## 2
8. Plot variable factor plot
png('var_factor_plot.png', width = 3200, height = 3200, res = 150)
plot.MIPCA(res.comp, choice = "var")
dev.off()
## png
## 2
Discussion
From the variable factor plot, it is apparent that the logMAR with pinhole correction values (RVA and LVA with PH) have a large spread, much more scattered than all the other variables. This indicates a relatively large uncertainty for the imputed values for those variables. From the summary statistics, these variables have around 200 missing values, whereas all other variables only have up to 16 missing values. For this reason, we have decided to exclude these variables from subsequent analyses.
Note: this multiple PCA found five components.