| tags: [ R ] categories: [Coding ]

Cleaning phenotypic data

Introduction

The phenotypic data collected from the eye examinations in 2008 were collated in a single excel file. The file contains data from all 801 study participants, including their demographic, environmental, and quantitative phenotypic data. There are also comments on clinical observations. Prior to analysing the phenotypic data, it is important to identify any missing or non-sense data so as not to interfere with the analyses or its results.

Aim

To identify missing and non-sense data from the phenotypic dataset.

Methods and Results

  1. Create new R Project and set the working directory to a local folder
  2. Create new R script
  3. Load phenotypic data (NIES_master_database) into R
phen.data <- read.csv('C:/Users/Martha/Documents/Honours/Project/honours.project/Data/NIES_master_database.csv')
  1. Data cleaning step
  1. Exploration of how much missing data – run summary(); will give box plot values 
#summary() will give number of missing data for each variable
summary <- summary(phen.data)

#produce the table of results as a .csv file
write.table(summary, file="C:/Users/Martha/Documents/Honours/Project/honours.project/summary.table.3.csv")

#results 
summary
##       UUID            LAB_ID          PID            SID     
##  Min.   :110150   Min.   :1001          :391           :391  
##  1st Qu.:310050   1st Qu.:1181   10050  :  1   10012610:  1  
##  Median :327190   Median :2509   10071  :  1   10014214:  1  
##  Mean   :323827   Mean   :3644   10121  :  1   10014215:  1  
##  3rd Qu.:400396   3rd Qu.:6858   10280  :  1   10014216:  1  
##  Max.   :400805   Max.   :7000   10291  :  1   10014218:  1  
##                   NA's   :449    (Other):405   (Other) :405  
##  Ped_ID_.BK_no.      NIES       NIES_GWAS      Miles_coreped   
##  Min.   :   2   ES_001 :  1   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 812   ES_002 :  1   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1921   ES_003 :  1   Median :0.0000   Median :1.0000  
##  Mean   :2239   ES_004 :  1   Mean   :0.1498   Mean   :0.5468  
##  3rd Qu.:3449   ES_005 :  1   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :7080   ES_006 :  1   Max.   :1.0000   Max.   :1.0000  
##  NA's   :288    (Other):795                                    
##     NIES_DNA         DNA_Box      DNA_Postition    Gender   
##  Min.   :0.0000   Min.   :1.000          :566   Female:443  
##  1st Qu.:0.0000   1st Qu.:1.000   A1     :  4   Male  :358  
##  Median :0.0000   Median :2.000   A2     :  4               
##  Mean   :0.2934   Mean   :2.315   A4     :  4               
##  3rd Qu.:1.0000   3rd Qu.:3.000   A5     :  4               
##  Max.   :1.0000   Max.   :4.000   A6     :  4               
##                   NA's   :566     (Other):215               
##          DOB      Current.height  Current.weight     Smoker       
##  11/12/1995:  2   Min.   :127.0   Min.   : 31.00   Mode :logical  
##  12/03/1965:  2   1st Qu.:162.4   1st Qu.: 65.00   FALSE:359      
##  15/11/1990:  2   Median :170.0   Median : 75.00   TRUE :442      
##  18/04/1958:  2   Mean   :169.7   Mean   : 76.84                  
##  19/01/1971:  2   3rd Qu.:177.6   3rd Qu.: 86.00                  
##  19/12/1995:  2   Max.   :195.0   Max.   :140.00                  
##  (Other)   :789   NA's   :137     NA's   :163                     
##  Glaucoma.medications               Eye.Colour  Logmar.VA.Right   
##          :769         Brown              :249   Min.   :-0.30000  
##  Xalatan :  6         Blue               :242   1st Qu.:-0.10000  
##  Timoptol:  3         Hazel / Light Brown: 85   Median : 0.02000  
##  xalatan :  3         Green              : 81   Mean   : 0.06266  
##  Xalacom :  2                            : 64   3rd Qu.: 0.12000  
##  Alphagan:  1         Hazel              : 33   Max.   : 2.00000  
##  (Other) : 17         (Other)            : 47   NA's   :3         
##   RVA.with.PH       Logmar.VA.Left      LVA.with.PH     
##  Min.   :-0.24000   Min.   :-0.30000   Min.   :-0.2600  
##  1st Qu.:-0.08000   1st Qu.:-0.10000   1st Qu.:-0.0800  
##  Median : 0.02000   Median : 0.02000   Median : 0.0400  
##  Mean   : 0.04614   Mean   : 0.05989   Mean   : 0.0529  
##  3rd Qu.: 0.10000   3rd Qu.: 0.12000   3rd Qu.: 0.1000  
##  Max.   : 2.00000   Max.   : 1.70000   Max.   : 1.0600  
##  NA's   :210        NA's   :6          NA's   :214      
##  R.Sph..pre.dilate. R.Cyl..pre.dilate. R.Axis..pre.dilate.
##  Min.   :-8.0000    Min.   :-11.0000   Min.   :  0.00     
##  1st Qu.: 0.0000    1st Qu.: -0.7500   1st Qu.: 25.00     
##  Median : 0.5000    Median : -0.5000   Median : 85.00     
##  Mean   : 0.5411    Mean   : -0.6389   Mean   : 80.22     
##  3rd Qu.: 1.2500    3rd Qu.: -0.2500   3rd Qu.:119.75     
##  Max.   : 8.5000    Max.   :  0.0000   Max.   :180.00     
##  NA's   :11         NA's   :11         NA's   :11         
##  L.Sph..pre.dilate. L.Cyl..pre.dilate. L.Axis..pre.dilate.  IPDprecyclo   
##  Min.   :-9.5000    Min.   :-9.7500    Min.   :  0.00      Min.   : 2.00  
##  1st Qu.: 0.0000    1st Qu.:-0.7500    1st Qu.: 12.50      1st Qu.:57.50  
##  Median : 0.5000    Median :-0.5000    Median : 70.00      Median :59.50  
##  Mean   : 0.5781    Mean   :-0.6403    Mean   : 71.96      Mean   :56.36  
##  3rd Qu.: 1.2500    3rd Qu.:-0.2500    3rd Qu.:112.00      3rd Qu.:61.50  
##  Max.   : 8.0000    Max.   : 0.0000    Max.   :180.00      Max.   :69.00  
##  NA's   :10         NA's   :10         NA's   :10          NA's   :787    
##      R.Sph            R.Cyl             R.Axis           L.Sph       
##  Min.   :0.2500   Min.   :-1.0000   Min.   :  0.00   Min.   :0.2500  
##  1st Qu.:0.2500   1st Qu.:-0.6875   1st Qu.: 50.25   1st Qu.:0.3125  
##  Median :0.3750   Median :-0.3750   Median : 86.00   Median :0.5000  
##  Mean   :0.5000   Mean   :-0.4583   Mean   : 71.50   Mean   :0.5417  
##  3rd Qu.:0.6875   3rd Qu.:-0.2500   3rd Qu.: 97.75   3rd Qu.:0.6875  
##  Max.   :1.0000   Max.   : 0.0000   Max.   :117.00   Max.   :1.0000  
##  NA's   :795      NA's   :795       NA's   :795      NA's   :795     
##      L.Cyl             L.Axis        IPDpostcyclo       Glasses   
##  Min.   :-1.0000   Min.   : 36.00   Min.   : 0.0000         :  3  
##  1st Qu.:-0.6875   1st Qu.: 78.75   1st Qu.: 0.0000   c CL's:  2  
##  Median :-0.3750   Median :112.00   Median : 0.0000   c gls :234  
##  Mean   :-0.5000   Mean   :100.67   Mean   : 0.9776   cCl's :  1  
##  3rd Qu.:-0.2500   3rd Qu.:125.00   3rd Qu.: 0.0000   cCL's :  1  
##  Max.   :-0.2500   Max.   :147.00   Max.   :64.0000   s gls :560  
##  NA's   :795       NA's   :795      NA's   :489                   
##     R.Sph.gl         R.Cyl.gl        R.Axis.gl        Add         
##  Min.   :-0.500   Min.   :-2.000   Min.   : 15.00   Mode:logical  
##  1st Qu.: 0.250   1st Qu.:-1.375   1st Qu.: 67.50   NA's:801      
##  Median : 1.000   Median :-0.750   Median :120.00                 
##  Mean   : 2.250   Mean   :-1.167   Mean   : 90.67                 
##  3rd Qu.: 3.625   3rd Qu.:-0.750   3rd Qu.:128.50                 
##  Max.   : 6.250   Max.   :-0.750   Max.   :137.00                 
##  NA's   :798      NA's   :798      NA's   :798                    
##     L.Sph.gl          L.Cyl.gl         L.Axis.gl       R.K.value.H   
##  Min.   :-0.2500   Min.   :-1.0000   Min.   : 41.00   Min.   :36.00  
##  1st Qu.: 0.1250   1st Qu.:-0.8750   1st Qu.: 54.50   1st Qu.:42.00  
##  Median : 0.5000   Median :-0.7500   Median : 68.00   Median :43.00  
##  Mean   : 0.6667   Mean   :-0.6667   Mean   : 91.33   Mean   :42.94  
##  3rd Qu.: 1.1250   3rd Qu.:-0.5000   3rd Qu.:116.50   3rd Qu.:44.00  
##  Max.   : 1.7500   Max.   :-0.2500   Max.   :165.00   Max.   :48.25  
##  NA's   :798       NA's   :798       NA's   :798      NA's   :17     
##  R.K.Value.H.Axis  R.K.value.V    R.K.value.V.Axis  L.K.value.H   
##  Min.   :  0.0    Min.   :37.00   Min.   :  1.00   Min.   :32.75  
##  1st Qu.: 42.0    1st Qu.:42.75   1st Qu.: 61.00   1st Qu.:42.00  
##  Median : 92.0    Median :43.75   Median : 89.00   Median :43.00  
##  Mean   : 93.5    Mean   :43.77   Mean   : 90.69   Mean   :42.94  
##  3rd Qu.:155.0    3rd Qu.:44.75   3rd Qu.:126.25   3rd Qu.:44.00  
##  Max.   :180.0    Max.   :55.00   Max.   :189.00   Max.   :47.25  
##  NA's   :17       NA's   :17      NA's   :17       NA's   :16     
##  L.K.value.H.Axis  L.K.value.V    L.K.value.V.Axis Cover.Test.D  
##  Min.   :  0.00   Min.   :39.00   Min.   :  0.00   Mode:logical  
##  1st Qu.: 26.00   1st Qu.:42.75   1st Qu.: 65.00   NA's:801      
##  Median : 89.00   Median :43.75   Median : 90.00                 
##  Mean   : 88.33   Mean   :43.84   Mean   : 92.25                 
##  3rd Qu.:147.00   3rd Qu.:44.75   3rd Qu.:121.00                 
##  Max.   :180.00   Max.   :52.00   Max.   :180.00                 
##  NA's   :16       NA's   :16      NA's   :16                     
##    Lang.score              Ocular.Motility      IOP.time    R.Pachimetry  
##  Min.   : -99.0                    :726             : 17   Min.   :428.0  
##  1st Qu.: 550.0   CT N & D Straight:  6    0/01/1900:784   1st Qu.:527.0  
##  Median : 550.0   lang 0/3         :  6                    Median :546.0  
##  Mean   : 560.3   Lang 0/3         :  3                    Mean   :546.4  
##  3rd Qu.: 550.0   LANG 0/3         :  2                    3rd Qu.:570.0  
##  Max.   :1600.0   ? CN IV palsy    :  1                    Max.   :656.0  
##  NA's   :98       (Other)          : 57                    NA's   :15     
##   L.Pachimetry   R.Axial.Length  L.Axial.Length    AC.Depth.R   
##  Min.   :424.0   Min.   :20.95   Min.   :21.29   Min.   :2.090  
##  1st Qu.:526.0   1st Qu.:22.90   1st Qu.:22.87   1st Qu.:3.060  
##  Median :547.0   Median :23.49   Median :23.46   Median :3.320  
##  Mean   :546.7   Mean   :23.56   Mean   :23.53   Mean   :3.327  
##  3rd Qu.:569.0   3rd Qu.:24.09   3rd Qu.:24.10   3rd Qu.:3.560  
##  Max.   :658.0   Max.   :27.66   Max.   :34.43   Max.   :4.950  
##  NA's   :17      NA's   :16      NA's   :16      NA's   :14     
##    AC.Depth.L      R.IOP.mmHg      L.IOP.mmHg           Anterior.segment
##  Min.   :2.000   Min.   : 6.00   Min.   : 8.00   NAD OU         :271    
##  1st Qu.:3.045   1st Qu.:14.00   1st Qu.:14.00   NAD            :174    
##  Median :3.290   Median :16.00   Median :16.00   IOL OU         : 20    
##  Mean   :3.314   Mean   :15.89   Mean   :16.06   pterygium OU   : 20    
##  3rd Qu.:3.560   3rd Qu.:18.00   3rd Qu.:18.00                  : 18    
##  Max.   :5.130   Max.   :28.00   Max.   :33.00   pseudophakia OU: 13    
##  NA's   :14      NA's   :3       NA's   :4       (Other)        :285    
##  Pterygium       Disc.size.RE Disc.size.LE     CDR.RE     
##  Mode :logical     : 41         : 43       Min.   :0.000  
##  FALSE:684       L : 82       L : 82       1st Qu.:0.300  
##  TRUE :117       M :493       M :495       Median :0.400  
##                  NR:112       NR:109       Mean   :0.402  
##                  S : 73       s :  1       3rd Qu.:0.500  
##                               S : 71       Max.   :0.990  
##                                            NA's   :21     
##      CDR.LE         FDT.MD_RE         FDT.MD_LE         FDT.PSD_RE    
##  Min.   :0.0000   Min.   :-17.790   Min.   :-17.120   Min.   :-3.070  
##  1st Qu.:0.3000   1st Qu.: -2.240   1st Qu.: -1.855   1st Qu.: 2.540  
##  Median :0.4000   Median :  0.000   Median : -0.220   Median : 3.100  
##  Mean   :0.3988   Mean   : -1.020   Mean   : -1.094   Mean   : 3.566  
##  3rd Qu.:0.5000   3rd Qu.:  1.657   3rd Qu.:  1.660   3rd Qu.: 4.000  
##  Max.   :0.9000   Max.   :  6.480   Max.   :  5.200   Max.   :12.030  
##  NA's   :19       NA's   :693       NA's   :694       NA's   :696     
##    FDT.PSD_LE                        Comments   R.L.handed
##  Min.   : 2.010                          :534     : 26    
##  1st Qu.: 2.638   LANG 0/3               : 13   \\:  1    
##  Median : 3.020   Fields OK.             :  2   L : 56    
##  Mean   : 3.663   LANG 0/3.              :  2   NR: 95    
##  3rd Qu.: 3.940   No diabetic retinopathy:  2   R :623    
##  Max.   :14.720   No K's                 :  2             
##  NA's   :697      (Other)                :246             
##  R.L.eye.dominance                                        Other.disease
##    :787                                                          :731  
##  LE:  2            PPA OU                                        :  3  
##  RE: 12            Glaucoma                                      :  2  
##                    -                                             :  1  
##                    ??glaucoma suspect but fields ok. See comments:  1  
##                    1 drusen for each eye                         :  1  
##                    (Other)                                       : 62  
##  UV.questionnaire_NIES.code UV.questionnaire_First.name Maiden.name   
##  Min.   :  1.0              Mode:logical                Mode:logical  
##  1st Qu.:199.5              NA's:801                    NA's:801      
##  Median :398.0                                                        
##  Mean   :398.3                                                        
##  3rd Qu.:598.5                                                        
##  Max.   :805.0                                                        
##  NA's   :10                                                           
##  UV.questionnaire_Surname UV.questionnaire_Age UV.questionnaire_Sex
##  Mode:logical             Min.   : 7.00         : 10               
##  NA's:801                 1st Qu.:41.00        ,:  1               
##                           Median :53.00        F:434               
##                           Mean   :52.39        M:356               
##                           3rd Qu.:65.00                            
##                           Max.   :89.00                            
##                           NA's   :13                               
##  Lived.in.Norfolk.Island   Location.   From.age.in.years
##  Min.   : 0.00                  :126   Min.   : 0.000   
##  1st Qu.:12.00           NZ     :118   1st Qu.: 0.000   
##  Median :26.00           Sydney : 69   Median : 0.000   
##  Mean   :27.52           AUS    : 65   Mean   : 8.367   
##  3rd Qu.:40.00           NSW    : 44   3rd Qu.:16.000   
##  Max.   :89.00           QLD    : 21   Max.   :61.000   
##  NA's   :32              (Other):358   NA's   :131      
##  Duration.outside.Norfolk.Island  Hair.colour                 Burns.Tans 
##  Min.   : 0.00                   Brown  :493   burns then tans     :462  
##  1st Qu.:10.00                   Black  :122   never burns but tans:162  
##  Median :23.00                   Blonde :115   burns not tan       :110  
##  Mean   :25.39                   Red    : 25   don't know          : 24  
##  3rd Qu.:39.00                   NR     : 18                       : 18  
##  Max.   :84.00                          : 14   not applicable      : 12  
##  NA's   :112                     (Other): 14   (Other)             : 13  
##            Sunburn.and.pain               Time.outside
##  2-10 times        :412                         : 17  
##  never             :135     1/2 of the day      :288  
##  once              : 95     cannot judge        : 47  
##  more than 10 times: 94     greater than 3/4 day:162  
##  don't know        : 45     less than 1/4day    :276  
##                    : 15     none                :  9  
##  (Other)           :  5     u                   :  2  
##        Outdoors.and.hats    Outdoors.and.sunglasses
##                 : 15                    : 15       
##  1/2 of the time:116     1/2 of the time: 85       
##  always         :147     always         :218       
##  can't judge    : 14     can't judge    : 11       
##  never          :107     never          :146       
##  seldom         :168     seldom         :161       
##  usually        :234     usually        :165       
##   Winter..Indoors.and.Outdoors                At.school  
##                 : 16                               : 21  
##  1/2 and 1/2    :424           don't know          : 53  
##  don't know     : 12           neither             :376  
##  mostly indoors :186           yes, both           :133  
##  mostly outdoors:163           yes, hat only       :101  
##                                yes, sunglasses only:117  
##                                                          
##     Feel.colder.than.people     Arc.Welding  If.yes..eye.protection
##                 : 16                  : 18             :616        
##  1/2 of the time: 82        don’t know: 23   don’t know:  8        
##  always         : 32        No        :562   No        : 62        
##  cannot judge   : 69        Yes       :198   y         :  2        
##  never          :221                         Yes       :113        
##  seldom         :280                                               
##  usually        :101                                               
##  If.yes..flash.burn.                 Reason.for.sunglasses
##            :615      glare                      :380      
##  don’t know:  9                                 :175      
##  No        :126      driving                    : 75      
##  Yes       : 51      protection from eye disease: 56      
##                      medical requirement        : 26      
##                      fashion                    : 20      
##                      (Other)                    : 69      
##            Do.not.wear.sunglasses
##                       :442       
##  Inconvenient         : 98       
##  Prescription glasses": 83       
##  Not necessary        : 72       
##  Uncomfortable"       : 45       
##  Decrease vision      : 32       
##  (Other)              : 29
  1. Presence of outliers and non-sense data and plotting histograms
#plot histogram of a variable 
hist(phen.data$Current.height)

#produce boxplot statistics which will identify outliers
outliers.height <- boxplot.stats(phen.data$Current.height)
#show boxplot stats 
outliers.height
## $stats
## [1] 141.00 162.25 170.00 177.75 195.00
## 
## $n
## [1] 664
## 
## $conf
## [1] 169.0496 170.9504
## 
## $out
## [1] 138 130 127
  1. Check that there are no individuals that only have one or two phenotypes (systematic missing data)

Discussion

This exercise shows basic summaries of the phenotypic data collected from the eye examinations. The results of the ‘summary’ function includes the means and medians of quantitative data and freqencies of qualitative data. In addition, it also shows the number of cells that are empty (‘NA’) for each variable. This is important for the data imputation step that will follow.

The histogram plots and the box plot statistics were aimed to determine whether there were any non-sense data, which will also need to be removed. The histograms and boxplot stats are yet to be completed for all quantitative variables. Step (d) will be performed to identify any individuals that have systematic missing data, which will be removed for subsequent analysis.