Under

dispersion of each of each imputation method will be assessed by computation of variance of all

estimated missing values and comparing them to that of known values that has

been set to missing. Proportionate

Variance(PV) will be calculated for each imputation method.

RMSD and MAD

will be used to determine the

closeness of estimated values of parameter to the true value. They do not always

give the same result, the data are more panelised by RMSD because the

difference term is squared. Another summary measure is BIAS. Which is defined as

Mean Absolute Deviation(MAD) is defined as

Where y is the true value and is the imputed value and m is the number of

missing value.

Root Mean Square Deviation is defined as

After imputation of “missing values” the performance of the

estimates will be examined using four summary measures. Two measures of

accuracy Root Mean Square Deviation(RMSD) and

Mean absolute Deviation(MAD) will be used .

1.2.7

Performance

measures for imputation methods

.

The subscript

denotes the imputation in MI and Q is the quantity of

interest The total variance is

Where B is between imputation variance and U is

within imputation variance defined by

The RIV is defined by the

equation

The estimate of degree of freedom in imputed

model is not influenced by sample size. DF increases as the number of

imputation increases.

The degree of freedom is defined by the equation

Where r is the relative increase in

variance(RIV) due to non response, v is the degree of freedom(DF). The FMI is estimated based on how correlated

a variable is to other variables in the imputed model and the percentage of

missing for this variable. If FMI is high for any variable, then the number of

imputation should be considered.

The fraction of missing information for a limited number of imputation in MI is

estimated by

The RE of an imputation is an indicator of how well

the true population parameters are estimated. It is related to the number (m)

of missing information as well as imputation performed.

Where ? is the fraction of missing information,

m is the number of imputations.

The relative(variance) efficiency(RE) of MI is

defined as

e)

Examples of how standard error are calculated

d)

Relative efficiency (RE) (Rubin 1987)

c)

Degree of freedom

b)

Fraction of missing information(FMI)(Rubin 1987)

a)

Relative increase in variance(RIV)

In order to assess how well the imputation performed , the following

measures will be used

1.2.6

Imputation diagnostics

Autocorrelation plot will be useful in assessing

convergence. Autocorrelation measures correlation between predicted values at

each iteration.

After performing multiple imputation, It is

useful to first look at the means, frequencies and box plots comparing the

observed and imputed values to assess if the range appears reasonable. This

will be followed by examination of the plots of residuals and outliers for each

individual imputed data set to see if there are anomalies. Evidence of anomaly

no matter how small in number is an indication of problem with imputed

model(White et al 2010). Next, is the use of trace plot to assess convergence

of each imputed variable. Trace plot are plots of estimated parameters against

iteration numbers.

1.2.5

Visual inspection of

Imputed data

In this project,

the first step in MI will involve (i) identifying variables with missing

values, (ii) computing the proportion of missing values for each variable and

(iii) assessment of the existence of missing values pattern(monotone? or

arbitrary?) in the data. The second step will involve analyses of m

complete data set using standard procedures. In the third step, the estimates

of the parameters from each imputed data sets are combined to get final sets of

parameter estimates.

Multiple imputation(MI) of missing values starts

from the core idea of regression. Imputation then adds further steps to obtain a

more realistic estimate of standard errors or uncertainty. These involve

creating multiple sets of artificial observation in which missing values

replaced by regression predictions plus noise. Then a final step pool the

information of these multiple imputation to estimate the regression model along

with the standard error task. In multiple imputation each missing value is

replaced with a multiple value that represents distribution of possibilities

(Alison 2001). MI procedure is simulation based, and its main purpose,

according to Schaffer (1997) is not create each missing values that are very

close to the true ones, but to handle missing data to achieve valid inference.

1.2.4

Multiple

Imputation

In FIML , data are not imputed or filled in like

multiple imputation, rather it make estimates for model parameters using

available information(raw data)(Enders 2001)

1.2.3

FIML

The approach for single imputation or

deterministic imputation involves using predicted scores from regression

equation to replace missing values. The advantage of using this imputation

method lies on the premise of using complete information to impute values. The

disadvantage is that fitted(statistical)model cannot be distinguished between

observed and imputed values, as a result error or uncertainty associated with

imputed values do not incorporate into the model.

1.2.2

Single

Imputation

In this project, using the NCDS data the

conceptual approach will begin with complete case analysis or listwise deletion

under the assumption that events rate in group who had missing data was the

same as the event rate for groups

without missing data.

1.2.1 Complete

case analysis

·

Data that are missing not at random

·

Different problems arise when data are missing

in binary or categorical variables. Some procedures may handle these types of

missing data better than others, and this area requires further research

·

Are the characteristics of subjects who provide

complete information (completers) different from those who don’t

(non-completers)?

·

Effects of imputation on measures of

relationship between variables

This project will be centered on impact of

imputation methods on the following issues

IV.

Which imputation method is best suited for

problem that may arise when data are missing in binary or categorical variable

III.

How do relationship between variables from

imputed datasets compared to other similar studies.

II.

How significant are the relationship between

variables with reference to imputation methods

I.

What impact do

various imputation methods have on degree of relationship between

variables

Key questions to be examined are

The nature and properties of missing data

can be very different from the originally observed data, it is important to analyse various

methods of treating missing data in order to determine which methods work

best under a given set of conditions(Cheema 2014)

To determine the best method of handling missing

data, it is beneficial to first consider the contest at which the data is

missing. This means when MNAR data are

incorrectly treated as MCAR or MAR, it means that the missing data

process is not being modeled correctly, and parameter estimates will not be

accurate. Similarly, when MCAR and MAR data

are incorrectly treated as MNAR, it means that the researcher is introducing

unnecessarily more complexity into the handling

of missing data. Finally, when

MAR data are incorrectly treated as

MCAR, the researcher is oversimplifying the handling

of missing data and will

generate parameter estimates that are not generalizable to the

population(reference)

The goal of statistical methods is to ensure

that inferences made on the population of interest are valid and efficient. A

good missing data handling method as observed by Allison(2001) should be

able to reduce bias, maximise use of available information and good estimates

of uncertainty. Based on the results of the scoping review that compared

missing data handling methods in epidemiology, the impact four popular methods

derived from the review will be assessed and evaluated using NCDS data. The

missing data methods to be compared include listwise deletion, single

imputation, multiple imputation, and full information maximum likelihood .

1.2

Missing data handling

strategy

Wald test and likelihood ratio test(LRT) of

different model will be used to test the statistical significance of each

predictor. Wald test is computed as

ratio of parameter estimate for each variable in the model to its corresponding

standard error. The null hypothesis for Wald test is that each parameter

coefficient is zero. Rejection of the null hypothesis indicates that the effect

of a variable is significant. Effects of multiple predictors can simultaneously

be tested by Wald test. The LRT compares the

-2LL of different models. A

significant LRT means that set of variables included in a fitted model makes a significant contribution to the model.

1.1.1.5 Testing significance of predictors of missing values

Where df is the degree of freedom associated

with deviance, n is the sample size and is the deviance of the fitted model.

The BIC adjusts the deviance by its degree of

freedom and sample size:

Where k

is the number of parameters ( the number of independent variables plus the

intercept), n is the sample size, and or is the deviance of the fitted model.

The number of predictors in a model and the

sample size is used by AIC to adjust deviance.

Akaike information criterion(AIC)(Akaike, 1974)

and Bayesian information criteria(BIC)(Schwarz, 1978) which are based on

deviance statistics will be used to compare non nested model(models with

different sets of independent variable) AIC and BIC will be useful for this

project because of variations in number of missing values in each parameter of

the fitted models. The smaller the AIC

the better model fit.

1.1.1.4 Information criteria indices

Pseudo R2 Measures

Pseudo R2

Formula

Likelihood ratio R2 (McFadden R2

)

Cox and Snell R2 (maximum

likelihood R2)

Nagelkerke

R2 (Cragg and Uhler’s R2)

Pseudo

R2 will be used to compare different fitted models with the same

outcome. The higher pseudo R2 indicates which model better predicts

the outcome. The table below shows three pseudo R2 that will be

employed to compare models in this project.

1.1.1.3

Pseudo R2

In

logistic regression for model with one predictor, the likelihood ratio chi

square test will be used to compare the deviance between the model with only

one intercept and model with one independent variable. If the likelihood ratio

chi square test is significant, the null hypothesis will be rejected with

conclusion that the model with one independent variable fits data better than

the model with only the intercept (null model). In model with multiple

predictors, the likelihood ratio chi-square test will be used to decide which

data fits the model.

1.1.1.2

likelihood ratio chi square test

The

deviance is one of the goodness of fit statistics that compares a fitted model

with a saturated model to show how well the model fits the data perfectly. If

the difference between the saturated model and fitted model is small, the model

is a good fit. On the other hand, if the deviance is large, the model has poor

fit smaller deviances means better fit(Zhu 2014).

1.1.1.1

The deviance

The

following measures of statistics – the

deviance, log likelihood ratio test, pseudo R2, AIC and BIC

statistics will be used to assess whether the model fits the data well.

1.1.1

Model

validation

In

this project, a consideration is given to a realistic situation where Rij and Rik (j ? k) are not independent. That is there are pattern where two covariates

tend to have data missing altogether or to say that there are covariates that

may influence data missing.

Where

Xobs is the observed part and

Xmiss is the missing part. R may be defined to be matrix of missing

data indicators with (i, j)th element

Rij=1 if Xij the value of jth predictor for ith subject is observed and 0 if missing.

Let

X denote the n x p matrix of

predictors(covariates). This can be partitioned as X= (Xobs, Xmiss)

Where

? = E

(Y ? X1,…,Xp)

and Bj is the Jth

regression coefficient, to predict the influence of data missing in predictor

variable on outcome(response) variable.

For

multiple predictors, a standard linear logistic model will be fitted. This

could be expressed as

The

above logistic model is used to fit single predictors to determine the crude

estimates of each socioeconomic and health covariates(predictors).

In

this case the outcome is binary Y and prespecified socioeconomic and health

predictors X1, …, Xp. The parameters in the data will be

used to fit a standard logistic model

Unlike

discriminant function analysis, logistic regression does not assume that

predictor variables are distributed as multivariate normal distribution with

equal covariance matrix, instead it assumes that binomial distribution

describes the distribution of error that equals the actual y minus predicted y

Normally,

the research question will be addressed by either ordinary least

square(OLS)regression or linear discriminant function analysis. Both these

techniques were subsequently found to be less than ideal for handling

dichotomous outcome due to strict statistical assumptions, i.e. linearity

normality, and continuity for OLS regression and multivariate normality with

equal variances and covariates for discriminant analysis.

After

investigating a crude association between missingness indicator variable and

other variables, the next stage will be to investigate the independent

contribution of variables in relation to the probability of

indicator(dependent) variable being missing. This will be carried out by

simultaneously fitting logistic regression model with all selected variables as

covariates. The logistic regression model will be able to confirm and establish

if the missing values in the indicator variable are MCAR, MAR or MNAR given

significance of other variables in the model.