Introduction

Data description

3154 healthy young men aged 39-59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in this situation was recorded.

Format

age: Age in years
height: Height in inches
weight: Weight in pounds
sdp: Systolic blood pressure in mm Hg
dbp: Diastolic blood pressure in mm Hg
chol: Fasting serum cholesterol in mm %
behave: Behavior type which is a factor with levels A1 A2 B3 B4
cigs: Number of cigarettes smoked per day
dibep: Behavior type which is a factor with levels A (Agressive) B (Passive)
chd: Coronary heat disease developed which is a factor with levels no yes
typechd: type of coronary heart disease which is a factor with levels angina infdeath none silent
timechd: Time of CHD event or end of follow-up
arcus: Arcus senilis is a factor with levels absent present

Details

The WCGS began in 1960 with 3,524 male volunteers who were employed by 11 California companies. Subjects were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3,154 and the number of companies to 10 because of various exclusions. The cohort comprised both blue- and white-collar employees. At baseline the following information was collected: socio-demographic including age, education, marital status, income, occupation; physical and physiological including height, weight, blood pressure, electrocardiogram, and corneal arcus; biochemical including cholesterol and lipoprotein fractions; medical and family history and use of medications; behavioral data including Type A interview, smoking, exercise, and alcohol use. Later surveys added data on anthropometry, triglycerides, Jenkins Activity Survey, and caffeine use. Average follow-up continued for 8.5 years with repeat examinations.

Source

Statistics for Epidemiology by N. Jewell (2004)

Goal

To find out the association between diseased variable and each exposure variable that is categorized. And to figure out the causal relationship and make conlusion about the inference. And fit some models to predict the outcome by those variables.

Preprocessing

Import data

# Load library
library(faraway)
data(wcgs)

# Present the structure of data
str(wcgs)

Output:

## 'data.frame': 3154 obs. of 13 variables:
## $ age : int 49 42 42 41 59 44 44 40 43 42 ...
## $ height : int 73 70 69 68 70 72 72 71 72 70 ...
## $ weight : int 150 160 160 152 150 204 164 150 190 175 ...
## $ sdp : int 110 154 110 124 144 150 130 138 146 132 ...
## $ dbp : int 76 84 78 78 86 90 84 60 76 90 ...
## $ chol : int 225 177 181 132 255 182 155 140 149 325 ...
## $ behave : Factor w/ 4 levels "A1","A2","B3",..: 2 2 3 4 3 4 4 2 3 2 ...
## $ cigs : int 25 20 0 20 20 0 0 0 25 0 ...
## $ dibep : Factor w/ 2 levels "A","B": 2 2 1 1 1 1 1 2 1 2 ...
## $ chd : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ typechd: Factor w/ 4 levels "angina","infdeath",..: 3 3 3 3 2 3 3 3 3 3 ...
## $ timechd: int 1664 3071 3071 3064 1885 3102 3074 3071 3064 1032 ...
## $ arcus : Factor w/ 2 levels "absent","present": 1 2 1 1 2 1 1 1 1 2 ...
# Create a new variable named "status"
status <- rep(1, length(wcgs$chd))       # status = 1 for all
status[which(wcgs$chd == "no")] <- 0     # status = 0 if chd == "no"

# Add status to data
wcgs <- cbind(wcgs, status)

Missing value

# Check the number of NA
sum(is.na(wcgs))

Output:

## [1] 14

As we can see that we have 14 missing value (NA) in our raw data. Furthermore, if we observe those value type of each column, it is easy to find out that we have 2 types of value in total (factor and integer). Hence, in order to get rid of those missing value, we should replace NA of different value type by using different method. Here we use the missForest package to impute missing values. missForest runs a random forests on each variable using the observed part and predicts the na values.

# Load library
library(missForest)

# Impute missing values
wcgs.mis <- missForest(wcgs, maxiter = 5, ntree = 250)
wcgs <- wcgs.mis$ximp

Output:

## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
# Check the number of NA again
sum(is.na(wcgs))

Output:

## [1] 0

Split Data

In order to check the performance of our models, we need to build a training data set and a testing data set. Here I decided to split our original data into two parts by half and half. And let one of them to be the train and the other one will be the test.

train <- wcgs[   1:1576, ] # Set row 1 to 1576 as train
test  <- wcgs[1577:3152, ] # Set the rest as test

test.chd <- test[(names(test) %in% c("chd"))]
test <- test[!(names(test) %in% c("typechd", "status", "chd"))]
train <- train[!(names(train) %in% c("typechd", "status"))]

Data visualization

Kaplan-Meier Plot

# Load libraries
library(survival)
library(survminer)
library(ggplot2)

# fit a model
fit <- survfit(Surv(timechd, status) ~ typechd, data = wcgs)

# KM curves
ggsurvplot(
    fit                       ,
    size = 0.3                ,
    pval = TRUE               ,
    conf.int = TRUE           , # Show confidence interval
    surv.median.line = "hv"   , # Specify median survival
    ggtheme = theme_light()   , # Change ggplot2 theme
    xlab = "Days"             ,
    legend.title = "Type of Coronary heart disease",
    legend.labs = c("Angina","Infdeath", "None", "Silent")
)

# log–rank test
data <- subset(wcgs, typechd != "none")
survdiff(Surv(timechd, status) ~ typechd, data = data)

Output:

## Call:
## survdiff(formula = Surv(timechd, status) ~ typechd, data = data)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## typechd=angina 51 51 52.3 0.0304 0.0386
## typechd=infdeath 135 135 137.8 0.0573 0.1245
## typechd=silent 71 71 66.9 0.2478 0.3408
##
## Chisq= 0.3 on 2 degrees of freedom, p= 0.8

From the Kaplan-Meier plot above, we can easily see that there are 3 types of Coronary heart disease. The survival probability of all of them are decreasing with time. Angina has relatively lower survival probability at the begining, and has relatively higher survival probability at the end.

We did the log-rank test with the data set which without level "none" in typechd since "none" is not a type of CHD, and our goal is to compare difference in survival between differet types of CHD.

The log-rank test resulted a p-value of p = 0.9, indicatiing that Coronary heart disease types differ insignificantly in survival. This outcome is obvious, since as we can see that although there is some difference between those three types of Coronary Heart Disease in KM curve presented above, in general, those differences are not notable.

Kernel Density Plots

# Load library
library(ggpubr)

# Kernel density plots
chd.age <- wcgs[,c("chd", "age")]
chd.height <- wcgs[,c("chd", "height")]
chd.weight <- wcgs[,c("chd", "weight")]
chd.sdp <- wcgs[,c("chd", "sdp")]

# SalePrice - KitchenQual
age.plt <- ggplot(chd.age, aes(x = age, fill = chd)) +
  geom_density(alpha = 0.4) + 
  labs(title = "CHD distribution by Age")

# SalePrice - LotConfig
height.plt <- ggplot(chd.height, aes(x = height, fill = chd)) +
  geom_density(alpha = 0.4) + 
  labs(title = "CHD distribution by Height")

# SalePrice - SaleType
weight.plt <- ggplot(chd.weight, aes(x = weight, fill = chd)) +
  geom_density(alpha = 0.4) + 
  labs(title = "CHD distribution by Weight")

# SalePrice - SaleType
sdp.plt <- ggplot(chd.sdp, aes(x = sdp, fill = chd)) +
  geom_density(alpha = 0.4) + 
  labs(title = "CHD distribution by sdp")

ggarrange(age.plt, height.plt, weight.plt, sdp.plt, ncol = 2, nrow = 2)

From these kernel density plots presented above we can see that only age has overt effects on CHD. The plot is showing that aged people tend to be more likely to have CHD. And the sdp might have slightly effects on CHD. The height and the weight does not have any association with chd due to the distribution of chd patient similarly through different height/weight.

Associations

Test for association between diseased (Response) variable and each exposure variable that is categorized.

# Behavior
behavior <- table(wcgs$chd, wcgs$behave)
behavior; chisq.test(behavior)

Output:

##
## A1 A2 B3 B4
## no 234 1177 1155 331
## yes 30 148 61 18
##
## Pearson's Chi-squared test
##
## data: behavior
## X-squared = 39.916, df = 3, p-value = 1.11e-08
# Behavior type
behavior_type <- table(wcgs$chd, wcgs$dibep)
behavior_type; chisq.test(behavior_type)

Output:

##
## A B
## no 1486 1411
## yes 79 178
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: behavior_type
## X-squared = 39.08, df = 1, p-value = 4.069e-10
# Arcus senilis
arcus_senilis <- table(wcgs$chd, wcgs$arcus)
arcus_senilis; chisq.test(arcus_senilis)

Output:

##
## absent present
## no 2058 839
## yes 154 103
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: arcus_senilis
## X-squared = 13.402, df = 1, p-value = 0.0002514

From the result of chi-square test above, we can clonclude that all of them have significant relationship with CHD due to the extremely low p-value.

Model Diagnosis

Logistic Regression

# Build logistic regression model
fit.glm <- glm(
  chd ~ . - weight - height, 
  family = binomial(link = "logit") ,
  data = train
)

# Check the detail of the model
summary(fit.glm)

Output:

##
## Call:
## glm(formula = chd ~ . - weight - height, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1540 -0.2840 -0.1880 -0.1211 3.1893
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.4808822 1.3633689 -4.754 2.00e-06 ***
## age 0.0593623 0.0199352 2.978 0.0029 **
## sdp 0.0261042 0.0107949 2.418 0.0156 *
## dbp -0.0132608 0.0176858 -0.750 0.4534
## chol 0.0124000 0.0025709 4.823 1.41e-06 ***
## behaveA2 -0.4747303 0.3451883 -1.375 0.1690
## behaveB3 -0.8120439 0.3822421 -2.124 0.0336 *
## behaveB4 -0.5967909 0.4874605 -1.224 0.2208
## cigs 0.0209861 0.0074927 2.801 0.0051 **
## dibepB NA NA NA NA
## timechd -0.0017381 0.0001282 -13.553 < 2e-16 ***
## arcuspresent 0.3571030 0.2389478 1.494 0.1350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 926.31 on 1575 degrees of freedom
## Residual deviance: 581.74 on 1565 degrees of freedom
## AIC: 603.74
##
## Number of Fisher Scoring iterations: 6

From the model summary presented above, it is not hard to see that some variables have high significant level. height, chol, timechd have 3 stars, and Intercept, sdp, cigs have 2 stars.

Since Intercept has 2 stars significant level, which means some interception between variables still valuable, we cannot arbitrarily chose variables with high significant level. Hence, we need the variable selection part as following.

The best decision is to leave the variable selection to computer. Also, we may call it "Subset selection". And it will go both "Forward" and "Backward" directions to select a subset which has the lowest AIC to be the best subset.

# Load package
library(caret)

# Build another model
fit.glm.step <- step(fit.glm, direction = "both")

Output:

## Start: AIC=603.74
## chd ~ (age + height + weight + sdp + dbp + chol + behave + cigs +
## dibep + timechd + arcus) - weight - height
9
##
##
## Step: AIC=603.74
## chd ~ age + sdp + dbp + chol + behave + cigs + timechd + arcus
##
## Df Deviance AIC
## - behave 3 586.27 602.27
## - dbp 1 582.31 602.31
## <none> 581.74 603.74
## - arcus 1 583.95 603.95
## - sdp 1 587.53 607.53
## - cigs 1 589.45 609.45
## - age 1 590.68 610.68
## - chol 1 605.32 625.32
## - timechd 1 810.54 830.54
##
## Step: AIC=602.27
## chd ~ age + sdp + dbp + chol + cigs + timechd + arcus
##
## Df Deviance AIC
## - dbp 1 586.67 600.67
## + dibep 1 583.81 601.81
## <none> 586.27 602.27
## - arcus 1 588.84 602.84
## + behave 3 581.74 603.74
## - sdp 1 591.73 605.73
## - cigs 1 595.06 609.06
## - age 1 596.66 610.66
## - chol 1 609.63 623.63
## - timechd 1 823.49 837.49
##
## Step: AIC=600.67
## chd ~ age + sdp + chol + cigs + timechd + arcus
##
## Df Deviance AIC
## + dibep 1 584.29 600.29
## <none> 586.67 600.67
## - arcus 1 589.32 601.32
## + dbp 1 586.27 602.27
## + behave 3 582.31 602.31
## - sdp 1 596.04 608.04
## - cigs 1 596.47 608.47
## - age 1 596.94 608.94
## - chol 1 609.79 621.79
## - timechd 1 823.88 835.88
##
## Step: AIC=600.29
## chd ~ age + sdp + chol + cigs + timechd + arcus + dibep
##
## Df Deviance AIC
## <none> 584.29 600.29
## - dibep 1 586.67 600.67
## - arcus 1 586.87 600.87
## + dbp 1 583.81 601.81
10
## + behave 2 582.31 602.31
## - age 1 593.31 607.31
## - cigs 1 593.36 607.36
## - sdp 1 593.45 607.45
## - chol 1 607.33 621.33
## - timechd 1 812.89 826.89
# Visualize model
par(mfrow = c(2,2))
plot(fit.glm.step)

# Make prediction
test.prob <- predict(fit.glm.step, newdata = test, type = "response")
pred.glm <- rep("no", length(test.prob))
pred.glm[test.prob >= 0.5] <- "yes"

# Evaluate the model obtained on the test set
postResample(pred.glm, test.chd$chd)

Output:

## Accuracy Kappa
## 0.9238579 0.3247499

KNN

This study is a classification problem which reminds me that KNN could be used for this problem.

# Set to a constant result
set.seed(123)

# Set up grid and trControl
grid_knn <- expand.grid(.k = seq(2, 20, by = 1))
tr.control <- trainControl(method = "cv", number = 15)

# Build model
fit.knn <- train(
  chd ~ . - weight - height  , 
  train                      , 
  method = "knn"             , 
  trControl = tr.control     ,
  tuneGrid = grid_knn        ,
  preProcess = c("center", "scale")
)

# Present the model
fit.knn

Output:

## k-Nearest Neighbors
##
## 1576 samples
## 11 predictor
## 2 classes: 'no', 'yes'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (15 fold)
## Summary of sample sizes: 1471, 1471, 1471, 1471, 1471, 1471, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.9022761 0.3564266
## 3 0.9232225 0.3955646
## 4 0.9213118 0.3804881
## 5 0.9295837 0.4169642
## 6 0.9245043 0.3505174
## 7 0.9245043 0.3545086
## 8 0.9175202 0.2629168
## 9 0.9187841 0.2634132
## 10 0.9149745 0.2241198
## 11 0.9162444 0.2189774
## 12 0.9156095 0.2173461
## 13 0.9156095 0.2035675
## 14 0.9181491 0.2223683
## 15 0.9168793 0.2001922
## 16 0.9162444 0.1829645
## 17 0.9194190 0.2195936
## 18 0.9213237 0.2395027
## 19 0.9194190 0.2199010
## 20 0.9213237 0.2309750
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# Plot the results
plot(fit.knn)

# Make prediction
pred.knn <- predict(fit.knn, newdata = test)

# Evaluate the model obtained on the test set
postResample(pred.knn, test.chd$chd)

Output:

## Accuracy Kappa
## 0.9276650 0.2909698

Boosting

# Set to a constant result
set.seed(123)

objControl <- trainControl(
  method = 'cv', 
  number = 20  , 
  returnResamp = 'none', 
  summaryFunction = twoClassSummary, 
  classProbs = TRUE
)

# Fit a model
fit.gbm <- train(
  train[,!(names(train) %in% c("weight", "height", "chd"))] ,
  train[,"chd"]                        ,
  method = "gbm"                       ,
  trControl = objControl               ,  
  metric = "ROC"                       ,
  preProc = c("center", "scale")
)
# Check detail of model
summary(fit.gbm)

Output:

## var rel.inf
## timechd timechd 64.1175247
## chol chol 13.1085644
## age age 8.0388713
## sdp sdp 4.6793517
## dbp dbp 3.9532382
## cigs cigs 3.9362107
## behave behave 1.2377050
## arcus arcus 0.8109386
## dibep dibep 0.1175955
# Make prediction
pred.gbm <- predict(
  fit.gbm, 
  newdata = test, 
  n.trees = which.min(fit.gbm$cv.error)
)

# Evaluate the model obtained on the test set
postResample(pred.gbm, test.chd$chd)

Output:

## Accuracy Kappa
## 0.9244924 0.3637141

Prediction

Ensemble

Since this is a classification problem so I decided to use majority voting to ensemble three models.

# The majority vote

prediction <- ifelse(pred.glm == "no" & pred.knn == "no", "no",
              ifelse(pred.glm == "no" & pred.gbm == "no", "no",
              ifelse(pred.knn == "no" & pred.gbm == "no", "no", 
                     "yes")))

# Evaluate the model obtained on the test set
postResample(prediction, test.chd$chd)

Output:

## Accuracy Kappa
## 0.9257614 0.3189232

Conclusion

No matter what type of Coronary heart disease you get, the consquences would be similar. Although there are three types of Coronary heart disease presented in the data provided by western collaborative group study, their survival probability almost equal during the period after getting diseased till die. In another word, the difference between different types of Coronary heart disease (Angina, Infdeath, Silent) is not notable.

Furthermore, behaviors, behavior types, arcus senilis are all highly related with Coronary heart disease.

After all, we built a simple model for predicting CHD.

Citation

http://search.r-project.org/library/faraway/html/wcgs.html

Last modification:January 12, 2024
如果觉得我的文章对你有用,请随意赞赏