Recent blogs - Things that make me go hmm
Graphs are generated dynamically through the R package knitr.
The source code to these posts are available on GitHub.
Latest posts
- 17 Jan 2016 » Stockmomentum.3
- 17 Jan 2016 » Stockmomentum.2
- 01 Nov 2015 » Rest 1
- 31 Aug 2015 » Titanic Kaggle
- 28 Jun 2015 » Boosting Algo
- Read More...
Titanic Kaggle
0.1 titantic - kaggle
[Source blog]https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
0.1.0.1 Titanic Survival Prediction
Titanic survival prediction challenge presented by Kaggle.
0.1.1 Background
hypothesize - a PASSENGERRS???s survival was heavily dependent upon two factors: 1. Recognition of the possibility that the ship could sink 2. Access to a lifeboat
0.1.2 Getting the Data Into R
Kaggle packaged the data for the Titanic challenge into two csv-format files: - train.csv (data containing attributes and known outcomes [survived or perished] for a subset of the passengers) - test.csv (data containing attributes without outcomes for a subset of passengers)
readData <- function(path.name, file.name, column.types, missing.types) {
read.csv( url( paste(path.name, file.name, sep="") ),
colClasses=column.types,
na.strings=missing.types )
}
I???ve pushed the Titanic csv files to my GitHub account so that I can access the data from anywhere and, more importantly, demonstrate here the reading of data from a web source. Here are the arguments I will pass into this custom file reading function for the train.csv file:
Titanic.path <- "https://raw.github.com/wehrley/Kaggle_Titanic/master/"
train.data.file <- "train.csv"
test.data.file <- "test.csv"
missing.types <- c("NA", "")
train.column.types <- c('integer', # PassengerId
'factor', # Survived
'factor', # Pclass
'character', # Name
'factor', # Sex
'numeric', # Age
'integer', # SibSp
'integer', # Parch
'character', # Ticket
'numeric', # Fare
'character', # Cabin
'factor' # Embarked
)
test.column.types <- train.column.types[-2] # # no Survived column in test.csv
Specifying missing types up front should make the data munging process a bit easier, and while I may have to change the class type for a data frame column or two along the way, I???ve specified class definitions in a manner which should be most conducive to modeling later. This leaves me with much cleaner code for reading the csv files.
train.raw <- readData(Titanic.path, train.data.file,
train.column.types, missing.types)
df.train <- train.raw
test.raw <- readData(Titanic.path, test.data.file,
test.column.types, missing.types)
df.infer <- test.raw
0.1.3 Data Munging
data preparation accounts for the bulk of the effort.
Sometimes my greatest insights have come from this phase, often referred to as data pre-processing.
missing data in the training set. - use the missmap
function from the Amelia package to display those.
## map missing data by provided feature
require(Amelia)
## Loading required package: Amelia
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.3, built: 2014-11-14)
## ## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(df.train, main="Titanic Training Data - Missings Map",
col=c("yellow", "black"), legend=FALSE)
20 percent of the Age data is missing, 70 percent of the passengers cannot be linked to a specific cabin number.
0.1.4 simple data visualization tools to work
predict whether or not a passenger survived.
barplot(table(df.train$Survived),
names.arg = c("Perished", "Survived"),
main="Survived (passenger fate)", col="black")
barplot(table(df.train$Pclass),
names.arg = c("first", "second", "third"),
main="Pclass (passenger traveling class)", col="firebrick")
barplot(table(df.train$Sex), main="Sex (gender)", col="darkviolet")
hist(df.train$Age, main="Age", xlab = NULL, col="brown")
barplot(table(df.train$SibSp), main="SibSp (siblings + spouse aboard)",
col="darkblue")
barplot(table(df.train$Parch), main="Parch (parents + kids aboard)",
col="gray50")
hist(df.train$Fare, main="Fare (fee paid for ticket[s])", xlab = NULL,
col="darkgreen")
barplot(table(df.train$Embarked),
names.arg = c("Cherbourg", "Queenstown", "Southampton"),
main="Embarked (port of embarkation)", col="sienna")
the dominant categories in the first three graphs: * more passengers perished than survived * about twice as many passengers in 3rd class than in either 1st or 2nd * male passengers far outnumbered females
Perhaps these are the first clues that the two themes discussed earlier ??? women and children first policy, and location on the ship ??? could dictate the feature set. Although the fact that Southampton was the port of embarkation for most passengers doesn???t make for a very balanced Embarked factor, it might mean something in the final analysis.
Mosaic plots offer an interesting ??? and arguably under-utilized ??? way to summarize data. The vcd package includes the mosaicplot
function for creating those. The following mosaic suggests that traveling class did influence the odds of a passenger???s survival.
mosaicplot(df.train$Pclass ~ df.train$Survived,
main="Passenger Fate by Traveling Class", shade=FALSE,
color=TRUE, xlab="Pclass", ylab="Survived")
Do you recall the earlier bar graph showing that some 2/3 of passengers were males? That is taken into account by the width of the two rectangles labeled ???male??? in the mosaic below. Now look at the height of the leftmost light gray rectangle [representing the proportion of females who survived] and compare it to the much shorter light gray rectange [representing proportion of males who survived]. Gender should certainly prove to be a prominent feature in the final model.
mosaicplot(df.train$Sex ~ df.train$Survived,
main="Passenger Fate by Gender", shade=FALSE, color=TRUE,
xlab="Sex", ylab="Survived")
Is it possible that ???survival of the fittest??? dictated the fate of passengers in certain parts of the ship? Perhaps, though it isn???t apparent at first glance from the boxplot of Age by Survival.
boxplot(df.train$Age ~ df.train$Survived,
main="Passenger Fate by Age",
xlab="Survived", ylab="Age")
While passenger survival didn???t vary as much across the three ports of embarkation as it did between genders and traveling classes, perhaps the Embarked feature will prove useful at some point.
mosaicplot(df.train$Embarked ~ df.train$Survived,
main="Passenger Fate by Port of Embarkation",
shade=FALSE, color=TRUE, xlab="Embarked", ylab="Survived")
Just one more graph, then we???ll get to those missing ages. The corrgram package is the source of a function for creating what is sometimes referred to as a correlogram. The one shown below confirms a couple of observations already made ??? namely, that survival odds drop with class, and age may not prove to be a significant predictor. Given that the upper class ranks tend to be represented by an older demographic, an inverse correlation between age and traveling class is to be expected. Although fare and class are closely related, it might be worth throwing the Fare feature into the mix as another way to define a
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
require(corrgram)
## Loading required package: corrgram
##
## Attaching package: 'corrgram'
##
## The following object is masked from 'package:plyr':
##
## baseball
corrgram.data <- df.train
## change features of factor type to numeric type for inclusion on correlogram
corrgram.data$Survived <- as.numeric(corrgram.data$Survived)
corrgram.data$Pclass <- as.numeric(corrgram.data$Pclass)
corrgram.data$Embarked <- revalue(corrgram.data$Embarked,
c("C" = 1, "Q" = 2, "S" = 3))
## generate correlogram
corrgram.vars <- c("Survived", "Pclass", "Sex", "Age",
"SibSp", "Parch", "Fare", "Embarked")
corrgram(corrgram.data[,corrgram.vars], order=FALSE,
lower.panel=panel.ellipse, upper.panel=panel.pie,
text.panel=panel.txt, main="Titanic Training Data")
Time to tackle those missing ages. A common approach to this type of situation is to replacing the missings with the average of the available values. In this case, that would mean replacing 177 missing Age values with 29.7.
summary(df.train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Taking that approach would be fine if only a small fraction of the ages were missing. However, with missings accounting for 20 percent of all Age data in a relatively small data set (<900 records), one could justify a search for a more refined method of imputation. Let???s peek again at the list of currently available features:
names(df.train)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
PassengerId is merely a record number, and we already know that splitting the ages solely by Survived doesn???t reveal much. A boxplot of ages by passenger traveling class looks interesting
This makes intuitive sense: Passengers in the upper classes (first and second) would tend to be wealthier, and in that period of U.S. history, acquiring wealth usually required a good deal of time (no dot-com kings in their 20s were aboard the Titanic on her maiden voyage). There are no missing values in Pclass, so we could replace the missing age for, say, a third class passenger with the average or median of the available ages for those in Pclass="3"
. Doing so would be an improvement over assigning 29.7 to all Age missings.
Inspection of the next feature ??? Name ??? reveals what could be an even better approach???
head(df.train$Name, n=10L)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
## [7] "McCarthy, Mr. Timothy J"
## [8] "Palsson, Master. Gosta Leonard"
## [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"
Notice the titles ??? Mr., Mrs., Miss., Master. ??? following each of the surnames. The Wikipedia entry for the English honorific ???Master??? explains that, By the late 19th century, etiquette dictated that men be addressed as Mister, and boys as Master."
The title ???Miss??? should help with differentiation betweeen younger and older females. Also, note the way the title appears in the name: The format ???Surname, Title. Firstname?????? is consistent in Name across all records. I used that pattern to create a custom function which employs a regular expression and the regexpr
function to extract the title from each name:
## function for extracting honorific (i.e. title) from the Name feature
getTitle <- function(data) {
# data <- data.frame(Name = tail(df.train$Name, 5))
title.dot.start <- regexpr("\\,[A-Z ]{1,20}\\.", data$Name, TRUE)
title.comma.end <- title.dot.start + attr(title.dot.start, "match.length")-1
substr(data$Name, title.dot.start+2, title.comma.end-1)
}
Let???s fetch the titles, given them their own column in the df.train data frame, and look at the uniques.
str(df.train$Name)
## chr [1:891] "Braund, Mr. Owen Harris" ...
df.train$Title <- getTitle(df.train)
unique(df.train$Title)
## [1] "Mr" "Mrs" "Miss" "Master"
## [5] "Don" "Rev" "Dr" "Mme"
## [9] "Ms" "Major" "Lady" "Sir"
## [13] "Mlle" "Col" "Capt" "the Countess"
## [17] "Jonkheer"
To identify the titles which have at least one record with an age missing, I???ll use the bystats
function from the Hmisc package.
options(digits=2)
require(Hmisc)
## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
##
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
bystats(df.train$Age, df.train$Title,
fun=function(x)c(Mean=mean(x),Median=median(x)))
##
## c(4, 13, 4, 55, 13, 55, 4, 4) of df.train$Age by df.train$Title
##
## N Missing Mean Median
## Capt 1 0 70.0 70.0
## Col 2 0 58.0 58.0
## Don 1 0 40.0 40.0
## Dr 6 1 42.0 46.5
## Jonkheer 1 0 38.0 38.0
## Lady 1 0 48.0 48.0
## Major 2 0 48.5 48.5
## Master 36 4 4.6 3.5
## Miss 146 36 21.8 21.0
## Mlle 2 0 24.0 24.0
## Mme 1 0 24.0 24.0
## Mr 398 119 32.4 30.0
## Mrs 108 17 35.9 35.0
## Ms 1 0 28.0 28.0
## Rev 6 0 43.2 46.5
## Sir 1 0 49.0 49.0
## the Countess 1 0 33.0 33.0
## ALL 714 177 29.7 28.0
# N Missing Mean Median
# Capt 1 0 70.0 70.0
# Col 2 0 58.0 58.0
# Don 1 0 40.0 40.0
# Dr 6 1 42.0 46.5
# Jonkheer 1 0 38.0 38.0
# Lady 1 0 48.0 48.0
# Major 2 0 48.5 48.5
# Master 36 4 4.6 3.5
# Miss 146 36 21.8 21.0
# Mlle 2 0 24.0 24.0
# Mme 1 0 24.0 24.0
# Mr 398 119 32.4 30.0
# Mrs 108 17 35.9 35.0
# Ms 1 0 28.0 28.0
# Rev 6 0 43.2 46.5
# Sir 1 0 49.0 49.0
# the Countess 1 0 33.0 33.0
# ALL 714 177 29.7 28.0
Now I can assign the titles with at least one missing Age value to a list???
## list of titles with missing Age value(s) requiring imputation
titles.na.train <- c("Dr", "Master", "Mrs", "Miss", "Mr")
???then pass that list to the following custom function I created for imputing the missing ages:
imputeMedian <- function(impute.var, filter.var, var.levels) {
for (v in var.levels) {
impute.var[ which( filter.var == v)] <- impute(impute.var[
which( filter.var == v)])
}
return (impute.var)
}
I apply the impute
function from the Hmisc package on a per-title basis to assign the median of the available ages to the missing age(s). For example, the single record with a missing Age value and Title="Dr"
will be assigned the median of the ages from the 6 records with Title="Dr"
which do have age data.
df.train$Age[which(df.train$Title=="Dr")]
## [1] 44 54 23 32 50 NA 49
After doing the age imputations, I check the Age data and find that the function seems to have done its job.
df.train$Age <- imputeMedian(df.train$Age, df.train$Title,
titles.na.train)
df.train$Age[which(df.train$Title=="Dr")]
## [1] 44 54 23 32 50 46 49
summary(df.train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 21 30 29 35 80
You may recall that the Embarked
feature also had at least one missing value. A summary of that data???
summary(df.train$Embarked)
## C Q S NA's
## 168 77 644 2
???reveals just two missings. It should be fine to replace those missings with ???S???, the most common value.
df.train$Embarked[which(is.na(df.train$Embarked))] <- 'S'
While there are no missing Fare values, a summary does show at least one Fare=0
???
summary(df.train$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 8 14 32 31 512
(That exceptionally high fare of $512.30 suggests that some tickets were purchased in groups. We???ll address that later.) A zero fare might have been assigned to a baby. However, a closer look at records where Fare = 0
suggests otherwise???
subset(df.train, Fare < 7)[order(subset(df.train, Fare < 7)$Fare,
subset(df.train, Fare < 7)$Pclass),
c("Age", "Title", "Pclass", "Fare")]
## Age Title Pclass Fare
## 264 40 Mr 1 0.0
## 634 30 Mr 1 0.0
## 807 39 Mr 1 0.0
## 816 30 Mr 1 0.0
## 823 38 Jonkheer 1 0.0
## 278 30 Mr 2 0.0
## 414 30 Mr 2 0.0
## 467 30 Mr 2 0.0
## 482 30 Mr 2 0.0
## 675 30 Mr 2 0.0
## 733 30 Mr 2 0.0
## 180 36 Mr 3 0.0
## 272 25 Mr 3 0.0
## 303 19 Mr 3 0.0
## 598 49 Mr 3 0.0
## 379 20 Mr 3 4.0
## 873 33 Mr 1 5.0
## 327 61 Mr 3 6.2
## 844 34 Mr 3 6.4
## 819 43 Mr 3 6.5
## 203 34 Mr 3 6.5
## 372 18 Mr 3 6.5
## 144 19 Mr 3 6.8
## 655 18 Miss 3 6.8
## 412 30 Mr 3 6.9
## 826 30 Mr 3 7.0
## 130 45 Mr 3 7.0
## 805 27 Mr 3 7.0
# Age Title Pclass Fare
# 264 40 Mr 1 0.0
# 634 30 Mr 1 0.0
# 807 39 Mr 1 0.0
# 816 30 Mr 1 0.0
# 823 38 Noble 1 0.0
# 278 30 Mr 2 0.0
# 414 30 Mr 2 0.0
# 467 30 Mr 2 0.0
# 482 30 Mr 2 0.0
# 675 30 Mr 2 0.0
# 733 30 Mr 2 0.0
# 180 36 Mr 3 0.0
# 272 25 Mr 3 0.0
# 303 19 Mr 3 0.0
# 598 49 Mr 3 0.0
# 379 20 Mr 3 4.0
# 873 33 Mr 1 5.0
# 327 61 Mr 3 6.2
# 844 34 Mr 3 6.4
# 819 43 Mr 3 6.4
# 203 34 Mr 3 6.5
# 372 18 Mr 3 6.5
# 144 19 Mr 3 6.8
# 655 18 Miss 3 6.8
# 412 30 Mr 3 6.9
# 826 30 Mr 3 7.0
# 130 45 Mr 3 7.0
# 805 27 Mr 3 7.0
The jump in fares from 0 to the 4-7 range suggests errors. I replaced the zero Fare values with the median fare from the respective passenger class using the imputMedian function introduced earlier.
## impute missings on Fare feature with median fare by Pclass
df.train$Fare[ which( df.train$Fare == 0 )] <- NA
df.train$Fare <- imputeMedian(df.train$Fare, df.train$Pclass,
as.numeric(levels(df.train$Pclass)))
I see the titles as more than merely a guide for imputation of missing ages. A passenger???s title can reflect gender, his/her position on the ship (officers & royalty), and access to a lifeboat (where ???Master??? superceded ???Mr???). Making the effort to get the Title feature model-ready seems worthwhile.
Recall from the bystats
results above that the training data contains 17 different titles. We already know that ???Master??? and ???Mr??? should separate the males into roughly two groups by age. The following script???
df.train$Title <- factor(df.train$Title,
c("Capt","Col","Major","Sir","Lady","Rev",
"Dr","Don","Jonkheer","the Countess","Mrs",
"Ms","Mr","Mme","Mlle","Miss","Master"))
boxplot(df.train$Age ~ df.train$Title,
main="Passenger Age by Title", xlab="Title", ylab="Age")
???produces this boxplot (too wide for display here) showing passenger age by title, including shading which illustrates the manner in which I consolidated the titles. I created and applied a custom function for revaluing the titles, then reclassified Title to a factor type, as follows:
## function for assigning a new title value to old title(s)
changeTitles <- function(data, old.titles, new.title) {
for (honorific in old.titles) {
data$Title[ which( data$Title == honorific)] <- new.title
}
return (data$Title)
}
## Title consolidation
df.train$Title <- changeTitles(df.train,
c("Capt", "Col", "Don", "Dr",
"Jonkheer", "Lady", "Major",
"Rev", "Sir"),
"Noble")
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, which(data$Title == honorific), value =
## structure(c(13L, : invalid factor level, NA generated
df.train$Title <- changeTitles(df.train, c("the Countess", "Ms"),
"Mrs")
df.train$Title <- changeTitles(df.train, c("Mlle", "Mme"), "Miss")
df.train$Title <- as.factor(df.train$Title)
I assigned the Countess of Rothes, a woman in first class and the sole passenger with a ???Countess??? title, to the ???Mrs??? group. In retrospect, I could have placed her under the ???Noble??? umbrella. Given that 91 of the 94 female first-class passengers in the training set survived, I was willing to live with that choice.
All of the work done designing the new Title column can be considered a part of feature engineering. The other features I chose to add are generated using custom function featureEngrg
, which can be applied to both the training data in df.train and the Kaggle-provided test data in df.infer.
require(plyr) # for the revalue function
require(stringr) # for the str_sub function
## Loading required package: stringr
## test a character as an EVEN single digit
isEven <- function(x) x %in% c("0","2","4","6","8")
## test a character as an ODD single digit
isOdd <- function(x) x %in% c("1","3","5","7","9")
## function to add features to training or test data frames
featureEngrg <- function(data) {
## Using Fate ILO Survived because term is shorter and just sounds good
data$Fate <- data$Survived
## Revaluing Fate factor to ease assessment of confusion matrices later
data$Fate <- revalue(data$Fate, c("1" = "Survived", "0" = "Perished"))
## Boat.dibs attempts to capture the "women and children first"
## policy in one feature. Assuming all females plus males under 15
## got "dibs' on access to a lifeboat
data$Boat.dibs <- "No"
data$Boat.dibs[which(data$Sex == "female" | data$Age < 15)] <- "Yes"
data$Boat.dibs <- as.factor(data$Boat.dibs)
## Family consolidates siblings and spouses (SibSp) plus
## parents and children (Parch) into one feature
data$Family <- data$SibSp + data$Parch
## Fare.pp attempts to adjust group purchases by size of family
data$Fare.pp <- data$Fare/(data$Family + 1)
## Giving the traveling class feature a new look
data$Class <- data$Pclass
data$Class <- revalue(data$Class,
c("1"="First", "2"="Second", "3"="Third"))
## First character in Cabin number represents the Deck
data$Deck <- substring(data$Cabin, 1, 1)
data$Deck[ which( is.na(data$Deck ))] <- "UNK"
data$Deck <- as.factor(data$Deck)
## Odd-numbered cabins were reportedly on the port side of the ship
## Even-numbered cabins assigned Side="starboard"
data$cabin.last.digit <- str_sub(data$Cabin, -1)
data$Side <- "UNK"
data$Side[which(isEven(data$cabin.last.digit))] <- "port"
data$Side[which(isOdd(data$cabin.last.digit))] <- "starboard"
data$Side <- as.factor(data$Side)
data$cabin.last.digit <- NULL
return (data)
}
## add remaining features to training data frame
df.train <- featureEngrg(df.train)
Some color on the features I???ve added: * Boat.dibs - assumes all females plus males under 15 get ???dibs??? on access to a lifeboat. Filtering by Title=???Master??? was considered, but the highest age in the training data for males addressed as???Master" was just 12, and I wanted to account for male teens with Title=???Mr??? who could pass for a child. * Deck - levels are as shown in the Titanic cross-section displayed previously. Cabin data provided for just 23 percent of training data records, so it???s tough to give this one much emphasis. * Side - subject to the same concern (dearth of data) expressed for Deck
I finish the data munging process by paring down the data frame to the columns I will use in model building.
train.keeps <- c("Fate", "Sex", "Boat.dibs", "Age", "Title",
"Class", "Deck", "Side", "Fare", "Fare.pp",
"Embarked", "Family", "Survived")
df.train.munged <- df.train[train.keeps]
0.2 Fitting a Model
Later, I will be conducting the predictive modeling effort using the caret
package. Created by Max Kuhn of Pfizer Global R&D, caret
provides a unified interface for modeling & prediction, and streamlines the model tuning process using resampling. The package includes a createDataPartition
function for splitting data into a training set and a test set (sometimes referred to as a validation set) via stratified random sampling. In this presentation, Kuhn delivered the best explanation I???ve seen of the decision on how to ???spend??? the available training data. His conclusion: Statistically, the best course of action would be to use all the data for model building and use statistical methods to get good estimates of error. From a non-statistical perspective, many consumers of these models emphasize the need for an untouched set of samples to evaluate performance.
I selected an 80/20 split for training data and testing data. The code:
nn <- nrow(df.train.munged)
training.rows <- ind <- sample(x = nn, size = nn*.8)
train.batch <- df.train.munged[+ind, ]
test.batch <- df.train.munged[-ind, ]
# ## split training data into train batch and test batch
# set.seed(23)
# library(caret)
# training.rows <- createDataPartition(df.train.munged$Survived,
# p = 0.8, list = FALSE)
# train.batch <- df.train.munged[training.rows, ]
# test.batch <- df.train.munged[-training.rows, ]
Before I go pouring features into the popular Random Forest method, I???m going to give one of the simplest classification methods a crack at the Titanic prediction challenge. Logistic regression, which surfaced about 70 years ago, has been used extensively in multiple fields. I???ll start simple by passing essentially the features provided in the raw training data (remember that we combined SibSp
and Parch
to form Family
) through the R function for fitting general linearized models. When entering the model formula, I typically have a habit of listing the features in an order roughly corresponding to what I initially believe their importance will be. In this case, I???ve ordered them roughly by the two main themes I discussed earlier (women & children first policy and location on the ship). By setting the argument family
to binomial
with a logit
link, I???m asking glm( )
to produce a logistic regression.
Titanic.logit.1 <- glm(Fate ~ Sex + Class + Age + Family + Embarked + Fare,
data = train.batch, family=binomial("logit")
)
Titanic.logit.1
##
## Call: glm(formula = Fate ~ Sex + Class + Age + Family + Embarked +
## Fare, family = binomial("logit"), data = train.batch)
##
## Coefficients:
## (Intercept) Sexmale ClassSecond ClassThird Age
## 4.33126 -2.63852 -1.17861 -2.29533 -0.03863
## Family EmbarkedQ EmbarkedS Fare
## -0.21155 -0.13795 -0.62868 0.00156
##
## Degrees of Freedom: 711 Total (i.e. Null); 703 Residual
## Null Deviance: 951
## Residual Deviance: 634 AIC: 652