We will walk to how to perform an EDA on a small subset of data
Load the appropriate libraries
# Install and load necessary packages if not already installed# install.packages("tidyverse")library(tidyverse)
Warning: package 'dplyr' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
Load the dataset
PassData <-read.csv("Titanic Passengers.csv")
Basic Exploration of the data
The code below helps to understand the data
head(PassData)
Name Survived Passenger.Class
1 Allen, Miss. Elisabeth Walton 1 1
2 Allison, Master. Hudson Trevor 1 1
3 Allison, Miss. Helen Loraine 0 1
4 Allison, Mr. Hudson Joshua Creighton 0 1
5 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 0 1
6 Anderson, Mr. Harry 1 1
Sex Age SiblingsandSpouses ParentsandChildren Fare Port
1 female 29.0000 0 0 211.3375 S
2 male 0.9167 1 2 151.5500 S
3 female 2.0000 1 2 151.5500 S
4 male 30.0000 1 2 151.5500 S
5 female 25.0000 1 2 151.5500 S
6 male 48.0000 0 0 26.5500 S
Home...Destination Validation
1 St Louis, MO 1
2 Montreal, PQ / Chesterville, ON 1
3 Montreal, PQ / Chesterville, ON 0
4 Montreal, PQ / Chesterville, ON 0
5 Montreal, PQ / Chesterville, ON 0
6 New York, NY 0
summary(PassData)
Name Survived Passenger.Class Sex
Length:1309 Min. :0.000 Min. :1.000 Length:1309
Class :character 1st Qu.:0.000 1st Qu.:2.000 Class :character
Mode :character Median :0.000 Median :3.000 Mode :character
Mean :0.382 Mean :2.295
3rd Qu.:1.000 3rd Qu.:3.000
Max. :1.000 Max. :3.000
Age SiblingsandSpouses ParentsandChildren Fare
Min. : 0.1667 Min. :0.0000 Min. :0.000 Min. : 0.000
1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 7.896
Median :28.0000 Median :0.0000 Median :0.000 Median : 14.454
Mean :29.8811 Mean :0.4989 Mean :0.385 Mean : 33.295
3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
Max. :80.0000 Max. :8.0000 Max. :9.000 Max. :512.329
NA's :263 NA's :1
Port Home...Destination Validation
Length:1309 Length:1309 Min. :0.0000
Class :character Class :character 1st Qu.:0.0000
Mode :character Mode :character Median :0.0000
Mean :0.3048
3rd Qu.:1.0000
Max. :1.0000
# Create a histogram for a numeric variable (replace 'weight' with an actual variable)ggplot(PassData, aes(x = Fare)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Histogram of Weight") +xlab("Weight")
# Box plot for a numeric variable# Box plot for the first numeric variableboxplot(PassData$Fare)
# Box plot for the second numeric variableboxplot(PassData$Age)
Statistics
Review the statistics of the data set
# Summary statistics summary(PassData)
Name Survived Passenger.Class Sex
Length:1309 Min. :0.000 Min. :1.000 Length:1309
Class :character 1st Qu.:0.000 1st Qu.:2.000 Class :character
Mode :character Median :0.000 Median :3.000 Mode :character
Mean :0.382 Mean :2.295
3rd Qu.:1.000 3rd Qu.:3.000
Max. :1.000 Max. :3.000
Age SiblingsandSpouses ParentsandChildren Fare
Min. : 0.1667 Min. :0.0000 Min. :0.000 Min. : 0.000
1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 7.896
Median :28.0000 Median :0.0000 Median :0.000 Median : 14.454
Mean :29.8811 Mean :0.4989 Mean :0.385 Mean : 33.295
3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
Max. :80.0000 Max. :8.0000 Max. :9.000 Max. :512.329
NA's :263 NA's :1
Port Home...Destination Validation
Length:1309 Length:1309 Min. :0.0000
Class :character Class :character 1st Qu.:0.0000
Mode :character Mode :character Median :0.0000
Mean :0.3048
3rd Qu.:1.0000
Max. :1.0000