In this post, I would like to look into Chi Square test of independence. The data set I am going to use is published in https://smartcities.data.gov.in which is a Government of India project under the National Data Sharing and Accessibility Policy.

I want to find what are the safest and deadliest ways to travel on Bangalore roads. The Injuries_and_Fatalities_Bengaluru_from_2016to2018.csv data set has total number of injuries and fatalities in Bangalore from 2016 to 2018. I want to take injuries as a dummy for the number of incidents that took place.

As I want to test that there is significant difference in the fatalities with different types of transport, the null and alternate hypothesis will be as follows:
H0: The type of transport is independent of the fatalities
H1: The type of transport is dependent

Sample data set:

##                                                                   instance
## 1                                         2018 - Total Injuries - Bicycles
## 2 2017 - Total Injuries - Other modes of road transport (auto, bus, lorry)
## 3                                   2016 - Total Fatalities - Two-wheelers
## 4                                       2017 - Total Injuries - Pedestrian
## 5                                     2017 - Total Fatalities - Pedestrian
##   count year             type
## 1    43 2018   Total Injuries
## 2  1380 2017   Total Injuries
## 3   374 2016 Total Fatalities
## 4  1346 2017   Total Injuries
## 5   284 2017 Total Fatalities
##                                          transport
## 1                                         Bicycles
## 2 Other modes of road transport (auto, bus, lorry)
## 3                                     Two-wheelers
## 4                                       Pedestrian
## 5                                       Pedestrian

The contingency table for the year 2017 is

contingency_table <- data %>% filter(year == 2017) %>% 
  dplyr::select(type, transport, count) %>% 
  spread(type, count)
library(kableExtra)
kable(contingency_table,
      caption = 'Contingency Table') %>% 
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T) %>%
  collapse_rows(columns = 1:2, valign = "middle") %>% 
  scroll_box()
Contingency Table
transport Total Fatalities Total Injuries
Bicycles 8 31
Other modes of road transport (auto, bus, lorry) 252 1380
Pedestrian 284 1346
Two-wheelers 98 1499

A Mosaic plot for the same is:

library(ggmosaic)
ggplot(data = data) +
  geom_mosaic(aes(weight = count, x = product(transport), fill = type), na.rm=TRUE) +
  labs(x = 'Type of transport', y='%',  title = 'What type of transport to use') +
    theme_minimal()+theme(legend.position="bottom")

From the above plot I can observe that there is a significant difference in the percentages of fatalities in each transport. To find if this percent is significant, I will conduct a chi-square test of independence.

library(gmodels)
# Converting contingency table to flat tables
# Two vectors to hold values of columns
caseType <- c();  conditionType <- c()

# For each cell, repeat the rowname, colname combo 
# as many times
for(i in 1:nrow(contingency_table)) {
  for(j in 2:ncol(contingency_table)) {
    numRepeats <- contingency_table[i, j]
    
    caseType <- append(caseType, 
                       rep(contingency_table[i,1],
                           numRepeats))
    conditionType <- append(conditionType, 
                        rep(colnames(contingency_table)[j],
                            numRepeats))
  }
}

# Construct the table from the vectors
flatTable <- data.frame(caseType, conditionType)
CrossTable(flatTable$caseType, flatTable$conditionType,
           dnn=c("Transportation Type", "Accident type"),
           expected=TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4898 
## 
##  
##                                                  | Accident type 
##                              Transportation Type | Total Fatalities |   Total Injuries |        Row Total | 
## -------------------------------------------------|------------------|------------------|------------------|
##                                         Bicycles |                8 |               31 |               39 | 
##                                                  |            5.112 |           33.888 |                  | 
##                                                  |            1.632 |            0.246 |                  | 
##                                                  |            0.205 |            0.795 |            0.008 | 
##                                                  |            0.012 |            0.007 |                  | 
##                                                  |            0.002 |            0.006 |                  | 
## -------------------------------------------------|------------------|------------------|------------------|
## Other modes of road transport (auto, bus, lorry) |              252 |             1380 |             1632 | 
##                                                  |          213.913 |         1418.087 |                  | 
##                                                  |            6.782 |            1.023 |                  | 
##                                                  |            0.154 |            0.846 |            0.333 | 
##                                                  |            0.393 |            0.324 |                  | 
##                                                  |            0.051 |            0.282 |                  | 
## -------------------------------------------------|------------------|------------------|------------------|
##                                       Pedestrian |              284 |             1346 |             1630 | 
##                                                  |          213.650 |         1416.350 |                  | 
##                                                  |           23.164 |            3.494 |                  | 
##                                                  |            0.174 |            0.826 |            0.333 | 
##                                                  |            0.442 |            0.316 |                  | 
##                                                  |            0.058 |            0.275 |                  | 
## -------------------------------------------------|------------------|------------------|------------------|
##                                     Two-wheelers |               98 |             1499 |             1597 | 
##                                                  |          209.325 |         1387.675 |                  | 
##                                                  |           59.206 |            8.931 |                  | 
##                                                  |            0.061 |            0.939 |            0.326 | 
##                                                  |            0.153 |            0.352 |                  | 
##                                                  |            0.020 |            0.306 |                  | 
## -------------------------------------------------|------------------|------------------|------------------|
##                                     Column Total |              642 |             4256 |             4898 | 
##                                                  |            0.131 |            0.869 |                  | 
## -------------------------------------------------|------------------|------------------|------------------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  104.4776     d.f. =  3     p =  1.692478e-22 
## 
## 
## 
chi.test <- chisq.test(contingency_table[,2:3], rescale.p = TRUE)
print(chi.test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table[, 2:3]
## X-squared = 104.48, df = 3, p-value < 2.2e-16
chi.sq.plot(chi.sq = chi.test$statistic, df = chi.test$parameter, title = 'Null hypothesis to test independence')

As p < a, where a = 0.05, I reject the Null hypothesis. There is a significant difference in the mortality rate with different vehicles. Travelling on two-wheeler is the safest while bicycle is the most dangerous.
Created on 24th June 2019, Achyuthuni Sri Harsha

Leave a Reply