Introduction

Multivariate EDA techniques generally show the relationship between two or more variables with the dependent variable in the form of either cross-tabulation, statistics or visualization. In the current problem it will help us look at relationships between our data.

This blog is a part of in-time analysis problem. I want to analyse my entry time at office and understand what factors effect it.
After integrating Google maps data with attendence dataset, Currently I chose the following as my factors
1. date (month / week day / season etc)
2. main_activity (means of transport)
3. hours.worked (of the previous day)
4. travelling.time (time it took to travel from house to office)
5. home.addr (the place of residence)

The dependent variable is diff.in.time (difference between my actual in time vs policy in-time) A sample of the data is shown

Sample Data
diff.in.time date main_activity hours.worked travelling.time home.addr diff.out.time
88 23 2018-04-13 ON_BICYCLE 9.316667 918.324 Old House -4
31 38 2018-01-12 ON_FOOT 2.083333 1477.822 Old House -453
103 20 2018-05-09 ON_FOOT 9.716667 749.121 Old House 23
208 -13 2018-10-24 ON_BICYCLE 9.800000 1683.958 New House 61
143 -3 2018-07-10 IN_VEHICLE 9.766667 574.590 Old House 49

Cross-tabulation

For categorical data cross-tabulation is very useful. For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels. The two variables might be both explanatory, both outcome, or one of each.

I am using Kable to make cool tables.

cross_table <- travel %>% group_by(home.addr, main_activity) %>% 
  summarise(avg.travel.time = mean(travelling.time),
            avg.in.time.diff = mean(diff.in.time),
            median.in.time.diff =  median(diff.in.time)) %>% 
  arrange(home.addr, main_activity)

library(kableExtra)
kable(cross_table, caption = 'Cross Tabulation') %>% 
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T) %>%
  collapse_rows(columns = 1:2, valign = "middle") %>% 
  scroll_box()
Cross Tabulation
home.addr main_activity avg.travel.time avg.in.time.diff median.in.time.diff
New House IN_VEHICLE 1334.0650 -1.500000 -1.0
ON_BICYCLE 1547.5557 -4.000000 -6.0
ON_FOOT 1695.7091 5.285714 5.0
Old House IN_VEHICLE 771.1752 2.857143 -4.0
ON_BICYCLE 997.2439 15.555556 19.5
ON_FOOT 1176.9413 17.357143 17.0

Scatter plots

Scatter plots show how much one variable is affected by another.

Let’s see how the travelling time affects the in-time

ggplot(travel, aes(x=diff.in.time, y= travelling.time, color = main_activity)) + 
  geom_point(show.legend = TRUE) +
  labs(x = 'In-time difference (Minutes)', y='Travelling time (seconds)',  title = "Travelling time vs in-time",
       color = 'Mode of transport') + 
  theme_minimal()+theme(legend.position="bottom")

From the above graph, I can see that:
1. For bicycle, as travelling time decreases(low traffic) in-time difference increases(coming earlier to office)
2. There seems to be no relationship between travelling time (traffic) and in-time difference when on foot.
3. Travelling time has little effect on in-time difference when travelling on vehicle.

Let’s see how the hours worked(on previous day) affects the in-time

From the above graph, I can observe that irrespective of mode of transport, my in-time difference increases (coming earlier to office) as hours worked on the previous day increases.

Box plots

Similarly, I want to see how mode of transport affects in-time difference. For categorical variable, box plots display this information in the most ideal manner.

ggplot(travel, aes(x=main_activity, y= diff.in.time, group = main_activity)) + 
  geom_boxplot() +
  labs(x='Mode of transport', y='In time difference (min)') + 
  theme_minimal()

From the above graph, I can observe that:
1. On vehicle, I went to office on average, ~12 minutes after the policy in-time (in-time difference is -12)
2. On cycle, I went to office almost close to the policy in-time
3. While walking, I was almost always before the policy in-time

Similarly for place of residence.

From this graph, I can understand that from New house I was close to ~5 minutes after the policy in-time while I used to be on-time while living in Old house.

Created using R Markdown.

Credits:
Thinkstats
Experimental Design and Analysis

4 Replies to “Multivariate Analysis”

Leave a Reply