This blog is a continuation of in-time analysis problem. I want to analyse my entry time at office and understand what factors effect it.

I want to integrate my Google Location history to my analysis. This to find the factors
1. Travel time
2. Mode of transportation

It’s no secret that Google stores a data on all of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.

The downloaded data is a .json file and can be loaded with the jsonlite package. Loading this file into R might take a few minutes because it can be quite big, depending on how many location points Google had saved.

library(jsonlite)
system.time(
x <- fromJSON("Takeout/Location History/Location History.json")
)
##    user  system elapsed
##   42.33    0.94   43.68

Converting to data frame

locations <- as.data.frame(x$locations) dplyr::sample_n(locations, 5) ## timestampMs latitudeE7 longitudeE7 accuracy ## 103365 1536096790010 129768902 777331000 26 ## 35910 1540915392804 129831967 777385539 25 ## 122524 1533614092634 129768885 777331206 24 ## 250121 1514781948632 129762573 777340167 30 ## 172682 1526581945449 129768248 777331578 26 ## activity ## 103365 NULL ## 35910 NULL ## 122524 NULL ## 250121 NULL ## 172682 1526581948974, STILL, IN_VEHICLE, IN_ROAD_VEHICLE, UNKNOWN, UNKNOWN, ON_FOOT, WALKING, IN_RAIL_VEHICLE, UNKNOWN, ON_BICYCLE, 61, 26, 26, 13, 13, 5, 5, 4, 3, 1 ## altitude verticalAccuracy velocity heading ## 103365 NA NA NA NA ## 35910 NA NA NA NA ## 122524 NA NA NA NA ## 250121 NA NA NA NA ## 172682 NA NA NA NA The date and time column is in the POSIX milliseconds format, so I converted it to human readable format. # The Unix epoch is 1970-01-01 00:00:00 UTC* locations$time <- as.POSIXct(as.numeric(locations$timestampMs)/1000, origin = "1970-01-01") Similarly, longitude and latitude are saved in 10^7 format and were converted to GPS coordinates. locations$lat = locations$latitudeE7 / 1e7 locations$lon = locations$longitudeE7 / 1e7 The modified dataFrame will look like this ## time lat lon accuracy ## 52943 2018-10-15 20:35:10 12.98313 77.73848 25 ## 261208 2017-12-10 01:28:14 12.97703 77.73333 20 ## 222407 2018-02-20 11:54:46 12.97704 77.73311 23 ## 31771 2018-11-04 20:15:41 12.97459 77.75759 16 ## 135606 2018-07-18 22:39:30 12.98714 77.73598 26 ## activity ## 52943 NULL ## 261208 NULL ## 222407 1519107857104, TILTING, 100 ## 31771 1541342741418, STILL, 100 ## 135606 NULL ### Data stats Before I get to actually plotting maps, I want to explore a few basic statistics of the data. 1. How many data points did Google record over what period of time? nrow(locations) ## [1] 268248 1. From when till when is this data recorded min(locations$time)
## [1] "2017-11-22 13:20:07 IST"
max(locations$time) ## [1] "2018-12-05 15:14:54 IST" 1. How many days were recorded? length(unique(date(locations$time)))
## [1] 379

### Integrating with attendance data set

The data set contains the time when I swiped into office and the time when I swiped out of office. Data from 4th October 2017 to 29th November 2018 is considered. For more details on attendance data set, please read here

Finding how many working days are captured in Google Maps.

working_days <- locations %>%
filter((date(time) %in% date(attendance$Attendance.Date))) length(unique(date(working_days$time)))
## [1] 246

A total of 246 working days were captured of 266 working days.

library(ggmap)
qmplot(lon, lat, data = working_days,zoom = 5)

Filtering for data for Bangalore

# Bangalore North Eastern border is 13.071393, 77.787015 (Hoskote)
# Bangalore South Western border is 12.889629, 77.454102 (Kumbalgodu)
bangalore <- locations %>% filter(lon < 77.787015, lon > 77.454102, lat < 13.071393,lat > 12.889629 )
qmplot(lon, lat, data = bangalore, geom = c("point","density2d"))

Considering only the time when I will be leaving the house
out.of.house.time = Clustering the longitudes and latitudes near my house. Then finding the maximum time (before the office in time) when I was in this cluster.
in.time = office in time taken from attendance data set

travelling.time = out.of.house.time – in.time

##           date travelling.time
## 124 2018-06-04   2823.197 secs
## 168 2018-08-09   1347.491 secs
## 70  2018-03-13   1266.811 secs
## 194 2018-09-24   1038.792 secs
## 74  2018-03-19    791.593 secs

Plotting histogram of travel time

ggplot(travel,aes(x = as.numeric(travelling.time))) +
geom_histogram(aes(y=..density..), bins = 50, fill = "cornflowerblue", alpha = 0.8) +
labs(x = 'Travel Time (seconds)', y='Density', title = "Travel time") +
theme_minimal()

### Mode of transport

Google also guesses my activity based on distance traveled per time.
In the Bangalore data frame, the column activity has activities data.
Removing null values and creating a data frame for activities (in Bangalore)
Capturing the activity in main_activity

activities <- bangalore$activity list.condition <- sapply(activities, function(x) !is.null(x[[1]])) activities <- activities[list.condition] activities.df <- do.call("rbind", activities) activities.df <- activities.df %>% mutate(main_activity = sapply(activity, function(x) x[[1]][1][[1]][1]), time = as.POSIXct(as.numeric(timestampMs)/1000, origin = "1970-01-01")) %>% dplyr::select(time, main_activity) dplyr::sample_n(activities.df, 5) ## time main_activity ## 31039 2018-08-17 11:57:32 IN_VEHICLE ## 13259 2018-10-09 09:35:16 TILTING ## 40059 2018-07-17 11:41:28 TILTING ## 90643 2018-01-14 23:39:23 ON_FOOT ## 79169 2018-03-08 20:15:39 STILL Plotting the frequency of Main Activities in Bangalore. ggplot(activities.df, aes(x = main_activity, group = main_activity, fill = main_activity)) + geom_bar() + guides(fill = FALSE) + theme_minimal() + labs( x = "", y = "Count", title = "Main activities in Bangalore", caption = "Associated activity for recorded positions in Bangalore. Because Google records activity probabilities for each position, only the activity with highest likelihood were chosen for each position." ) Filtering for transportation during coming to office only 1. Filtering for activities during working days only 2. Merging the travel data set(contains travel time) 3. Filtering for time between out.of.house.time and in.time (office in time) 4. On days where more than one means of transport is used, considering the one which was observed maximum number of times or using some logic. transport <- activities.df %>% filter((date(time) %in% date(attendance$Attendance.Date))) %>%
mutate(date = date(time))
transport <- merge(transport, travel, by = 'date') %>%
filter(time > out.of.house.time, time < in.time.dmy.hm) %>%
group_by(date, main_activity) %>%
summarise(count = n()) %>%
group_by(date) %>%
filter(main_activity %in% c("ON_BICYCLE", "EXITING_VEHICLE", 'IN_VEHICLE',"ON_FOOT")) %>%
top_n(1, count) %>%
top_n(1, match(main_activity, c("ON_BICYCLE", 'IN_VEHICLE', "EXITING_VEHICLE", "ON_FOOT"))) %>%
arrange(date)

The final data set will look as follows:

##           date main_activity travelling.time
## 168 2018-08-17    IN_VEHICLE    731.287 secs
## 35  2018-01-18       ON_FOOT   1029.621 secs
## 30  2018-01-10       ON_FOOT    308.578 secs
## 190 2018-09-25    IN_VEHICLE    616.699 secs
## 198 2018-10-10    IN_VEHICLE   1501.639 secs

Plotting the frequencies of the means of transportation.

ggplot(transport, aes(x = main_activity, group = main_activity, fill = main_activity)) +
geom_bar()  +
guides(fill = FALSE) +
theme_minimal() +
labs(
x = "",
y = "Count",
title = "Mode of transport"
)

These variables travel time and mode of transportation will be used in another blog post to do analysis on office in-time.

### Extra

Scatter plot of travelling time with different means of transportation