Integrating Google location history

This blog is a continuation of in-time analysis problem. I want to analyse my entry time at office and understand what factors effect it.

I want to integrate my Google Location history to my analysis. This to find the factors
1. Travel time
2. Mode of transportation

It’s no secret that Google stores a data on all of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.

The Google location history can be downloaded from your Google account under https://takeout.google.com/settings/takeout. I have only selected “location history” for download, otherwise it will take super long to get all my Google data.

The downloaded data is a .json file and can be loaded with the jsonlite package. Loading this file into R might take a few minutes because it can be quite big, depending on how many location points Google had saved.

library(jsonlite)
system.time(
  x <- fromJSON("Takeout/Location History/Location History.json")
)
##    user  system elapsed 
##   42.33    0.94   43.68

Converting to data frame

locations <- as.data.frame(x$locations)
dplyr::sample_n(locations, 5)
##          timestampMs latitudeE7 longitudeE7 accuracy
## 103365 1536096790010  129768902   777331000       26
## 35910  1540915392804  129831967   777385539       25
## 122524 1533614092634  129768885   777331206       24
## 250121 1514781948632  129762573   777340167       30
## 172682 1526581945449  129768248   777331578       26
##                                                                                                                                                              activity
## 103365                                                                                                                                                           NULL
## 35910                                                                                                                                                            NULL
## 122524                                                                                                                                                           NULL
## 250121                                                                                                                                                           NULL
## 172682 1526581948974, STILL, IN_VEHICLE, IN_ROAD_VEHICLE, UNKNOWN, UNKNOWN, ON_FOOT, WALKING, IN_RAIL_VEHICLE, UNKNOWN, ON_BICYCLE, 61, 26, 26, 13, 13, 5, 5, 4, 3, 1
##        altitude verticalAccuracy velocity heading
## 103365       NA               NA       NA      NA
## 35910        NA               NA       NA      NA
## 122524       NA               NA       NA      NA
## 250121       NA               NA       NA      NA
## 172682       NA               NA       NA      NA

The date and time column is in the POSIX milliseconds format, so I converted it to human readable format.

# The Unix epoch is 1970-01-01 00:00:00 UTC*
locations$time <- as.POSIXct(as.numeric(locations$timestampMs)/1000, origin = "1970-01-01")

Similarly, longitude and latitude are saved in 10^7 format and were converted to GPS coordinates.

locations$lat = locations$latitudeE7 / 1e7
locations$lon = locations$longitudeE7 / 1e7

The modified dataFrame will look like this

##                       time      lat      lon accuracy
## 52943  2018-10-15 20:35:10 12.98313 77.73848       25
## 261208 2017-12-10 01:28:14 12.97703 77.73333       20
## 222407 2018-02-20 11:54:46 12.97704 77.73311       23
## 31771  2018-11-04 20:15:41 12.97459 77.75759       16
## 135606 2018-07-18 22:39:30 12.98714 77.73598       26
##                           activity
## 52943                         NULL
## 261208                        NULL
## 222407 1519107857104, TILTING, 100
## 31771    1541342741418, STILL, 100
## 135606                        NULL

Data stats

Before I get to actually plotting maps, I want to explore a few basic statistics of the data.
1. How many data points did Google record over what period of time?

nrow(locations)
## [1] 268248
  1. From when till when is this data recorded
min(locations$time)
## [1] "2017-11-22 13:20:07 IST"
max(locations$time)
## [1] "2018-12-05 15:14:54 IST"
  1. How many days were recorded?
length(unique(date(locations$time)))
## [1] 379

Integrating with attendance data set

The data set contains the time when I swiped into office and the time when I swiped out of office. Data from 4th October 2017 to 29th November 2018 is considered. For more details on attendance data set, please read here

Finding how many working days are captured in Google Maps.

working_days <- locations %>% 
  filter((date(time) %in% date(attendance$Attendance.Date)))
length(unique(date(working_days$time)))
## [1] 246

A total of 246 working days were captured of 266 working days.

library(ggmap)
qmplot(lon, lat, data = working_days,zoom = 5)

Filtering for data for Bangalore

# Bangalore North Eastern border is 13.071393, 77.787015 (Hoskote)
# Bangalore South Western border is 12.889629, 77.454102 (Kumbalgodu)
bangalore <- locations %>% filter(lon < 77.787015, lon > 77.454102, lat < 13.071393,lat > 12.889629 )
qmplot(lon, lat, data = bangalore, geom = c("point","density2d"))

Considering only the time when I will be leaving the house
out.of.house.time = Clustering the longitudes and latitudes near my house. Then finding the maximum time (before the office in time) when I was in this cluster.
in.time = office in time taken from attendance data set

travelling.time = out.of.house.time – in.time

##           date travelling.time
## 124 2018-06-04   2823.197 secs
## 168 2018-08-09   1347.491 secs
## 70  2018-03-13   1266.811 secs
## 194 2018-09-24   1038.792 secs
## 74  2018-03-19    791.593 secs

Plotting histogram of travel time

ggplot(travel,aes(x = as.numeric(travelling.time))) +
  geom_histogram(aes(y=..density..), bins = 50, fill = "cornflowerblue", alpha = 0.8) +
  labs(x = 'Travel Time (seconds)', y='Density', title = "Travel time") +
  theme_minimal()

Mode of transport

Google also guesses my activity based on distance traveled per time.
In the Bangalore data frame, the column activity has activities data.
Removing null values and creating a data frame for activities (in Bangalore)
Capturing the activity in main_activity

activities <- bangalore$activity

list.condition <- sapply(activities, function(x) !is.null(x[[1]]))

activities  <- activities[list.condition]

activities.df <- do.call("rbind", activities)

activities.df <- activities.df %>%
  mutate(main_activity = sapply(activity, function(x) x[[1]][1][[1]][1]),
         time = as.POSIXct(as.numeric(timestampMs)/1000, origin = "1970-01-01")) %>% 
  dplyr::select(time, main_activity)
dplyr::sample_n(activities.df, 5)
##                      time main_activity
## 31039 2018-08-17 11:57:32    IN_VEHICLE
## 13259 2018-10-09 09:35:16       TILTING
## 40059 2018-07-17 11:41:28       TILTING
## 90643 2018-01-14 23:39:23       ON_FOOT
## 79169 2018-03-08 20:15:39         STILL

Plotting the frequency of Main Activities in Bangalore.

ggplot(activities.df, aes(x = main_activity, group = main_activity, fill = main_activity)) + 
  geom_bar()  + 
  guides(fill = FALSE) +
  theme_minimal() +
  labs(
    x = "",
    y = "Count",
    title = "Main activities in Bangalore",
    caption = "Associated activity for recorded positions in Bangalore. 
    Because Google records activity probabilities for each position, 
    only the activity with highest likelihood were chosen for each position."
  )

Filtering for transportation during coming to office only
1. Filtering for activities during working days only
2. Merging the travel data set(contains travel time)
3. Filtering for time between out.of.house.time and in.time (office in time)
4. On days where more than one means of transport is used, considering the one which was observed maximum number of times or using some logic.

transport <- activities.df %>% 
  filter((date(time) %in% date(attendance$Attendance.Date))) %>% 
  mutate(date = date(time))
transport <- merge(transport, travel, by = 'date') %>% 
  filter(time > out.of.house.time, time < in.time.dmy.hm) %>% 
  group_by(date, main_activity) %>% 
  summarise(count = n()) %>% 
  group_by(date) %>%
  filter(main_activity %in% c("ON_BICYCLE", "EXITING_VEHICLE", 'IN_VEHICLE',"ON_FOOT")) %>%
  top_n(1, count) %>% 
  top_n(1, match(main_activity, c("ON_BICYCLE", 'IN_VEHICLE', "EXITING_VEHICLE", "ON_FOOT"))) %>%
  arrange(date)

The final data set will look as follows:

##           date main_activity travelling.time
## 168 2018-08-17    IN_VEHICLE    731.287 secs
## 35  2018-01-18       ON_FOOT   1029.621 secs
## 30  2018-01-10       ON_FOOT    308.578 secs
## 190 2018-09-25    IN_VEHICLE    616.699 secs
## 198 2018-10-10    IN_VEHICLE   1501.639 secs

Plotting the frequencies of the means of transportation.

ggplot(transport, aes(x = main_activity, group = main_activity, fill = main_activity)) + 
  geom_bar()  + 
  guides(fill = FALSE) +
  theme_minimal() +
  labs(
    x = "",
    y = "Count",
    title = "Mode of transport"
  )

These variables travel time and mode of transportation will be used in another blog post to do analysis on office in-time.

Extra

Scatter plot of travelling time with different means of transportation