Univariate Analysis

Introduction

Uni-variate analysis is the simplest form of EDA. “Uni” means “one”, so in other words your data has only one variable.

It doesn’t deal with causes or relationships and it’s major purpose is to describe; it takes data, summarizes and finds patterns.

In describing or characterizing the observations of an individual variable, there are three basic properties that are of interest:

1. The location of observations, or how large or small the values of the individual observations are
2. The dispersion (sometimes called scale or spread) of the observations
3. The distribution of the observations

Uni-variate plots provide one way to find out about those properties. There are two basic kinds of uni variate plots:

1. Enumeration plots, or plots that show every observation
2. Summary plots, that generalize the data into a simplified representation.

For the current tutorial, I will be using my office attendance data set. The data set contains the time when I swiped into office and the time when I swiped out of office. Data from 4th October 2017 to 29th November 2018.

After some manipulation on the data set, I will get the difference between policy out-time and my actual out-time. I can leave from 15 minutes before the policy out time. A sample of the data after manipulation is as follows: (Actual data is not shown for security reasons. This is mock data which is very similar to the actual one. The analysis will be the same)

##     Attendance.Date diff.in.time diff.out.time
## 109      2018-06-18     -11 mins       63 mins
## 38       2018-10-01      -9 mins       46 mins
## 128      2018-05-17       8 mins       -9 mins
## 260      2017-11-03     122 mins       33 mins
## 179      2018-03-02       7 mins       -8 mins

Summary Statistics

Some basic summary statistics before further analysis would me the basic mean and standard deviation of the data. For this tutorial, I will use diff.in.time (difference between actual in-time and policy in-time)

mean(as.numeric(attendance$diff.out.time)) # Mean in minutes ##  21.25623 sd(as.numeric(attendance$diff.out.time)) # Standard Deviation in minutes
##  67.3828
nrow(attendance) # Length of the data set
##  281

Enumerative plots

“Enumerative plots” are called such because they enumerate or show every individual data point

Index Plot/Univariate Scatter Diagram

Displays the values of a single variable for each observation using symbols plotted relative to the observation number.

ggplot(attendance, aes(x=Attendance.Date, y= as.numeric(diff.out.time), color = (diff.out.time >= -15))) +
geom_point(show.legend = FALSE) +
labs(x = 'Time', y='Out-time difference (Minutes)') +
theme_minimal() Just looking at this plot I can say the following:

1. I could cluster into three parts, One cluster would be before December 2017, where I used to leave office way after my out-time, second cluster would be from December 2017 to June 2018, where I used to leave office 15 minutes before my out time, and after June 2018, when I was leaving way after my out-time.
2. The red dots indicate the days when I came to office after 15 minutes from in-time. They are anomalies, days when I took half days etc.. We can exclude them from our current analysis.

Y Zero High-Density Plot

Another way to look at the same data is by using a Y Zero High-Density Plot. It displays the values of a single variable plotted as thin vertical lines. Here the magnitude of the observations are highlighted.

ggplot(attendance, aes(x=Attendance.Date, y = 0, color = (diff.out.time >= -15),
xend = Attendance.Date, yend = as.numeric(diff.out.time))) +
geom_segment(show.legend = FALSE) +
labs(x = 'Time', y='Out-time difference (Minutes)') +
theme_minimal() Removing half-days as outliers

attendance <- attendance %>%
filter(diff.out.time >= -15)

Strip Plot/Strip Chart (univariate scatter diagram)

Displays the values of a single variable as symbols plotted along a line. This is a basic plot where we can see the spread of the data.

sd = sd(as.numeric(attendance$diff.out.time)) )) + geom_density(aes(y=..density..), color="darkblue", size = 1)+ geom_histogram(aes(y=..density..), bins = 50, fill = "cornflowerblue", alpha = 0.2) + labs(x = 'Out-time difference (minutes)', y='Density') + theme_minimal() In the above graph, the red line is normal distribution(with the same mean and sd) while the blue line is the density plot of in-time. Q-Q plot In statistics, a Q-Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. If the two distributions being compared are similar, the points in the Q-Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q-Q plot will approximately lie on a line, but not necessarily on the line y = x. Q-Q plots can also be used as a graphical means of estimating parameters in a location-scale family of distributions. A Q-Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions. Below is a Q-Q plot with a normal distribution ggplot(attendance,aes(sample = as.numeric(diff.out.time))) + stat_qq() + stat_qq_line() + ggtitle("Normal distribution Q-Q plot") + theme_minimal() We can clearly see that the distribution is not a normal distribution. Trying to check with exponential distribution params <- as.list(fitdistr(rexp(nrow(attendance), rate = 2), "exponential")$estimate)
ggplot(attendance,aes(sample = as.numeric(diff.out.time))) +
stat_qq(distribution = qexp, dparams = params) +
stat_qq_line(distribution = qexp, dparams = params) +
ggtitle("Exponential distribution Q-Q plot") +
theme_minimal() From the above graph I am approximating my distribution to an exponential distribution.

lamda <- 1/mean(sd(as.numeric(attendance$diff.out.time)),mean(as.numeric(attendance$diff.out.time)))
exp.curve <- function(x){
lamda*exp(-lamda*(x +15))
}
ggplot(attendance,aes(x = as.numeric(diff.out.time))) +
stat_function(fun = exp.curve, color="darkred",  size = 1) +
geom_density(aes(y=..density..), color="darkblue",  size = 1)+
geom_histogram(aes(y=..density..), bins = 50, fill = "cornflowerblue", alpha = 0.2) +
labs(x = 'In-time difference (minutes)', y='Density') +
theme_minimal() In the above graph, the red line is exponential distribution( lambda = 1/mean, mean = mean of the distribution) while the blue line is the density plot of in-time.

Created using RMarkdown

|