Skip to content

Univariate Analysis (R)

Introduction

A univariate analysis is the simplest form of EDA. "Uni" means "one", so in other words, this analysis is on only one variable.
It doesn't deal with causes or relationships, and its primary purpose is to describe; it takes data, summarizes that data and finds patterns in it.

In describing or characterizing the observations of an individual variable, three basic properties are of interest:
1. The location of observations, or how large or small the values of the individual observations are
2. The dispersion (sometimes called scale or spread) of the observations
3. The distribution of the observations

Uni-variate plots provide one way to find out about those properties. There are two basic kinds of univariate plots:
1. Enumeration plots, or plots that show every observation
2. Summary plots that generalize the data into a simplified representation.

For the current tutorial, I will be using my office attendance data set. The data set contains the time when I swiped into the office and when I swiped out of the office. Data from 4th October 2017 to 29th November 2018.
After some data set manipulation, I will get the difference between policy out-time and my actual out-time. I can leave 15 minutes before the policy out time. After manipulation, a sample of the data is as follows: (Actual data is not shown for security reasons. This is mock data that is very similar to the actual one.)

##   Attendance.Date diff.in.time diff.out.time
## 1      2018-03-22      18 mins      226 mins
## 2      2018-08-14      -9 mins        5 mins
## 3      2017-12-04      42 mins       11 mins
## 4      2018-03-01      26 mins       -6 mins
## 5      2018-01-23      35 mins       -4 mins

Summary Statistics

Before further analysis, some basic summary statistics would show me the mean and standard deviation of the data. For this tutorial, I will use diff.in.time (difference between actual in-time and policy in-time)

mean(as.numeric(attendance$diff.out.time)) # Mean in minutes
## [1] 20.3227
sd(as.numeric(attendance$diff.out.time)) # Standard Deviation in minutes
## [1] 69.06549
nrow(attendance) # Length of the data set
## [1] 282

Enumerative plots

"Enumerative plots" are called such because they enumerate or show every individual data point.

Index Plot/Univariate Scatter Diagram

Displays the values of a single variable for each observation using symbols plotted relative to the observation number.

ggplot(attendance, aes(x=Attendance.Date, y= as.numeric(diff.out.time), color = (diff.out.time >= -15))) + 
  geom_point(show.legend = FALSE) +
  labs(x = 'Time', y='Out-time difference (Minutes)') + 
  theme_minimal()

Just looking at this plot, I can say the following:
1. I could cluster into three parts.
- One cluster would be before December 2017, where I used to leave the office way after my out-time.
- The second cluster would be from December 2017 to June 2018, where I used to leave the office 15 minutes before my out time.
- The third cluster was after June 2018, when I was leaving way after my out-time.
2. The red dots indicate the days when I came to the office after 15 minutes from in-time. They are anomalies, days when I took half days etc. We can exclude them from our current analysis.

Y Zero High-Density Plot

Another way to look at the same data is by using a Y Zero High-Density Plot. It displays the values of a single variable plotted as thin vertical lines. Here the magnitude of the observations iss highlighted.

ggplot(attendance, aes(x=Attendance.Date, y = 0, color = (diff.out.time >= -15),
                       xend = Attendance.Date, yend = as.numeric(diff.out.time))) + 
  geom_segment(show.legend = FALSE) +
  labs(x = 'Time', y='Out-time difference (Minutes)') + 
  theme_minimal()

Removing half-days as outliers

attendance <- attendance %>% 
  filter(diff.out.time >= -15)

Strip Plot/Strip Chart (univariate scatter diagram)

Displays the values of a single variable as symbols plotted along a line. This is a basic plot where we can see the spread of the data.

stripchart(x = as.numeric(attendance$diff.out.time),xlab = 'Out-time difference (minutes)')

Sometimes it is more visually apparent when the points are stacked.

ggplot(attendance, aes(x = as.numeric(diff.out.time), y=..density..))+
  geom_dotplot(binwidth = 3,method = 'histodot') +
  labs(x = 'Out-time difference (minutes)') + 
  theme_minimal()

We can observe that the number of observations is high and slowly tends to drop as time progresses.

Dot Plot/Dot Chart

Displays the values of a single variable as symbols plotted along a line. With a separate line for each observation, it is generally constructed after sorting the rows of the data table.

df = attendance %>% arrange(as.numeric(diff.out.time))
ggplot(df,
       aes(x=as.numeric(row.names(df)), y = as.numeric(diff.out.time))) + 
  geom_point() +
  labs(x = 'count', y='Out time difference (min)') + 
  theme_minimal()

From the graph, I can observe that the distribution initially seems to be exponential.
A sample normal distribution is plotted for reference.

We can see that the distribution looks nowhere like a normal distribution. Instead, I suspect that it is close to an exponential distribution.

Univariate Summary Plots

Summary plots display an object or graph that gives a more concise expression of a variable's location, dispersion, and distribution than an enumerative plot. This comes at the expense of some loss of information: In a summary plot, it is no longer possible to retrieve the individual data value, but the gain usually matches this loss in understanding that results from the efficient representation of the data. Summary plots generally prove to be much better than the enumerative plots in revealing the data distribution.

Box plot

A simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines on either side of the rectangle.

ggplot(attendance, aes(x="Box Plot", y= as.numeric(diff.out.time), group = 123)) + 
  geom_boxplot() +
  labs(y='Out time difference (min)') + 
  theme_minimal()

Histograms

The other summary plots are of various types:

  1. Histograms: Histograms are bar charts that display the counts or relative frequencies of values falling in different class intervals or ranges.
  2. Density Plots: A density plot plots the local relative frequency or density of points along the number line or x-axis of a plot. The local density is determined by summing the individual "kernel" densities for each point. Where points occur more frequently, this sum, and consequently the local density, will be greater.
legendcols <- c("Normal distribution"="darkred","Density"="darkBlue","Histogram"="lightBlue")
ggplot(attendance,aes(x = as.numeric(diff.out.time))) +
  geom_histogram(aes(y=..density.., fill ="Histogram"), bins = 50) +
    stat_function(fun = dnorm, aes(color="Normal distribution"),  size = 1,
                args = list(mean = mean(as.numeric(attendance$diff.out.time)),
                                         sd = sd(as.numeric(attendance$diff.out.time)) )) +
  geom_density(aes(y=..density.., color="Density"),  size = 1)+
  scale_colour_manual(name="Distribution",values=legendcols) +
  scale_fill_manual(name="Bar",values=legendcols) +
  labs(x = 'Out-time difference (minutes)', y='Density') +
  theme_minimal() + theme(legend.position="bottom")

In the above graph, the red line is a normal distribution(with the same mean and standard deviation) while the blue line is the density plot of in-time.

Q-Q plot

In statistics, a Q-Q (quantile-quantile) plot is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.

If the two distributions being compared are similar, the points in the Q-Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q-Q plot will approximately lie on a line, but not necessarily on the line y = x. Q-Q plots can also be used as a graphical means of estimating parameters in a location-scale family of distributions.

A Q-Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.

Below is a Q-Q plot with a normal distribution

ggplot(attendance,aes(sample = as.numeric(diff.out.time))) +
  stat_qq() + stat_qq_line() +
  ggtitle("Normal distribution Q-Q plot") +
  theme_minimal()

We can see that the distribution is not normal.
Trying to check with exponential distribution

params <- as.list(fitdistr(rexp(nrow(attendance), rate = 2), "exponential")$estimate)
ggplot(attendance,aes(sample = as.numeric(diff.out.time))) +
  stat_qq(distribution = qexp, dparams = params) + 
  stat_qq_line(distribution = qexp, dparams = params) +
  ggtitle("Exponential distribution Q-Q plot") +
  theme_minimal()

From the above graph, I am approximating my distribution to an exponential distribution.

lamda <- 1/mean(sd(as.numeric(attendance$diff.out.time)),mean(as.numeric(attendance$diff.out.time)))
exp.curve <- function(x){
                lamda*exp(-lamda*(x +15))
}
legendcols <- c("Exponential distribution"="darkred","Density"="darkBlue","Histogram"="lightBlue")
ggplot(attendance,aes(x = as.numeric(diff.out.time))) +
  geom_histogram(aes(y=..density.., fill ="Histogram"), bins = 50) +
  stat_function(fun = exp.curve, aes(color="Exponential distribution"),  size = 1) +
  geom_density(aes(y=..density.., color="Density"),  size = 1)+
  scale_colour_manual(name="Distribution",values=legendcols) +
  scale_fill_manual(name="Bar",values=legendcols) +
  labs(x = 'Out-time difference (minutes)', y='Density') +
  theme_minimal() + theme(legend.position="bottom")

In the above graph, the red line is exponential distribution, while the blue line is the density plot of in-time.

Created using RMarkdown.

Back to top