In this post, I would like to look into Anova hypothesis testing. The dataset I am going to use is published in https://smartcities.data.gov.in which is a Government of India project under the National Data Sharing and Accessibility Policy.

I want to analyse if the unemployment across Bangalore is similar or are there pockets with high unemployment. The Unemployment_Rate_Bengaluru.csv dataset has total employed and unemployed people in Bangalore at a Ward level. For simplicity sake, I want to concentrate on three zones, Bangalore-east, Mahadevpura and Rajeshwari Nagar.

As I want to test that there is significant difference between unemployment rate in different zones in Bangalore, the null and alternate hypothesis will be as follows:
H0: μBangalore-east = μMahadevpura = μRajeshwari-Nagar
H1: Not all μ are equal

Sample unemployment data set:

## # A tibble: 5 x 8
##   cityName zoneName wardName wardNo unemployed.no employed.no total.labour
##   <chr>    <chr>    <chr>     <int>         <int>       <int>        <int>
## 1 Bengalu~ East     Jayacha~     46         19148       12301        31449
## 2 Bengalu~ Rajaraj~ HMT Ward    105         16096       12259        28355
## 3 Bengalu~ East     SanJaya~     19         19564       12927        32491
## 4 Bengalu~ East     Kadugon~     97         20156       15565        35721
## 5 Bengalu~ East     Maruthi~    126         16381       12938        29319
## # ... with 1 more variable: rate <dbl>

Visualizing the difference of unemployment in each zone.

ggplot(unemployment, aes(x=zoneName, y= rate, group = zoneName)) + 
  geom_boxplot() +
  labs(x='Zone', y='Unemployment %') + 
  theme_minimal()

The summary statistics of unemployment across different zones are as follows:

data.summary <- unemployment %>% 
  rename(group = zoneName) %>% 
  group_by(group) %>% 
  summarise(count = n(), 
            mean = mean(rate),
            sd = sd(rate),
            skewness = skewness(rate),
            kurtosis = kurtosis(rate))
data.summary
## # A tibble: 3 x 6
##   group               count  mean     sd skewness kurtosis
##   <chr>               <int> <dbl>  <dbl>    <dbl>    <dbl>
## 1 East                   81 0.574 0.0361 -0.228      0.747
## 2 Mahadevapura           24 0.567 0.0449 -0.812      1.69 
## 3 Rajarajeswari Nagar    18 0.548 0.0499  0.00421   -1.24

From the above table I observe that the unemployment rate is between 54% and 57% in all the three zones.

One of the conditions to perform anova is that the population response variable follows a normal distribution in each group. The distribution of unemployment rate in diffeent zones are:

ggplot() + 
  geom_density(data=unemployment, aes(x=rate, group=zoneName, color=zoneName), adjust=2) +
  labs(x = 'Unemployment Rate', y='Density',  title = 'Testing normality among response variables') +
  theme_minimal()+theme(legend.position="bottom")

From the above graph, it looks like the three groups are normally distributed. To test for normality, we could use Shapiro-Wilk test.
The test for each group would have the following hypothesis:
H0: The unemployment rate in the group is normally distributed
H1: The unemployment rate in the group is not normally distributed

for(i in 1:nrow(data.summary)){
  test.dist <- (unemployment %>% dplyr::filter(zoneName == data.summary$group[i]))$rate
  cat('Testing for group ', data.summary$group[i], '\n')
  print(shapiro.test(test.dist))
  norm.plot <- ggplot() + 
    geom_qq(aes(sample = test.dist)) +
    stat_qq_line(aes(sample = test.dist)) +
    ggtitle(paste0("Normal distribution Q-Q plot for group ",data.summary$group[i])) +
    theme_minimal()
  plot(norm.plot)
}
## Testing for group  East 
## 
##  Shapiro-Wilk normality test
## 
## data:  test.dist
## W = 0.97147, p-value = 0.06734

## Testing for group  Mahadevapura 
## 
##  Shapiro-Wilk normality test
## 
## data:  test.dist
## W = 0.90815, p-value = 0.03216

## Testing for group  Rajarajeswari Nagar 
## 
##  Shapiro-Wilk normality test
## 
## data:  test.dist
## W = 0.95852, p-value = 0.5734

As p > α, where α = 0.01, retaining the Null hypothesis in all the three groups. Therefore we can assume that unemployment rate is normally distributed among all groups.

Another condition for anova is that the population variances are assumed to be same. This is an assumption I am willing to take at this point.

Taking these assumptions, the ideal distributions of the sample are as follows. These distributions can be compared to a simulation that I created where change in F value (and significance of anova) can be visualised by increasing the between variance (increasing the distance between group means)

plot.normal.groups <- function(data.summary, mean, sd, label, title){
  common.group.sd <- mean(data.summary$sd)
  range <- seq(mean-3*sd, mean+3*sd, by = sd*0.001)
  norm.dist <- data.frame(range = range, dist = dnorm(x = range, mean = mean, sd = sd))
  # Plotting sampling distribution and x_bar value with cutoff
  norm.aov.plot <- ggplot(data = norm.dist, aes(x = range,y = dist))
  for (i in 1:nrow(data.summary)) {
    norm.aov.plot <- norm.aov.plot + 
      stat_function(fun = dnorm, color=colors()[sample(50:100, 1)],  size = 1, 
                    args = list(mean = data.summary$mean[i], sd = common.group.sd))
  }
  norm.aov.plot + labs(x = label, y='Density',  title = title) +
      theme_minimal()+theme(legend.position="bottom")
}
set.seed(9)
mean <- mean(unemployment$rate)
sd <- sd(unemployment$rate)
plot.normal.groups(data.summary, mean, sd, 'Travel time (sec)', 'Assuming normality among response variables')

Performing Anova with
H0: μBangalore-east = μMahadevpura = μRajeshwari-Nagar
H1: Not all μ are equal

# Functions used in anova-test

f.plot <- function(pop.mean=0, alpha = 0.05, f, df1, df2,
                              label = 'F distribution',title = 'Anova test'){
  # Creating a sample F distribution
  range <- seq(qf(0.0001, df1, df2), qf(0.9999, df1, df2), by = (qf(0.9999, df1, df2)-qf(0.0001, df1, df2))*0.001)
  f.dist <- data.frame(range = range, dist = df(x = range, ncp = pop.mean, df1 = df1, df2 = df2)) %>% 
    dplyr::mutate(H0 = if_else(range <= qf(p = 1-alpha, ncp = pop.mean, df1 = df1,df2 = df2),'Retain', 'Reject'))
  # Plotting sampling distribution and F value with cutoff
  plot.test <- ggplot(data = f.dist, aes(x = range,y = dist)) +
    geom_area(aes(fill = H0)) +
    scale_color_manual(drop = TRUE, values = c('Retain' = "#00BFC4", 'Reject' = "#F8766D"), aesthetics = 'fill') +
    geom_vline(xintercept = f, size = 2) +
    geom_text(aes(x = f, label = paste0('F = ', round(f,3)), y = mean(dist)), colour="blue", vjust = 1.2) +
    labs(x = label, y='Density',  title = title) +
    theme_minimal()+theme(legend.position="bottom")
  plot(plot.test)
}
anva <- aov(rate ~ zoneName, unemployment)
anova.summary <- summary(anva)
print(anova.summary)
##              Df  Sum Sq  Mean Sq F value Pr(>F)  
## zoneName      2 0.01065 0.005323   3.315 0.0397 *
## Residuals   120 0.19269 0.001606                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
f.plot(f = anova.summary[[1]]$F[1], df1 = anova.summary[[1]]$Df[1], df2 = anova.summary[[1]]$Df[2])

Here p < α, where α = 0.05 Hence rejecting the Null Hypothesis. From this test we can observe that not all groups have same mean. But to find out which groups are similar and which are different, I am conducting a TukeyHSD test.

tukey.test <- TukeyHSD(anva)
print(tukey.test)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = rate ~ zoneName, data = unemployment)
## 
## $zoneName
##                                          diff         lwr          upr
## Mahadevapura-East                -0.007071899 -0.02917267  0.015028872
## Rajarajeswari Nagar-East         -0.026768634 -0.05154855 -0.001988723
## Rajarajeswari Nagar-Mahadevapura -0.019696735 -0.04934803  0.009954561
##                                      p adj
## Mahadevapura-East                0.7285538
## Rajarajeswari Nagar-East         0.0309193
## Rajarajeswari Nagar-Mahadevapura 0.2597730
# Plot pairwise TukeyHSD comparisons and color by significance level
tukey.df <- as.data.frame(tukey.test$zoneName)
tukey.df$pair = rownames(tukey.df)
ggplot(tukey.df, aes(colour=cut(`p adj`, c(0, 0.01, 0.05, 1), 
                           label=c("p<0.01","p<0.05","Non-Sig")))) +
  geom_hline(yintercept=0, lty="11", colour="grey30") +
  geom_errorbar(aes(pair, ymin=lwr, ymax=upr), width=0.2) +
  geom_point(aes(pair, diff)) +
  labs(x = 'Groups', y='Density',  colour="", title = 'Tukey HSD Test') +
    theme_minimal()+theme(legend.position="bottom")

From this test we can see that there is significant difference (with α = 0.05 confidence) between Rajeshwari Nagar and Bangalore East.

Created on 17th June 2019, Achyuthuni Sri Harsha

One Reply to “Analysis of variance (Anova)”

Leave a Reply