Discussed in this notebook:
1. Class size paradox
2. Web scraping in R

When I was joining my first job after college, my classmates from college who joined along with me were of great help. The alumni of my college assisted me especially during the initial days to find accommodation and to adjust to the culture. Even today, if I need any support in the org from a different department, I ask them first.

The number of college alumni an employee will have in his/her company will be based on, among other factors, the size of the university they come from. Therefore one metric to rank colleges could be the average number of classmates (or alumni) a student will have in his future workplace.

Amrita placements data set

I graduated from Amrita University in 2017. I am interested to see the distribution of the number of my classmates present across various companies and their average.

For NIRF ranking, my college publishes the number of students that are placed in every year at We can web scrape this data set to find the average and the distribution.

For web scraping using R, I am using the ‘rvest’ package in R authored by Hadley Wickham. You can access the documentation for rvest package here.

#Loading the rvest package

#Specifying the url for desired website to be scraped
url <- ''

#Reading the HTML code from the website
webpage <- read_html(url)

On this website, I want to scrape the table in the tab 2016-17. If I just right click on any table element and select Inspect. From the elements tab I can observe that the table is a child of table-responsive class.

#Using CSS selectors to scrap the parent of table class
node <- html_nodes(webpage, '.table-responsive .table')
## {xml_nodeset (3)}
## [1] <table class="table table-bordered table-striped"><tbody>\n<tr>\n<td ...
## [2] <table class="table table-bordered table-striped"><tbody>\n<tr>\n<td ...
## [3] <table class="table table-bordered table-striped"><tbody>\n<tr>\n<td ...

There are three lists in node, one for each tab in the website. I need to select the 2016-17 tab and pull the table in a data frame. The first 5 rows in the data frame are also displayed.

# Pulling the table from the third list (for the third tab) in 'node'
company.list <- html_table(node[3], fill = TRUE, header = TRUE)[[1]]
##        S.No. Academic Year     Name of the Company
## 1 UG IV YEAR    UG IV YEAR              UG IV YEAR
## 2          1       2016-17 Hinduja Global Services
## 3          2       2016-17                CSS Corp
## 4          3       2016-17         Sonata Software
## 5          4       2016-17            TPF Software
## 6          5       2016-17          B&R Automation
##   No of students recruited Minimum salary Offered Maximum salary offered
## 1               UG IV YEAR             UG IV YEAR             UG IV YEAR
## 2                        1                   3LPA               23.45LPA
## 3                        0                   3LPA               23.45LPA
## 4                        5                   3LPA               23.45LPA
## 5                        0                   3LPA               23.45LPA
## 6                        2                   3LPA               23.45LPA
##   Average salary offered Median salary offered
## 1             UG IV YEAR            UG IV YEAR
## 2                4.21LPA                5.8LPA
## 3                4.21LPA                5.8LPA
## 4                4.21LPA                5.8LPA
## 5                4.21LPA                5.8LPA
## 6                4.21LPA                5.8LPA

So I have successfully web-scrapped the table into a data frame. After some cleaning and converting columns to proper data type the distribution of the number of students is:

The average from the table is:

## [1] 14.78723

On average a student will have ~15 of his batchmates in their company.

But what if I do not have the above table, and instead took a survey by asking all my classmates during our annual alumni meet? What will I get then?

A sample of the survey dataset is created using the above table itself. This dataset will look as follows (example is shown for 2 companies):

##   student.number                 company no.of.alumni
## 1  Hinduja_emp_1 Hinduja Global Services            1
## 2   Sonata_emp_1         Sonata Software            5
## 3   Sonata_emp_2         Sonata Software            5
## 4   Sonata_emp_3         Sonata Software            5
## 5   Sonata_emp_4         Sonata Software            5
## 6   Sonata_emp_5         Sonata Software            5

Hinduja Global Services has one Amrita student but Sonata Software employee has 5. So one student will report the number of alumni as 1 while 5 different students from Sonata will report the number of alumni as 5, and so on.
The distribution of the responses of no.of.alumni and the mean will look as follows:

## [1] 150.2345

From the survey, I get that the average number of college mates that a student will have is 150. The estimate is biased upwards because the larger classes are overweighted in the average.

This paradox is called class size paradox.
The right estimate to use in the second case is called as the harmonic mean. While doing any survey, one should keep this paradox in mind.


You can find the code for the above example here

For better understanding, a good start would be Think Stats section 3.4

Created using RMarkdown.

One Reply to “Class size paradox”

Leave a Reply