Data Science — Creating Google Sheets Explore feature for Descriptive Statistics with Outliers

A practical approach to understand how to do the right data and descriptive statistics and analysis by factoring Outliers in your data

Apurav Chauhan
Towards Data Science

--

Photo by Giorgio Tomassetti on Unsplash

What is Descriptive Analysis ?

Descriptive analysis or statistics means the summary of a dataset showing the central tendency and the spread of the data using things like mean, median, standard deviation etc

I remember studying these concepts when I was in school but I never understood the use in real life until recently.

Motivation behind this exercise?

I was recently playing with some salary data to learn machine learning. I created some normally distributed data. In the end I appended two huge salaries to play around with different statistics concept.

I uploaded the data in Google sheets and to my surprise, it had an Explore feature which had cleaned up these outliers and presented a very clean and crisp descriptive analysis with a neat histogram.

Google Sheet’s Explore feature showing summary analysis

That really made me curious and made me to think about below questions:

  1. What Math did Google use to remove the two wrong values/outliers and show me the min and max salary from the remaining data?
  2. What Math was used to come to the centre value of 50300?
  3. What Math was used to come to the spread value of 8320?

And thats when my real journey began into the wonderful world of data analysis.

Outliers?

You would rarely find ideal datasets in real life. The data points which are abnormal or are different from most of the data are called as Outliers. Outliers could tell you about an anomaly, a special event or simply mean something doctored. It really depends on use-cases.

When you want descriptive analysis about a dataset, it becomes really important to factor in the outliers otherwise your results could be skewed.

Example: Consider the following data set and descriptive summary

[10, 15, 15, 20, 20, 20, 25, 25, 30, 400]

Min: 10 | Max: 400 | Average: 58 | Median: 20

Now if you will look carefully, the data point 400 could be a possible anomaly which is giving us the wrong average and wrong max. Most of the data is between 10 and 30 so how can the average be 58? Thats because a single outlier of 400 skewed our summary. However, if you will look at median that somehow shows the right centre value and is not effected by the outlier value of 400.

Lets further skew the data to below and see the stats:

[10, 15, 15, 20, 20, 20, 25, 25, 30, 800]

Min: 10 | Max: 400 | Average: 98 | Median: 20

As you can see, if we change 400 to 800, the average increases but the median remains same and still gives us a better picture of the centre of the data. So the question here is how do we detect outliers and get the right descriptive summary of the data?

Quartiles and Inter Quartile Range (IQR)

Quartiles are values that divide a (part of a) sorted data table into four groups containing an approximately equal number of observations. The total of 100% is split into four equal parts: 25%, 50%, 75% and 100%.

The first quartile (Q1) is defined as the middle number between the smallest number and the media of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Now as per above diagram here are the ranges, you need to look out to get the correct data points:

IQR (Inter Quartile Range) = Q3 — Q1

UIF (Upper Inner Fence) = Q3+1.5*IQR

LIF (Lower Inner Fence) = Q1–1.5*IQR

UOF (Upper Outer Fence) = Q3+3*IQR

LOF (Upper Outer Fence) = Q1–3*IQR

Now, you can easily find suspicious data OR Outliers based on below fence rules :

Anything that lies between UIF and UOF are called mild upper outliers.

Anything that lies between LIF and LOF are called mild lower outliers.

Anything that lies beyond UOF are called upper extreme outliers.

Anything that lies below LOF are called lower extreme outliers.

Now if we apply the above calculations on our sample data set, this is what we get

[10, 15, 15, 20, 20, 20, 25, 25, 30, 800]

Q1:16.25 | Q2:20| Q3:25| IQR:8.75 | LIF: 3.125 | LOF: -10| UIF: 38.125 | UOF: 51.25

Since 800>UOF, this is a possible outlier and it needs to be removed to get correct descriptive analysis.

So lets proceed and remove the Outlier 800 and see our new analysis on new data set

[10, 15, 15, 20, 20, 20, 25, 25, 30]

Min: 10 | Max: 30| Average: 20| Median: 20 | Spread/SD: 6

Which brings us to our final summary containing the centre value as 20(Average) , plus or minus 6(Standard Deviation(SD) or spread of data) :

The data ranges from 10 to 30. Most of the data will be centred around 20, plus or minus 6.”

I could be wrong, but I believe thats the Maths Google is using to spit out such beautiful and simple descriptive statistics about a dataset.

Coding our own Google Explore feature

Custom Google’s Style Descriptive Statistics for data analysis with outlier correction

To take the above concepts to the next level, it is time to code our Google Sheet’s like explore analysis

The sample data sheet which is use for analysis can be referred or downloaded here.

Below is a codepen that spits out a similar analysis using Google charts, data from above link and a bit of Javascript Maths.

Our custom Google Explore feature

The code is very simple. We calculate Q1, Q2, Q3, IQR, UOF, LOF on sorted data and filter the outliers before getting average and standard deviation of data.

Downloads

  1. The Google sheet for sample data
  2. Executable Codepen
  3. Github Code Repo

Feedback and followup

Any feedback that you have is welcomed because I believe thats how we all grow and mature. Feel free to follow up with me on LinkedIn or Twitter.

--

--