Nexus Archive

GROUP BY

This is gonna be a tricky one 🤦🏼‍♂️... Well at least took me quite a bit to understand this.


What GROUP BY does is that groups data of a column into groups. Duplicates are removed and data are grouped. This sentence does not make sense but will once we look at an example.

lets consider a dataset age with a few numeric data containing ages. Then we try to GROUP BY them

age = { 20, 20, 25, 30, 35, 35 }

GROUP BY age

What this will do is take the dataset and create a group for each unique data.

age
---
20
25
30
35

What it did is create a group for each unique data. There are 2 20s and 2 35s but it will just create group for each distinct value so 20 is a distinct value and 35 is a distinct value.

Now this is the core of GROUP BY and its used alongside aggregate Functions Like this ->

SELECT age, COUNT(*) FROM customers GROUP BY age; 

What this will do is create a group for the age column. Each unique age will be displayed in its group. Then the aggregate function will run on each group. So COUNT(*) returns the count of rows in each group.

Lets insert a duplicate age for a customer and see the output of the query.

INSERT INTO customers VALUES(7, 'Jin Sakai', 30, 'Tshushima');

This makes a new entry . Now lets display ages by doing GROUP BY.

SELECT age, COUNT(*) FROM customers GROUP BY age;

This is output:

age    COUNT(*)
---------------
22       1    /* Nexus */
24       1    /* CJ */
30       2    /* Niko and Jin Sakai */
35       1    /* Tommy Vercetti */
1001     1    /* Kratos */

Lets dissect everything that happened here !

  1. First the columns are selected from the table.
  2. Next, GROUP BY is ran on the age column. This gives us age groups without any duplicates - exactly how its in the output.
  3. Next the aggregate function (COUNT()) is executed on each group so for group 22 COUNT() runs and returns 1 as there is only one record for that group, similarly for age group 30, COUNT() runs again and it finds 2 records one of Niko and one of Jin Sakai and returns 2, and follows.
  4. Lastly age is displayed (No duplicates because of GROUP BY) and the aggregate results for each group.

Now this can still be very unclear and confusing lemme try to clarify again.

Now if we do something like this lets see what happens :

SELECT name FROM customers GROUP BY age;

We are displaying names but creating groups by age. What happens is that the groups are still created but as discussed duplicates are not displayed, in our case both Niko and Jin Sakai have age 30 but only Niko is displayed as its the first to come and is of age 30 therefore any other record with age 30 will not be displayed.

SQL standard requires all non-aggregated SELECT columns to be in GROUP BY

One thing to note : GROUP BY does NOT modify the table, it just is used to display data in groups.

The main issue I faced is that I read queries from left to right and like an English sentence and expecting a left to right execution but SQL does not work like that it does have a execution precedence i.e. what executes first.

So in the case of GROUP BY the grouping is done before any aggregate functions run even though we write GROUP BY column_name after aggregate functions.

So here is the trick - First look for a GROUP BY Clause and mentally think that groups are created and the aggregate functions will run for each group and...well...its exactly how its executed.


❓ Why use GROUP BY in the first place ?

For many cases we would need specific aggregate data but based on specific group. For example, We want to see the count of users for each age group and only the count and not other details we can use GROUP BY here by doing :

SELECT age, COUNT(*) FROM customers GROUP BY age;

This displays each age, without any duplicate and the number of users for each.

There can be many use cases like these.


❔ More Questions

Lets see questions on GROUP BY with aggregate functions.

Q. Display each age and sum for each age group.

S. We can combine GROUP BY with SUM() to solve this.

SELECT age, SUM(age) FROM customers GROUP BY age;

Q. Display number of customers living in each address.

S.

SELECT address, COUNT(*) FROM customers GROUP BY address;

Q. What will this Query Output ?

SELECT name, address, SUM(age)
FROM customers
GROUP BY address;

S. This will return the table as is, specifically in alphabetical order based on address. The first thing is to find a GROUP BY column_name and we found that. Since each of the address are distinct we have 1 group per customer. Display their address and name as is and the age too because since there is only 1 record per group nothing is there to sum with.


Q. Display average ages per address.

S.

SELECT address, AVG(age) 
FROM customers 
GROUP BY address

Similar result because we try to group by address which is distinct for each record so we have group for each address and average will be the age as is.


Q. Display count of customers from the oldest 2 age groups.

S. This actually mix many concepts lets see:

SELECT age, COUNT(*) FROM customers 
GROUP BY age 
ORDER BY age DESC 
LIMIT 2; 

We group ages then order them by descending order so that the groups are arranged Highest First then we limit the output by 2 and finally display the age groups and aggregate results !

We used GROUP BY, ORDER BY, Aggregate Functions and LIMIT in one question !


And...we Have finished GROUP BY..... At least I think 😐, It still confuses me...

Follow The HAVING Clause to continue from here.

Browse Full SQL Page.

#SQL on the go