GROUP BY, DISTINCT, HAVING Commands in SQL

This post will focus on several SQL commands that are often used for aggregation purposes. These commands are

  • DISTINCT
  • GROUP BY
  • HAVING

The data that will be used as examples is the Seasons_Stats data from Kaggle which is a database of NBA statistics.

DISTINCT

The DISTINCT command allows you to find unique items in a column. As an example, we want to find all of the unique positions in our database. Below is the code and output to do this.

SELECT DISTINCT Pos
from Seasons_Stats

In this dataset we can see that there are more than 5 positions in basketball because some of the players were coded as playing more than one position

GROUP BY

The GROUP BY command allows use to aggregate data by a group or a category in a categorical variable. In the example below we sum the points scored by position and we sort them by descending order. The code and output is below.

SELECT SUM(PTS), Pos
from Seasons_Stats
Group BY Pos
Order by SUM(PTS) DESC

We can see that shooting guards are the most prolific scorers followed by small forwards. By aggregating the points and then grouping them by position we are able to achieve a more nuance insight into our data.

HAVING

The HAVING command is a filter command similar to the WHERE command but the HAIVNG command is used in the context of aggregation. In the example below, we want to filter our previous code so that we only see positions that scored more than 2 million points. If we use a WHERE command we will get an error message as shown below.

SELECT SUM(PTS), Pos
from Seasons_Stats
WHERE SUM(PTS) >= 2000000
Group BY Pos
Order by SUM(PTS) DESC

The WHERE command cannot deal with aggregate data. Below is the same code but with the using of the HAVING command.

SELECT SUM(PTS), Pos
FROM Seasons_Stats
GROUP BY Pos
HAVING SUM(PTS) >= 2000000
ORDER BY SUM(PTS) DESC

By adding the HAVING command and stating we only wanted positions that scored more than 2 million points everything now works.

Conclusion

Hopefully, after learning about this commands, aggregating data is a little easier in SQL. There is always more than one way to do anything when using code. However, at least one way has been explained to deal with the problems that were addressed here.

Leave a Reply