Tag Archives: SQL


In this video, we will look at more commands that are used for aggregation in SQL. The three commands we will look at include the DISTINCT, GROUP BY, and the HAVING command. The DISTINCT command pulls unique values, the GROUP BY command allows you to aggregate data by categories, and the HAVING command allows you to filter results as you decide.


This post will focus on several SQL commands that are often used for aggregation purposes. These commands are


The data that will be used as examples is the Seasons_Stats data from Kaggle which is a database of NBA statistics.


The DISTINCT command allows you to find unique items in a column. As an example, we want to find all of the unique positions in our database. Below is the code and output to do this.

from Seasons_Stats

In this dataset we can see that there are more than 5 positions in basketball because some of the players were coded as playing more than one position


The GROUP BY command allows use to aggregate data by a group or a category in a categorical variable. In the example below we sum the points scored by position and we sort them by descending order. The code and output is below.

from Seasons_Stats
Group BY Pos
Order by SUM(PTS) DESC

We can see that shooting guards are the most prolific scorers followed by small forwards. By aggregating the points and then grouping them by position we are able to achieve a more nuance insight into our data.


The HAVING command is a filter command similar to the WHERE command but the HAIVNG command is used in the context of aggregation. In the example below, we want to filter our previous code so that we only see positions that scored more than 2 million points. If we use a WHERE command we will get an error message as shown below.

from Seasons_Stats
WHERE SUM(PTS) >= 2000000
Group BY Pos
Order by SUM(PTS) DESC

The WHERE command cannot deal with aggregate data. Below is the same code but with the using of the HAVING command.

FROM Seasons_Stats
HAVING SUM(PTS) >= 2000000

By adding the HAVING command and stating we only wanted positions that scored more than 2 million points everything now works.


Hopefully, after learning about this commands, aggregating data is a little easier in SQL. There is always more than one way to do anything when using code. However, at least one way has been explained to deal with the problems that were addressed here.

SQL and CASE WHEN Statements


CASE WHEN statements are the SQL version of If-Then statements. Essentially, SQL will perform certain behaviors based on the the criteria you set at the data analyst. In this post, we will go through several examples of the use of CASE WHEN statements using the NBA_salary database that is available on Kaggle.

Example 1

In the first example, we are going to use a CASE WHEN statement to identify if a basketball player is a shooting guard or not below is the code followed by the output.

CASE WHEN Pos = 'SG' THEN 'Yes' END AS Shooting_Guard
FROM Seasons_Stats

In line one we SELECT two columns which are Pos and Tm. In the second line is the CASE WHEN statement. We are telling SQL that when Pos = ‘SG’ then place a yes in a new column called Shooting_Guard. You can see in the second row that there is a shooting guard there and this is why there is a yes. All the other rows have a NULL there. In the next example we will learn how to deal with the NULL values.

Example 2

In the second example we will replace the NULL values with the word No. Below is the code and output

CASE WHEN Pos = 'SG' THEN 'Yes'ELSE 'No' END AS Shooting_Guard
FROM Seasons_Stats

The only change to the code is found in the second line. We added the ELSE command and place the word No in single quotes. This tells SQL that if a player is not a shooting guard then put the word No in the new column.

Example 3

In this example we are going to employ an OR command. As with If-Then statements, CASE WHEN Statements can use AND, OR, IN commands as well. Below we are going to have SQL say yes if a player is a shooting guard or a member of the CHS team.

CASE WHEN Pos = 'SG' OR Tm = 'CHS' THEN 'Yes'ELSE 'No' END AS Shooting_Guard
FROM Seasons_Stats

The only change is found in line 2 again. We added the OR command and told SQL to say yes if Tm = CHS. This why if you look at the results we have a yes in row 3. Even though this player is a SF SQL still marked yes because he was on the CHS team.

Example 4

In our last example, we are going to make several categories by using several WHEN statements. Specifically, we are going to divide players based on how many points they scored. Below is the code and output.

CASE WHEN PTS > 2500 THEN 'Over 2500'
	WHEN PTS > 1500 AND PTS < 2499 THEN 'Over 1500'
	WHEN PTS > 500 AND PTS < 1499 THEN 'Over 500'
	ELSE 'Less than 500' END AS Pts_groups
FROM Seasons_Stats 

Here is an explanation

  1. We call the PTS column
  2. We create our CASE WHEN statements
    • If a player scored more than 2500 then say ‘Over 2500’
    • If a player scored more than 1500 and less then 2499 the say ‘Over 1500’
    • If a player scored more than 500 and less then 1499 the say ‘Over 500’
    • If a player scored less than 1500 then say ‘Less than 500’


This was just several simple examples of the use of CASE WHEN statements. The purpose of this command is to perform certain actions when certain criteria are set. When this happens powerful things can be done in SQL.

Aggregating Data in SQL

Aggregation is focused on summarizing data in order to develop some insights for taking action. This post will provide some simple ways to aggregate data in SQL. Specifically, we will learn how to use the following commands

  • SUM
  • MIN
  • MAX
  • AVG

The data we are using is nba_salary which is available at the Kaggle website.


The COUNT command allows you to count the number of rows that contain data in a table. In the code below we are going to count the number of data containing rows in the Seasons_Stats table.

FROM Seasons_Stats

As you can see the output is not much only one cell. In this first example we took into consideration all of the columns in the table. That is why there is an asterisk inside the COUNT function parentheses. However, you can also limit this to one or more columns as shown below.

FROM Seasons_Stats

In the code above, we limited are query to only the “Player” column in the Table. You can see there is a differece in the count for this example and the previuous one. This indicates that some of the rows do not have data for the Player’s name.


The sum gives a total for numerical values. For our example, we are going to add up all the points scored by the players in our table.

FROM Seasons_Stats

In this table, you can see the number of points scored. This command is rather straightforward in how to use it,


The MIN and MAX commands do what they say. To make this a little more interesting we are going to use these two commands together to find the range of the data. In the table we are going to use the Age column which shares the age of the player. We are going to do three things in the code below

  • calculate the minimum age
  • Calculate the maximum age
  • Find the difference between the MAX and the MIN to get the range of the age

The code and output are below

	   MAX(Age) - MIN(Age)
FROM Seasons_Stats


The AVG command calculates the average. this is also rather simple. Therefore, the code is below for the average age of a basketball player.

FROM Seasons_Stats


This post share some simple ways in which aggregation can happening when using SQL. There are naturally other ways that these tools can be used. However, for informative purposes this will provide a foundation for someone who is new to using SQL.


This video explains the NOT, AND, & BETWEEN Commands in SQL. These are commonly used commands in SQL that any data scientist needs to be familiar with. The majority of an analyst’s day is actually spent getting and cleaning data. The video below provides a basic explanation of a basic concept that all analysts need to know.


In this video, ERT will explain how to filter data in SQL. Often, we want results but we don’t want everything the database can provide us. This can be because its too much information but also because getting everything can take a long time with larger database. For these and other reasons we will learn how to use some of the Filtering commands in SQL.

WHERE, LIKE and IN Commands for SQL

This post will explore the use and application of the WHERE, LIKE, and IN commands in SQL.


The WHERE command is generally used as a way to filter data in SQL. The database we are using in this post contains data on basketball players from 1950-2017. What we want to do is filter the data so that we only see data from players who played in 2017. In the example below, we will filter our basketball players by year be set to 2017

FROM Seasons_Stats
WHERE year = 2017

If you look at the year column you can clearly see that the value is set to 2017 just as we wanted it.

The WHERE command is not limited to numbers as it can also be used with text. In the example below, we filter our data with the WHERE command so that we only see players who played for the Golden State Warriors (GSW).

FROM Seasons_Stats

If you look closely at the “Tm” column you will only see the initials for the Golden State Warriors.


The WHERE command is often teamed with the LIKE command when you are looking for a text but are not sure of an exact match. You can specify a pattern you are looking for similar to regular expressions. In the code below we use the LIKE command with the WHERE command searching for any player whose name begins with L.

FROM Seasons_Stats
WHERE Player LIKE 'L%'

The percentage sign (%) after the letter L tells SQL that anything can be after the letter L in the search and meet the criteria. We can also put the % at the end of or text or put one on both sides are some other combination as shown in the examples below.

In the example below, the last letter must be an L. After this example, is one in which an L must appear anywhere in the name.

FROM Seasons_Stats
WHERE Player LIKE '%L'
FROM Seasons_Stats
WHERE Player LIKE '%L%'


The IN command allows you to filter your data based on several values. This command is also combined with the WHERE command. In the example below, we filter our data so that we only see players whose position is small forward or shooting guard.

FROM Seasons_Stats
WHERE Pos in  ('SG', 'SF')

You could also insert numerical values in the parentheses when using the IN command.


The command shared in this post provide more information on basic tools you can use and apply in SQL. Filtering using the WHERE, LIKE, and IN commands get helpo you to focus the breath of your research to find answers to your questions.

Intro to SQL

SQL (structured query language) is a programming language used with databases. A database is essentially a collection of tables that all contain information in a text or numerical form. SQL can be used for developing and maintaining databases for various purposes. Our goal is not to learn how to become database administrators. Instead, we are going to focus on using SQL for data analysis.

The data used in this example isĀ NBA Salary and Statistics 2016-17 (SQLite version), a database file available at Kaggle. This database contains information on basketball statistics from 1950-2017. There is more that could be said, but for simplicity, we will learn as we go along.

The software used to explore the database in these examples is DBeaver, a free, open-source software database manager. You can run SQL from a terminal, but it is a little bit easier for beginners to have a GUI interface most of the time. If you want to use this database, you can download it and connect this database to whatever database manager you are using.

In this post, we will focus on using the following commands.

  • FROM


The SELECT and FROM commands almost always work together at the beginning of an SQL query. The SELECT command tables which column(s) you want SQL to pull from a table. The FROM command indicates which table you want to pull data from. Below is code and a screenshot of using these two commands together to pull data from the database.


FROM Seasons_Stats

This is what the output looks like in DBeaver. What you see is different. The asterisk after SELECT tells SQL to pull all columns in the table. You can also see a combination of text and numeric data in the columns we can see.


The next command that we will learn is the LIMIT command. This command limits how many rows of data are returned for an output. The code and output are below.


FROM Seasons_Stats


By using the LIMIT command, only five rows of data were pulled. How many rows you want is up to you, but this example was set to 5.


The ORDER BY command allows you to sort the data from highest to lowest or vice versa based on one or more columns in the table. In the example below, we will sort the data by the age column descending to determine who are some of the oldest players to play in the NBA since 1950, which is the first year data is available in this data set. Also, note that only the first five rows are listed because of the LIMIT command. The code and output is below.


FROM Seasons_Stats 

order by Age DESC 


From the output, it appears that Kevin Willis is the oldest player to play in the NBA since 1950. The DESC argument next to ORDER BY sorts the results from highest to lowest.


There is much more to learn about SQL than what was shared here. However, the most important thing to know whenever doing a data analyst is to make sure you know what you want to know. This involves developing straightforward questions that can be answered with available tools.