CASE WHEN statements are similar to if-then statements in other programming languages. These statements are used to have SQL execute certain behaviors as determine by the criteria that is set in the statement. In the video below, we will go through several examples of how to use CASE WHEN statements in SQL.
This post wuill provide an explanation of how to use the following commands in SQL
The NOT command is used to exclude data based on some criteria. In the example, below we will use the NOT command to exclude basketball players who play the power forward or shooting guard position. The code is next followed by the output.
SELECT * FROM Seasons_Stats WHERE Pos NOT IN ('PF','SG')
Here is a breakdown of the code.
- Line 1 tells SQL to select all columns in the dataset
- Line 2 explain which table to use
- Line 3 Use the WHERE command first which serves as a filter. Next, the NOT command means to exclude what comes next. The IN command indicates specifically what will be excluded. The information in the parentheses is the categories to remove from the column called “POS”
IF you look at the first ten rows shown above you will see there are no PF or SG in the POS column which is what we wanted.
The example above excluded text data but we can also exclude numerical values as well. In the code below we replace the “Pos” column with the “Year” and we exclude the Years of 1950 and 1051. The code and output are below.
SELECT * FROM Seasons_Stats WHERE "Year" NOT IN (1950,1951)
If you look at the “Year” colum you can see it does not start until 1952. In addition. you can clearly see that there are PF and SG in the “Pos” column which used to be what we exlcuded in the previous example.
The AND command is used whe n you have multiple criteria for including or excluding information. In the code below we continue to exlcude plauers from 1950 and 1951 but then with the AND command we exclude players whose first name starts with an A. Below is the code and output.
SELECT * FROM Seasons_Stats WHERE "Year" NOT IN (1950,1951) AND NOT Player LIKE 'A%' ORDER BY Player
What’s new this time is the use of the LIKE command which is used when you are not exactly sure what you are looking for. THis is way the “A” is in sinle quotes followed by the percent sign. THis tells SQL to exclude anything that starts with an A.
The BETWEEN command is when you are searching to include values that fit a certain range. In the code below, we remove the NOT command and include the BETWEEN and the AND command. THis means that we want to include rows that meet our critria rather than exlcude them.
SELECT * FROM Seasons_Stats WHERE Age BETWEEN 41 AND 43 ORDER BY Age
In this code we want any player who is between 41 and 43 years of age. Below is the same output but written with the use of the NOT command
SELECT * FROM Seasons_Stats WHERE Age NOT BETWEEN 18 AND 40 ORDER BY Age
Which is better probably comes down to preference
The commands mentioned here are among some of the basic tools anybody who wants to use SQL needs to know. In the future we will deal with other concepts related to SQL
In this video, we will look at more commands that are used for aggregation in SQL. The three commands we will look at include the DISTINCT, GROUP BY, and the HAVING command. The DISTINCT command pulls unique values, the GROUP BY command allows you to aggregate data by categories, and the HAVING command allows you to filter results as you decide.
This post will focus on several SQL commands that are often used for aggregation purposes. These commands are
- GROUP BY
The data that will be used as examples is the Seasons_Stats data from Kaggle which is a database of NBA statistics.
The DISTINCT command allows you to find unique items in a column. As an example, we want to find all of the unique positions in our database. Below is the code and output to do this.
SELECT DISTINCT Pos from Seasons_Stats
In this dataset we can see that there are more than 5 positions in basketball because some of the players were coded as playing more than one position
The GROUP BY command allows use to aggregate data by a group or a category in a categorical variable. In the example below we sum the points scored by position and we sort them by descending order. The code and output is below.
SELECT SUM(PTS), Pos from Seasons_Stats Group BY Pos Order by SUM(PTS) DESC
We can see that shooting guards are the most prolific scorers followed by small forwards. By aggregating the points and then grouping them by position we are able to achieve a more nuance insight into our data.
The HAVING command is a filter command similar to the WHERE command but the HAIVNG command is used in the context of aggregation. In the example below, we want to filter our previous code so that we only see positions that scored more than 2 million points. If we use a WHERE command we will get an error message as shown below.
SELECT SUM(PTS), Pos from Seasons_Stats WHERE SUM(PTS) >= 2000000 Group BY Pos Order by SUM(PTS) DESC
The WHERE command cannot deal with aggregate data. Below is the same code but with the using of the HAVING command.
SELECT SUM(PTS), Pos FROM Seasons_Stats GROUP BY Pos HAVING SUM(PTS) >= 2000000 ORDER BY SUM(PTS) DESC
By adding the HAVING command and stating we only wanted positions that scored more than 2 million points everything now works.
Hopefully, after learning about this commands, aggregating data is a little easier in SQL. There is always more than one way to do anything when using code. However, at least one way has been explained to deal with the problems that were addressed here.
CASE WHEN statements are the SQL version of If-Then statements. Essentially, SQL will perform certain behaviors based on the the criteria you set at the data analyst. In this post, we will go through several examples of the use of CASE WHEN statements using the NBA_salary database that is available on Kaggle.
In the first example, we are going to use a CASE WHEN statement to identify if a basketball player is a shooting guard or not below is the code followed by the output.
SELECT Pos,Tm, CASE WHEN Pos = 'SG' THEN 'Yes' END AS Shooting_Guard FROM Seasons_Stats
In line one we SELECT two columns which are Pos and Tm. In the second line is the CASE WHEN statement. We are telling SQL that when Pos = ‘SG’ then place a yes in a new column called Shooting_Guard. You can see in the second row that there is a shooting guard there and this is why there is a yes. All the other rows have a NULL there. In the next example we will learn how to deal with the NULL values.
In the second example we will replace the NULL values with the word No. Below is the code and output
SELECT Pos,Tm, CASE WHEN Pos = 'SG' THEN 'Yes'ELSE 'No' END AS Shooting_Guard FROM Seasons_Stats
The only change to the code is found in the second line. We added the ELSE command and place the word No in single quotes. This tells SQL that if a player is not a shooting guard then put the word No in the new column.
In this example we are going to employ an OR command. As with If-Then statements, CASE WHEN Statements can use AND, OR, IN commands as well. Below we are going to have SQL say yes if a player is a shooting guard or a member of the CHS team.
SELECT Pos,Tm, CASE WHEN Pos = 'SG' OR Tm = 'CHS' THEN 'Yes'ELSE 'No' END AS Shooting_Guard FROM Seasons_Stats
The only change is found in line 2 again. We added the OR command and told SQL to say yes if Tm = CHS. This why if you look at the results we have a yes in row 3. Even though this player is a SF SQL still marked yes because he was on the CHS team.
In our last example, we are going to make several categories by using several WHEN statements. Specifically, we are going to divide players based on how many points they scored. Below is the code and output.
SELECT PTS, CASE WHEN PTS > 2500 THEN 'Over 2500' WHEN PTS > 1500 AND PTS < 2499 THEN 'Over 1500' WHEN PTS > 500 AND PTS < 1499 THEN 'Over 500' ELSE 'Less than 500' END AS Pts_groups FROM Seasons_Stats
Here is an explanation
- We call the PTS column
- We create our CASE WHEN statements
- If a player scored more than 2500 then say ‘Over 2500’
- If a player scored more than 1500 and less then 2499 the say ‘Over 1500’
- If a player scored more than 500 and less then 1499 the say ‘Over 500’
- If a player scored less than 1500 then say ‘Less than 500’
This was just several simple examples of the use of CASE WHEN statements. The purpose of this command is to perform certain actions when certain criteria are set. When this happens powerful things can be done in SQL.
Aggregation is focused on summarizing data in order to develop some insights for taking action. This post will provide some simple ways to aggregate data in SQL. Specifically, we will learn how to use the following commands
The data we are using is nba_salary which is available at the Kaggle website.
The COUNT command allows you to count the number of rows that contain data in a table. In the code below we are going to count the number of data containing rows in the Seasons_Stats table.
SELECT COUNT(*) FROM Seasons_Stats
As you can see the output is not much only one cell. In this first example we took into consideration all of the columns in the table. That is why there is an asterisk inside the COUNT function parentheses. However, you can also limit this to one or more columns as shown below.
SELECT COUNT(Player) FROM Seasons_Stats
In the code above, we limited are query to only the “Player” column in the Table. You can see there is a differece in the count for this example and the previuous one. This indicates that some of the rows do not have data for the Player’s name.
The sum gives a total for numerical values. For our example, we are going to add up all the points scored by the players in our table.
SELECT SUM(PTS) FROM Seasons_Stats
In this table, you can see the number of points scored. This command is rather straightforward in how to use it,
MIN and MAX
The MIN and MAX commands do what they say. To make this a little more interesting we are going to use these two commands together to find the range of the data. In the table we are going to use the Age column which shares the age of the player. We are going to do three things in the code below
- calculate the minimum age
- Calculate the maximum age
- Find the difference between the MAX and the MIN to get the range of the age
The code and output are below
SELECT MIN(Age), MAX(Age), MAX(Age) - MIN(Age) FROM Seasons_Stats
The AVG command calculates the average. this is also rather simple. Therefore, the code is below for the average age of a basketball player.
SELECT AVG(Age) FROM Seasons_Stats
This post share some simple ways in which aggregation can happening when using SQL. There are naturally other ways that these tools can be used. However, for informative purposes this will provide a foundation for someone who is new to using SQL.
Joining data is perform for many different reasons. Often it is done because two or more tables have information that needs to be in one place for analysis. In the video that is embedded below, we will learn how to do simple JOINS in SQL.
This video explains the NOT, AND, & BETWEEN Commands in SQL. These are commonly used commands in SQL that any data scientist needs to be familiar with. The majority of an analyst’s day is actually spent getting and cleaning data. The video below provides a basic explanation of a basic concept that all analysts need to know.
In this video, ERT will explain how to filter data in SQL. Often, we want results but we don’t want everything the database can provide us. This can be because its too much information but also because getting everything can take a long time with larger database. For these and other reasons we will learn how to use some of the Filtering commands in SQL.
This post will explore the use and application of the WHERE, LIKE, and IN commands in SQL.
The WHERE command is generally used as a way to filter data in SQL. The database we are using in this post contains data on basketball players from 1950-2017. What we want to do is filter the data so that we only see data from players who played in 2017. In the example below, we will filter our basketball players by year be set to 2017
SELECT * FROM Seasons_Stats WHERE year = 2017
If you look at the year column you can clearly see that the value is set to 2017 just as we wanted it.
The WHERE command is not limited to numbers as it can also be used with text. In the example below, we filter our data with the WHERE command so that we only see players who played for the Golden State Warriors (GSW).
SELECT * FROM Seasons_Stats WHERE Tm = 'GSW'
If you look closely at the “Tm” column you will only see the initials for the Golden State Warriors.
The WHERE command is often teamed with the LIKE command when you are looking for a text but are not sure of an exact match. You can specify a pattern you are looking for similar to regular expressions. In the code below we use the LIKE command with the WHERE command searching for any player whose name begins with L.
SELECT * FROM Seasons_Stats WHERE Player LIKE 'L%'
The percentage sign (%) after the letter L tells SQL that anything can be after the letter L in the search and meet the criteria. We can also put the % at the end of or text or put one on both sides are some other combination as shown in the examples below.
In the example below, the last letter must be an L. After this example, is one in which an L must appear anywhere in the name.
SELECT * FROM Seasons_Stats WHERE Player LIKE '%L'
SELECT * FROM Seasons_Stats WHERE Player LIKE '%L%'
The IN command allows you to filter your data based on several values. This command is also combined with the WHERE command. In the example below, we filter our data so that we only see players whose position is small forward or shooting guard.
SELECT * FROM Seasons_Stats WHERE Pos in ('SG', 'SF')
You could also insert numerical values in the parentheses when using the IN command.
The command shared in this post provide more information on basic tools you can use and apply in SQL. Filtering using the WHERE, LIKE, and IN commands get helpo you to focus the breath of your research to find answers to your questions.
For those who may not be familiar with this, this video will provide a brief intro to SQL. We will look at a few commands as we begin our journey learning how to navigate databases and the SQL programming language.
SQL (structured query language) is a programming language used with databases. A database is essentially a collection of tables that all contain information in a text or numerical form. SQL can be used for developing and maintaining databases for various purposes. Our goal is not to learn how to become database administrators. Instead, we are going to focus on using SQL for data analysis.
The data used in this example is NBA Salary and Statistics 2016-17 (SQLite version), a database file available at Kaggle. This database contains information on basketball statistics from 1950-2017. There is more that could be said, but for simplicity, we will learn as we go along.
The software used to explore the database in these examples is DBeaver, a free, open-source software database manager. You can run SQL from a terminal, but it is a little bit easier for beginners to have a GUI interface most of the time. If you want to use this database, you can download it and connect this database to whatever database manager you are using.
In this post, we will focus on using the following commands.
- ORDER BY
SELECT and FROM
The SELECT and FROM commands almost always work together at the beginning of an SQL query. The SELECT command tables which column(s) you want SQL to pull from a table. The FROM command indicates which table you want to pull data from. Below is code and a screenshot of using these two commands together to pull data from the database.
This is what the output looks like in DBeaver. What you see is different. The asterisk after SELECT tells SQL to pull all columns in the table. You can also see a combination of text and numeric data in the columns we can see.
The next command that we will learn is the LIMIT command. This command limits how many rows of data are returned for an output. The code and output are below.
By using the LIMIT command, only five rows of data were pulled. How many rows you want is up to you, but this example was set to 5.
The ORDER BY command allows you to sort the data from highest to lowest or vice versa based on one or more columns in the table. In the example below, we will sort the data by the age column descending to determine who are some of the oldest players to play in the NBA since 1950, which is the first year data is available in this data set. Also, note that only the first five rows are listed because of the LIMIT command. The code and output is below.
order by Age DESC
From the output, it appears that Kevin Willis is the oldest player to play in the NBA since 1950. The DESC argument next to ORDER BY sorts the results from highest to lowest.
There is much more to learn about SQL than what was shared here. However, the most important thing to know whenever doing a data analyst is to make sure you know what you want to know. This involves developing straightforward questions that can be answered with available tools.