In the video below we will look at how to use UNION and INTERSECT clauses in SQL. Both of these clauses have to do with joining data in ways that are an alternative to JOIN statements.
Tag Archives: SQL
Subqueries in SQL VIDEO
Window Function in SQL
SQL window functions are calculation functions based on individual rows that create a new column that contains the new calculation. They are similar to aggregate functions but are different in that normal aggregate functions like “group by,” will provide a single number as the output. As mentioned earlier, with a window function, the results are placed in a new column for each row.
Window functions allow users to perform many analytical tasks, from ranking and partitioning data to calculating moving averages, cumulative sums, and differences between consecutive rows. Again, what is important here is that widow functions can be used to calculate values between rows.
Basic Example
In this first example, we will figure out how many customers we have from Texas and California and put this in a new column. Below is the code followed by the output.
SELECT first_name, last_name,state ,
COUNT(*) OVER () as total_customers
FROM customers
WHERE state in ('CA','TX')
ORDER BY customer_id;
In the select statement, we pull first_name, last_name, and state, and we then have our window function. In this window function, we are counting the number of customers in the customer table who are from CA and TX. The OVER() function is used to define the window. Since it is blank it is telling SQL that the entire table is the window. This will not make sense right now but is critical information for future examples. Lastly, we are ordering the data by customer_id.
The output indicates that 9,903 customers are from CA or TX. We can confirm this by running a separate line of code.
SELECT COUNT(*)
FROM customers
where state in('CA','TX')
The output from the window function is repeated in every row called total_customers. Repeating this information doesn’t make sense but it shows us what the window function does. For example, in row 1, the function sees that this person is from TX and then outputs that the total number of customers is 9,903 for somebody from TX or CA. This is a basic example of what window functions can do. Of course, there are things much more insightful than this that can be calculated.
Intermediate Example
We are now going to run mostly the same code with one slight difference. We want to know not just how many total customers there are but how many are from TX and how many are from CA. To do this we will have to use PARTITION BY which is the group by clause for window functions.
SELECT customer_id, first_name, last_name, state,
COUNT(*) OVER (PARTITION BY state) as total_customers
FROM customers
where state in ('CA','TX')
ORDER BY customer_id;
In the output, we have all of the same rows from the previous SELECT clause and the total_customers columns are different. When a person is from TX this column shows how many people are from TX but when a person is from CA it shows how many people are from CA. If you add up the number of people from TX and CA you get the following
4,865 + 5,038 = 9,903
This is the same amount as the total number of customers in our previous example. The PARTITION BY clause breaks the number of customers into two groups, those from TX and those from CA, and assigns the appropriate value based on where the customer in that row is from.
Advanced Examples
The next example will involve using the SUM aggregation function. We are going to add customer_id in a new column. This will be a running total. In other words, SQL will keep adding the customer_id until they get through all of the data. Below is the code followed by the output and explanation.
SELECT customer_id, title, first_name, last_name, state,
Sum(customer_id) OVER (ORDER BY customer_id) as total_customers
FROM customers
where state in('CA','TX')
ORDER BY customer_id;
Here is what is happening. In the total_customer column, a running total of customer_id is being created. For example, row 1 has the value 10 because that is the first customer id of someone from TX. Row 2 has a customer ID of 13. When you add 10 + 13 = 23 which is the value in row 2 of total_customers. This continues for the rest of the table.
Here is another example this time with the RANK() function. The RANK() function allows you to create a new column that ranks the data based on a criteria. For our example, we will rank the data based on their customer_id with the lower the id number the higher the ranking. To make this even more interesting, we will partition the data by state so that the lowest value customer_id will be number 1 for TX and the lowest ranking customer_id will be number 1 for CA. Below is the code
SELECT customer_id, title, first_name, last_name, state,
rank () OVER (PARTITION BY state ORDER BY customer_id) as total_customers
FROM customers
where state in ('CA','TX')
ORDER BY customer_id;
Row number 1 is rank 1 because it is the lowest value customer_id of all TX. Row 2 is also ranked 1 because it is the lowest value customer_id of all CA.
Conclusion
The possibilities are almost endless with window functions. These tools allow you to get into the data and find insightful answers to complex questions. The examples here are only scratching the surface of what can be achieved.
UNION and INTERSECT in SQL
UNION and INTERSECT are two useful statements used in SQL for specific purposes. In this post, we will define, provide examples of each statement, and compare and contrast UNION, INTERSECT, with the JOIN command.
UNION
The UNION statement is used to append rows together from different select statements. Below is the code followed by the output and explanation.
(
SELECT street_address, city, state, postal_code
FROM customers
WHERE street_address IS NOT null
)
union
(
SELECT street_address, city, state, postal_code
FROM dealerships
WHERE street_address IS NOT null
)
Notice how the select statements are both in their own set of parentheses with the union statement between them. In addition, you can see that both select statements used the same columns. Remember we are trying to combine data from different places that have the same columns. You can ignore this but the output will be hard to comprehend. For example, if you have different columns in each select statement you will get an output but it will be hard to interpret.
Essentially, in the example above, we took the same columns from two different tables and created one table. UNION removes duplicates. If you want duplicates you must use UNION ALL
In contrast, joins are used to append columns together based on criteria. If we were to join two or more tables the joined table would increase in its number of columns and possibly its rows. A UNION will have a specified number of columns while growing in terms of the number of rows that are present in the output.
INTERSECT
INTERSECT finds common rows between select statements. It is highly similar to JOIN with the main difference being INTERSECT removes duplicates while JOIN will not. Below is an example.
SELECT state,postal_code FROM customers
INTERSECT
SELECT state,postal_code FROM dealerships;
The code is simple. You make your select statements and place INTERSECT in between them. The results above show us what data these two select statements share. When state and postal_code are the criteria for these two tables only these three rows are in common. If we did a JOIN we would get every instance of these three state and postal codes rather than just the unique ones.
Conclusion
UNION and INTERSECT have their place in the data analyst toolbelt. UNION is for appending rows together. INTERSECT is for finding commonalities between different select statements. Both UNION and INTERESECT remove duplicates which is not done when using a JOIN.
Subqueries in SQL
Subqueries allow you to use the tables produced by SELECT queries. One advantage of this is instead of referencing an existing table in your database you can pull from the table you are making when making your SQL query. This is best explained with an example.
WHERE Clause
I have two tables I want to pull data from called salespeople and dealerships. The titles of these two tables explain what they contain. One column these two tables have in common is dealership_id which identifies the dealership in the dealership table and where the salesperson works at in the salespeople table.
Now the question we want to know is “Which of my salespeople work in Texas.” This can only be determined by using my two tables together the salespeople table does not have the state the person is in while the dealerships table has the state but does not have the salesperson’s information.
This problem can be solved with a subquery. Below is the code followed by the output and an explanation.
SELECT *
FROM salespeople
WHERE dealership_id IN (
SELECT dealership_id FROM dealerships
WHERE dealerships.state = 'TX'
)
The first two lines are standard SQL coding. In line three, we have our subquery in the WHERE clause. What is happening is that we are filtering the data to include only dealership_id that matches the state of Texas inside the parentheses. This leads to a note that subqueries are always inside parentheses.
SELECT Clause
Using subqueries in the WHERE clause is most common but we can also do them in the SELECT clause as well. In our first example we learned where the employees work but let’s say we want to know the city and not just the state. Below is a way to pull the city data for one dealership with the salespeople data.
SELECT *,
(SELECT city
FROM dealerships d
WHERE d.dealership_id = 19) as City
FROM salespeople
WHERE salespeople.dealership_id =19;
In this example, we pull all data from salespeople while pulling only data related to the city from the dealership while also filtering for only dealership_id 19.
FROM Clause
We can also place subqueries in a FROM clause. In the example below, I want the first and last name of the salesperson followed by the city and state they work in. The name info is on the salespeople table and the city and state are on the dealership table. Below is the code and results.
SELECT *
FROM
(SELECT s.first_name, s.last_name,d.city,d.state
FROM salespeople s , dealerships d
WHERE s.dealership_id= d.dealership_id);
The code should be self-explanatory. Inside the parentheses, we are creating a table we want to pull data from. The subquery is essentially a join of the two tables based on the criteria. This brings up an important point. Subqueries and joins (inner joins in particular) serve the same purpose. Joins are better for large amounts of data while suqueries are better for smaller amounts of data.
Conclusion
Subqueries are another tool that can be used to deal with data questions. They can be used in the WHERE, SELECT, or FROM clauses. When to use them is a matter of preference and efficiency.
CASE WHEN Statements in SQL VIDEO
CASE WHEN statements are similar to if-then statements in other programming languages. These statements are used to have SQL execute certain behaviors as determine by the criteria that is set in the statement. In the video below, we will go through several examples of how to use CASE WHEN statements in SQL.
NOT, AND & BETWEEN Commands in SQL
This post wuill provide an explanation of how to use the following commands in SQL
- NOT
- AND
- BETWEEN
NOT
The NOT command is used to exclude data based on some criteria. In the example, below we will use the NOT command to exclude basketball players who play the power forward or shooting guard position. The code is next followed by the output.
SELECT *
FROM Seasons_Stats
WHERE Pos NOT IN ('PF','SG')
Here is a breakdown of the code.
- Line 1 tells SQL to select all columns in the dataset
- Line 2 explain which table to use
- Line 3 Use the WHERE command first which serves as a filter. Next, the NOT command means to exclude what comes next. The IN command indicates specifically what will be excluded. The information in the parentheses is the categories to remove from the column called “POS”
IF you look at the first ten rows shown above you will see there are no PF or SG in the POS column which is what we wanted.
The example above excluded text data but we can also exclude numerical values as well. In the code below we replace the “Pos” column with the “Year” and we exclude the Years of 1950 and 1051. The code and output are below.
SELECT *
FROM Seasons_Stats
WHERE "Year" NOT IN (1950,1951)
If you look at the “Year” colum you can see it does not start until 1952. In addition. you can clearly see that there are PF and SG in the “Pos” column which used to be what we exlcuded in the previous example.
AND Command
The AND command is used whe n you have multiple criteria for including or excluding information. In the code below we continue to exlcude plauers from 1950 and 1951 but then with the AND command we exclude players whose first name starts with an A. Below is the code and output.
SELECT *
FROM Seasons_Stats
WHERE "Year" NOT IN (1950,1951) AND NOT Player LIKE 'A%'
ORDER BY Player
What’s new this time is the use of the LIKE command which is used when you are not exactly sure what you are looking for. THis is way the “A” is in sinle quotes followed by the percent sign. THis tells SQL to exclude anything that starts with an A.
BETWEEN Command
The BETWEEN command is when you are searching to include values that fit a certain range. In the code below, we remove the NOT command and include the BETWEEN and the AND command. THis means that we want to include rows that meet our critria rather than exlcude them.
SELECT *
FROM Seasons_Stats
WHERE Age BETWEEN 41 AND 43
ORDER BY Age
In this code we want any player who is between 41 and 43 years of age. Below is the same output but written with the use of the NOT command
SELECT *
FROM Seasons_Stats
WHERE Age NOT BETWEEN 18 AND 40
ORDER BY Age
Which is better probably comes down to preference
Conclusion
The commands mentioned here are among some of the basic tools anybody who wants to use SQL needs to know. In the future we will deal with other concepts related to SQL
DISTINCT, GROUP BY, HAVING in SQL VIDEO
In this video, we will look at more commands that are used for aggregation in SQL. The three commands we will look at include the DISTINCT, GROUP BY, and the HAVING command. The DISTINCT command pulls unique values, the GROUP BY command allows you to aggregate data by categories, and the HAVING command allows you to filter results as you decide.
GROUP BY, DISTINCT, HAVING Commands in SQL
This post will focus on several SQL commands that are often used for aggregation purposes. These commands are
- DISTINCT
- GROUP BY
- HAVING
The data that will be used as examples is the Seasons_Stats data from Kaggle which is a database of NBA statistics.
DISTINCT
The DISTINCT command allows you to find unique items in a column. As an example, we want to find all of the unique positions in our database. Below is the code and output to do this.
SELECT DISTINCT Pos
from Seasons_Stats
In this dataset we can see that there are more than 5 positions in basketball because some of the players were coded as playing more than one position
GROUP BY
The GROUP BY command allows use to aggregate data by a group or a category in a categorical variable. In the example below we sum the points scored by position and we sort them by descending order. The code and output is below.
SELECT SUM(PTS), Pos
from Seasons_Stats
Group BY Pos
Order by SUM(PTS) DESC
We can see that shooting guards are the most prolific scorers followed by small forwards. By aggregating the points and then grouping them by position we are able to achieve a more nuance insight into our data.
HAVING
The HAVING command is a filter command similar to the WHERE command but the HAIVNG command is used in the context of aggregation. In the example below, we want to filter our previous code so that we only see positions that scored more than 2 million points. If we use a WHERE command we will get an error message as shown below.
SELECT SUM(PTS), Pos
from Seasons_Stats
WHERE SUM(PTS) >= 2000000
Group BY Pos
Order by SUM(PTS) DESC
The WHERE command cannot deal with aggregate data. Below is the same code but with the using of the HAVING command.
SELECT SUM(PTS), Pos
FROM Seasons_Stats
GROUP BY Pos
HAVING SUM(PTS) >= 2000000
ORDER BY SUM(PTS) DESC
By adding the HAVING command and stating we only wanted positions that scored more than 2 million points everything now works.
Conclusion
Hopefully, after learning about this commands, aggregating data is a little easier in SQL. There is always more than one way to do anything when using code. However, at least one way has been explained to deal with the problems that were addressed here.
SQL and CASE WHEN Statements
CASE WHEN statements are the SQL version of If-Then statements. Essentially, SQL will perform certain behaviors based on the the criteria you set at the data analyst. In this post, we will go through several examples of the use of CASE WHEN statements using the NBA_salary database that is available on Kaggle.
Example 1
In the first example, we are going to use a CASE WHEN statement to identify if a basketball player is a shooting guard or not below is the code followed by the output.
SELECT Pos,Tm,
CASE WHEN Pos = 'SG' THEN 'Yes' END AS Shooting_Guard
FROM Seasons_Stats
In line one we SELECT two columns which are Pos and Tm. In the second line is the CASE WHEN statement. We are telling SQL that when Pos = ‘SG’ then place a yes in a new column called Shooting_Guard. You can see in the second row that there is a shooting guard there and this is why there is a yes. All the other rows have a NULL there. In the next example we will learn how to deal with the NULL values.
Example 2
In the second example we will replace the NULL values with the word No. Below is the code and output
SELECT Pos,Tm,
CASE WHEN Pos = 'SG' THEN 'Yes'ELSE 'No' END AS Shooting_Guard
FROM Seasons_Stats
The only change to the code is found in the second line. We added the ELSE command and place the word No in single quotes. This tells SQL that if a player is not a shooting guard then put the word No in the new column.
Example 3
In this example we are going to employ an OR command. As with If-Then statements, CASE WHEN Statements can use AND, OR, IN commands as well. Below we are going to have SQL say yes if a player is a shooting guard or a member of the CHS team.
SELECT Pos,Tm,
CASE WHEN Pos = 'SG' OR Tm = 'CHS' THEN 'Yes'ELSE 'No' END AS Shooting_Guard
FROM Seasons_Stats
The only change is found in line 2 again. We added the OR command and told SQL to say yes if Tm = CHS. This why if you look at the results we have a yes in row 3. Even though this player is a SF SQL still marked yes because he was on the CHS team.
Example 4
In our last example, we are going to make several categories by using several WHEN statements. Specifically, we are going to divide players based on how many points they scored. Below is the code and output.
SELECT PTS,
CASE WHEN PTS > 2500 THEN 'Over 2500'
WHEN PTS > 1500 AND PTS < 2499 THEN 'Over 1500'
WHEN PTS > 500 AND PTS < 1499 THEN 'Over 500'
ELSE 'Less than 500' END AS Pts_groups
FROM Seasons_Stats
Here is an explanation
- We call the PTS column
- We create our CASE WHEN statements
- If a player scored more than 2500 then say ‘Over 2500’
- If a player scored more than 1500 and less then 2499 the say ‘Over 1500’
- If a player scored more than 500 and less then 1499 the say ‘Over 500’
- If a player scored less than 1500 then say ‘Less than 500’
Conclusion
This was just several simple examples of the use of CASE WHEN statements. The purpose of this command is to perform certain actions when certain criteria are set. When this happens powerful things can be done in SQL.
Aggregating Data in SQL
Aggregation is focused on summarizing data in order to develop some insights for taking action. This post will provide some simple ways to aggregate data in SQL. Specifically, we will learn how to use the following commands
- COUNT
- SUM
- MIN
- MAX
- AVG
The data we are using is nba_salary which is available at the Kaggle website.
COUNT
The COUNT command allows you to count the number of rows that contain data in a table. In the code below we are going to count the number of data containing rows in the Seasons_Stats table.
SELECT COUNT(*)
FROM Seasons_Stats
As you can see the output is not much only one cell. In this first example we took into consideration all of the columns in the table. That is why there is an asterisk inside the COUNT function parentheses. However, you can also limit this to one or more columns as shown below.
SELECT COUNT(Player)
FROM Seasons_Stats
In the code above, we limited are query to only the “Player” column in the Table. You can see there is a differece in the count for this example and the previuous one. This indicates that some of the rows do not have data for the Player’s name.
SUM
The sum gives a total for numerical values. For our example, we are going to add up all the points scored by the players in our table.
SELECT SUM(PTS)
FROM Seasons_Stats
In this table, you can see the number of points scored. This command is rather straightforward in how to use it,
MIN and MAX
The MIN and MAX commands do what they say. To make this a little more interesting we are going to use these two commands together to find the range of the data. In the table we are going to use the Age column which shares the age of the player. We are going to do three things in the code below
- calculate the minimum age
- Calculate the maximum age
- Find the difference between the MAX and the MIN to get the range of the age
The code and output are below
SELECT MIN(Age),
MAX(Age),
MAX(Age) - MIN(Age)
FROM Seasons_Stats
AVG
The AVG command calculates the average. this is also rather simple. Therefore, the code is below for the average age of a basketball player.
SELECT AVG(Age)
FROM Seasons_Stats
Conclusion
This post share some simple ways in which aggregation can happening when using SQL. There are naturally other ways that these tools can be used. However, for informative purposes this will provide a foundation for someone who is new to using SQL.
Intro to JOINS in SQL VIDEO
Joining data is perform for many different reasons. Often it is done because two or more tables have information that needs to be in one place for analysis. In the video that is embedded below, we will learn how to do simple JOINS in SQL.
NOT, AND, & BETWEEN Commands in SQL VIDEO
This video explains the NOT, AND, & BETWEEN Commands in SQL. These are commonly used commands in SQL that any data scientist needs to be familiar with. The majority of an analyst’s day is actually spent getting and cleaning data. The video below provides a basic explanation of a basic concept that all analysts need to know.
WHERE, LIKE, IN Commands SQL VIDEO
In this video, ERT will explain how to filter data in SQL. Often, we want results but we don’t want everything the database can provide us. This can be because its too much information but also because getting everything can take a long time with larger database. For these and other reasons we will learn how to use some of the Filtering commands in SQL.
WHERE, LIKE and IN Commands for SQL
This post will explore the use and application of the WHERE, LIKE, and IN commands in SQL.
WHERE
The WHERE command is generally used as a way to filter data in SQL. The database we are using in this post contains data on basketball players from 1950-2017. What we want to do is filter the data so that we only see data from players who played in 2017. In the example below, we will filter our basketball players by year be set to 2017
SELECT * FROM Seasons_Stats WHERE year = 2017
If you look at the year column you can clearly see that the value is set to 2017 just as we wanted it.
The WHERE command is not limited to numbers as it can also be used with text. In the example below, we filter our data with the WHERE command so that we only see players who played for the Golden State Warriors (GSW).
SELECT * FROM Seasons_Stats WHERE Tm = 'GSW'
If you look closely at the “Tm” column you will only see the initials for the Golden State Warriors.
LIKE
The WHERE command is often teamed with the LIKE command when you are looking for a text but are not sure of an exact match. You can specify a pattern you are looking for similar to regular expressions. In the code below we use the LIKE command with the WHERE command searching for any player whose name begins with L.
SELECT * FROM Seasons_Stats WHERE Player LIKE 'L%'
The percentage sign (%) after the letter L tells SQL that anything can be after the letter L in the search and meet the criteria. We can also put the % at the end of or text or put one on both sides are some other combination as shown in the examples below.
In the example below, the last letter must be an L. After this example, is one in which an L must appear anywhere in the name.
SELECT * FROM Seasons_Stats WHERE Player LIKE '%L'
SELECT * FROM Seasons_Stats WHERE Player LIKE '%L%'
IN
The IN command allows you to filter your data based on several values. This command is also combined with the WHERE command. In the example below, we filter our data so that we only see players whose position is small forward or shooting guard.
SELECT *
FROM Seasons_Stats
WHERE Pos in ('SG', 'SF')
You could also insert numerical values in the parentheses when using the IN command.
Conclusion
The command shared in this post provide more information on basic tools you can use and apply in SQL. Filtering using the WHERE, LIKE, and IN commands get helpo you to focus the breath of your research to find answers to your questions.
Intro to SQL VIDEO
Intro to SQL
SQL (structured query language) is a programming language used with databases. A database is essentially a collection of tables that all contain information in a text or numerical form. SQL can be used for developing and maintaining databases for various purposes. Our goal is not to learn how to become database administrators. Instead, we are going to focus on using SQL for data analysis.
The data used in this example is NBA Salary and Statistics 2016-17 (SQLite version), a database file available at Kaggle. This database contains information on basketball statistics from 1950-2017. There is more that could be said, but for simplicity, we will learn as we go along.
The software used to explore the database in these examples is DBeaver, a free, open-source software database manager. You can run SQL from a terminal, but it is a little bit easier for beginners to have a GUI interface most of the time. If you want to use this database, you can download it and connect this database to whatever database manager you are using.
In this post, we will focus on using the following commands.
- SELECT
- FROM
- LIMIT
- ORDER BY
SELECT and FROM
The SELECT and FROM commands almost always work together at the beginning of an SQL query. The SELECT command tables which column(s) you want SQL to pull from a table. The FROM command indicates which table you want to pull data from. Below is code and a screenshot of using these two commands together to pull data from the database.
SELECT *
FROM Seasons_Stats
This is what the output looks like in DBeaver. What you see is different. The asterisk after SELECT tells SQL to pull all columns in the table. You can also see a combination of text and numeric data in the columns we can see.
LIMIT
The next command that we will learn is the LIMIT command. This command limits how many rows of data are returned for an output. The code and output are below.
SELECT *
FROM Seasons_Stats
LIMIT 5
By using the LIMIT command, only five rows of data were pulled. How many rows you want is up to you, but this example was set to 5.
ORDER BY
The ORDER BY command allows you to sort the data from highest to lowest or vice versa based on one or more columns in the table. In the example below, we will sort the data by the age column descending to determine who are some of the oldest players to play in the NBA since 1950, which is the first year data is available in this data set. Also, note that only the first five rows are listed because of the LIMIT command. The code and output is below.
SELECT *
FROM Seasons_Stats
order by Age DESC
LIMIT 5
From the output, it appears that Kevin Willis is the oldest player to play in the NBA since 1950. The DESC argument next to ORDER BY sorts the results from highest to lowest.
Conclusion
There is much more to learn about SQL than what was shared here. However, the most important thing to know whenever doing a data analyst is to make sure you know what you want to know. This involves developing straightforward questions that can be answered with available tools.
