CIT 111: Introduction to Databases

Prepare 5: Joining and Summarizing Data

This week you will...

Instructions

This material will help you build a foundation for this week's topic. Test your understanding by taking this week's prepare quiz in ICanvas. The quiz is open book and open note. You may retake the quiz up to 3 times to improve your score.

Terms to Know

JOIN:
A clause used to combine rows from two or more tables based on a related column between them.
INNER JOIN: 
The most common join that combines two tables and only returns the rows that have matching values in both tables.
ON:
A clause that specifies the relationship between the two tables, which are the columns that each table has in common.
OUTER JOIN:
 There are two types of outer joins. LEFT OUTER JOIN (or LEFT JOIN) and RIGHT OUTER JOIN or (RIGHT JOIN) Returns all rows of one table and just the matching rows from another table.
LEFT JOIN:
An outer join that returns every row of the first table, or table on the left, in the query syntax and only the matching rows of the other table.
RIGHT JOIN:
An outer join that returns every row of the second table, or table on the right, in the query syntax and only the matching rows of the other table.
FULL OUTER JOIN:
A join that returns rows from both tables, even if there is not a related key. In MySQL you can use the UNION operator to combine multiple result sets of SELECT statements to simulate it.
Summary Query:
 Used to summarize the contents of a table. These queries contain one or more aggregate functions. (Also called Group-By queries)
Aggregate Function:
A function that operates on a set of rows to calculate and return a single value. AVG, SUM and COUNT are examples of aggregate functions.
GROUP BY:
A clause used when you have an aggregate function that you want to use on a group of data.
HAVING:
A clause that works with aggregate functions and allows you to filter the groups.
ROLLUP:
An operator that can be used with grouping and aggregates, allowing you to add one or more summary rows to your results.

Watch:

(12:42 mins, "Joins" Transcript

Retrieving Data from Multiple Tables

Because a relational database consists of many tables linked with related columns, each individual table is usually incomplete for most business purposes. Different table's data usually needs to be combined to give complete results for business purposes. In order to retrieve data from more than one table you use a JOIN. The data that had been split apart during normalization can be joined back together in one result set. The JOIN is based on the related column between the tables. A table is related to another table with a foreign key. It is usually with this relationship of the primary key of one table and the foreign key of the other table, which are the matching columns that are used with the JOIN operator.

There are different types of JOINS:

INNER JOIN       
    Returns rows that have matching values in both tables	
LEFT JOIN        
    Returns all rows from the left table, and matched rows from the right table
RIGHT JOIN       
    Returns all rows from the right table, and matching rows from the left table
FULL OUTER JOIN  
    Returns all rows from both tables, even if there is not a related key

Let's take a look at the ERD and the same 3 tables with data.

photo of the erd for the magazine database
Magazine database ERD
magKey magName magPrice
1 Fishing in the Mojave 13.95
2 Car Racing Made Easy 15.45
3 Pine Cone Computing 17.50
4 Cooking Like Mad 18.00
5 If Only I Could Sing 12.45
6 Beautiful Birds 12.45
7 Corn Shucking for Fun and Profit 15.05
8 MySQL Magic 10.95
subKey magKey scribKey subStartDate SubLength
1 1 1 2011-03-01 12
2 2 2 2011-03-01 14
3 6 3 2012-02-01 12
4 6 5 2012-02-01 12
5 4 3 2011-09-01 12
6 7 5 2012-07-01 24
7 7 4 2012-08-01 12
8 1 3 2011-05-01 12
9 1 4 2011-09-01 12
10 5 3 2011-12-01 12
11 3 3 2011-05-01 18
scribKey lastName firstName address city state zip
1 Johnston Julie 10336 NE 187th St Bothell WA 98012
2 Anderson Albert 220 K Street Southeast Auburn WA 98002
3 Sanders Samantha 316 Union Ave Snohomish WA 98290
4 Jimenez Jose 187 27th Ave Seattle WA 98122
5 Lamont Lucy 175 Smokey Point Dr Lakewood WA 98409
6 Wong Walter 1073 South 323rd Street Federal Way WA 98003

Using single table queries, you could get information like a list of the magazine subscribers or a list of the magazines. But to get useful information like all the subscribers of a certain magazine, you would need to use a JOIN clause.

INNER JOIN

The most common way to join tables is with an INNER JOIN. There are different ways to write the syntax to join tables but we will be using the explicit syntax, which is considered the 'best practice' way to write the syntax for a join.

SELECT select_list
   FROM table_1
       JOIN table_2
           ON join_condition_1
       JOIN table_3
           ON join_condition_2 [. . .]

To create the join you specify all the tables that will have data that will be included in the result set or the filtering condition and specify how they are related to each other. Nothing in the database table definitions can instruct MySQL how to join the tables; we have to do that ourselves.

Lets join two of the tables together. Say we need to have a list of the subscribers names and subscription start dates that belong to each person. Some people have more than one magazine that they subscribe to, so there might be people listed more than once with different subscription start dates.

SELECT lastName, firstName, subStartDate

We know that for this query's SELECT we will have columns from more than one table. The first and last name from the subscriber table and the subscription start date from the subscription table. We don't have a filter, or WHERE clause in this case, but if we did, we'd also want to take in to consideration what table column was being used there as well. But, in our case, we know we need to join the subscriber and the subscription tables together.

With an INNER JOIN only people who actually have a subscription, and only subscription start dates that correspond to subscribers will show up. Remember: INNER JOINS will only return rows that have matching values in both tables.

The INNER JOIN keyword follows the first table and specifies which table you are joining to the first table. Then the ON clause specifies the relationship between the two tables; which are the columns that each table has in common. In our case the scribKey as the primary key from the subscriber table and the scribKey as the foreign key from the subscription table.

FROM subscriber 
    INNER JOIN subscription
        ON subscriber.scribKey = subscription.scribKey

Because, in our case, the primary and foreign key have the exact same name, we must qualify the column name by preceding it with the table name and a period. Each of the two tables is represented, this is important because MySQL needs to know from which table the column name came from.

Here is the whole query with an ORDER BY to sort by last name.

SELECT lastName, firstName, subStartDate
FROM subscriber 
    INNER JOIN subscription
        ON subscriber.subKey = subscription.subKey
ORDER BY lastName;
lastName firstName subStartDate
Anderson Albert 2011-03-01
Jimenez Jose 2012-08-01
Jimenez Jose 2011-09-01
Johnston Julie 2011-03-01
Lamont Lucy 2012-07-01
Lamont Lucy 2012-02-01
Sanders Samantha 2011-05-01
Sanders Samantha 2011-12-01
Sanders Samantha 2012-02-01
Sanders Samantha 2011-05-01
Sanders Samantha 2011-09-01

Now we have a result set with data from two different tables. We can see that most of our subscribers have a subscription to more than one magazine and therefore have a number of different start dates.

Let's look at another INNER JOIN. In this example we will be joining all 3 tables. We will be using table aliases for each table that we can then use to qualify the tables in the ON clause. 'sr' to represent 'subscriber', 'sn' to represent 'subscription', and m to represent magazine. You can use the keyword AS before each alias, but it is optional.

Once you give a table an alias you must refer to that alias throughout the query if you need to qualify any column names.

We will also be using just the keyword JOIN instead of INNER JOIN, they keyword INNER is optional. Both JOIN and INNER JOIN mean exactly the same thing. When you use the keyword JOIN it is assumed that you are using INNER JOIN.

This query will get a list of people who subscribe to the magazine called 'Beautiful Birds'.

SELECT lastName Last_Name, firstName First_Name
FROM subscriber sr
    JOIN subscription sn
        ON sr.subKey = sn.subKey
    JOIN magazine m
        ON sn.magKey = m.magKey
WHERE magName = 'Beautiful Birds' 
ORDER BY last_Name;
Last_Name First_Name
Lamont Lucy
Sanders Samantha

We can see that we want the first and last names again, which is only from one table but in order to see who has subscribed to 'Beautiful Birds' we have to include a WHERE clause that uses magName from the magazine table. So we are using data from multiple tables. Because the subscriber table that lists the people and the magazine table that lists the magazines do not have a common key that related the two we will have to use the linking subscription table in our join as well. So we join the subscriber table to the subscription table using the related column of subKey. Then we join the magazine table to the subscription table using the related column of magKey.

Order of Execution

Let's take a minute to look at the Order of Execution with joins. Remember: The order that we write the syntax of our queries is not the order that the clauses actually execute. All of the SQL clauses have complex interdependencies on each other, and they need to execute in a certain order to behave properly. The execution order is:

1. FROM and JOIN
2. WHERE
3. SELECT
4. ORDER BY

Using the last query we looked at, the FROM is the first thing that executes. Before anything else all the rows from all the tables are retrieved. This gets us the total working set of data that is being queried. JOIN is an operator of FROM, not an independent clause so it executes at the same time as FROM.

scribKey lastName firstName address city state zip subKey magKey scribKey subStartDate subLength magKey magName magPrice
1 Johnston Julie 10336 NE 187th St Bothell WA 98012 1 1 1 2011-03-01 12 1 Fishing in the Mojave 13.95
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 8 1 3 2011-05-01 12 1 Fishing in the Mojave 13.95
4 Jimenez Jose 187 27th Ave Seattle WA 98122 9 1 4 2011-09-01 12 1 Fishing in the Mojave 13.95
2 Anderson Albert 220 K Street Southeast Auburn WA 98002 2 2 2 2011-03-01 14 2 Car Racing Made Easy 15.45
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 11 3 3 2011-05-01 18 3 Pine Cone Computing 17.50
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 5 4 3 2011-09-01 12 4 Cooking Like Mad 18.00
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 10 5 3 2011-12-01 12 5 If Only I Could Sing 12.45
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 3 6 3 2012-02-01 12 6 Beautiful Birds 12.45
5 Lamont Lucy 175 Smokey Point Dr Lakewood WA 98409 4 6 5 2012-02-01 12 6 Beautiful Birds 12.45
5 Lamont Lucy 175 Smokey Point Dr Lakewood WA 98409 6 7 5 2012-07-01 24 7 Corn Shucking for Fun and Profit 15.05
4 Jimenez Jose 187 27th Ave Seattle WA 98122 7 7 4 2012-08-01 12 7 Corn Shucking for Fun and Profit 15.05

So now that we have our set of data, let's filter it down to exactly what we want. That's done with WHERE next. With WHERE we can throw away the data we don't need by setting up conditions on what it is exactly we want or don't want. The WHERE then narrows down the set of data.

scribKey lastName firstName address city state zip subKey magKey subKey subStartDate subLength magKey magName magPrice
3 Sanders Samantha 316 Union Ave Snohomish WA 98290 3 6 3 2012-02-01 12 6 Beautiful Birds 12.45
5 Lamont Lucy 175 Smokey Point Dr Lakewood WA 98409 4 6 5 2012-02-01 12 6 Beautiful Birds 12.45

The data is filtered but all columns included at this point. SELECT now executes and we can specify which columns we want to show up.

Last_Name First_Name
Sanders Samantha
Lamont Lucy

Then the ORDER BY executes and sorts the results

Last_Name First_Name
Lamont Lucy
Sanders Samantha
OUTER JOIN

Another type of join is the OUTER JOIN. This type of join returns all rows from one table and just the matched rows from another table.

Let's take a look at the ERD and the same 3 tables with data again.

photo of the erd for the magazine database
Magazine database ERD
magKey magName magPrice
1 Fishing in the Mojave 13.95
2 Car Racing Made Easy 15.45
3 Pine Cone Computing 17.50
4 Cooking Like Mad 18.00
5 If Only I Could Sing 12.45
6 Beautiful Birds 12.45
7 Corn Shucking for Fun and Profit 15.05
8 MySQL Magic 10.95
subKey magKey scribKey subStartDate SubLength
1 1 1 2011-03-01 12
2 2 2 2011-03-01 14
3 6 3 2012-02-01 12
4 6 5 2012-02-01 12
5 4 3 2011-09-01 12
6 7 5 2012-07-01 24
7 7 4 2012-08-01 12
8 1 3 2011-05-01 12
9 1 4 2011-09-01 12
10 5 3 2011-12-01 12
11 3 3 2011-05-01 18
scribKey lastName firstName address city state zip
1 Johnston Julie 10336 NE 187th St Bothell WA 98012
2 Anderson Albert 220 K Street Southeast Auburn WA 98002
3 Sanders Samantha 316 Union Ave Snohomish WA 98290
4 Jimenez Jose 187 27th Ave Seattle WA 98122
5 Lamont Lucy 175 Smokey Point Dr Lakewood WA 98409
6 Wong Walter 1073 South 323rd Street Federal Way WA 98003

Notice there is one subscriber 'Walter Wong' that is not subscribed to any magazine yet. There is also one magazine 'MySQL Magic' that has no subscribers yet. When we do an inner join as we did above, Walter never showed up because there was no matching foreign key that matched his primary scribKey. In other words, there was no scribKey 6 in the linking table as a foreign key. What if we wanted to show all subscribers last and first name and their subscription start dates even if that subscriber hadn't subscribed to a magazine yet. We would have to use an OUTER JOIN.

SELECT lastName, firstName, subStartDate
FROM subscriber 
    LEFT OUTER JOIN subscription
        ON subscriber.scribKey = subscription.scribKey
ORDER BY lastName;
lastName firstName subStartDate
Anderson Albert 2011-03-01
Jimenez Jose 2012-08-01
Jimenez Jose 2011-09-01
Johnston Julie 2011-03-01
Lamont Lucy 2012-07-01
Lamont Lucy 2012-02-01
Sanders Samantha 2011-05-01
Sanders Samantha 2011-12-01
Sanders Samantha 2012-02-01
Sanders Samantha 2011-05-01
Sanders Samantha 2011-09-01
Wong Walter NULL

The keywords LEFT OUTER JOIN specified that all records from the table on the left would be returned and only the matching rows from the other table are returned. Because Walter Wong was from the subscriber table (the table on the left in the query) he showed up. But notice that the subscription start date is null (there is no foreign key in the subscription table and therefore no subscription start date. So all the rows are returned from the subscriber table and only the matching rows of the subscription table.

RIGHT JOIN is used the same way but refers to the table on the right as the table that will have all the rows in the result set. So the query above could be rewritten as follows and get the same results.

SELECT lastName, firstName, subStartDate
FROM subscription 
    RIGHT OUTER JOIN subscriber
        ON subscriber.scribKey = subscription.scribKey
ORDER BY lastName;

Usually a LEFT JOIN is used placing the table you want all the results from first (or on the left) so RIGHT JOIN doesn't have to ever be used.

Let's look at a join that uses all 3 tables.

If we wanted all the magazines and their associated subscribers, even if they didn't have any subscribers, we'd use an outer join. This time we will use the keywords LEFT JOIN. The keyword OUTER is optional. Using just LEFT JOIN will be exactly the same.

SELECT magName, magPrice, lastName, firstName
FROM magazine 
    LEFT JOIN subscription
        ON magazine.magKey = subscription.magKey
    LEFT JOIN subscriber
        ON subscription.scribKey = subscriber.scribKey
ORDER BY lastName;
magName magPrice lastName firstName
MySQL Magic 10.95 null null
Car Racing Made Easy 15.45 Anderson Albert
Fishing in the Mojave 13.95 Jimenez Jose
Corn Shucking for Fun and Profit 15.05 Jimenez Jose
Fishing in the Mojave 13.95 Johnston Julie
Beautiful Birds 12.45 Lamont Lucy
Corn Shucking for Fun and Profit 15.05 Lamont Lucy
Cooking Like Mad 18.00 Sanders Samantha
If Only I Could Sing 12.45 Sanders Samantha
Beautiful Birds 12.45 Sanders Samantha
Fishing in the Mojave 13.95 Sanders Samantha
Pine Cone Computing 17.50 Sanders Samantha

Now all the rows from the magazine table show up even if the magazine has no subscriptions associate with it. 'MySQL Magic' is a magazine that has no associated foreign key in another table. Therefore it shows null for both the last and first name, since no one has subscribed to it yet.

FULL OUTER JOIN

Full outer joins are not very common but worth mentioning here. A full outer join will return all rows from both tables, even if there is not a related key. So if we wanted all magazines and all subscribers even if they didn't have related records we would use the keyword UNION to combine the rows of two different results sets together to simulate a full outer join. Both result sets must return the same number of columns and the corresponding columns must have compatible data types.

SELECT magName, magPrice, lastName, firstName, subStartDate
FROM magazine m 
    LEFT JOIN subscription sn
        ON m.magKey = sn.magKey
    LEFT JOIN subscriber sr
        ON sn.subKey = sr.subKey
UNION
SELECT magName, magPrice, lastName, firstName, subStartDate
FROM magazine m 
    RIGHT JOIN subscription sn
        ON m.magazineKey = sn.magazineKey
    RIGHT JOIN subscriber sr
        ON sn.subscriberKey = sr.subscriberKey;
magName magPrice lastName firstName subStartDate
Fishing in the Mojave 13.95 Johnston Julie 2011-03-01
Fishing in the Mojave 13.95 Sanders Samantha 2011-05-01
Fishing in the Mojave 13.95 Jimenez Jose 2011-09-01
Car Racing Made Easy 15.45 Anderson Albert 2011-03-01
Pine Cone Computing 17.50 Sanders Samantha 2011-05-01
Cooking Like Mad 18.00 Sanders Samantha 2011-09-01
If Only I Could Sing 12.45 Sanders Samantha 2011-12-01
Beautiful Birds 12.45 Sanders Samantha 2012-02-01
Beautiful Birds 12.45 Lamont Lucy 2012-02-01
Corn Shucking for Fun and Profit 15.05 Lamont Lucy 2012-07-01
Corn Shucking for Fun and Profit 15.05 Jimenez Jose 2012-08-01
MySQL Magic 10.95 null null null
null null Wong Walter null

Notice how MySQL Magic shows up even though no one has subscribed to it and how Walter Wong shows up even though he has not subscribed to any magazines. Again, full outer joins are rare but might be used with something like exception reports, which are used to discover abnormal data that might need attention.

Summarizing Data

Summary Queries

Summary queries are queries that contain one or more aggregate functions.

Aggregate Functions

Aggregate functions operate on a set of rows to calculate and return a single value. For example, determining how many rows are returned; adding up values; or finding the highest, lowest, or average values in a column or group of the result set.

Common SQL Aggregate Functions
AVG() Returns a column's average value
SUM() Returns the sum of a column's values
MAX() Returns a column's highest value
MIN() Returns a column's lowest value
COUNT() Returns the number of rows in a column
The AVG() Function Syntax:

AVG([ALL| DISTINCT] expression)

SELECT AVG(salary) AS 'Average Salary'
FROM salaries
Average Salary
64238.8072
THE SUM() Function Syntax:

SUM([ALL| DISTINCT] expression)

SELECT SUM(salary) AS 'Total of all Salaries'
FROM salaries
Total of all Salaries
609497803
The MAX() Function Syntax:

MAX([ALL| DISTINCT] expression)

THE MIN() Function Syntax:

MIN([ALL| DISTINCT] expression)

SELECT MAX(salary) AS 'Highest Salary', MIN(salary) As 'Lowest Salary'
FROM salaries
Highest Salary Lowest Salary
136004 39264
THE COUNT() Function Syntax:

COUNT([ALL| DISTINCT] expression)

SELECT COUNT(*) AS 'Number of Employees'
FROM employees
Number of Employees
9488

The ALL keyword is assumed with aggregate functions. You don't have to include it. This means all rows are included in the aggregate function, which is the default.

SELECT AVG(ALL prod_price) AS 'Average Price'
FROM products

All product prices are included in the average calculation.

You can use the DISTINCT keyword instead to include only unique values.

SELECT AVG(DISTINCT prod_price) AS 'Average Price'
FROM products

So if there are multiple products with the same price, that price will only be included once in the average calculation.

Creating Groups

In previous examples the aggregate functions summarized data across all rows in the given table. Creating groups allows you to summarize subsets of table contents.

We will use the GROUP BY clause to do this. The GROUP BY clause is used when you have an aggregate function that you want to use on the grouped data.


SELECT statement with GROUP BY and HAVING clause syntax:

SELECT select_list
FROM table_source
[WHERE search_condition]
[GROUP BY group-by_list]
[HAVING search_condition]
[ORDER BY order_by_list]

SELECT vendor_id, COUNT(*) AS 'Number of Products'
FROM products 
GROUP BY vendor_id

The GROUP BY clause sorts the data and groups by vendors, then the COUNT function is used to count the number of products for each vendor. So you'd have a count once per vendor rather than for the entire table.

vendor_id Number of Products
021 15
044 23
121 31

Filtering Groups

The HAVING clause also works with aggregate functions and allows you to filter the groups (or in other words specify which groups to include or not include). It works a lot like the WHERE clause but the WHERE clause filters rows and the HAVING clause filters groups.

SELECT vendor_id, COUNT(*) AS 'Number of Products'
FROM products
GROUP BY vendor_id
HAVING COUNT(*) > 20
vendor_id Number of Products
044 23
121 31

The HAVING clause filters the vendors after they are grouped and shows only the vendors with the number of products over 20.

You can use WHERE and HAVING together. If you want to only include product prices greater than $10 and then show the number of products that result grouped by vendors for those vendors who have more than 20 products.

SELECT vendor_id, COUNT(*) AS 'Number of Products'
FROM products
WHERE product_price > 10
GROUP BY vendor_id
HAVING COUNT(*)  > 20
vendor_id Number of Products
121 25

The ROLLUP operator can be used with grouping and aggregates allowing you to add one or more summary rows to your results. A summary row will show up for every group you have.

SELECT vendor_id, SUM(product_price) AS 'Total of Product Prices'
FROM products
WHERE vendor_id < 15
GROUP BY vendor_id WITH ROLLUP
vendor_id Total of Product Prices
010 121.87
011 242.21
012 450.43
013 312.12
014 576.94
NULL 1703.57

The summary row shows up as the bottom row and totals all the totals into a grand total.

SELECT Clause Syntax Order

The order of how you must write the code.

SELECT

FROM

WHERE

GROUP BY

HAVING

ORDER BY

LIMIT

SELECT Order of Execution

The written order, or the syntax order of how we write a query differs from the order of how the query is actually executed. It's important to understand the order of execution so you know what results are accessible where.

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY
  7. LIMIT

The data from the entire table(s) is retrieved, then that is filtered by the WHERE criteria. Then that resulting data is grouped and filtered by the HAVING criteria. The columns, calculations or functions that will show up are determined in SELECT. Then that resulting data is sorted and then limited.

Submission

Take the Week 5 Prepare quiz in Canvas

You may use this quizlet to help you study the terms.

Useful Links: