THE PERSONAL PORTFOLIO WEBSITE OF MGA

subqueryimage

SUBQUERIES: A lengthy primer

-- INPUT PARAMETERS
DECLARE @FCT_DT DATE		= '5/6/2016'
DECLARE @table  VARCHAR(50)	= 'Person.Person' 

DECLARE @strFactDate varchar(8) = convert(varchar, @FCT_DT ,112)
DECLARE @newtable NVARCHAR(MAX) = @table + '_' + @strFactDate
DECLARE @sql  NVARCHAR(MAX)
DECLARE @cols NVARCHAR(MAX) = N''
DECLARE @debug INT = 1

IF @debug = 0
BEGIN
  IF OBJECT_ID(@newtable) IS NOT NULL 
  DROP TABLE Person.Person_20160506
END

SELECT @cols += N', [' + name + '] '+  system_type_name + case is_nullable when 1 then ' NULL' else ' NOT NULL' end
FROM  sys.dm_exec_describe_first_result_set('SELECT * FROM ' + @table, NULL, 1)

SET @cols = STUFF(@cols, 1, 2, N'');

SET @sql = N'CREATE TABLE '+ @newtable + '(' + @cols + ') '  -- newtable = Person.Person_20160506

IF NOT EXISTS (SELECT 1 FROM sys.tables where object_name(object_id) = @newtable)
  BEGIN
        
    IF @DEBUG = 1 
      BEGIN 
        PRINT @sql
      END
        ELSE 
      BEGIN 
        EXEC sp_executesql @sql
      END 
  END

    Select @newtable as NewTable,
    @strFactDate as PPNDate

 

SELECT * FROM Table
WHERE RN>2

 

 

What exactly are Subqueries?

  • Subqueries allow you to execute two or more SELECT queries at the same time while returning one result set.
  • There two main types of subqueries: Nested Scalar and Correlated.

Rules to writing any subquery:

      • A subquery is contained in parentheses.
  • Subqueries are separated by a comma like any other column.
  • Subqueries are presented as an alias.

Nested Scalar Subqueries:

  • Nested scalar subqueries return a single row and column in their result. NULL is returned if the subquery returns no value.
  • Example using the Northwind sample database: Here, we are to find out all orders that were placed on the last day an order was recorded. So how do we accomplish this? Well, first we need to know the date the last order was recorded, then we have to compare the order date of each order to this last order date. Here is the query:
USE Northwind;

SELECT * 
FROM orders
WHERE orderdate=
(--This query within the parentheses is a subquery
    SELECT MAX(orderdate) --This subquery will return only one value
    FROM orders
);

 

  • The above subquery used the a WHERE clause along with SELECT, but it’s also possible to find meaningful nested scalar subqueries off a SELECT list via expressions. Expressions are formulas or mathematical statements( ex. product.price * product.quantity). If the operator (such as multiplication) expects only one value, then be sure the subquery returns only one value. Here is an example using the AdventureWorks database:   
  • It is certainly possible to create multiple subqueries. The following is an example of using two subqueries using the AdventureWorks database:
USE AdventureWorks2016CTP3;

SELECT product id
color,
listprice,
(
SELECT AVG(listprice)
FROM Production.Product
) AS AverageListPrice
(
SELECT AVG(weight)
FROM Production.Product
) AS AverageWeight
FROM Production.Products; 
  • Nested subqueries can also be used with the IN predicate. Predicated are expressions that return TRUE, FALSE, or UNKNOWN.
  • Using IN is a more elegant substitute to using a bunch of OR predicates.
  • For example, suppose we are tasked with finding the number of transactions attributed to customers from the U.K. on the Northwind database. The first step is to list all customerIDs from the U.K. , and then find the number of matches using the IN query. Here is the query that accomplishes this:

USE NorthWind;

SELECT COUNT(*) 
FROM orders
WHERE customerID IN
(
    SELECT customerID
    FROM customers
    WHERE country = 'UK'
);

 

  • Using the IN predicate as a subquery: Predicates are expressions that return TRUE, FALSE, or UNKNOWN. The IN predicate is a more elegant alternative to a series of OR predicates. The following example, let’s say we were given a task to find the number of transactions attributed to customers from the U.K. The first step is to find all CustomerIDs from the U.K. , and then use the IN predicate to find the number of matches. Here is how the query would go:

USE NorthWind;

SELECT COUNT(*)
FROM Orders
WHERE CustomerID IN
(
    SELECT customerID
    FROM customers
    WHERE country = 'UK'
);

The above results of the above query can also be obtained using an equivalent join, as shown here:


USE NorthWind;

SELECT COUNT(*)
FROM Orders o
     JOIN Customers c on o.customerid=c.customerid
WHERE c.country='UK';

Both these queries have exactly the same performance in SQL Server because the optimizer generates the same query plan for each.

  • The opposite to the IN predicate is NOT IN. Replacing IN with NOT IN for the above IN query will return the number of all transactions where Customer.Country is NOT IN UK. It’s important to be careful of using the NOT IN predicate when the inner query contains NULL values. IN can be written as a bunch of OR x=y clauses, whereas NOT IN applies to the entire IN predicate, in other words into AND x<>y clauses. If any y is NULL, then the entire NOT IN

    predicate is UNKNOWN, because x<>NULL returns UNKNOWN. For this reason, it is quintessential to filter out the NULLS inside the inner query when using NOT IN. The functions ISNULL() and COALESCE() can be used in such cases.

     

Correlated Subqueries:

  • In correlated subqueries, the inner “SELECT” query is dependent on the values of the outer “SELECT.”
  • Using Correlated Subqueries in the WHERE clause:Let’s go through an example of correlated subqueries using the Northwind database. Let’s assume we’re given the task of finding the orders that included more than two dozen orders of product 17. The inner SELECT will retrieve records where ProductID = 17, while the outer query correlates to the Order table through the OrderID. Here’s the actual query:

USE NorthWind;

SELECT o.*
FROM Orders o
WHERE 24 &lt;
(
    SELECT od.quantity
    FROM [Order Details] od
    WHERE od.productid = 17
          AND od.orderid=o.orderid -- here is the correlation
);

 

  • Using Correlated subqueries in the SELECT statement:A correlated subquery (where the inner subquery is dependent on the outer query) can just as easily appear using the SELECT clause. For instance, suppose we’re using the Northwind database and we’re seeking to determine the productID, price, and the difference between price and average price for the same category. Here’s how the query would go:

USE NorthWind;

SELECT p.productid
       p.unitprice
       p.unitprice -
(   
    SELECT AVG(unitprice)
    FROM Products a
    WHERE a.categoryid = p.categoryid
    GROUP BY a.categoryid
) AS Difference
FROM products p;

 

  • Using Correlated Subqueries in the HAVING predicate: HAVING is the equivalent of the WHERE clause but only in GROUP BY situations or the result of an aggregate function. Suppose we’re using the Northwind database, and are tasked with finding the orders where the quantity ordered for a particular product exceeds three times the average order for the product. Here’s how that query would go:

USE NorthWind;

SELECT od1.productid
       od1.orderid
FROM [Order Details] AS od1
GROUP BY od1.productid
         od1.orderid
HAVING SUM(od1.quantity) &gt; 3*
(
   SELECT AVG(od2.quantity)
   FROM [Order Details] AS od2
   WHERE od1.productid=od2.productid

 

  • Using the EXISTS predicate:  The EXISTS predicate allows us to return a row if a given condition exists. Using the Northwind database, let’s suppose product 63 has been recalled and we need the contact information of all customers who have ordered the product. Here is how to run the query:

USE NorthWind;

SELECT c.contactname
From Customers AS C
WHERE EXISTS
(
    SELECT *
    FROM Orders AS o
         JOIN [Order Details] AS od on od.orderid=o.orderid
    WHERE od.productid = 64
          AND o.customerid = c.customerid
);

 

    • By using an EXISTS predicate, SQL server has to do less work compared to when using a COUNT (*)> 0 predicate. Once one row has been found to satisfy the EXISTS predicate, there’s no need to search any further. The EXISTS version of the query is less resource-intensive than the COUNT(*) version. Failure to use EXISTS instead of COUNT(*) >0 on real-world databases (having tens of thousands to millions rows of data) could take hours to process the query.

      • Using the NOT EXIST predicate: This is used to return rows that do not match the query. For example, to find all rows of customers from Customer table on Northwind not existing in the Orders table, this would be the query:

USE NorthWind;

SELECT c.*
FROM Customers AS c
WHERE NOT EXISTS
( 
    SELECT *
    FROM orders AS o
    WHERE o.customerid=c.customerid
);

The above query can also be accomplished using the NOT IN predicate, as shown here:


USE NorthWind;

SELECT c.*
FROM Customers 
WHERE customerid NOT IN
( 
    SELECT customerid
    FROM orders
);

USE NorthWind;

SELECT c.*
FROM Customers AS c

WHERE NOT EXISTS

( 
    
    SELECT *
    
    FROM orders AS o
    
    WHERE o.customerid=c.customerid

);

USE NorthWind;

SELECT c.*
FROM Customers 
WHERE customerid NOT IN
( 
    SELECT customerid
    FROM orders
);

While the Query Analyzer reports the query cost to about the same for the NOT EXISTS and NOT IN queries, the I/O statistics are doubled for the Orders table in the NOT IN version. The Profiler reports a tripling of the duration of the query, and the total reads goes up by a factor of 15. (Show proof here)

      • Handling Data Feeds:  Sometimes we are tasked with updating or inserting depending on where data in the existing table already exists. Using the NOT EXISTS predicate, we can insert new row while not violating referential integrity. For example, let’s imagine we have a feed of car license-plate numbers and the owner names. We want to update the owner if the car exists, and add the car license-plate and owner if it doesn’t exist. Here we can use the at least one UPDATE and one INSERT statement, and the INSERT statement adds only the new cars. Here’s how we would run such a query:

--Update the existing cars
UPDATE C
  SET
     owner=f.owner
FROM feed AS f
     JOIN cars AS c ON f.license=c.licence

--Insert the new cars
INSERT INTO Cars
       SELECT licence,
              owner
       FROM feed AS f 
       WHERE NOT EXISTS
       (
           SELECT *
           FROM cars AS c
           WHERE c.license=f.license
        );
      • Using Quantified Predicates: Quantified predicated use comparison operators (=, <>, !=, >, >=, <=) to compare the outer query values to the inner query values. There two types of quantified predicates: ANY (or SOME) and ALL.
        • Using the ANY (or SOME) predicate: The ANY (or SOME) predicate returns TRUE if any value in the subquery satisfied the comparison operation, or FALSE if none of the values satisfy the comparison operation or if there are no rows in the subquery. Suppose we’re working on the pubs database, and we’re tasked with finding all the authors who wrote the book with title ID of BU1032. We would first have to determine the author IDs for the book from the titleauthor table, and then compare these to the author IDs in the authors table. This is how the query would go:

USE pubs;

SELECT *
FROM authors
WHERE au_id= ANY
(
    SELECT au_id
    FROM titleauthor
    WHERE title_id = 'BU1032'
);

This type of subquery is essentially a set of values to which we are comparing the value from the outer query to see if it is a member of the set.

The ANY predicate can be substituted for the equivalent IN predicate. The same rules regarding the handling of NULLS apply. We can also flatten this into a join.

      • Using the ALL predicate: The ALL predicate works similarly to the ANY predicate, except that it’s more restrictive in that all of the values in the subquery must satisfy the comparison condition. For example, let’s say we’re using the pubs database, and we’re tasked with finding all books that were co-authored. In other words, the titleauthor table for a given book must return more than 1 row. This is how the query would go:

USE pubs;

SELECT *
FROM title as t
WHERE 1&lt; ALL
(
   SELECT COUNT(*)
   FROM titleauthor AS ta
   WHERE ta.title_id=t.title_id
);
      • Comparing ALL and MAX(): MAX() is used when we wish to find the highest value of something. The ALL predicate can accomplish the same thing. Suppose we’re using the Northwind database, and we’re tasked with finding all orders that were shipped on the most recent shipping date. We will use a nested subquery where the inner query finds the last shipping date, and the outer query is used to compare shipping dates of all orders with the result of the inner query. This is how the query would go: 

USE Northwind;

SELECT *
FROM orders
WHERE shippeddate =
(
    SELECT MAX(shippeddate)
    FROM orders
);

Using the ALL predicate:


USE Northwind;

SELECT *
FROM orders
WHERE shippeddate &gt;= ALL
(
    SELECT shippeddate
    FROM orders
);

As you can see, all literally means all! The Max() aggregate function discards NULLs and returns the most recent non-NULL ShippedDate. The ALL predicate compares all values of ShippedDate in the outer query to all of the values of ShippedDate in the inner query. Unfortunately here, ALL literally means “all.” All values of ShippedDate in the outer query must be greater than or equal to (>=) all values of ShippedDate in the inner query. So, if there is at least one NULL in the ShippedDate column, then the ALL predicate is FALSE, because any comparison with a FALSE is FALSE.

The fix to this is to filter out any NULLS inside the inner query, Here is how the query would go:


USE Northwind;

SELECT *
FROM orders
WHERE shippeddate &gt;= ALL
(
    SELECT shippeddate
    FROM orders
    WHERE shippeddate IS NOT NULL
);

      • Comparing ALL and MIN(): Just like we can use the ALL predicate to find the maximum value of something, we can also use it to find the minimum. Here, MAX() changes to MIN(), and >= sign changes to <= sign. Say we’re using the Northwind database still, and need the orders from the earliest shipping date. Here’s how the query would go:

USE Northwind;

SELECT *
FROM orders
WHERE shippeddate &gt;= ALL
(
    SELECT MIN(shippeddate)
    FROM orders
);

Using the ALL predicate:


USE Northwind;

SELECT *
FROM orders
WHERE shippeddate &lt;= ALL
(
    SELECT shippeddate
    FROM orders
);

Once again, no rows are returned because of the presence of NULLs. Both MAX()and MIN() functions discard NULLs, the ALL predicate does not. Adding the WHERE…IS NOT NULL will from the past

example will bring the correct result set.

Even though MAX() vs ALL>= and MIN() vs ALL<= return identical rows (assuming no NULLs in the subquery), the performance can differ.

      • Calculating Running Totals:

USE Northwind;

SELECT CONVERT(CHAR(11), o1.orderdate) AS orderdate
       SUM(od1.unitprice * od1.quantity) AS sales
(
   SELECT SUM(od2.unitprice * od2.quantity) AS sale
   FROM orders AS o2
        JOIN [Order Details] AS od2 ON od2.orderid = o2.orderid
   WHERE o2.customerid = 'BOTTM'
         AND o2.orderdate &lt;= o1.orderdate
)AS 'Sales to Date'
FROM orders as o1
     JOIN [Order Details] AS od1 ON od1.orderid = o1.orderid
WHERE o1.customerid = 'BOTTM'
GROUP BY o1.orderdate;

The inner query takes the same sum as the outer query, but it doesn’t use grouping, and it sums all orders with an order date greater or equal to the order date of the outer query. If we run
this against tables with many rows, we will get a lot of I/O, which can have a performance impact.

      • Creating Derived Tables: A table is a collection of zero or more rows and one or more columns, with all columns having unique names. A table can also have a name or alias. Derived tables are SELECT statements in the FROM clause of a query, and they’re referred to with alias.
        • Using a SELECT with a GROUP BY as a Derived Table: One of the best uses of a derived table is one that uses a GROUP BY clause. For instance, say we’re using the NorthWind database and we want to return a list of products with their prices as well as the average price for each product category, and then present the product list with the averages by joining the product table onto the derived table. Here’s how the query would go:

USE Northwind;

SELECT p.productid,
       p.unitprice,
       a.averageprice
FROM products AS p
     JOIN
(
     SELECT categoryid,
            AVG(unitprice) AS AveragePrice
     FROM products
     GROUP BY categoryid AS a on a.categoryid = p.categoryid;

The above query could also be returned using a correlated subquery:


USE Northwind;

SELECT p.productid,
       p.unitprice,
(
   SELECT AVG(unitprice)
   FROm products AS a
   WHERE a.categoryid = p.categoryid
   GROUP BY a.categoryid
) AS AveragePrice
FROM products AS p;

TIP: When join a derived table with another table, make the name of any aliased key columns within the derived table the same as those for the table to which you are joining. This will prevent any potential confusion. 

  • Finding duplicate rows in a Table: We may run across situations where we need to determine entire duplicate rows where the primary key has been duplicated. We can accomplish this by using a derived table to find the duplicate keys, and then join the original table to find the final result set. This is how such a query would go: 

SELECT T.*
FROM MyTable AS T
     JOIN
(
    SELECT keycol1,
           keycol2
    FROM MyTable
    GROUP BY keycol1,
             keycol2
    HAVING COUNT(*)&gt; 1
) AS D ON D.keycol1=T.keycol2
          AND D.keycol2=T.keycol2;


The above example uses two columns for the key, but we can adapt to our particular
situation.

  • Updating Rows with a GROUP BY clause: We cannot make an update with a GROUP BY clause, unless it’s with a derived table. For instance, let’s say we wish to raise the price of the top 5 selling products by 20%. This is how the query would go: 

USE Northwind;

UPDATE p
  SET
     p.unitprice= p.unitprice * 1.2
FROM products AS p
     JOIN
(
    SELECT TOP 5 productid,
                 SUM(quantity) as quantity
    FROM [Order Details]
    GROUP BY productid
    ORDER BY quantity DESC
) AS s ON s.productid=p.productid;


Note: DO NOT try this on the sample Northwind database unless it’s inside a transaction with a ROLLBACK. Otherwise, the changes will be permanent.
 

    • Using Unions in Derived Tables: The classic example of UNION in a derived table comes in the case of horizontal partitioning. This is when we have a massive table, and we decide to distribute the rows in a number of tables for performance reasons. A row does not appear in more than one table. For example, we may have sales information dating back a number of years, so we may put each year’s information into a separate Sales table. To reconstitute the original logical table, we would use the UNION clause, or more specifically the UNION ALL clause. The difference is that UNION removes duplicates, while UNION ALL does not. The UNION clause has more overhead associated with it because it has to create a work table to eliminate the duplicates.

    For example, let’s say we broken the Sales table into two physical tables: CurrentSales and PastSales, where CurrentSales has the current calendar year, while the PastSales calendar table has the previous six calendar years. The fiscal year ends October 31st. Assuming the calendar year is 2000, and that an index exists on SalesDate on both tables, this is how the code would go to get sales broken down by ProductID for the current fiscal year:


USE Northwind;

SELECT productid,
       SUM(amount) AS sales
FROM
( 
   SELECT salesdate,
          productid,
          amount
    FROM CurrentSales
    UNION ALL
    SELECT salesdate,
           productid,
           amount
    FROM PastSales
) AS s
WHERE salesdate &gt;='19991101'
      AND salesdate &lt;='20001031'
GROUP BY productid;
  • Creating Histograms: One scenario where derived tables are useful is when we require histograms, which depicts the distribution of counts of values. Suppose we’re using the Northwind database, and we wish to have a histogram showing the number of orders placed by a customer. Once we have the counts, we need to get a “count of counts” which is generated by another GROUP BY clause. Here is how the code would go:

USE Northwind;

SELECT cnt,
       COUNT(*) AS frequency
FROM
(
   SELECT customerid,
          COUNT(*) AS cnt
   FROM orders
   GROUP BY customerid
) AS x
GROUP BY cnt;

If we want a quasi-graphical representation of the histogram, we just have to change COUNT(*) to REPLICATE(*,COUNT(*)), which produces an instant bar chart! Here is the code for this:


USE Northwind;

SELECT cnt,
       REPLICATE('*', COUNT(*)) AS frequency
FROM
(
   SELECT customerid,
          COUNT(*) AS cnt
   FROM orders
   GROUP BY customerid
) AS x
GROUP BY cnt;
  • Comparing Performance: In SQL, there many ways of doing the same thing. This is good because it brings flexibility, while also bad because we need to be aware of all the ways to cast a query, as well as know how things work. Let’s say we wish to look at the last row of the Products table in the Northwind database. We fist need to find the highest ProductID value and then use that retrieve the row corresponding to that ProductID. Here is the query: 

USE Northwind;

SET STATISTICS IO ON

SELECT *
FROM products
WHERE productid =
(
    SELECT MAX(productid)
    FROM products
)

As we can see, the query gives a scan count of 1 and a logical read count of 2. Let’s see what happens when we rewrite the query with an equivalent ALL predicate:


USE Northwind;

SET STATISTICS IO ON

SELECT *
FROM products
WHERE productid &gt;= ALL
(
    SELECT productid
    FROM products
)

The above query gives us a scan count of 78 and 156 logical reads! The Query Analyzer reports a relative query cost of 11.25% in the MAX() query versus 88.75% in the ALL query.

We can also see performance differences on a correlated subquery. Let’s say we need the first order row (ie. Lowest order ID) for each customer. This is the query using the MIN() function:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.*
FROM orders AS o1
WHERE o1.orderid=
(
    SELECT MIN(o2.orderid)
    FROM orders AS o2
    WHERE o2.orderid=o1.orderid
);

The query gives us a scan count of 2, and logical reads as 44. Changing the above query by using the ALL predicate we get:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.*
FROM orders AS o1
WHERE o1.orderid &lt;= ALL
(
    SELECT o2.orderid
    FROM orders AS o2
    WHERE o2.customerid=o1.customerid
);

Although the numbers are down for the Orders table(2 scan counts and 27 logical reads), the worktable that was created behind the scenes shows a considerable amount of I/O (88 scan counts and 3367 logical reads). Another bad score for the ALL predicate. Let’s recast the previous query as a derived table:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.customerid
       SUM(od.quantity) as quantity
FROM orders AS o1
     JOIN[Order Details] AS od ON od.orderid=o1.orderid
WHERE o1.orderid=
(
    SELECT MAX(o2.order)
    FROM orders AS o2
    WHERE o2.customerid = o1.customerid
)
GROUP BY o1.customerid;

As we can see, the MIN() or MAX() subquery produces similar I/O statistics and query plan to the equivalent derived table.

The final scenario involves aggregation. Here, we wish to sum the quantity of items for each customer’s most recent order. First, we find the last order date for each customer and the sum the quantity for all orders placed by that customer on that date. Using the MAX() version, here is how our correlated subquery would look like:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.customerid
       SUM(od.quantity) as quantity
FROM orders AS o1
     JOIN[Order Details] AS od ON od.orderid=o1.orderid
WHERE o1.orderdate=
(
    SELECT MAX(o2.orderdate)
    FROM orders AS o2
    WHERE o2.customerid = o1.customerid
)
GROUP BY o1.customerid;

^^ See above for the I/O statistics.

Here is the same query as above using the ALL predicate:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.customerid
       SUM(od.quantity) as quantity
FROM orders AS o1
     JOIN[Order Details] AS od ON od.orderid=o1.orderid
WHERE o1.orderdate&gt;= ALL
(
    SELECT o2.orderdate
    FROM orders AS o2
    WHERE o2.customerid = o1.customerid
)
GROUP BY o1.customerid;

This time, the Orders table was busier.

The same query can also be recast as a derived table. Here’s how that query would go:


USE Northwind;

SET STATISTICS IO ON

SELECT o1.customerid
       SUM(od.quantity) as quantity
FROM orders AS o1
     JOIN[Order Details] AS od ON od.orderid=o1.orderid
     JOIN
(
     SELECT customerid,
            MAX(orderdate) as OrderDate
     FROM orders
     GROUP BY customerid
) AS o2 ON o2.customerid = o1.customerid
           AND o2.orderdate = o1.orderdate
GROUP BY o1.customerid;

 

1 Comment

  1. A WordPress Commenter on August 23, 2018 at 12:01 pm

    Hi, this is a comment.
    To get started with moderating, editing, and deleting comments, please visit the Comments screen in the dashboard.
    Commenter avatars come from Gravatar.

Leave a Comment