The except operator in t-sql
When I first read about the except operator in t-sql and saw how simple it was to use, it knew I would be using it a lot. And I have.
I first came across it when studying for the 70-461 exam (Querying Microsoft SQL Server 2012). I keep trying to tell my colleagues about it but it seems they already know. So the only logical thing to do was write a blog post about it.
What does it do?
The EXCEPT OPERATOR compares two sets of data and performs set difference. In other words, it returns all of the rows from query 1 EXCEPT when an identical row is returned from query 2. This makes EXCEPT a quick and easy way to compare two sets of data without using 3rd party tools or more complicated JOIN syntax.
What is the syntax?
Ok, this is why I love EXCEPT. It is not difficult to learn. In fact it is simple. As long as you can write your 2 queries, all you need to do is stick the word EXCEPT between them and execute the full block of code at the same time. That really is all there is to it.
This example below over simplifies the problem – I’ll include a more complex one later in the post however this is really just to show the syntax. Using the Adventureworks2014 database, we select BusinessEntityID from tables [HumanResources].[Employee] and [Sales].[SalesPerson]. It stands to reason that every sales person will be an employee but not every employee will be a sales person. In this database, there are 290 rows in [HumanResources].[Employee] and 17 rows in [Sales].[SalesPerson]. If we get the query correct, we should expect to return 273 rows for employees who are not sales persons.
[sql]
— Determine which Employees are not sales persons
SELECT DISTINCT BusinessEntityID
FROM [HumanResources].[Employee]
EXCEPT
SELECT DISTINCT BusinessEntityID
FROM [Sales].[SalesPerson]
[/sql]
The order of the 2 queries matters.
You must write the query like you would say it English. It’s important because the other way round gives completely different results. Assuming you have data integrity, you wouldn’t expect to have any sales persons who are not employees – and this is confirmed when we change the order of query 1 and query 2.
[sql]
— Query1 and Query2 are the wrong way round and return 0 rows.
SELECT DISTINCT BusinessEntityID
FROM [Sales].[SalesPerson]
EXCEPT
SELECT DISTINCT BusinessEntityID
FROM [HumanResources].[Employee]
[/sql]
Is it expensive?
The first thing to say here as a note of caution is don’t worry about the Clustered Index Scan. As we discussed earlier, these tables are only 290 and 17 rows in size – so they are tiny. When a result set is so small that seeking the index key provides no benefit, the query optimizer will simply scan the index.
The second item of note is the Merge Join. Here is what TechNet say about Merge Join.
Merge join itself is very fast, but it can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.
So, in this query plan – we have done quite well. If it’s something you will be running regularly, it is worth reviewing your plan for performance purposes however for ad hoc problem solving, you don’t need to worry about this unless your query takes longer than you think is reasonable.
When do I use it?
As part of my job as a DBA, I manage a large amount of transactional replication. Hundreds of publications, some with more than 1 subscriber and this can change with each release as new tables are added. I can run the query below against production and our Dev/QA environments to make sure that our replication is consistent across the estate.
Note: I collect the results of the query below from prod and our Dev/QA instances and store the results in a central collection server.
[sql]
SELECT DISTINCT
p.Publication,
a.publisher_db AS Source_db,
s.subscriber_db AS Dest_db,
a.destination_object AS Dest_table,
a.source_object AS Source_table
FROM Distribution..msarticles a
JOIN Distribution..mssubscriptions s
ON a.publication_ID = s.publication_ID
JOIN Distribution..mspublications p
ON s.publication_ID = p.publication_ID
where s.subscriber_db <>’virtual’
GROUP BY
a.article,
a.publisher_db,
s.subscriber_db,
p.Publication,
a.source_object,
a.destination_object
[/sql]
So in order to do the comparison, each of these results is inserted into its own table. Let’s call them repl_qa and repl_prod. Then I run the query below but I purposely run with both queries so I can check both if something is missing in QA/DEV AND if something is missing from prod. (Either it hasn’t been released yet or it has been dropped)
[sql]
— Missing in QA
SELECT DISTINCT
Publication,
Source_db,
Dest_db,
Dest_table,
Source_table
FROM repl_prod
EXCEPT
SELECT DISTINCT
Publication,
Source_db,
Dest_db,
Dest_table,
Source_table
FROM repl_qa
— Not released to Prod – Or dropped from Prod
SELECT DISTINCT
Publication,
Source_db,
Dest_db,
Dest_table,
Source_table
FROM repl_qa
EXCEPT
SELECT DISTINCT
Publication,
Source_db,
Dest_db,
Dest_table,
Source_table
FROM repl_prod
[/sql]
Please let me know in the comments if you have any other good uses for the EXCEPT operator.
Leave a Reply