Hashbytes and Checksum
HASHBYTE and CHECKSUM
CHECKSUM and HASHBYTES are both commonly used to determine whether field values in two records are the same. HASHBYTES and CHECKSUM are typically used in data warehouse loading scenarios, especially when we need to compare many columns from the source table to many columns in the destination table. This allows us to really scale our solution as far as controlling the number of columns that we need to maintain, since we only need to maintain the one function that we’ve created. We also don’t have to bring as much data into our ETL tool as well.
CHECKSUM is significantly syntactically easier to write as far as the code goes than HASHBYTES are. It also handles NULLs automatically for us, whereas with HASHBYTES there many little intricacies that we have to worry about, and requires a lot of additional coding. HASHBYTES will not handle NULLS automatically, and unlike CHECKSUMs they’re case sensitive.
We use one over the other based on something called collision. Collision gives us a false positive or false negative. So if we’re trying to compare ten columns in one table and ten columns in another table, we don’t want a false positive or a false negative. Unfortunately, CHECKSUM is more likely to give us that false positive or false negative than HASHBYTES. Another reason for using one over the other is the easiness of writing the syntax. If we’re querying on some relatively small tables where collision is not a factor, then we would definitely prefer to use CHECKSUM over HASHBYTES since it involves significantly less code and it’s easier to troubleshoot and debug and maintain over time. It’s just an easier and simpler solution. But HASHBYTES will give us additional functionality and guarantee that we’re not getting any false positives and false negatives.
(Include syntax for HASHBYTES and CHECKSUM)
HASHBYTES give us a significantly different result set in the column that we’re modifying. It cannot handle NULLss; this is something we will have to do on our own using ISNULL() or COALESCE(), and it is case sensitive. If we’re trying to compare a FirstName column to another FirstName column, we must be aware that if the name is spelled with lowercase or uppercase, it’s going to make a difference as far as the algorithm and output of that algorithm.
(Explain the Syntax around the 3:56 mark)
--CHECKSUM SELECT CHECKSUM(Name, ProductNumber, Color) AS CheckColumn, Name, ProductNumber, Color FROM Production.Product;
The CheckColumn is a value column that we use for comparison. So if we want to compare the Name, ProductNumber, and Color columns from our source table to a destination table, we can so and that would be easier and better for performance compared to the alternative. Notice in some cases that even when Color has a NULL value, the CheckColumn is still returning a value that is valid.
--CHECKSUM is not case-sensitive (unlike HASHBYTES) --CHECKSUM will return the same value even though first query is uppercase while second is lowercase. SELECT CHECKSUM('Moe', 'G', 'Ahad') SELECT CHECKSUM('moe', 'g,''ahad')
The is no difference in the value returned from either table.
--HASHBYTES: Here we're using the hashing algorithm 'SHA' --We will not use comma separators and instead add the columns. SELECT HASHBYTES('SHA', 'Moe', +'G', +'Ahad');
^^ As we can see, there’s a difference between the two HASHBYTES values when we use different casing.
--HASHBYTES are case-sensitive SELECT HASHBYTES('Moe', 'G', 'Ahad') SELECT HASHBYTES('moe', 'g,''ahad')
^^ When we add a NULL, the HASHBYTES is no longer returning any value but rather just a NULL. This is because anything plus a NULL will return a NULL.
If we are working with tables that are nullable, then we need to handle those ahead of time. We need to replace NULL with a value.
--How to handle NULLs with HASHBYTES SELECT HASHBYTES('SHA', 'Moe' + 'G' + 'Ahad'); SELECT HASHBYTES('SHA', 'Moe', 'G', + 'Ahad' + COALESCE(NULL, 'NA'));
Another scenario we need to watch out for is when the columns are combined. Since we’re combining columns, the HASHBYTES algorithm doesn’t know how many columns are appended in the code. It just combines the strings together and then run the HASHBYTES algorithm on it. This is important because what if we have something like two of the columns already combined instead of originally working with three columns? What will happen whenever we have this scenario we still have the exact same strings appended to one final string, and the HASHBYTES algorithm will still return the same value:
For HASHBYTES, we need to watch out for case sensitivity, NULLs, scenarios where columns blend with one another, and converting integers to strings.
A work around to the above query is to use a delimiter.
SELECT HASHBYTES('SHA', 'Moe' + '|' + 'G' + '|' + 'Ahad'); --'Moe|G|Ahad' SELECT HASHBYTES('SHA', 'M' + '|' + 'oeG' + '|' + 'Ahad'); --'M|oeG|Ahad'
Now we get two different result sets.
What if we add in another column? We will get an error because we cannot have an integer as a second argument for the HASHBYTES function. HASHBYTES work by concatenating strings. We will need to explicitly cast the integer to a varchar values.
If we are working with data coming from user-entered forms, we can convert all the columns to uppercase.
--Demo of HASHBYTES against a table SELECT HASHBYTES('SHA', ISNULL(Name, 'NA') + '|' + ISNULL(ProductNumber, 'NA') + '|' + ISNULL(Color, 'NA')) AS HashColumn, Name, ProductNumber, Color FROM Production.Product;
We have our hashing algorithm on the left hand and even though we have NULLs we’re still returning back the hashing algorithm values each time. One last thing we can do is add an additional function either around all the columns in the HASHBYTES function such as make all strings uppercase, and if anything is an integer we explicitly cast them into varchar values.
Integer values are not allowed, the first argument is the algorithm type and the second argument will not allow any integer data types.