Intellectually I knew that query results can vary depending on where we decide to “filter” our data. In the JOIN condition, or in the predicate WHERE clause? The following example of this revealed itself recently. It struck me as such a simple and clear case I thought others might benefit from it.
The situation arose where an end user provided a csv file with a list of users, and wanted all the data in our dimension and fact tables for each of those users. If no record existed for a particular user, he still wanted that user to be shown. This would make matching things up in his original spreadsheet easier. No problem… first I imported the csv to a one off table in the DB. The source (driving) table that resulted by importing the list of users had 180,760 rows. Then I LEFT joined that table to the other tables of interest and exported the data. LEFT JOIN is important since we want to make sure our result output shows one row for every original record in the .csv that we were provided with. And we only want the most current row for each seller from the dimension table, so we need to include a “RowIsCurrent” flag condition.
If we choose to apply the current flag filter at the wrong part of the query, rows we want are actually filtered out IN SPITE OF using all LEFT JOINS. See below:
SELECT COUNT(*) FROM [DataMartETL].[oneoffs].[jp_match_201801229] am WITH(NOLOCK) LEFT JOIN [EntDW].[Merchant].[dimMerchant] dm WITH(NOLOCK) ON (dm.SourceSystemAccountID = am.sellerid AND dm.SourceSystemName = 'Gamma') LEFT JOIN [EntDW].[Merchant].[factSales] fct WITH(NOLOCK) ON (fct.MerchantDimKey = dm.MerchantDimKey) WHERE RowIsCurrent = 'Y'
The result of this query only returns a rowcount = 177,058. Hmm… we’re missing 3700+ rows. Now what happens if instead we move our “RowIsCurrent = Y” filter to be part of the JOIN condition?:
SELECT COUNT(*) FROM [DataMartETL].[oneoffs].[jp_match_201801229] am WITH(NOLOCK) LEFT JOIN [EntDW].[Merchant].[dimMerchant] dm WITH(NOLOCK) ON (dm.SourceSystemAccountID = am.sellerid AND dm.SourceSystemName = 'Gamma' AND dm.RowIsCurrent = 'Y') LEFT JOIN [EntDW].[Merchant].[factSales] fct WITH(NOLOCK) ON (fct.MerchantDimKey = dm.MerchantDimKey)
Ta da! We get 180,760 rows returned! Now the results reflect what I expected originally because of the LEFT JOINS; a row for every row in the driving table. As you can see, simply moving the criteria for our “RowIsCurrent = Y” from the WHERE to the JOIN results in the desired outcome.