JOIN versus WHERE filtering

Intellectually I knew that query results can vary depending on where we decide to “filter” our data. In the JOIN condition, or in the predicate WHERE clause? The following example of this revealed itself recently. It struck me as such a simple and clear case I thought others might benefit from it.

The situation arose where an end user provided a csv file with a list of users, and wanted all the data in our dimension and fact tables for each of those users. If no record existed for a particular user, he still wanted that user to be shown. This would make matching things up in his original spreadsheet easier. No problem… first I imported the csv to a one off table in the DB. The source (driving) table that resulted by importing the list of users had 180,760 rows. Then I LEFT joined that table to the other tables of interest and exported the data. LEFT JOIN is important since we want to make sure our result output shows one row for every original record in the .csv that we were provided with. And we only want the most current row for each seller from the dimension table, so we need to include a “RowIsCurrent” flag condition.

If we choose to apply the current flag filter at the wrong part of the query, rows we want are actually filtered out IN SPITE OF using all LEFT JOINS. See below:

SELECT COUNT(*)
FROM [DataMartETL].[oneoffs].[jp_match_201801229] am WITH(NOLOCK)
LEFT JOIN [EntDW].[Merchant].[dimMerchant] dm WITH(NOLOCK) ON (dm.SourceSystemAccountID = am.sellerid
                                                              AND dm.SourceSystemName = 'Gamma')
LEFT JOIN [EntDW].[Merchant].[factSales] fct WITH(NOLOCK) ON (fct.MerchantDimKey = dm.MerchantDimKey)
WHERE RowIsCurrent = 'Y'

The result of this query only returns a rowcount = 177,058. Hmm… we’re missing 3700+ rows. Now what happens if instead we move our “RowIsCurrent = Y” filter to be part of the JOIN condition?:

SELECT COUNT(*)
FROM [DataMartETL].[oneoffs].[jp_match_201801229] am WITH(NOLOCK)
LEFT JOIN [EntDW].[Merchant].[dimMerchant] dm WITH(NOLOCK) ON (dm.SourceSystemAccountID = am.sellerid
                                                              AND dm.SourceSystemName = 'Gamma'
                                                              AND dm.RowIsCurrent = 'Y')
LEFT JOIN [EntDW].[Merchant].[factSales] fct WITH(NOLOCK) ON (fct.MerchantDimKey = dm.MerchantDimKey)

Ta da! We get 180,760 rows returned! Now the results reflect what I expected originally because of the LEFT JOINS; a row for every row in the driving table. As you can see, simply moving the criteria for our “RowIsCurrent = Y” from the WHERE to the JOIN results in the desired outcome.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s