Dimension ProcessAdd with speed and integrity

There’s an SSAS Dimension ProcessAdd tuning tip that no one seems to consider, as evidenced in several blog posts and workplaces.  I’m not sure if that’s correlation or causation but the line of conversation I’ve witnessed goes like this:

Old Timer: “We’ve got a very large dimension that takes forever to ProcessUpdate, and we’re not even benefiting from the ProcessUpdate because existing rows never change – we always create new versions via type-2 update in the warehouse.”

New Team Member: “There’s this great feature, ProcessAdd, we can use to only process the new rows.  Let’s do it!”

Old Timer: “Yeah we looked at that but the MDX query to get the max value in the dimension to find new rows from the table takes longer than the time saved in the ProcessAdd itself. All these blogs are written with such tiny cubes – I mean, why do you even need the speed of ProcessAdd when your dimension has less than a million records?”

Newbie Team Member: “Why not just store the highest value you processed in a config table in the warehouse ready to use the next time you need to process? That will solve all your performance dramas!”

OT: “I’d rather have a slow ProcessUpdate than a flaky ProcessAdd”

Noob: “wha?”

OT: “What if rows are being loaded to the warehouse when dimension processing starts? The value stored in your config table will be inconsistent with what’s actually in the dimension. What if you successfully store that value, but then the dimension processing fails?  What if someone does a ProcessFull without remembering to update the maximum ID? The whole thing is error-prone and I’m not getting up in the middle of the night to fix it”

Me: “What if I told you you could have the performance of a database config table with the transactional purity of the MDX approach?”

Both give me the rolling side-eye.

Does this sound familiar?

The solution I’ve used is to create a related dimension attribute to store the maximum ID. It might as well be non-visisble and non-aggregatable, but will need AttributeHierarchyEnabled. This will have low cardinality so it’s super fast to get the max from MDX, and it’s populated from the very same query that was used to update the dimension (whether Full, Add etc.) In a simple proof of concept the ProcessAdd query might change from this:

SELECT 
 TradeID
,TradeAttribute1
,TradeAttribute2
--etc
,TradeAttribute33
FROM mart.DimTrade
WHERE TradeID >103122403

to this:

SELECT 
 TradeID
,TradeAttribute01
,TradeAttribute02
--etc
,TradeAttribute33
,(SELECT MAX (TradeID) FROM DimTrade) As 'ProcessAddRangeHigh'
FROM mart.DimTrade
WHERE TradeID > 103122403

Yes this will yield equivalent performance assuming you are clustered on TradeID.

It’s not actually that simple because instead of hardcoding that literal,  you will need to obtain it dynamically via MDX, for example:

WITH MEMBER [Measures].[MaxKey] AS
   MAX([Dim Trade].[Process Add Range High].ALLMEMBERS
   , StrToValue([Dim Trade].[Process Add Range High].currentmember.MEMBER_KEY))
SELECT
   {[Measures].[MaxKey]} ON 0
FROM
   Trades

If your workplace automates cube processing via SSIS, c# and static relational views (probably the least convenient architecture for what I’m trying to achieve in this blog post), then you could obtain the value from MDX and store in a transient config table for immediate use.  Your database view defining the dimension source will look more like this:

SELECT 
 TradeID
,TradeAttribute01
,TradeAttribute02
--etc
,TradeAttribute33
,(SELECT MAX (TradeID) FROM DimTrade) As 'ProcessAddRangeHigh'
FROM dbo.DimTrade
WHERE 
	TradeID >  ( SELECT ConfigValue FROM dbo.Config 
		     WHERE ConfigName = 'DimTradeProcessAddRangeHigh')
	OR EXISTS
	 (SELECT ConfigValue FROM dbo.Config 
	  WHERE ConfigName = 'DimTradeNeedsFullProcess' and ConfigValue =1) 

So yes, we’re still using a value from a relational config table, but crucially (for reliability) we’re populating the config table from the cube immediately before processing, and then using it straight away. Compare that to populating the value with the previous cube processing activity.

As an added bonus, this also gives you a definitive record within the dimension of how many rows were processed each time, which can be useful for troubleshooting any suspicious elapsed processing times. No the timestamp or lineage data in your warehouse is not a reliable proxy for this!

This technique is best used when your dimension is keyed by an ascending surrogate, and there’s no other attribute readily available to denote the incremental processing grain.

There is of course the downside that this marginally increases the size of your dimension, and the time to fully process this dimension.  But that is likely to be a small price to pay for enabling a fast and more resilient ProcessAdd.

[End]

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s