Updating Statistics on Steroids

Updating statistics on large tables can take a while.  A page-compressed warehouse fact table with 19 billion rows, 1000 partitions, a sensible clustered index, and 35 columns occupying in excess of  1TB disk space on my server takes around 22 hours to update its 33 stats objects using the default sample size.  Sure you can do several tables in parallel, but you can’t update the various stats on a single table in parallel using supported functionality.

So the next step is usually to tinker with sample size, spread the update across multiple maintenance windows, switch older partitions out to an archive table, or perhaps leverage filtered stats (conditions apply) or incremental stats (from SQL 2014). This blog post looks at a more radical approach – attempting to update those 33 stats “in parallel” to get that 22 hour binge down to the length of a reasonable coffee break.

For simplicity I’m going to limit this post to automatic stats on columns and on the clustered index, operating on the AdventureWorksDW2014 FactInternetSales table in my example. A relatively low risk, future proof (and let’s face it, lazy) approach will reuse rather than reinvent existing stats functionality where possible:

  1. Clone an empty stats donor table from the DDL of the patient, dbo.FactInternetSales
  2. Trigger the natural creation of auto stats objects on every column and index of the donor
  3. Seed the donor with a similar number sample rows from FactInternetSales as are used by the default stats update process.
  4. Fake the rowcount of the donor table  to simulate the patient, and then update its stats
  5. Transplant the stats histograms from the donor to the patient table using the unsupported STATS_STREAM features of DBCC SHOW_STATISTICS and UPDATE STATISTICS

At step 3, I’m going to use the same pseudo-random, page-oriented sampling mechanism underlying the standard update stats command.  Extracting this sample of 10m rows from 19bn is hugely time consuming, but now we’re doing it just once rather than 33 times.  Step 4 still updates the stats in series, but is super-fast because it scans only 10m rows vs 19bn in the original.

I will wade through the code in detail below, but I’m sure most of you just want to know the results:  I can now update stats on my 22 hour table at the same sample rate in 40 minutes, using proportionately less IO and CPU too.

Disclaimer: This approach is about as safe to use as steroids. It uses unsupported functionality, and should therefore be applied to production systems with extreme caution, and only by expert programmers who understand their tables and how they are used. On the other hand, it’s not dependent on the internal structure of the STATS_STREAM binary, and if it all falls in a heap (well, not that kind of heap) you can heroically resuscitate the patient by running the conventional update stats command on your production table (or if you had the foresight to back up the previous histograms, you could reinstate them using the techniques illustrated below, saving you 22 hours for the restore).

There are a few downsides (or at least points of difference) to be aware of:

  1. It requires some initial coding effort, and increases your overall complexity and code base.
  2. Any subsequent performance difficulties will doubtless be attributed to this “hack”, so you may need to exercise your people skills to ensure your team buys into this approach.
  3. All statistics on the table will use the same sample size. With the conventional approach you get some relatively minor variation (by design) because (I believe) SQL pre-samples the data to determine the sample size to use per statistic update, and also because the table may be growing as statistics are updated.
  4. How do you determine an appropriate sample size? As the table gradually grows, or is abruptly archived, do you want to vary the sample size?  One conservative approach might be to continue to update statistics on the primary key conventionally, and then detect and reuse this same sample rate for the other stats via the donor table. However, because we’re only sampling once for many statistics, it seems to make sense to benefit from a higher sample rate than the default. I’m considering a value of 15 million compared to the current default of around 8 million for my table, which will be good for a couple of years of natural growth, and provide greater resolution in the shorter term.

Thomas Kejser blogged a similar approach to manipulate histograms via transplant, but I’m not aware of anyone transplanting to accelerate the update stats process.

Thanks also to my colleague Paul Hunt for early brainstorming on this topic.

Code walkthrough

1. Clone an empty stats donor table from the DDL of the patient, dbo.FactInternetSales

I’m going to use a permanent donor table for now to eliminate any theoretical risk of a binary stats stream from a temp table being incompatible with a permanent table.  A permanent table is also handy for troubleshooting across multiple sessions.

USE AdventureWorksDW2014;
CREATE TABLE dbo.donor(
	[ProductKey] [int] NOT NULL,
	[OrderDateKey] [int] NOT NULL, 
	[DueDateKey] [int] NOT NULL,
	[ShipDateKey] [int] NOT NULL,
	[CustomerKey] [int] NOT NULL,
	[PromotionKey] [int] NOT NULL,
	[CurrencyKey] [int] NOT NULL,
	[SalesTerritoryKey] [int] NOT NULL,
	[SalesOrderNumber] [nvarchar](20) NOT NULL,
	[SalesOrderLineNumber] [tinyint] NOT NULL,
	[RevisionNumber] [tinyint] NOT NULL,
	[OrderQuantity] [smallint] NOT NULL,
	[UnitPrice] [money] NOT NULL,
	[ExtendedAmount] [money] NOT NULL,
	[UnitPriceDiscountPct] [float] NOT NULL,
	[DiscountAmount] [float] NOT NULL,
	[ProductStandardCost] [money] NOT NULL,
	[TotalProductCost] [money] NOT NULL,
	[SalesAmount] [money] NOT NULL,
	[TaxAmt] [money] NOT NULL,
	[Freight] [money] NOT NULL,
	[CarrierTrackingNumber] [nvarchar](25) NULL,
	[CustomerPONumber] [nvarchar](25) NULL,
	[OrderDate] [datetime] NULL,
	[DueDate] [datetime] NULL,
	[ShipDate] [datetime] NULL,
 CONSTRAINT [PK_Donor] PRIMARY KEY CLUSTERED 
(
	[SalesOrderNumber] ASC,
	[SalesOrderLineNumber] ASC
));

2. Trigger the natural creation of auto stats objects on ever column of the donor

We’re not going to get any automatic column stats while the table is empty, so add a row:

insert dbo.Donor select top 1 * from dbo.FactInternetSales;

Run some queries with predicates covering every column. I’ve chosen one query per column because it was simple to generate and manage the code.  Although we’ve only got 1 record in the table, this pattern is fast on any size table, and is good with columns of any data type:

declare @sinkCount int;
SELECT @sinkCount=count(1) from (SELECT TOP 1 ProductKey FROM dbo.donor WHERE ProductKey = (SELECT TOP 1 ProductKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 OrderDateKey FROM dbo.donor WHERE OrderDateKey = (SELECT TOP 1 OrderDateKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 DueDateKey FROM dbo.donor WHERE DueDateKey = (SELECT TOP 1 DueDateKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 ShipDateKey FROM dbo.donor WHERE ShipDateKey = (SELECT TOP 1 ShipDateKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 CustomerKey FROM dbo.donor WHERE CustomerKey = (SELECT TOP 1 CustomerKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 PromotionKey FROM dbo.donor WHERE PromotionKey = (SELECT TOP 1 PromotionKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 CurrencyKey FROM dbo.donor WHERE CurrencyKey = (SELECT TOP 1 CurrencyKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 SalesTerritoryKey FROM dbo.donor WHERE SalesTerritoryKey = (SELECT TOP 1 SalesTerritoryKey FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 SalesOrderNumber FROM dbo.donor WHERE SalesOrderNumber = (SELECT TOP 1 SalesOrderNumber FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 SalesOrderLineNumber FROM dbo.donor WHERE SalesOrderLineNumber = (SELECT TOP 1 SalesOrderLineNumber FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 RevisionNumber FROM dbo.donor WHERE RevisionNumber = (SELECT TOP 1 RevisionNumber FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 OrderQuantity FROM dbo.donor WHERE OrderQuantity = (SELECT TOP 1 OrderQuantity FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 UnitPrice FROM dbo.donor WHERE UnitPrice = (SELECT TOP 1 UnitPrice FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 ExtendedAmount FROM dbo.donor WHERE ExtendedAmount = (SELECT TOP 1 ExtendedAmount FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 UnitPriceDiscountPct FROM dbo.donor WHERE UnitPriceDiscountPct = (SELECT TOP 1 UnitPriceDiscountPct FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 DiscountAmount FROM dbo.donor WHERE DiscountAmount = (SELECT TOP 1 DiscountAmount FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 ProductStandardCost FROM dbo.donor WHERE ProductStandardCost = (SELECT TOP 1 ProductStandardCost FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 TotalProductCost FROM dbo.donor WHERE TotalProductCost = (SELECT TOP 1 TotalProductCost FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 SalesAmount FROM dbo.donor WHERE SalesAmount = (SELECT TOP 1 SalesAmount FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 TaxAmt FROM dbo.donor WHERE TaxAmt = (SELECT TOP 1 TaxAmt FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 Freight FROM dbo.donor WHERE Freight = (SELECT TOP 1 Freight FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 CarrierTrackingNumber FROM dbo.donor WHERE CarrierTrackingNumber = (SELECT TOP 1 CarrierTrackingNumber FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 CustomerPONumber FROM dbo.donor WHERE CustomerPONumber = (SELECT TOP 1 CustomerPONumber FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 OrderDate FROM dbo.donor WHERE OrderDate = (SELECT TOP 1 OrderDate FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 DueDate FROM dbo.donor WHERE DueDate = (SELECT TOP 1 DueDate FROM dbo.donor)) X
SELECT @sinkCount=count(1) from (SELECT TOP 1 ShipDate FROM dbo.donor WHERE ShipDate = (SELECT TOP 1 ShipDate FROM dbo.donor)) X

3. Seed the donor with a similar number sample rows from FactInternetSales as are used by the default stats update process

Inserting a large number of rows into the donor might indirectly trigger an auto update of stats, so I’m going to set all stats to NORECOMPUTE, just to be safe:

update statistics dbo.Donor WITH NORECOMPUTE;

This is practically instantaneous because dbo.Donor only contains 1 row.  I now need to remove that row to prevent a primary constraint violation when we insert the bulk of rows:

truncate table dbo.Donor;

Now add the rows, using the same sampling mechanism underlying the conventional update statistics command. Dbo.FactInternetSales only has 60k records and by default samples 57k records – not different enough to prove the scaling up of cardinality estimates, so I’ll choose 30k for the donor.  This will be adequate to inspect the statistics and show the cardinality estimates on query plans, but these numbers are all too small to demonstrate the UPDATE STATISTICS time saving:

insert dbo.donor 
SELECT * FROM dbo.FactInternetSales
TABLESAMPLE (30000 ROWS) with (nolock);

(31112 row(s) affected)

Note that we don’t get precisely the number of rows we asked for because SQL samples by page granularity.

4. Fake the rowcount of the donor to simulate the patient, and then update its stats

DECLARE @PageCount int
DECLARE @RowCount int
SELECT 
 @PageCount = sum(in_row_data_page_count) 
, @RowCount = sum(row_count) 
FROM sys.dm_db_partition_stats 
WHERE object_id = OBJECT_ID('dbo.FactInternetSales'); 

DECLARE @SQL nvarchar(max) = 'UPDATE STATISTICS dbo.donor WITH ROWCOUNT= ' + CONVERT(CHAR,@RowCount) 
 + ' , PAGECOUNT=' + CONVERT(CHAR,@PageCount) + ';'

EXEC sp_executesql @SQL;

UPDATE STATISTICS dbo.donor WITH 
SAMPLE 99 PERCENT --to be more accurate, could use "@RowCount-1 ROWS" 
, NORECOMPUTE; 

 

5. Transplant the stats histograms from the donor to the patient table using the unsupported STATS_STREAM features of DBCC SHOW_STATISTICS and UPDATE STATISTICS

While we can extract the histogram from automatic column stats by specifying the column name, when we come to load this back into dbo.FactInternetSales we’ll need to use the current auto-generated statistics name.  So we’ll need a mapping table:

CREATE TABLE #PatientColumnToStatMapping (columnName SYSNAME, StatsName SYSNAME);

INSERT #PatientColumnToStatMapping (columnName, statsName)
SELECT
	c.name 'column',
	ss.name AS 'Statistic'
FROM sys.stats ss
INNER JOIN sys.stats_columns AS sc  ON ss.object_id = sc.object_id AND ss.stats_id = sc.stats_id
INNER JOIN sys.columns AS c ON sc.object_id = c.object_id AND c.column_id = sc.column_id
INNER JOIN sys.objects so ON ss.object_id = so.object_id
INNER JOIN sys.schemas sch ON so.schema_id = sch.schema_id
WHERE sch.name = 'dbo' AND so.name = 'FactInternetSales'
  AND ss.auto_created = 1 AND ss.user_created = 0  AND ss.has_filter = 0;

Let’s check this table:

select * from #PatientColumnToStatMapping;
columnName StatsName
CurrencyKey _WA_Sys_00000007_276EDEB3
CustomerKey _WA_Sys_00000005_276EDEB3
OrderDateKey _WA_Sys_00000002_276EDEB3
DueDateKey _WA_Sys_00000003_276EDEB3
ShipDateKey _WA_Sys_00000004_276EDEB3
ProductKey _WA_Sys_00000001_276EDEB3
PromotionKey _WA_Sys_00000006_276EDEB3
SalesTerritoryKey _WA_Sys_00000008_276EDEB3
SalesOrderLineNumber _WA_Sys_0000000A_276EDEB3
OrderDate _WA_Sys_00000018_276EDEB3
DueDate _WA_Sys_00000019_276EDEB3
ShipDate _WA_Sys_0000001A_276EDEB3

We’ll also need a table (#stream) to programmatically receive the single row output from DBCC SHOW_STATISTICS, and for convenience a table (#histogram) to hold all the histogram streams converted to varchar, with their relevant column names.    For brevity only a few examples are shown (2 column stats and the PK stats):

CREATE TABLE #stream(
stream VARBINARY(MAX) NOT NULL
, rows INT NULL
, pages INT NULL);

CREATE TABLE #histogram (
ColumnName SYSNAME PRIMARY KEY
, StreamVarchar varchar(max));

insert #stream EXEC ('DBCC SHOW_STATISTICS (''dbo.Donor'',ProductKey) WITH STATS_STREAM,NO_INFOMSGS');
insert #histogram SELECT 'ProductKey',CONVERT(varchar(max),stream,1) FROM #Stream;
truncate table #stream;

insert #stream EXEC ('DBCC SHOW_STATISTICS (''dbo.Donor'',CustomerKey) WITH STATS_STREAM,NO_INFOMSGS');
insert #histogram SELECT 'CustomerKey',CONVERT(varchar(max),stream,1) FROM #Stream;
truncate table #stream;

insert #stream EXEC ('DBCC SHOW_STATISTICS (''dbo.Donor'',PK_Donor) WITH STATS_STREAM,NO_INFOMSGS');
insert #histogram SELECT 'PK_Donor',CONVERT(varchar(max),stream,1) FROM #Stream;
drop table #stream;

Let’s take a look at #histogram:

select * from #histogram;
ColumnName StreamVarchar
CustomerKey 0x0100000001000000000000000000000053D9EDA90000000….
PK_Donor 0x010000000200000000000000000000004432204E0000000…
ProductKey 0x01000000010000000000000000000000024F35100000000B…

Finally, implant the histograms back into the original table using UPDATE STATISTICS via dynamic sql.  Rather than looping, I’ve used a single semi-colon-delimited string to store all the UPDATE STATISTICS commands. The NORECOMPUTE option is required (again) because this property isn’t part of the imported STATS_STREAM, and we want the stats on dbo.FactInternetSales to be NORECOMPUTE.  This is because the “Rows” setting on the statistic is transferred and may otherwise trigger an expensive auto stats update (although I think it defers to the modification counter for this behaviour). In any case, when scheduling the update of stats on large tables, you probably don’t want them to recompute outside your schedule.

DECLARE @SQL NVARCHAR(MAX) = ';'; 
SELECT @SQL = @SQL + N'UPDATE STATISTICS dbo.FactInternetSales('+Map.StatsName+')WITH NORECOMPUTE
, STATS_STREAM='+ StreamVarchar + ';' 
FROM #histogram Hist 
join #PatientColumnToStatMapping Map on Hist.ColumnName = Map.columnName ; 
EXECUTE sp_executesql @SQL; 
SELECT @SQL; 

where @SQL looks like this:

;UPDATE STATISTICS dbo.FactInternetSales(_WA_Sys_00000005_276EDEB3) WITH NORECOMPUTE, STATS_STREAM = 0x0100000001000000000000000000000053D9EDA90000000005170...
 UPDATE STATISTICS dbo.FactInternetSales(_WA_Sys_00000001_276EDEB3) WITH NORECOMPUTE, STATS_STREAM = 0x01000000010000000000000000000000024F351000000000B81400...

We should also update stats on the primary key (although I’m not going to use it in my example):

DECLARE @SQL_PK NVARCHAR(MAX);

SELECT @SQL_PK =  N'UPDATE STATISTICS dbo.FactInternetSales 
(PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber) WITH NORECOMPUTE, STATS_STREAM=' + StreamVarchar + ';'
FROM #histogram WHERE ColumnName = 'PK_Donor';
EXECUTE sp_executesql @SQL_PK;
SELECT @SQL_PK;

where @SQL_PK looks like:

UPDATE STATISTICS dbo.FactInternetSales
(PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber) WITH NORECOMPUTE, 
STATS_STREAM=0x010000000200000000000000000000004432204E00000000EB4F00...

 

How do the new stats impact query plans?

DBCC FREEPROCCACHE;
select count(1) 'Actual number of rows'  from dbo.FactInternetSales;
select count(1) 'Rows for product 217' from dbo.FactInternetSales where ProductKey = 217;

xb
5

So this is where it gets interesting. The stats are showing 1121 rows, but the query plan is estimating 2176.21 rows. That’s because the optimizer scales up from the number of rows sampled, to the total number of rows in the table (using the alternative metadata I mentioned earlier), i.e.:

2176.21 = 1121 * 60398 / 31112

Finally, let’s demonstrate that query plans are not unduly impacted by the faster stats update, and at the same time demonstrate the (slow version of the) rollback process:

update statistics dbo.FactInternetSales;
DBCC SHOW_STATISTICS ('dbo.FactInternetSales',ProductKey);

6

Note (dark green) that SQL has reverted to the original stats sample rate, and the histogram rowcount (light green) reflects the new sample.

DBCC FREEPROCCACHE; 
select count(1) from dbo.FactInternetSales where ProductKey = 217; 
 7

As you can see the estimated number of rows has changed from 2176.21 to 1988.45 (compared to an actual of 2085), due to the randomness of the sampling, not due to the sampling size.  Fully accurate estimates can only be obtained by updating stats with full scan, which is impractical for some of the large tables found in a typical data warehouse.

Clean up:

drop table dbo.Donor;
drop table #histogram;
drop table #PatientColumnToStatMapping;

[End]

Advertisements