Median Strip

SQL Server has a number of aggregate functions including MAX, COUNT and AVG, but if you want to calculate MEDIAN you will be rolling your own, probably with a little help from Google. If your data is appropriately indexed and performance is critical then Paul White has your back: the OFFSET-FETCH method from 2012 and a dynamic cursor method for earlier versions, both calculated singly, and per group.  It’s hard to imagine a faster solution as those provided pretty much jump direct to the median values.

For data without a supporting index it gets a lot more interesting as established methods are quite memory and time-consuming: they all involve sorting the data.  Dwain Camps seems to have developed the leading approach and applied it per group. This built on Aaron Bertrand’s work documenting and running performance tests across a single dataset.

I have a few ideas to present that will massively strip back the amount of sorting required to calculate medians without an index, resulting in huge performance gains. Unfortunately these solutions are quite “elaborate”, but I’ve managed to avoid unsupported features, query hints and trace flags (for a change).

The first idea I mentioned in a comment on Aaron’s post above, and is best suited to calculate the median across an entire large table (just one column though!).  In a nutshell the algorithm is:

Examine the statistics histogram relating to the column to determine a small range of values that is likely to encompass the median. This histogram doesn’t need to be accurate because we will run one or two quick scans of the entire table to firm up counts for critical ranges in the middle of the histogram, and identify the range that definitely contains our median.  The stats histogram just tells us where to start looking.

You have now identified a small set of records (around 1/200 of the entire table) that needs to be sorted  using one of the more elegant median techniques, remembering to apply an adjusted offset (using the number of records less than this range) because the median of this set is almost certainly not the median of the table.

As clear as custard? Let’s jump into an example.  First create and populate a (heavily skewed) heap of sample data (thanks, Aaron):

CREATE TABLE dbo.obj 
(
    id  integer NOT NULL IDENTITY(1,1), 
    val integer NOT NULL
);
 
INSERT dbo.obj WITH (TABLOCKX) 
    (val)
SELECT TOP (10000000) 
    AO.[object_id]
FROM sys.all_columns AS AC
CROSS JOIN sys.all_objects AS AO
CROSS JOIN sys.all_objects AS AO2
WHERE    AO.[object_id] >0
ORDER BY AC.[object_id];

Obviously no stats exist yet, so create some:

CREATE STATISTICS valStats ON dbo.obj(val) WITH SAMPLE 10 PERCENT ; -- we don't need the accuracy of a full scan

Dump the newly created histogram to a temp table for analysis:

CREATE TABLE #Histogram 
(   [Range] INT IDENTITY(1,1), -- useful for demo, not really required
    RANGE_HI_KEY INT,
    RANGE_ROWS BIGINT, 
    EQ_ROWS BIGINT, 
    DISTINCT_RANGE_ROWS BIGINT, 
    AVG_RANGE_ROWS BIGINT); 

INSERT INTO #Histogram 
EXEC ('DBCC SHOW_STATISTICS([dbo.obj],valStats) WITH HISTOGRAM');

SELECT  thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS 
, sum(prevRange.RANGE_ROWS) + sum(prevRange.EQ_ROWS) 'RunningCountBelowRange' 
, sum(prevRange.RANGE_ROWS) + sum(prevRange.EQ_ROWS) + max(thisRange.RANGE_ROWS) + max(thisRange.EQ_ROWS) 'RunningCountIncludingRange' 
FROM #Histogram thisRange
LEFT JOIN  #Histogram prevRange ON prevRange.RANGE_HI_KEY < thisRange.RANGE_HI_KEY
GROUP BY thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS
ORDER BY thisRange.RANGE_HI_KEY

 

hist1 w highlight

Since we know there are about 10 million rows in the table (not because we just created the test data, but because an estimated row count is available in the statistic – we should use this value because it will at least be consistent with the histogram), we are looking for rows 5000000 and 5000001 to determine the median.  According to this (possibly inaccurate) histogram, we’re most likely to find these rows somewhere in range 101 – with the median value between 661,577,395 and 677,577,452.  However, we need to check the actual table.  We also need to establish exactly how many rows are in the table:

SELECT  count(1) 'CountLowerThanRange101' FROM dbo.obj WHERE   Val <= 661577395
SELECT  count(1) 'CountInRange101' FROM dbo.obj WHERE 661577395 < Val AND Val <= 677577452
SELECT  count(1) 'TotalRowCount' FROM dbo.obj

 

grid7.png

 

Unfortunately this proves the median is in a higher range, so we’ll run another scan to try the next bucket:

SELECT COUNT(1) 'Count_In_Range_102'
FROM dbo.obj WHERE 677577452 < Val AND Val <= 693577509 

 

grid6.png

Bingo! We now know that range 102 contains values sequenced from 4935830 + 47775 + 1, that is from 4983606 to 5031381, and therefore includes our median. Reviewing the histogram tells us the value of our median will be between 677,577,452
and 693,577,509:

hist1 w highlight2

We can use this to construct a final median calculation with the stripped down set of rows to be sorted – it’s the clause in green that’s responsible for the great performance. Here I contrast it to the standard (SQL 2012) OFFSET-FETCH approach:

results.png

I’ve used the OFFSET-FETCH technique for clarity.  You might want to substitute Dwain’s faster approach (particularly if you’re not up to SQL2012 yet), but the overall effect will be marginal as we’ve already narrowed down the search.

If we were writing real code for this, we’d also need to cater for the rare case when there are two median values (because even number of rows) falling in different ranges, and also probably be more careful with <, <=, general arithmetic etc. but you get the idea.

The good news is that my performance tests indicate this to be significantly faster than the vanilla OFFSET-FETCH method for Aaron’s 10m row table, even if you have to calculate stats on that column because they didn’t previously exist. Because the sort is so small, the memory requirements are minimised too.  Performance comparisons are inevitably more dramatic for larger data sets, but will erode to negative for small data sets due to the overhead of looking at stats and doing some quite tortuous arithmetic.

Here’s my entire script:

USE [AdventureWorksDW2014]
DROP TABLE dbo.obj 
CREATE TABLE dbo.obj -- as a heap 
(
    id  INTEGER  NOT NULL IDENTITY(1,1), 
    val INTEGER NOT NULL
);
 
INSERT dbo.obj WITH (TABLOCKX) 
    (val)
SELECT TOP (10000000) 
    AO.[object_id]
FROM sys.all_columns AS AC
CROSS JOIN sys.all_objects AS AO
CROSS JOIN sys.all_objects AS AO2
WHERE    AO.[object_id] >0
ORDER BY AC.[object_id];

DROP STATISTICS  dbo.obj.valstats
CREATE STATISTICS valStats ON dbo.obj(val) WITH SAMPLE 10 PERCENT ; -- we don't need the accuracy of a full scan, the default sample rate will do.
GO
SET STATISTICS TIME ON

DECLARE @sink INT, @FasterMedian DECIMAL(19,2), @SimplerMedian DECIMAL (19,2)
,@ms1 INT, @ms2 INT

DECLARE @Start DATETIME2 = SYSUTCDATETIME();

CREATE TABLE #Histogram 
(   [Range] INT IDENTITY(1,1), 
    RANGE_HI_KEY INT,
    RANGE_ROWS BIGINT, 
    EQ_ROWS BIGINT, 
    DISTINCT_RANGE_ROWS BIGINT, 
    AVG_RANGE_ROWS BIGINT); 

INSERT INTO #Histogram 
EXEC ('DBCC SHOW_STATISTICS([dbo.obj],valStats) WITH HISTOGRAM');

SELECT  thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS 
, SUM(prevRange.RANGE_ROWS) + SUM(prevRange.EQ_ROWS) 'RunningCountBelowRange' 
, SUM(prevRange.RANGE_ROWS) + SUM(prevRange.EQ_ROWS) + MAX(thisRange.RANGE_ROWS) + MAX(thisRange.EQ_ROWS) 'RunningCountIncludingRange' 
INTO #HistogramAnalysis
FROM #Histogram thisRange
LEFT JOIN #Histogram prevRange ON prevRange.RANGE_HI_KEY < thisRange.RANGE_HI_KEY
GROUP BY thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS
ORDER BY thisRange.RANGE_HI_KEY


SELECT  @sink = count(1) FROM dbo.obj WHERE					 Val <= 661577395
SELECT  @sink = count(1) FROM dbo.obj WHERE 661577395 < Val AND Val <= 677577452
SELECT  @sink = count(1) FROM dbo.obj


DECLARE @c INT=10000000
SELECT @FasterMedian = AVG(1.0 * val) 
FROM (
    SELECT val FROM dbo.Obj WHERE 677577452 < Val AND Val <=  693577509 
     ORDER BY val
     OFFSET (@c - 1) / 2 - 4983606 	ROWS
     FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;

SELECT @ms1 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME()); 

SELECT @SimplerMedian = AVG(1.0 * val) 
FROM (
    SELECT val FROM dbo.Obj 
     ORDER BY val
     OFFSET (@c - 1) / 2 ROWS
     FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;

SELECT @ms2 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME()); 

SELECT @ms1 'Faster median ms',  @ms2-@ms1 'Simpler median ms', @FasterMedian '@FasterMedian', @SimplerMedian '@SimplerMedian'
DROP TABLE #Histogram
DROP TABLE #HistogramAnalysis

DROP TABLE dbo.obj

and the outcome (using pre-existing stats, and needing two bites at the histogram ):

results2

This method isn’t sensible for use on a suitably indexed column. The established techniques are low-maintenance and extremely fast, and of course why bother going to all this trouble to reduce the number of rows to be sorted, when the data are already sorted!?

However, this approach can be extended to deriving the median across a single group of rows in a heap. Instead of using the raw column statistics I would extract a small pseudo-random sample of rows from the group to a temp table, and calculate stats on that table with a full scan. Scale up this histogram, apply to the original filtered table, and voilà!

As you can see this is going to get quite unwieldy if we require median values across multiple groups within a table or query. So in my next post I will apply a similar method to the more general problem.  I won’t be using the stats histogram, but the method will be downward compatible to the simple case of an individual median so I’m hoping it will make the statistics histogram approach redundant.

 

[End]

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s