SQL Server has a number of aggregate functions including MAX, COUNT and AVG, but if you want to calculate MEDIAN you will be rolling your own, probably with a little help from Google. If your data is appropriately indexed and performance is critical then Paul White has your back: the OFFSET-FETCH method from 2012 and a dynamic cursor method for earlier versions, both calculated singly, and per group. It’s hard to imagine a faster solution as those provided pretty much jump direct to the median values.

For data without a supporting index it gets a lot more interesting as established methods are quite memory and time-consuming: they all involve sorting the data. Dwain Camps seems to have developed the leading approach and applied it per group. This built on Aaron Bertrand’s work documenting and running performance tests across a single dataset.

I have a few ideas to present that will massively strip back the amount of sorting required to calculate medians without an index, resulting in huge performance gains. Unfortunately these solutions are quite “elaborate”, but I’ve managed to avoid unsupported features, query hints and trace flags (for a change).

The first idea I mentioned in a comment on Aaron’s post above, and is best suited to calculate the median across an entire large table (just one column though!). In a nutshell the algorithm is:

Examine the statistics histogram relating to the column to determine a small range of values that is likely to encompass the median. This histogram doesn’t need to be accurate because we will run one or two quick scans of the entire table to firm up counts for critical ranges in the middle of the histogram, and identify the range that definitely contains our median. The stats histogram just tells us where to start looking.

You have now identified a small set of records (around 1/200 of the entire table) that needs to be sorted using one of the more elegant median techniques, remembering to apply an adjusted offset (using the number of records less than this range) because the median of this set is almost certainly not the median of the table.

As clear as custard? Let’s jump into an example. First create and populate a (heavily skewed) heap of sample data (thanks, Aaron):

CREATE TABLE dbo.obj ( id integer NOT NULL IDENTITY(1,1), val integer NOT NULL ); INSERT dbo.obj WITH (TABLOCKX) (val) SELECT TOP (10000000) AO.[object_id] FROM sys.all_columns AS AC CROSS JOIN sys.all_objects AS AO CROSS JOIN sys.all_objects AS AO2 WHERE AO.[object_id] >0 ORDER BY AC.[object_id];

Obviously no stats exist yet, so create some:

CREATE STATISTICS valStats ON dbo.obj(val) WITH SAMPLE 10 PERCENT ; -- we don't need the accuracy of a full scan

Dump the newly created histogram to a temp table for analysis:

CREATE TABLE #Histogram ( [Range] INT IDENTITY(1,1), -- useful for demo, not really required RANGE_HI_KEY INT, RANGE_ROWS BIGINT, EQ_ROWS BIGINT, DISTINCT_RANGE_ROWS BIGINT, AVG_RANGE_ROWS BIGINT); INSERT INTO #Histogram EXEC ('DBCC SHOW_STATISTICS([dbo.obj],valStats) WITH HISTOGRAM'); SELECT thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS , sum(prevRange.RANGE_ROWS) + sum(prevRange.EQ_ROWS) 'RunningCountBelowRange' , sum(prevRange.RANGE_ROWS) + sum(prevRange.EQ_ROWS) + max(thisRange.RANGE_ROWS) + max(thisRange.EQ_ROWS) 'RunningCountIncludingRange' FROM #Histogram thisRange LEFT JOIN #Histogram prevRange ON prevRange.RANGE_HI_KEY < thisRange.RANGE_HI_KEY GROUP BY thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS ORDER BY thisRange.RANGE_HI_KEY

Since we know there are about 10 million rows in the table (not because we just created the test data, but because an estimated row count is available in the statistic – we should use this value because it will at least be consistent with the histogram), we are looking for rows 5000000 and 5000001 to determine the median. According to this (possibly inaccurate) histogram, we’re most likely to find these rows somewhere in range 101 – with the median value between 661,577,395 and 677,577,452. However, we need to check the actual table. We also need to establish exactly how many rows are in the table:

SELECT count(1) 'CountLowerThanRange101' FROM dbo.obj WHERE Val <= 661577395 SELECT count(1) 'CountInRange101' FROM dbo.obj WHERE 661577395 < Val AND Val <= 677577452 SELECT count(1) 'TotalRowCount' FROM dbo.obj

Unfortunately this proves the median is in a higher range, so we’ll run another scan to try the next bucket:

SELECT COUNT(1) 'Count_In_Range_102' FROM dbo.obj WHERE 677577452 < Val AND Val <= 693577509

Bingo! We now know that range 102 contains values sequenced from 4935830 + 47775 + 1, that is from 4983606 to 5031381, and therefore includes our median. Reviewing the histogram tells us the value of our median will be between 677,577,452

and 693,577,509:

We can use this to construct a final median calculation with the stripped down set of rows to be sorted – it’s the clause in green that’s responsible for the great performance. Here I contrast it to the standard (SQL 2012) OFFSET-FETCH approach:

I’ve used the OFFSET-FETCH technique for clarity. You might want to substitute Dwain’s faster approach (particularly if you’re not up to SQL2012 yet), but the overall effect will be marginal as we’ve already narrowed down the search.

If we were writing real code for this, we’d also need to cater for the rare case when there are two median values (because even number of rows) falling in different ranges, and also probably be more careful with <, <=, general arithmetic etc. but you get the idea.

The good news is that my performance tests indicate this to be significantly faster than the vanilla OFFSET-FETCH method for Aaron’s 10m row table, even if you have to calculate stats on that column because they didn’t previously exist. Because the sort is so small, the memory requirements are minimised too. Performance comparisons are inevitably more dramatic for larger data sets, but will erode to negative for small data sets due to the overhead of looking at stats and doing some quite tortuous arithmetic.

Here’s my entire script:

USE [AdventureWorksDW2014] DROP TABLE dbo.obj CREATE TABLE dbo.obj -- as a heap ( id INTEGER NOT NULL IDENTITY(1,1), val INTEGER NOT NULL ); INSERT dbo.obj WITH (TABLOCKX) (val) SELECT TOP (10000000) AO.[object_id] FROM sys.all_columns AS AC CROSS JOIN sys.all_objects AS AO CROSS JOIN sys.all_objects AS AO2 WHERE AO.[object_id] >0 ORDER BY AC.[object_id]; DROP STATISTICS dbo.obj.valstats CREATE STATISTICS valStats ON dbo.obj(val) WITH SAMPLE 10 PERCENT ; -- we don't need the accuracy of a full scan, the default sample rate will do. GO SET STATISTICS TIME ON DECLARE @sink INT, @FasterMedian DECIMAL(19,2), @SimplerMedian DECIMAL (19,2) ,@ms1 INT, @ms2 INT DECLARE @Start DATETIME2 = SYSUTCDATETIME(); CREATE TABLE #Histogram ( [Range] INT IDENTITY(1,1), RANGE_HI_KEY INT, RANGE_ROWS BIGINT, EQ_ROWS BIGINT, DISTINCT_RANGE_ROWS BIGINT, AVG_RANGE_ROWS BIGINT); INSERT INTO #Histogram EXEC ('DBCC SHOW_STATISTICS([dbo.obj],valStats) WITH HISTOGRAM'); SELECT thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS , SUM(prevRange.RANGE_ROWS) + SUM(prevRange.EQ_ROWS) 'RunningCountBelowRange' , SUM(prevRange.RANGE_ROWS) + SUM(prevRange.EQ_ROWS) + MAX(thisRange.RANGE_ROWS) + MAX(thisRange.EQ_ROWS) 'RunningCountIncludingRange' INTO #HistogramAnalysis FROM #Histogram thisRange LEFT JOIN #Histogram prevRange ON prevRange.RANGE_HI_KEY < thisRange.RANGE_HI_KEY GROUP BY thisRange.[Range], thisRange.RANGE_HI_KEY, thisRange.RANGE_ROWS, thisRange.EQ_ROWS, thisRange.DISTINCT_RANGE_ROWS, thisRange.AVG_RANGE_ROWS ORDER BY thisRange.RANGE_HI_KEY SELECT @sink = count(1) FROM dbo.obj WHERE Val <= 661577395 SELECT @sink = count(1) FROM dbo.obj WHERE 661577395 < Val AND Val <= 677577452 SELECT @sink = count(1) FROM dbo.obj DECLARE @c INT=10000000 SELECT @FasterMedian = AVG(1.0 * val) FROM ( SELECT val FROM dbo.Obj WHERE 677577452 < Val AND Val <= 693577509 ORDER BY val OFFSET (@c - 1) / 2 - 4983606 ROWS FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY ) AS x; SELECT @ms1 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME()); SELECT @SimplerMedian = AVG(1.0 * val) FROM ( SELECT val FROM dbo.Obj ORDER BY val OFFSET (@c - 1) / 2 ROWS FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY ) AS x; SELECT @ms2 = DATEDIFF(MILLISECOND, @Start, SYSUTCDATETIME()); SELECT @ms1 'Faster median ms', @ms2-@ms1 'Simpler median ms', @FasterMedian '@FasterMedian', @SimplerMedian '@SimplerMedian' DROP TABLE #Histogram DROP TABLE #HistogramAnalysis DROP TABLE dbo.obj

and the outcome (using pre-existing stats, and needing two bites at the histogram ):

This method isn’t sensible for use on a suitably indexed column. The established techniques are low-maintenance and extremely fast, and of course why bother going to all this trouble to reduce the number of rows to be sorted, when the data are already sorted!?

However, this approach can be extended to deriving the median across a single group of rows in a heap. Instead of using the raw column statistics I would extract a small pseudo-random sample of rows from the group to a temp table, and calculate stats on that table with a full scan. Scale up this histogram, apply to the original filtered table, and voilà!

As you can see this is going to get quite unwieldy if we require median values across multiple groups within a table or query. So in my next post I will apply a similar method to the more general problem. I won’t be using the stats histogram, but the method will be downward compatible to the simple case of an individual median so I’m hoping it will make the statistics histogram approach redundant.

[End]