So you have a large SQL dataset? Say 1,000,000+ records? Need to convey simple trends in data over vast historical periods on-demand? Sounds like a great candidate for data decimation (aka Down Sampling). In some recent Silverlight development, this very issue was raised and solved using examples provided in this post. The visualization project in question required that users be able to visualize trends in statistics data over significant historical periods (5+ years, 200+ points a day), support user selectable date ranges, and refresh UI visualization controls within an application mandated 3-5 second response window. Decimation proved to be both valueable in meeting project spec and allowed users to target specific ranges based on trends for further investigation (drill down) that are generally not available using
For this post, we'll generate 1,000,000+ records of sample data and walk through the steps of creating a simple data decimation solution that accepts custom date range information and evenly displays decimated data for the requested date range. The T-SQL syntax can easily be ported into Stored Procedures, Views and more. Time for data generation!
Creating The Sample Data Set
For the purpose of this post, we'll add the following table schema to a stock
AdventureWorks2008R2 SQL database, though any SQL database will do. We'll use this new schema to house 1,000,000 sample
float value records for use with our sample data decimation solution.
CREATE TABLE [dbo].[SampleStats] ( [ID] [int] IDENTITY(1,1) NOT NULL, [MeanDate] [date] NULL, [MeanValue] [float] NULL )
Now that we have our T-SQL table in place, we'll make use of Redgate's SQL Data Generator utility to rapidly fill our table with 1,000,000 randomly generated records for a five year period (Jan 1, 2006 - Dec 31, 2011).
Defining The Base Query
Over the course of the example, we'll make use of
EXPR2 table aliases as we add additional logic. For now, our base query is meat of our syntax designed to query
Mean value data from our
SampleStats schema via what would be user-supplied
@endDate parameter values. This currently returns our 1,000,000 records.
DECLARE @startDate DATETIME DECLARE @endDate DATETIME SET @startDate = '2006-01-01' SET @endDate = '2011-12-31' SELECT MeanDate , MeanValue FROM SampleStats WHERE MeanDate >= @startDate AND MeanDate <= @endDate
Augmenting with ROW_NUMBER()
Next we'll add the all important
RowNumber field ordering by our
MeanDate field to ensure decimated data remains evenly spaced out by
MeanDate for the queried period. The
RowNumber field will be used in combination with a SQL Modulo operator in our next step.
... MeanValue , ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber FROM SampleStats ...
Integrating SQL Modulo (%) and Nth Record Logic
In this step, we add a few things:
EXPR1 alias, an @nthRecord parameter and a SQL Modulo operator. A Modulo (%) provides the remainder of one number divided by another. In this example, we return the MOD of each
RowNumber using our @nthRecord value (1000) as the divisor.
DECLARE @startDate DATETIME DECLARE @endDate DATETIME DECLARE @nthRecord INT SET @startDate = '2006-01-01' SET @endDate = '2011-12-31' SET @nthRecord = 1000 SELECT EXPR1.MeanDate , EXPR1.MeanValue , EXPR1.RowNumber , (EXPR1.RowNumber % @nthRecord) AS RowMod FROM (SELECT MeanDate , MeanValue , ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber FROM SampleStats WHERE MeanDate >= @startDate AND MeanDate <= @endDate ) EXPR1
@nthRecord value controls the amount of data returned. Our example makes use of a fairly divisible value (1000). This value would normally be the result of a custom algorithm based on any number of factors (i.e.: total table record count, max records returned, etc.). Our query now returns the following (note the RowMod field change):
Querying by the RowMod Field
Finally, we alias our current
SELECT clause (
EXPR2) and query for records where our aliased
RowMod field equals
0 and order by
RowNumber. The final example now looks like the following:
DECLARE @startDate DATETIME DECLARE @endDate DATETIME DECLARE @nthRecord INT SET @startDate = '2006-01-01' SET @endDate = '2011-12-31' SET @nthRecord = 1000 SELECT EXPR2.* FROM (SELECT EXPR1.MeanDate , EXPR1.MeanValue , EXPR1.RowNumber , (EXPR1.RowNumber % @nthRecord) AS RowMod FROM (SELECT MeanDate , MeanValue , ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber FROM SampleStats WHERE MeanDate >= @startDate AND MeanDate <= @endDate ) EXPR1 ) EXPR2 WHERE EXPR2.RowMod = 0 ORDER BY EXPR2.RowNumber
With this final step, our data decimation solution is now complete as we're querying for a mere fraction of our 1,000,000 records (every 1000th record) spanning a period of five years evenly spaced out thanks in part to the ordering by
MeanDate syntax. One million records decimated down to 1,000 records in a mere two seconds. Extreme decimation, but you get the idea.
From an application perspective, we're returning a smaller number of records processed ideally, by a more powerful SQL host rather than selecting all records an iterating through them in application code. This is great for QUICKLY identifying trends in larger data sets prompting the user for more focused querying that is an advantage over regular Skip() / Take() methods of reviewing data in grids, etc. Cheers!