So you have a large SQL dataset? Say 1,000,000+ records? Need to convey simple trends in data over vast historical periods on-demand? Sounds like a great candidate for data decimation (aka Down Sampling). In some recent Silverlight development, this very issue was raised and solved using examples provided in this post. The visualization project in question required that users be able to visualize trends in statistics data over significant historical periods (5+ years, 200+ points a day), support user selectable date ranges, and refresh UI visualization controls within an application mandated 3-5 second response window. Decimation proved to be both valueable in meeting project spec and allowed users to target specific ranges based on trends for further investigation (drill down) that are generally not available using Skip() & Take() methods.

For this post, we'll generate 1,000,000+ records of sample data and walk through the steps of creating a simple data decimation solution that accepts custom date range information and evenly displays decimated data for the requested date range. The T-SQL syntax can easily be ported into Stored Procedures, Views and more. Time for data generation!

Creating The Sample Data Set

For the purpose of this post, we'll add the following table schema to a stock AdventureWorks2008R2 SQL database, though any SQL database will do. We'll use this new schema to house 1,000,000 sample date and float value records for use with our sample data decimation solution.

CREATE TABLE [dbo].[SampleStats] (
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [MeanDate] [date] NULL,
    [MeanValue] [float] NULL
)

Now that we have our T-SQL table in place, we'll make use of Redgate's SQL Data Generator utility to rapidly fill our table with 1,000,000 randomly generated records for a five year period (Jan 1, 2006 - Dec 31, 2011).

Using RedGate's SQL Data Generator

Defining The Base Query

Over the course of the example, we'll make use of EXPR1 and EXPR2 table aliases as we add additional logic. For now, our base query is meat of our syntax designed to query Date and Mean value data from our SampleStats schema via what would be user-supplied @startDate and @endDate parameter values. This currently returns our 1,000,000 records.

DECLARE @startDate DATETIME
DECLARE @endDate DATETIME
SET @startDate = '2006-01-01'
SET @endDate = '2011-12-31'

SELECT
    MeanDate ,
    MeanValue 
FROM
    SampleStats
WHERE
    MeanDate >= @startDate AND 
    MeanDate <= @endDate

Augmenting with ROW_NUMBER()

Next we'll add the all important RowNumber field ordering by our MeanDate field to ensure decimated data remains evenly spaced out by MeanDate for the queried period. The RowNumber field will be used in combination with a SQL Modulo operator in our next step.

...
    MeanValue ,
    ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber
FROM
    SampleStats
...

Integrating SQL Modulo (%) and Nth Record Logic

In this step, we add a few things: EXPR1 alias, an @nthRecord parameter and a SQL Modulo operator. A Modulo (%) provides the remainder of one number divided by another. In this example, we return the MOD of each RowNumber using our @nthRecord value (1000) as the divisor.

DECLARE @startDate DATETIME
DECLARE @endDate DATETIME
DECLARE @nthRecord INT
SET @startDate = '2006-01-01'
SET @endDate = '2011-12-31'
SET @nthRecord = 1000

SELECT
    EXPR1.MeanDate ,
    EXPR1.MeanValue ,
    EXPR1.RowNumber ,
    (EXPR1.RowNumber % @nthRecord) AS RowMod
FROM
    (SELECT
        MeanDate ,
        MeanValue ,
        ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber
    FROM      
        SampleStats
    WHERE
        MeanDate >= @startDate AND 
        MeanDate <= @endDate
    ) EXPR1

The @nthRecord value controls the amount of data returned. Our example makes use of a fairly divisible value (1000). This value would normally be the result of a custom algorithm based on any number of factors (i.e.: total table record count, max records returned, etc.). Our query now returns the following (note the RowMod field change):

Sample Data with RowMod Values

Querying by the RowMod Field

Finally, we alias our current SELECT clause (EXPR2) and query for records where our aliased RowMod field equals 0 and order by RowNumber. The final example now looks like the following:

DECLARE @startDate DATETIME
DECLARE @endDate DATETIME
DECLARE @nthRecord INT
SET @startDate = '2006-01-01'
SET @endDate = '2011-12-31'
SET @nthRecord = 1000

SELECT 
    EXPR2.*
FROM
    (SELECT
        EXPR1.MeanDate ,
        EXPR1.MeanValue ,
        EXPR1.RowNumber ,
        (EXPR1.RowNumber % @nthRecord) AS RowMod
    FROM
        (SELECT
            MeanDate ,
            MeanValue ,
            ROW_NUMBER() OVER ( ORDER BY MeanDate ) AS RowNumber
        FROM      
            SampleStats
        WHERE
            MeanDate >= @startDate AND 
            MeanDate <= @endDate
        ) EXPR1
    ) EXPR2
WHERE
    EXPR2.RowMod = 0
ORDER BY
    EXPR2.RowNumber

With this final step, our data decimation solution is now complete as we're querying for a mere fraction of our 1,000,000 records (every 1000th record) spanning a period of five years evenly spaced out thanks in part to the ordering by MeanDate syntax. One million records decimated down to 1,000 records in a mere two seconds. Extreme decimation, but you get the idea.

Final Example Query Result

From an application perspective, we're returning a smaller number of records processed ideally, by a more powerful SQL host rather than selecting all records an iterating through them in application code. This is great for QUICKLY identifying trends in larger data sets prompting the user for more focused querying that is an advantage over regular Skip() / Take() methods of reviewing data in grids, etc. Cheers!