How To Create SQL Custom User-Defined Aggregate Functions

Posted by Ahmed Tarek Hasan on 3/08/2013 03:59:00 PM with No comments
How To Create SQL Custom User-Defined Aggregate Functions


Code samples on this post can be downloaded from here


Sometimes you need to perform some complex SQL queries depending on some aggregate operations or functions. In most cases, these complex operations could be achieved by series of queries with joins and groupings, but, sometimes it is almost impossible to achieve this by the basic queries and functions natively supported by SQL.

For example, assume you have a table of buyers called "Buyers" and a table of products called "Products". Also, each product has some properties like size which may have multiple values per product. Now, the business says that we need to list all the products but not with all the size values instead we will only list the most commonly purchased size.

To be more clear, lets assume that 10 buyers bought a certain mobile phone with size 5, while other 2 buyers bought the same phone but with size 6. Then, when we list this mobile phone product, we will only list mobile phone with the size 5 cause it is the most common purchased one.

So, to achieve this using only native SQL queries you will find it so complex especially when you need to apply the same logic on other columns (in our example, not only the size column). This leads us to custom user-defined aggregate functions.

SQL Server doesn’t have so many aggregate functions to use, only the basics such as COUNT, MIN, MAX and few others are implemented but still the list is quite small. However, SQL Server includes CLR integration which we can use to add our own custom user-defined aggregate functions.

To do this, we need to:
  1. Create a class library project targeting .NET Framework 3.5 or below (as the maximum supported .NET Framework version for SQL Server 2008 is 3.5)
  2. Implement the necessary structures
  3. Register the assembly into SQL
  4. Register the aggregate functions into SQL
These are the main basic steps to make your own SQL custom user-defined aggregate functions. So, now lets explain each step.

1. Create a class library project targeting .NET Framework 3.5 or below
You can do so by creating a new visual studio project of type class library. This is straight forward.

2. Implement the necessary structures
To do this, you need to know that:
  • An aggregate is created by defining a "Struct" not a "Class"
  • This struct must be decorated with the "SqlUserDefinedAggregate" attribute.
  • This attribute has some options which we need to set properly to achieve the desired behavior
    • Format:
      • Serialization format for the struct
      • The possible values are Native, Unknown and UserDefined
      • This attribute decides how SQL will serialize and deserialize your struct members
      • In case of "Native" format, the framework handles all the necessary steps to serialize and deserialize the structure.
      • In case of "UserDefined" format, the framework will use the serialize and deserialize methodlogy you provide for your struct through implementing the "IBinarySerialize" interface
      • Also, in case of "UserDefined" format, we must set the "MaxByteSize" option, find it explained below
    • IsInvariantToDuplicates (bool): 
      • Decides whether receiving the same value twice or more affects the result or not
      • Sometimes you need to work on unique values, like computing the max value in a column, you don't need to re-compute or re-compare an already processed value as it is already decided if it is the max value or another value is greater
    • IsInvariantToNulls (bool):
      • Decides whether receiving a NULL value changes the result or not
      • Sometimes you need to ignore the NULL values in your operation, then, you need to set this to true, else, false
    • IsInvariantToOrder (bool):
      • Decides whether the order of values affects the result or not
    • IsNullIfEmpty (bool):
      • Decides whether an empty set results to a NULL value or not
    • Name (string):
      • Name of the aggregate method which will be used in SQL to call our aggregate function
    • MaxByteSize (int):
      • Sets the maximum size of the aggregate instance.
      • -1 represents a value larger than 8000 bytes, up to 2 gigabytes
      • This is required when using the "UserDefined" format and providing our own serialization and de-serialization logic
  • To define this struct, we must provide implementation for some methods
    • Init():
      • Used in order to initialize any needed resources or variables
      • Called only once when the aggregate is called
    • Accumulate():
      • Decides what action to take when receiving a new row value
      • Called once for each row
      • It is mistakenly thought that this method should provide the actual implementation of the aggregate but this is not a must as we will see later
    • Merge():
      • Used because sometimes if the the column to aggregate contains a large number of data the SQL CLR may split those rows into groups and aggregate each group individually and then merge the results of those groups again using the Merge() method
    • Terminate():
      • Used to free any used resources and to return the result
      • Called at the end of the evaluation procedure
    • The sequence of invocation should be like this

3. Register the assembly into SQL
This can be done as follows:
  1. After creating your class library, build your DLL
  2. Run an SQL query to register this DLL into SQL for SQL to be able to recognize and use your struct
  3. This SQL query will be provided in the examples section

4. Register the aggregate functions into SQL
This can be done by running an SQL query which will register our new aggregate function into SQL to be used later. This SQL query will be provided in the examples section.

Now, it is time to see some examples


Example 1
We want to define a new aggregate function to calculate the count of NULLs in a column.

CountNulls.cs
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections.ObjectModel;
using System.Linq;
using System.Xml.Serialization;
using System.IO;
using System.Text;
using System.Collections;

namespace DevelopmentSimplyPut.SQLAggregateFunctions
{
    [System.Serializable]
    [Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
       Microsoft.SqlServer.Server.Format.UserDefined,
       IsInvariantToDuplicates = false, // receiving the same value again changes the result
       IsInvariantToNulls = false,      // receiving a NULL value changes the result
       IsInvariantToOrder = true,       // the order of the values doesn't affect the result
       IsNullIfEmpty = true,            // if no values are given the result is null
       MaxByteSize = -1,
       Name = "CountNulls"              // name of the aggregate
    )]
    public struct CountNulls
 {
  private int counter;
  public void Init()
  {
   counter = 0;
  }
  public void Accumulate(object Value)
  {
   if (Value == DBNull.Value)  // count just the rows that are not equal to NULL
    counter++;
  }
  public void Merge(CountNulls Group)
  {
   this.counter += Group.Counter; // when merge is needed the counter of other groups should be added
  }

  public SqlString Terminate()
  {
   return new SqlString(counter.ToString()); //returning the results
  }
 }
}

RegisterAssemblyAndCreatingAggregate.sql
--Turning on CLR functionality
--By default, CLR is disabled in SQL Server so to turn it on
--we need to run this command against our database
EXEC sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

-- Creating the SQL assembly and linking it to the C# library DLL we created
CREATE ASSEMBLY SQLAggregateFunctions
AUTHORIZATION dbo
FROM 'C:\SQLAggregateFunctions.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE AGGREGATE dbo.CountNulls (@value nvarchar(MAX)) RETURNS nvarchar(MAX)
EXTERNAL NAME SQLAggregateFunctions.[DevelopmentSimplyPut.SQLAggregateFunctions.CountNulls]
--EXTERNAL NAME SQLAssemblyName.[C#NameSpace".C#ClassName].C#MethodName


/*
DROP AGGREGATE dbo.CountNulls
DROP ASSEMBLY SQLAggregateFunctions
*/

Testing.sql
SELECT dbo.CountNulls(MyColumn)
FROM MyTable


Example 2
We want to define a new aggregate function to get the most common value in a column

MostCommon.cs
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections.ObjectModel;
using System.Linq;
using System.Xml.Serialization;
using System.IO;
using System.Text;
using System.Collections;

namespace DevelopmentSimplyPut.SQLAggregateFunctions
{
    [System.Serializable]
    [Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
       Microsoft.SqlServer.Server.Format.UserDefined,
       IsInvariantToDuplicates = false, // receiving the same value again changes the result
       IsInvariantToNulls = false,      // receiving a NULL value changes the result
       IsInvariantToOrder = true,       // the order of the values doesn't affect the result
       IsNullIfEmpty = true,            // if no values are given the result is null
       MaxByteSize = -1,
       Name = "MostCommon"              // name of the aggregate
    )]
    public struct MostCommon : IBinarySerialize
    {
        private Collection<string> pool;
        private string result;
        private int count;

        public void Init()
        {
            pool = new Collection<string>();
            result = string.Empty;
            count = -1;
        }
        public void Accumulate(string value)
        {
            if (null != value)
            {
                pool.Add(value.ToString());
            }
        }
        public void Merge(MostCommon group)
        {
            if (null != group.pool && group.pool.Count > 0)
            {
                foreach (string entry in group.pool)
                {
                    pool.Add(entry);
                }
            }
        }
        public SqlString Terminate()
        {
            string[] distinctValues = pool.Distinct().ToArray();

            foreach (string distinctValue in distinctValues)
            {
                int tempCount = pool.Count(s => s.Equals(distinctValue, StringComparison.OrdinalIgnoreCase));

                if (tempCount > count)
                {
                    count = tempCount;
                    result = distinctValue;
                }
            }

            return new SqlString(result.ToString());
        }

        #region IBinarySerialize Members
        public void Read(System.IO.BinaryReader reader)
        {
            this.result = reader.ReadString();
            this.count = reader.ReadInt32();
            int itemsCount = reader.ReadInt32();
            
            pool = new Collection<string>();

            for (int i = 0; i < itemsCount; i++)
         {
             this.pool.Add(reader.ReadString());
         } 
        }
        public void Write(System.IO.BinaryWriter writer)
        {
            writer.Write(this.result);
            writer.Write(this.count);
            writer.Write(pool.Count);
            
            foreach (string entry in pool)
            {
                writer.Write(entry);
            } 
        }
        #endregion
    }
}

RegisterAssemblyAndCreatingAggregate.sql
--Turning on CLR functionality
--By default, CLR is disabled in SQL Server so to turn it on
--we need to run this command against our database
EXEC sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

-- Creating the SQL assembly and linking it to the C# library DLL we created
CREATE ASSEMBLY SQLAggregateFunctions
AUTHORIZATION dbo
FROM 'C:\SQLAggregateFunctions.dll'
WITH PERMISSION_SET = SAFE
GO

CREATE AGGREGATE dbo.MostCommon (@value nvarchar(MAX)) RETURNS nvarchar(MAX)
EXTERNAL NAME SQLAggregateFunctions.[DevelopmentSimplyPut.SQLAggregateFunctions.MostCommon]
--EXTERNAL NAME SQLAssemblyName.[C#NameSpace".C#ClassName].C#MethodName


/*
DROP AGGREGATE dbo.MostCommon
DROP ASSEMBLY SQLAggregateFunctions
*/

CreateTestDB.sql
USE [TestDB01]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[T1](
 [Col1] [nvarchar](100) NOT NULL,
 [Col2] [int] NOT NULL,
 [Col3] [int] NOT NULL,
 CONSTRAINT [PK_T1] PRIMARY KEY CLUSTERED 
(
 [Col1] ASC,
 [Col2] ASC,
 [Col3] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO


INSERT INTO T1(Col1, Col2, Col3)
VALUES('Ahmed', 1, 1)
, ('Ahmed', 1, 2)
, ('Ahmed', 1, 5)
, ('Ahmed', 1, 6)
, ('Ahmed', 2, 2)
, ('Ahmed', 3, 4)
, ('Ahmed', 4, 2)
, ('Ahmed', 5, 2)
, ('Ahmed', 6, 2)
, ('Tarek', 2, 1)
, ('Tarek', 2, 3)
, ('Tarek', 2, 4)
, ('Tarek', 2, 5)
, ('Tarek', 2, 7)
, ('Tarek', 2, 9)
, ('Tarek', 3, 5)
, ('Tarek', 4, 5)
, ('Tarek', 6, 5)
, ('Hasan', 2, 3)
, ('Hasan', 2, 7)
, ('Hasan', 5, 0)
, ('Hasan', 5, 1)
, ('Hasan', 5, 4)
, ('Hasan', 8, 6)

Test.sql
USE [TestDB01]
GO

SELECT
Col1
, (SELECT dbo.MostCommon(Col2) FROM T1 where main.Col1 = Col1 GROUP BY Col1) AS CommonCol2
, (SELECT dbo.MostCommon(Col3) FROM T1 where main.Col1 = Col1 GROUP BY Col1) AS CommonCol3
FROM T1 AS main
GROUP BY Col1


That's it for now. You can read more about this topic on the internet and know more about what you can and can't do using this technique. You can make use of the resources below.


Resources
  1. Custom Aggregates in SQL Server - CodeProject
  2. How to implement your own aggregate function in SQLCLR, SQL Server 2005 - Bashar Kokash' Blog
  3. System.Data.SqlTypes Namespace ()
  4. IBinarySerialize Interface (Microsoft.SqlServer.Server)



Categories: , ,