How To Create SQL Custom User-Defined Aggregate Functions
Posted by 3/08/2013 03:59:00 PM with No comments
on Code samples on this post can be downloaded from here
Sometimes you need to perform some complex SQL queries depending on some aggregate operations or functions. In most cases, these complex operations could be achieved by series of queries with joins and groupings, but, sometimes it is almost impossible to achieve this by the basic queries and functions natively supported by SQL.
For example, assume you have a table of buyers called "Buyers" and a table of products called "Products". Also, each product has some properties like size which may have multiple values per product. Now, the business says that we need to list all the products but not with all the size values instead we will only list the most commonly purchased size.
To be more clear, lets assume that 10 buyers bought a certain mobile phone with size 5, while other 2 buyers bought the same phone but with size 6. Then, when we list this mobile phone product, we will only list mobile phone with the size 5 cause it is the most common purchased one.
So, to achieve this using only native SQL queries you will find it so complex especially when you need to apply the same logic on other columns (in our example, not only the size column). This leads us to custom user-defined aggregate functions.
SQL Server doesn’t have so many aggregate functions to use, only the basics such as COUNT, MIN, MAX and few others are implemented but still the list is quite small. However, SQL Server includes CLR integration which we can use to add our own custom user-defined aggregate functions.
To do this, we need to:
- Create a class library project targeting .NET Framework 3.5 or below (as the maximum supported .NET Framework version for SQL Server 2008 is 3.5)
- Implement the necessary structures
- Register the assembly into SQL
- Register the aggregate functions into SQL
1. Create a class library project targeting .NET Framework 3.5 or below
You can do so by creating a new visual studio project of type class library. This is straight forward.
2. Implement the necessary structures
To do this, you need to know that:
- An aggregate is created by defining a "Struct" not a "Class"
- This struct must be decorated with the "SqlUserDefinedAggregate"
- This attribute has some options which we need to set properly to achieve the desired behavior
- Format:
- Serialization format for the struct
- The possible values are Native, Unknown and UserDefined
- This attribute decides how SQL will serialize and deserialize your struct members
- In case of "Native" format, the framework handles all the necessary steps to serialize and deserialize the structure.
- In case of "UserDefined" format, the framework will use the serialize and deserialize methodlogy you provide for your struct through implementing the "IBinarySerialize" interface
- Also, in case of "UserDefined" format, we must set the "MaxByteSize" option, find it explained below
- IsInvariantToDuplicates (bool):
- Decides whether receiving the same value twice or more affects the result or not
- Sometimes you need to work on unique values, like computing the max value in a column, you don't need to re-compute or re-compare an already processed value as it is already decided if it is the max value or another value is greater
- IsInvariantToNulls (bool):
- Decides whether receiving a NULL value changes the result or not
- Sometimes you need to ignore the NULL values in your operation, then, you need to set this to true, else, false
- IsInvariantToOrder (bool):
- Decides whether the order of values affects the result or not
- IsNullIfEmpty (bool):
- Decides whether an empty set results to a NULL value or not
- Name (string):
- Name of the aggregate method which will be used in SQL to call our aggregate function
- MaxByteSize (int):
- Sets the maximum size of the aggregate instance.
- -1 represents a value larger than 8000 bytes, up to 2 gigabytes
- This is required when using the "UserDefined" format and providing our own serialization and de-serialization logic
- To define this struct, we must provide implementation for some methods
- Init():
- Used in order to initialize any needed resources or variables
- Called only once when the aggregate is called
- Accumulate():
- Decides what action to take when receiving a new row value
- Called once for each row
- It is mistakenly thought that this method should provide the actual implementation of the aggregate but this is not a must as we will see later
- Merge():
- Used because sometimes if the the column to aggregate contains a large number of data the SQL CLR may split those rows into groups and aggregate each group individually and then merge the results of those groups again using the Merge() method
- Terminate():
- Used to free any used resources and to return the result
- Called at the end of the evaluation procedure
- The sequence of invocation should be like this
3. Register the assembly into SQL
This can be done as follows:
- After creating your class library, build your DLL
- Run an SQL query to register this DLL into SQL for SQL to be able to recognize and use your struct
- This SQL query will be provided in the examples section
4. Register the aggregate functions into SQL
This can be done by running an SQL query which will register our new aggregate function into SQL to be used later. This SQL query will be provided in the examples section.
Now, it is time to see some examples
Example 1
We want to define a new aggregate function to calculate the count of NULLs in a column.
CountNulls.cs
using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Collections.ObjectModel; using System.Linq; using System.Xml.Serialization; using System.IO; using System.Text; using System.Collections; namespace DevelopmentSimplyPut.SQLAggregateFunctions { [System.Serializable] [Microsoft.SqlServer.Server.SqlUserDefinedAggregate( Microsoft.SqlServer.Server.Format.UserDefined, IsInvariantToDuplicates = false, // receiving the same value again changes the result IsInvariantToNulls = false, // receiving a NULL value changes the result IsInvariantToOrder = true, // the order of the values doesn't affect the result IsNullIfEmpty = true, // if no values are given the result is null MaxByteSize = -1, Name = "CountNulls" // name of the aggregate )] public struct CountNulls { private int counter; public void Init() { counter = 0; } public void Accumulate(object Value) { if (Value == DBNull.Value) // count just the rows that are not equal to NULL counter++; } public void Merge(CountNulls Group) { this.counter += Group.Counter; // when merge is needed the counter of other groups should be added } public SqlString Terminate() { return new SqlString(counter.ToString()); //returning the results } } }
RegisterAssemblyAndCreatingAggregate.sql
--Turning on CLR functionality --By default, CLR is disabled in SQL Server so to turn it on --we need to run this command against our database EXEC sp_configure 'clr enabled', 1 GO RECONFIGURE GO -- Creating the SQL assembly and linking it to the C# library DLL we created CREATE ASSEMBLY SQLAggregateFunctions AUTHORIZATION dbo FROM 'C:\SQLAggregateFunctions.dll' WITH PERMISSION_SET = SAFE GO CREATE AGGREGATE dbo.CountNulls (@value nvarchar(MAX)) RETURNS nvarchar(MAX) EXTERNAL NAME SQLAggregateFunctions.[DevelopmentSimplyPut.SQLAggregateFunctions.CountNulls] --EXTERNAL NAME SQLAssemblyName.[C#NameSpace".C#ClassName].C#MethodName /* DROP AGGREGATE dbo.CountNulls DROP ASSEMBLY SQLAggregateFunctions */
Testing.sql
SELECT dbo.CountNulls(MyColumn) FROM MyTable
Example 2
We want to define a new aggregate function to get the most common value in a column
MostCommon.cs
using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Collections.ObjectModel; using System.Linq; using System.Xml.Serialization; using System.IO; using System.Text; using System.Collections; namespace DevelopmentSimplyPut.SQLAggregateFunctions { [System.Serializable] [Microsoft.SqlServer.Server.SqlUserDefinedAggregate( Microsoft.SqlServer.Server.Format.UserDefined, IsInvariantToDuplicates = false, // receiving the same value again changes the result IsInvariantToNulls = false, // receiving a NULL value changes the result IsInvariantToOrder = true, // the order of the values doesn't affect the result IsNullIfEmpty = true, // if no values are given the result is null MaxByteSize = -1, Name = "MostCommon" // name of the aggregate )] public struct MostCommon : IBinarySerialize { private Collection<string> pool; private string result; private int count; public void Init() { pool = new Collection<string>(); result = string.Empty; count = -1; } public void Accumulate(string value) { if (null != value) { pool.Add(value.ToString()); } } public void Merge(MostCommon group) { if (null != group.pool && group.pool.Count > 0) { foreach (string entry in group.pool) { pool.Add(entry); } } } public SqlString Terminate() { string[] distinctValues = pool.Distinct().ToArray(); foreach (string distinctValue in distinctValues) { int tempCount = pool.Count(s => s.Equals(distinctValue, StringComparison.OrdinalIgnoreCase)); if (tempCount > count) { count = tempCount; result = distinctValue; } } return new SqlString(result.ToString()); } #region IBinarySerialize Members public void Read(System.IO.BinaryReader reader) { this.result = reader.ReadString(); this.count = reader.ReadInt32(); int itemsCount = reader.ReadInt32(); pool = new Collection<string>(); for (int i = 0; i < itemsCount; i++) { this.pool.Add(reader.ReadString()); } } public void Write(System.IO.BinaryWriter writer) { writer.Write(this.result); writer.Write(this.count); writer.Write(pool.Count); foreach (string entry in pool) { writer.Write(entry); } } #endregion } }
RegisterAssemblyAndCreatingAggregate.sql
--Turning on CLR functionality --By default, CLR is disabled in SQL Server so to turn it on --we need to run this command against our database EXEC sp_configure 'clr enabled', 1 GO RECONFIGURE GO -- Creating the SQL assembly and linking it to the C# library DLL we created CREATE ASSEMBLY SQLAggregateFunctions AUTHORIZATION dbo FROM 'C:\SQLAggregateFunctions.dll' WITH PERMISSION_SET = SAFE GO CREATE AGGREGATE dbo.MostCommon (@value nvarchar(MAX)) RETURNS nvarchar(MAX) EXTERNAL NAME SQLAggregateFunctions.[DevelopmentSimplyPut.SQLAggregateFunctions.MostCommon] --EXTERNAL NAME SQLAssemblyName.[C#NameSpace".C#ClassName].C#MethodName /* DROP AGGREGATE dbo.MostCommon DROP ASSEMBLY SQLAggregateFunctions */
CreateTestDB.sql
USE [TestDB01] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[T1]( [Col1] [nvarchar](100) NOT NULL, [Col2] [int] NOT NULL, [Col3] [int] NOT NULL, CONSTRAINT [PK_T1] PRIMARY KEY CLUSTERED ( [Col1] ASC, [Col2] ASC, [Col3] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO INSERT INTO T1(Col1, Col2, Col3) VALUES('Ahmed', 1, 1) , ('Ahmed', 1, 2) , ('Ahmed', 1, 5) , ('Ahmed', 1, 6) , ('Ahmed', 2, 2) , ('Ahmed', 3, 4) , ('Ahmed', 4, 2) , ('Ahmed', 5, 2) , ('Ahmed', 6, 2) , ('Tarek', 2, 1) , ('Tarek', 2, 3) , ('Tarek', 2, 4) , ('Tarek', 2, 5) , ('Tarek', 2, 7) , ('Tarek', 2, 9) , ('Tarek', 3, 5) , ('Tarek', 4, 5) , ('Tarek', 6, 5) , ('Hasan', 2, 3) , ('Hasan', 2, 7) , ('Hasan', 5, 0) , ('Hasan', 5, 1) , ('Hasan', 5, 4) , ('Hasan', 8, 6)
Test.sql
USE [TestDB01] GO SELECT Col1 , (SELECT dbo.MostCommon(Col2) FROM T1 where main.Col1 = Col1 GROUP BY Col1) AS CommonCol2 , (SELECT dbo.MostCommon(Col3) FROM T1 where main.Col1 = Col1 GROUP BY Col1) AS CommonCol3 FROM T1 AS main GROUP BY Col1
That's it for now. You can read more about this topic on the internet and know more about what you can and can't do using this technique. You can make use of the resources below.
Resources