Friday, June 25, 2010

What are Ranking Functions?

This is in continuation of my last post Sql Server Rownumber

They are used within the column list of a query to create a new column containing an integer value for each row in a set of results. For example, the Row_Number() function generates a unique, sequential number for each row. Other functions may generate duplicate values depending upon their usage and the data they are operating against.

MSDN says, "Ranking functions return a ranking value for each row in a partition. Depending on the function that is used, some rows might receive the same value as other rows. Ranking functions are nondeterministic."

In simple terms, ranking functions allow you to sequentially number your result set. Your result set can be partitioned so the numbering essentially resets for each partition for example you can get the sales rank of employees partitioned by their department, or manager etc..

Transact-SQL provides the following ranking functions:
  • RANK
  • NTILE
  • DENSE_RANK
  • ROW_NUMBER
Row_Number
This function generates a unique integer value for every row in the results. The first row is assigned a value of one and each subsequent row increments the number. The syntax for this function is interesting as it introduces a new clause that will be seen in all of the ranking functions:

SELECT row_number() OVER (ORDER BY column-name), columns FROM table-name

The OVER clause specifies the order of the rows that should be used when generating values with the ranking function. Following the OVER keyword, the sort order to apply is provided in parentheses.

Rank
The results are ranked according to the sort order, with each row being given a number that is one greater than the number of results that appear before it. Where there are two or more items that are equivalent in the sort order they are each given the same rank. For example, if the first two items in the list are equivalent, they are both given a rank of one. The next item in the list will have a rank of three.

RANK ()    OVER ( [ < partition_by_clause > ] < order_by_clause > )

The rank of a row is one plus the number of ranks that come before the row in question.

Dense_Rank
The Dense_Rank function is very similar to Rank. The key difference is that there are no gaps between the rank values when duplicates are encountered. If the first two values in the sort order are identical, each will be given a value of one. The third value would have a dense rank of two, not three.

DENSE_RANK ()    OVER ( [ <partition_by_clause> ] < order_by_clause > )

The rank of a row is one plus the number of distinct ranks that come before the row in question.

Ntile
Is used to distribute the rows in an ordered partition into x number of groups. Each row receives the group number it belongs to.

The Ntile function splits the results of a query into a number of groups. Each group is then given a number that is applied to every row within the group. The numbering starts with group one and is incremented for every additional group. The syntax for the statement is similar to that of the previous functions with the addition of a parameter to specify the number of groups that should be generated.

TILE (integer_expression)    OVER ( [ <partition_by_clause> ] < order_by_clause > )

integer_expression

Is a positive integer constant expression that specifies the number of groups into which each partition must be divided. integer_expression can be of type int, or bigint.
Where possible, each group will contain exactly the same number of rows. However, if the number of rows is not divisible by the number of groups, the later groups will be one row smaller than the first group.

All together
the ROW_NUMBER function assigns a unique (and ascending) value to each row without regard for ties in the ORDER BY values, the RANK and DENSE_RANK functions assign the same value to rows that have the same ORDER BY value.  The difference between the RANK and DENSE_RANK functions is in how values are assigned to rows following a tie.

The easiest way to illustrate the difference between all of these functions is with a simple example:

CREATE TABLE T (PK INT IDENTITY, A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)

INSERT T VALUES (0, 1, 6)
INSERT T VALUES (0, 1, 4)
INSERT T VALUES (0, 3, 2)
INSERT T VALUES (0, 3, 0)
INSERT T VALUES (1, 0, 7)
INSERT T VALUES (1, 0, 5)
INSERT T VALUES (0, 2, 3)
INSERT T VALUES (0, 2, 1)

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber,
    RANK() OVER (ORDER BY B) AS Rank,
    DENSE_RANK() OVER (ORDER BY B) AS DenseRank
FROM T

PK    A     B     C     RowNumber  Rank       DenseRank
----- ----- ----- ----- ---------- ---------- ----------
5     1     0     7     1          1          1
6     1     0     5     2          1          1
1     0     1     6     3          3          2
2     0     1     4     4          3          2
7     0     2     3     5          5          3
8     0     2     1     6          5          3
3     0     3     2     7          7          4
4     0     3     0     8           7         4

Notice how the ROW_NUMBER function ignores the duplicate values for column B and assigns the unique integers from 1 to 8 to the 8 rows while the RANK and DENSE_RANK functions assigns the same value to each of the pairs of duplicate rows.  Moreover, notice how the RANK function counts the duplicate rows even while it assigns the same value to each duplicate row whereas the DENSE_RANK function does not count the duplicate rows.  For example, both the RANK and DENSE_RANK functions assign a rank of 1 to the first two rows, but the RANK function assigns a rank of 3 to the third row - as it is the third row - while the DENSE_RANK function assigns a rank of 2 to the third row - as it contains the second distinct value for column B.  Note that the maximum value returned by the DENSE_RANK function is exactly equal to the number of distinct values in column B.
Download Article

No comments:

Post a Comment