Sunday, February 23, 2014

Partitioning in SQL Server - Part 1

Partitioning in SQL Server is one of the interesting concepts and also a wonderful concept. It is available only in the Enterprise Edition of SQL Server. In this article, I will narrate my experience with Partitioning. 

Partitioning means distributing a table across different file groups. These file groups can be present on the same disk or different disks. Generally Partition is done on the table which has large number of records in order to have the faster retrieval of data and facilitate easier maintenance of index. To have a better performance due to partitioning, it is recommended to have the file groups on different disks. To search a record, Query Optimizer goes to the respective partition instead of scanning the whole table.

Partitioning can be done on an existing table or a table can be created with partitions at the creation time. To master this concept, one should be aware of the following components:

  1. Partition Function.
  2. Partition Scheme.
  3. Partition Column.
  4. Aligned Index.
  5. Non-Aligned Index.
  6. Partition Elimination.

Let's know about the above specified components in order,

Partition Function


Creating a Partition Function is the first task a user has to do for partitioning a table. It specifies how many partitions a table can have, with what kind of boundary and also specifies the boundary values.


CREATE PARTITION FUNCTION SAMPLE_PF (INT)
AS RANGE LEFT/RIGHT FOR VALUES (500000,1000000)

In the above T-SQL code, a partition function is created for an INT datatype. There are two values specified namely 500000 and 1000000 which are called Boundary Values. As two values are specified, this means there will be partitions. Two ranges can be specified like LEFT and RIGHT. I have written the both to specify there are two. If LEFT range is specified then the value for INT datatype column <= 500000 reside in first partition and the value for INT datatype <= 1000000 reside in second partition. If RIGHT range is specified then the value for INT datatype column < 500000 resides in first partition which doesn't include 500000. The value for INT datatype column < 1000000 which doesn't include 1000000.

Partition Scheme


After creating a Partition Function, partition scheme has to be created. This is an important part of partition concept. Here the file groups across which the table has to be distributed is specified. While creating Partition Scheme, the Partition Function which is to be used should be specified. It maps the partition function to the specified file groups.

For the above partition column, the partition scheme can be created as follows,


CREATE PARTITION SCHEME SAMPLE_PS
AS PARTITION SAMPLE_PF
TO ([PRIMARY],[Secondary_FileGroup],[PRIMARY])

Here, the reason for specifying three file groups though two boundary values are specified is, specifying two boundary values only. Let me be more clear on this. In the above partition function, we specified two boundary values 500000 and 1000000. So the Query Optimizer thinks there exist some values <= 500000 if it is a left boundary or <500000 if it is right boundary. Hence it needs one file group to be specified. The next boundary value is 1000000 and the same case will be with Query optimizer. Next, though one more boundary is not specified, Query Optimizer thinks there may be some values beyond 1000000. We know that there doesn't exist any value beyond this value but Query Optimizer doesn't. So it asks to specify one more file group. If there are any rows beyond 1000000 then rowcount will be more than '1' or else '1'. I will explain about this in my next article where I will show you practically with examples. If only two file groups are specified in the above Partition Scheme then SQL Server throws an error and insists you to specify one more file group.

Partition Column


Next comes, choosing a column on which partition has to be made. Choosing a Partition Columns is a critical choice because once partition is done based on a column it cannot be changed. If it has to be changed, the table should be dropped and re-created and all the indexes should be re-built. There are some strategies to be followed for a Partition Column:

  • A partitioning column should not have columns with data types like text, ntext, image, xml, timestamp, varchar(max), nvarchar(max), varbinary(max).
  • It should be a part of Primary Key because a Primary Key doesn't allow NULL values and facilitates uniqueness of data. If there are NULL values in Partition Column then that record is stored in the left most partition.
  • It should be a Clustered Indexed column. This is because Clustered Index sorts the physical storage order of the rows. As partition effects the physical storage of table, it is recommended to have a Clustered Indexed column as the Partition Column.
  • A Partition Column can be a Unique Indexed Column. Like Primary Key it also ensures uniqueness of data but allows a single NULL value. So this strategy is the last recommended one after above two strategies.
  • Partition Column should be a single column. This means partition cannot be done on two different columns. If a combination of columns can make the best partition then they both should be declared as PERSISTED computed column.
  • To let the Query Optimizer go into a correct partition for data retrieval, it should be used in the predicate of queries.


Aligned Index


If a table is partitioned it has a Partition Function as well as Partition Scheme. If there is an index on the table and it is also partitioned with the same partition scheme then such index is called Aligned Index. 

Non-Aligned Index


If the table and the index have different partition schemes then such index is called Non-Aligned Index.

Partition Elimination


This is one of the advantages with Partitioning. As said earlier, it is better if the partition column is chosen in the query predicate. This facilitates Partition Elimination. It means, when the partition column is used in the query predicate, Query Optimizer goes to the partition where the particular record exists. Therefore all the remaining partitions are removed from its execution plan. Hence query performance increases. So it is always recommended to choose a partition column which can be used in query predicate often.

             These are the concepts one should be aware of, while knowing about partitioning. Read the continuation post about partitioning at Partitioning in SQL Server - Part 2

No comments:

Post a Comment