This is a presentation that I gave at the Austin SQL Server Users group.
We upgraded to SQL 2016 in part to take advantage of the updateable nonclustered columnstore indexes for our web based Credit Risk product for D&B. I will show you the before and after for our product as well as some general exercises using columnstore in a BI over OLTP .
This is a brief overview of how SQL Server Query Optimizer uses statistics to estimate the cost of a query plan. I intend to write a deeper dive into the new statistics of 2016 soon. If you are curious about how statistics are used, I hope this helps.
What is the Query Optimizer?
One of the best Query Optimizer descriptions I have read is in Ben Navarez’ book “Inside the SQL Server Query Optimizer” If you do not own it, buy it. His blog is: http://www.benjaminnevarez.com
“The SQL Server Query Optimizer is a cost-based optimizer. It analyzes a number of candidate execution plans for a given query, estimates the cost of each of these plans, and selects the plan with the lowest cost of the choices considered. Indeed, given that the Query Optimizer cannot consider every possible plan for every query, it actually has to find a balance between the optimization time and the quality of the selected plan.” – Ben Navarez, “Inside the SQL Server Query Optimizer”
What are Statistics?
SQL Server statistics are statistical information about the distribution of values in one or more columns of a table or an index. Each statistics object contains a histogram displaying the distribution of values of the column.
You can see the statistics for an object by using the DBCC command DBCC SHOW_STATISTICS
Here is an example of the statistics histogram for the person.person table in the adventureworks database from Microsoft.
The first result set shows general info about the index. The date that the statistics were updated, number of rows sampled, density, and average key length.
The second result set shows the density and average length of the columns.
The third result set is the statistics histogram. The columns in the histogram are as follows:
RANGE_HI_KEY – This is also called a key value and represents the upper bound column value for a histogram step. In the example above it is the last name in the index.
RANGE_ROWS – This represents the estimated number of rows whose column value falls within a histogram step, excluding the upper bound.
DISTINCT_RANGE_ROWS – This represents the estimated number of rows with a distinct column value within a histogram step, excluding the upper bound.
EQ_ROWS – This represents the estimated number of rows whose column value equals the upper bound of the histogram step.
AVG_RANGE_ROWS (RANGE_ROWS / DISTINCT_RANGE_ROWS for DISTINCT_RANGE_ROWS > 0) – This represents the average number of rows with duplicate column values within a histogram step, excluding the upper bound.
In the above example, Select * From person.person where LastName = ‘Abbas’, returns 1 row. Whereas where LastName = ‘Alexander’, returns 123 rows. You can see this reflected in the EQ_ROWS value of the histogram
How does Query Optimizer come up with the estimate? In part, Statistics.
Query optimizer uses statistics to determine much of the cost base plan analysis. I want to list some of the steps that optimizer takes during this analysis.
The uniqueness of the values stored in the columns of the tables. Simply put, it is the ability of the cardinality estimator to determine the number of rows affected by the query. The cardinality estimator has gone through some improvements in SQL 2012, 2014, and 2016. It is essential that the query optimizer can correctly determine cardinality.
This is also a measure of uniqueness in a way. A high value means high uniqueness. Low selectivity means that the values are not terribly unique. Optimizer may see that a parameter is less than or equal to the RANGE_HI_KEY and that the selectivity is high for that table. This will affect the join order, with the most selective join first.
Selectivity = # rows that pass the predicate/# rows in the table
This is how often a duplicate value occurs in a column. It is the opposite of selectivity, but equally valuable to the optimizer to determine the cost of a plan. High density equals less unique data. Low density equals more unique data. If you think of a person table. The last name column will be high density, and the social security number column will be low density.
Density = 1/# of distinct values in a column OR Density = 1/100 = 0.01
What happens when statistics are out of date?
When data changes in a table, statistics must change as well. If data is added, updated or deleted, the statistics of that object is now different. Take the example of the person table. What if there were an import of a data feed, that included 10,000 people with the last name of Abbas. That could affect the cost of existing or new query plans. Statistics should be updated to reflect the changes. Optimizer makes choices based on bad data, and that can result in a bad query plan.
Here is an example of a query plan that was created using stale statistics.
For the most expensive operator, optimizer estimated 491,338 rows with an operator cost of 88%. The actual rows were 6. This is a big difference in cost. Upon investigation, I noticed that the statistics for that index had not been updated in months.
Here is the plan after updating statistics:
The estimated rows are now 1,349 with a cost of 40%. This is much closer to the actual rows of 6. But why aren’t the estimated rows closer to the actual rows? Well, there are a few reasons. Remember that an estimate by definition will be wrong. And in this case, the predicates for this operator were not SARGable. There was a cast function in the AND clause, and optimizer couldn’t clearly determine cardinality. Here is a brief description of what SARG means. I will publish another article on this later.
What is SARG?
Searchable arguments. This refers to the ability of the optimizer to understand the arguments passed as predicates. If you use functions in your predicate clause, optimizer will have to evaluate that value for every row making it difficult to determine cardinality.
Where ExpiryDate <= GetDate()
Declare @Now datetime = GetDate()
Where ExpiryDate <= @Now
How do I find the date of my statistics?
By using DBCC SHOW_STATISTICS, you can see that date updated for that index in the first result set.
This query will show you the statistics and dates by column:
SELECT OBJECT_NAME(stats.object_id) AS TableName
, stats.NAME AS StatisticsName
FROM sys.stats stats
OUTER APPLY sys.dm_db_stats_properties(stats.object_id, stats.stats_id) AS stats_properties
WHERE OBJECT_NAME(stats.object_id) ='company'ORDERBY stats.NAME;
The moral to this story is, keep your statistics fresh. Your database will be happier, and that will make you happier.
I am a data architect, mentor and a development manager. Obsessed with architecture, optimization and internals, along with the desire to teach and mentor others.