This is a presentation that I gave at the Austin SQL Server Users group.
We upgraded to SQL 2016 in part to take advantage of the updateable nonclustered columnstore indexes for our web based Credit Risk product for D&B. I will show you the before and after for our product as well as some general exercises using columnstore in a BI over OLTP .
This is a brief overview of how SQL Server Query Optimizer uses statistics to estimate the cost of a query plan. I intend to write a deeper dive into the new statistics of 2016 soon. If you are curious about how statistics are used, I hope this helps.
What is the Query Optimizer?
One of the best Query Optimizer descriptions I have read is in Ben Navarez’ book “Inside the SQL Server Query Optimizer” If you do not own it, buy it. His blog is: http://www.benjaminnevarez.com
“The SQL Server Query Optimizer is a cost-based optimizer. It analyzes a number of candidate execution plans for a given query, estimates the cost of each of these plans, and selects the plan with the lowest cost of the choices considered. Indeed, given that the Query Optimizer cannot consider every possible plan for every query, it actually has to find a balance between the optimization time and the quality of the selected plan.” – Ben Navarez, “Inside the SQL Server Query Optimizer”
What are Statistics?
SQL Server statistics are statistical information about the distribution of values in one or more columns of a table or an index. Each statistics object contains a histogram displaying the distribution of values of the column.
You can see the statistics for an object by using the DBCC command DBCC SHOW_STATISTICS
Here is an example of the statistics histogram for the person.person table in the adventureworks database from Microsoft.
The first result set shows general info about the index. The date that the statistics were updated, number of rows sampled, density, and average key length.
The second result set shows the density and average length of the columns.
The third result set is the statistics histogram. The columns in the histogram are as follows:
RANGE_HI_KEY – This is also called a key value and represents the upper bound column value for a histogram step. In the example above it is the last name in the index.
RANGE_ROWS – This represents the estimated number of rows whose column value falls within a histogram step, excluding the upper bound.
DISTINCT_RANGE_ROWS – This represents the estimated number of rows with a distinct column value within a histogram step, excluding the upper bound.
EQ_ROWS – This represents the estimated number of rows whose column value equals the upper bound of the histogram step.
AVG_RANGE_ROWS (RANGE_ROWS / DISTINCT_RANGE_ROWS for DISTINCT_RANGE_ROWS > 0) – This represents the average number of rows with duplicate column values within a histogram step, excluding the upper bound.
In the above example, Select * From person.person where LastName = ‘Abbas’, returns 1 row. Whereas where LastName = ‘Alexander’, returns 123 rows. You can see this reflected in the EQ_ROWS value of the histogram
How does Query Optimizer come up with the estimate? In part, Statistics.
Query optimizer uses statistics to determine much of the cost base plan analysis. I want to list some of the steps that optimizer takes during this analysis.
The uniqueness of the values stored in the columns of the tables. Simply put, it is the ability of the cardinality estimator to determine the number of rows affected by the query. The cardinality estimator has gone through some improvements in SQL 2012, 2014, and 2016. It is essential that the query optimizer can correctly determine cardinality.
This is also a measure of uniqueness in a way. A high value means high uniqueness. Low selectivity means that the values are not terribly unique. Optimizer may see that a parameter is less than or equal to the RANGE_HI_KEY and that the selectivity is high for that table. This will affect the join order, with the most selective join first.
Selectivity = # rows that pass the predicate/# rows in the table
This is how often a duplicate value occurs in a column. It is the opposite of selectivity, but equally valuable to the optimizer to determine the cost of a plan. High density equals less unique data. Low density equals more unique data. If you think of a person table. The last name column will be high density, and the social security number column will be low density.
Density = 1/# of distinct values in a column OR Density = 1/100 = 0.01
What happens when statistics are out of date?
When data changes in a table, statistics must change as well. If data is added, updated or deleted, the statistics of that object is now different. Take the example of the person table. What if there were an import of a data feed, that included 10,000 people with the last name of Abbas. That could affect the cost of existing or new query plans. Statistics should be updated to reflect the changes. Optimizer makes choices based on bad data, and that can result in a bad query plan.
Here is an example of a query plan that was created using stale statistics.
For the most expensive operator, optimizer estimated 491,338 rows with an operator cost of 88%. The actual rows were 6. This is a big difference in cost. Upon investigation, I noticed that the statistics for that index had not been updated in months.
Here is the plan after updating statistics:
The estimated rows are now 1,349 with a cost of 40%. This is much closer to the actual rows of 6. But why aren’t the estimated rows closer to the actual rows? Well, there are a few reasons. Remember that an estimate by definition will be wrong. And in this case, the predicates for this operator were not SARGable. There was a cast function in the AND clause, and optimizer couldn’t clearly determine cardinality. Here is a brief description of what SARG means. I will publish another article on this later.
What is SARG?
Searchable arguments. This refers to the ability of the optimizer to understand the arguments passed as predicates. If you use functions in your predicate clause, optimizer will have to evaluate that value for every row making it difficult to determine cardinality.
Where ExpiryDate <= GetDate()
Declare @Now datetime = GetDate()
Where ExpiryDate <= @Now
How do I find the date of my statistics?
By using DBCC SHOW_STATISTICS, you can see that date updated for that index in the first result set.
This query will show you the statistics and dates by column:
SELECT OBJECT_NAME(stats.object_id) AS TableName
, stats.NAME AS StatisticsName
FROM sys.stats stats
OUTER APPLY sys.dm_db_stats_properties(stats.object_id, stats.stats_id) AS stats_properties
WHERE OBJECT_NAME(stats.object_id) ='company'ORDERBY stats.NAME;
The moral to this story is, keep your statistics fresh. Your database will be happier, and that will make you happier.
So, you are doing some harmless updates and all of a sudden, you realize that you just updated the whole entire table. NOT GOOD!
Hopefully this happens in a DEV environment, and the damage is at best an inconvenience. Here are a few tips to help you make data changes in a NON production environment without the accidental mishaps. Use an explicit transaction.
You need to update some rows to create data that fits a scenario that you want to test.
In adventureworks, I select from the person table where Suffix = ‘Jr.’ I get 39 rows
I want to update that suffix to ‘II’ after making a change to my webform to use ‘II’ instead of ‘Jr.’ Innocent enough. this should take no time at all.
I begin my script with a “Begin Tran”
I run my update, but it is taking WAYYYYYYYYY too long. OH NO! I didn’t have a where Clause, and now every row in the person table has the Suffix of ‘II’ !!!! AGONY!
No problem, I have an open transaction. I can simply rollback, and add a DADGUM WHERE CLAUSE!
There now, that’s better. 39 rows updated. But wait, there’s more. You have an OPEN TRANSACTION!!! That can do all kinds of damage. Let’s commit. And as a bonus, we can use a built in function to check the number of open transactions for our spid. Select @@TRANCOUNT The answer should be 0 before we go on about our day.
Set [Suffix] ='II'--add where clauseWhere Suffix ='Jr.'--Rollback tran--Commit tran-- BONUS!Select@@TRANCOUNT
SQL Server stored procedures vs. application ad-hoc queries or ORMs
If you have worked for any significant amount of time in the tech industry, you have probably been involved in or witness to a passionate debate about where Domain Logic belongs.
As with most architectural decisions, many compromises are made weighing the pros and cons of each possible approach. These decisions can only be made by those with a very deep technical expertise in object oriented and relational paradigm. Since that expertise can rarely be found in one person, it is usually a synergy between an application, data, and systems architect.
Software engineering in all tiers is frequently full of uncertainties. These uncertainties are one of the causes of missed deadlines and future performance problems, and these are bad for business. The best way to deal with these sorts of uncertainties is to have faith in those best to eliminate them. In the case of the middle-tier, it is the expert in data structures, algorithms, memory management and APIs. In the data tier, the expertise lies with the data developers and the DBAs. Each tier relies on its abstractions in order to maintain its guarantees for correctness and performance. It is humans with expertise in their specific areas that are always superior to the tools, compiler optimizations aside.
Object Relational Impedance Mismatch
Much has been written about the differences between the Object Model and the Relational Model. You can learn quite a bit by simply googling the term. The Wikipedia page has quite a bit of information. The best explanation that I have heard is this: Objects model behavior, and Relations model data.
While the two have some similarities, including forms of normalization (3onf, and 3nf), there are distinct differences. These can make the interaction between the two a little challenging.
There are many tried and true techniques that exist to overcome these challenges, and the key to successful interaction lies in the deep knowledge of the technologies being used. That is essential to understanding the compromises being made throughout the development process.
“There is no performance difference between an inline query and a stored procedure”
I have heard this many times. I even read one paper that stated incorrectly that “stored procedures are compiled at runtime like any other SQL statement.” This is a clear example of the complications that assumptive ignorance can cause in the architectural dialog.
Stored Procedures are compiled the first time the first time they are executed, and SQL server then stores an execution plan. Subsequent executions can skip this step altogether. 
It is true that inline queries can also have saved and reusable execution plans, but this is only available to simple queries. Anything other than selecting some columns by the clustered key will not render this benefit. If you add an additional predicate (AND, OR, etc.), and/or a join, you will not get this benefit. SQL Server will have to create an execution plan for every execution of the query.
Let’s take a look at some examples.
Fig. 1 used a saved execution plan, and the rather complex procedure ran in 1ms.
Fig. 2 I will force the stored procedure to recompile and create a new execution plan to demonstrate the overhead. This procedure ran in 321ms, with significant cpu usage compared to fig. 1
Fig.2 The procedure is forced to recompile the plan to simulate an inline query.
It is true that sometimes execution plans become out of date due to changing statistics, but this is easily discovered and remedied by a good DBA. Again, expertise is the key.
More than just speed…stability
There are many more advantages to using a stored procedure, but for the sake of brevity, I will leave those for another document.
“There is no good source control for SQL Server”
This one speaks to the differences in the object oriented and relational cultures more than anything else. Yes, source control, as well as deployment tools exist, of course they do. Microsoft Team Foundation Server  has great support for database source control as well as automated deployment. In fact, it is the very same source control integrated with Visual studio. There are also several add on solutions from vendors like Redgate.
“There is no way to unit test stored procedures”
Of course there are ways to unit test stored procedures, and in fact are integrated into TFS. If the technology stack was all Microsoft, the application code, as well as the database objects are sourced in the same repository, can be branched, built, unit tested, and deployed together. Like any other source control, this requires some time, discipline, and effort to set up correctly. The effort is multiplied if the source control is created after many databases are already being used in production with active development going on.
Companies would never begin building any application without source control for the application code, but somehow fail to realize the need for database source control until the issue becomes dire, and the effort is then much more difficult.
No matter how good your ORM is, using any single sided solution ignores the strengths of the other technologies. To keep all of the logic in the application layer omits an entire layer of optimization in the database. An ORM is a compromise, with decisions made on one side of the playing field. It is just as lopsided to include application logic in the database. The trade offs that are made to get an application up very quickly using an ORM very rarely sustain growth, or scale without significant cost to the business. To model the data to support the behaviors, rather than the data, limits the uses of the data for any other business needs.
Stable and sound architecture comes only from making decisions with the understanding of the strengths and weaknesses of the technologies being considered. There are things that OOP was designed to do very well, and there are things that TSQL does very well. It is the expertise in the two languages that overcomes the mismatch. Don’t use a screwdriver as a hammer just because you have no true understanding how a hammer works. Let the screwdriver expert and the hammer expert decide what trade offs make sense for the business.
OK, let’s talk about a few bad habits that we have when we are first using SQL on any Relational Database. I will give you some tips below that will have everyone fooled into thinking you really know what you are doing!
This is a tough one. We have all used it. But, do you know the harm it can cause?
What it does:
It simply returns all the rows and all the columns from a table. This will scan the entire table to return the data, and very likely cause lock escalation which can block other users from reading or writing to the table. There are other issues like schema locks, but that is another lesson. Just know that it is NOT GOOD.
What are you trying to do?
Are you trying to see a sample of the data in the table?
Use a TOP Clause : Select TOP (100) * from tableA
Are you trying to see the columns that are in the table?
USE built in functions: exec sp_help to to see the columns, datatypes, indexes, and much more. EXEC sp_help ‘tablename’
This hint will nicely ask SQL Server to forgo locks and bring you dirty data. You can use this hint, or set the transaction isolation level to read uncommitted.
Select TOP (100) * from tableA with(NoLock)
This is just the beginning, and I hope this helps. I will continue to post some tips for beginners all the way to architects.
I am a data architect, mentor and a development manager. Obsessed with architecture, optimization and internals, along with the desire to teach and mentor others.