I was recently fortunate enough to attend some Microsoft SQL Server training on my employer’s dime. The training had been billed to me as performance training, but it ended up being a SQL Server developer deep-dive instead. Performance was certainly a major topic, but we covered a lot of other things as well.

Most of my “training” with relational databases (and SQL Server in particular) has been pretty ad-hoc and cargo-cult oriented. I did take an Oracle class in college, so I was at least familiar with thinking in sets when I started working here but the particular idioms and characteristics of SQL Server have mostly come from watching what the more experienced programmers in my workplace have been doing.

Which is to say: I had some things to learn. I’m not entirely sure that I took away the right things from this training, though. Previously, my relationship with SQL Server has been friendly. We’ve been colleagues working together to retrieve data correctly and quickly for our customers. Now, though? I see SQL Server as a devious force working to undermine me at every turn.

I want to share a few examples. Partly to cement these things in my head and partly to whinge. These are in no particular order.

Transactions

SQL Server does not support nested transactions. It looks like it supports nested transactions. It tells you it can support nested transactions. But it cannot support nested transactions. Observe.

create table dbo.A ([KEY] int, [VALUE] nvarchar(100));
go
create table dbo.B ([KEY] int, [VALUE] nvarchar(100));
go

begin transaction FIRST
insert into dbo.A ([KEY], [VALUE]) 
	values (10, 'hello'), (11, 'goodbye'), (42, null);

--Doesn't actually start a new transaction 
begin transaction SECOND 
insert into dbo.B ([KEY], [VALUE]) 
	values (1, 'one'), (2, 'two'), (3, 'four'); 
--Rolls back the inserts to A and B
rollback 

--Returns an empty result set because the transaction 
--was rolled back 
select * from dbo.a 

--throw an error message because there's no 
--longer a transaction running. It's been committed. 
commit 

The good news, if you want to call it that, is that it works the other way. If you commit the “first” transaction and rollback the “second”, both inserts will be rolled back. Take a look at the docs for @@TRANCOUNT to see how it all works (“begin transaction” increments @@TRANCOUNT, “commit transaction” decrements it, and “rollback” undoes everything and sets @@TRANCOUNT to 0. It’s a mess). The take-away, I think, is that nested transactions must be considered harmful; even though you can probably get away with it as long as you only ever commit in the “inner” transactions and keep your error handling and rolling back in your outermost code. It still seems too fragile, though.

Temp Tables

Unfortunately, it gets worse. Take a look at this:

create procedure dbo.A 
as
create table #temptable (thing1 int, thing2 int); 
insert into #temptable values (42, 43);
exec dbo.B;
go

create procedure dbo.B
as
create table #temptable (anotherThing nvarchar(100));
insert into #temptable values ('This One Here');
select * from #temptable 
go

--This won't work: (depending on what's found its way into your proccache)
exec dbo.A 

When I ran this, SQL Server threw the 213 message “Procedure B, Line 5 Column name or number of supplied values does not match table definition.” But we can clearly see that the insert in dbo.B is fine. It gets worse, though. If I clear the procedure cache (dbcc freeproccache) and then run

exec dbo.B;
exec dbo.A;

everything works fine. When you execute A first, SQL Server creates a plan for inserting into #temptable. Then, when it goes into A’s call to B, it looks for a plan for inserting into #temptable. But even though it’s a completely different #temptable, it thinks the plan it had for A’s #temptable will work. But the number or arguments is wrong. So that doesn’t work at all.

The second time around, I called B first. Since SQL Server already had a plan in cache for running B, it didn’t try to generate another one. So that worked. Of course, if in the fullness of time B’s plan ends up dropping out of the cache…well, B will break again. Randomly. Through no fault of its own (or its developer’s).

So temp tables are pretty unsafe as well. I think the only sane way to use temp tables is to enforce strict naming discipline. If I’d named them #dbo_a_temptable_1 and #dbo_b_temptable_2 there wouldn’t have been an issue without A being actively malicious (and if you’re letting someone write actively malicious code on your database, this is the least of your worries). Still, this sort of discipline isn’t something I see documented very often. That’s pretty unfortunate.

Miscellaneous Evils

Then there are the smaller things that I already knew in the back of my mind but seeing them all at once in a formal class sort of drove home just how programmer-unfriendly SQL Server can be.

I had a vague idea that scaler functions were slow and that inline (or “single statement”) table valued function were fast. It was driven home to me just how slow scalers are though: so don’t use them (I’d actually just spent a week or two converting a really slow query that used a scaler function to a nice and speedy inline table function, so this didn’t really come from the class). The instructor was also not fond of multi-statement table functions, preferring SPs instead (since that’s how these functions are actually run in the database). Personally, I think there’s a lot to be said for being able to join directly to the table valued functions, though. That said, they can only return table variables and those can only have at most one index (they’ll become a clustered index if you put a clustered primary key on them when you declare them). So if your function is returning a lot of data and an index would be helpful, all that’s for it is to use an SP to toss data into a real table (or a temp table, I suppose) and join to that instead.

Parameter sniffing came up a lot in the class. If you have something like

create procedure dbo.FIND_ENTITY(@ENTITYTYPE tinyint, @ENTITYNAME nvarchar(100))
as
if @ENTITYTYPE = 1 
	select * from dbo.PEOPLE where NAME like @ENTITYNAME;
if @ENTITYTYPE = 2 
	select * from dbo.COMPANIES where NAME like @ENTITYNAME;

you’ll almost certainly be running the query with the wrong plan some of the times based on what’s in the cache. If the procedure is planned out for finding people, the query will be optimized for companies with names like “John Smith”. If the procedure is planned out for finding companies, the query will be optimized for people with names like “Microsoft”. Since the optimizer is dependent on a statistical analysis of what the results are likely to be (in order to decide, say, between a merge join or a nested loop join), optimizing based on the wrong parameters can be devastating. And it call comes down to which version is used when the SP isn’t in the cache.

The solution is to do something like

create procedure dbo.FIND_ENTITY(@ENTITYTYPE tinyint, @ENTITYNAME nvarchar(100))
as
if @ENTITYTYPE = 1 
	exec dbo.FIND_PEOPLE @ENTITYNAME;
if @ENTITYTYPE = 2 
	exec dbo.FIND_COMPANIES @ENTITYNAME;

since the separate SPs won’t have plans generated for them until they’re actually executed (and when they’re executed, they’ll have been called with the “right” @ENTITYNAME for that use). You could also use option recompile to make the procedure recompile every time, but that might not be a great solution for procedures you expect to get called hundreds of times a second. Still, it just feels sneaky.

You might think that you can be clever and combine this trick with the fact that inline table functions are fast. So you try to do

create procedure dbo.FIND_ENTITY(@ENTITYTYPE tinyint, @ENTITYNAME nvarchar(100))
as
if @ENTITYTYPE = 1 
	select * from dbo.FIND_PEOPLE(); --Inline table-valued function
if @ENTITYTYPE = 2 
	select * from dbo.FIND_COMPANIES(); --Inline table-valued function

But the key to an inline table function is that they’re just that: inline. SQL Server sees that procedure exactly as if you’d copied and pasted the text from the functions into the procedure. So you end up with the same problem as before. So SQL Server would defeat your cleverness in this case.

Multi-statement functions are fine, though, allowing for the fact that the data they return won’t be indexed. I’m not sure it’s fair that the function can look identical to the caller, but have such completely different performance characteristics.

Conclusion

And that’s really my problem with most of this. I didn’t come up in the mainframe days. I’ve never punched cards. I don’t worry too much about what my processor is doing. I’m a high-level programmer raised in a high-level programming world. My entire education and most of my career has taught me the value of black box programming. I should be able to call a function and, as long as it obeys its contract, everything should be fine. Anyone should be able to call the function I’m writing, and as long as they’re obeying their contract, everything should be fine. I don’t have to know the details of my callers and I don’t have to know the details of what I’m calling.

SQL Server doesn’t really give me that though. I have to be familiar with everything that’s happening throughout the call stack. I have to know if it’s safe to begin or rollback a transaction (because someone else in the chain might want to start or commit one). I have to know what the safe names for my temp tables are. I have to know how everything I call was written so I can consider its performance correctly.

If I have something that’s best “wrapped up” as a scaler function (and doesn’t really lend itself to being a table function), I have to copy and paste that code into all my queries that can use it (or pay an outrageous performance penalty). So any bug fixes have to be made in all my queries instead of just one function. (Alternately, I guess, I can create a table function that returns one row and cross apply it in my other queries. But I try really hard to not confuse the guy coming after me).

And that, really, seems to be the best way to achieve good performance in SQL: forget everything we’ve learned about programming in the last few decades and do all the things we thought we’d left behind.

I can understand a lot of the reasons why: I can understand the difficulties of ACID. I can understand the mishmash between procedural languages and declarative ones. I can understand the tradeoffs that have to be made between maintainablity and speed. And because I can understand all of that, I can learn it and I can practice with it, and I can (hopefully) write professional applications built on this tool with a minimum of whinging.

That said, I can still wish it would warn me when I did things that aren’t so good. SQL Server demands perfection, but I’m just not there yet.

blog comments powered by Disqus