Facing a Full TempDB? Here’s What Happened – and How to Prevent It From Happening Again


William Durkin


Uncategorized


May 30, 2020

Ahh, TempDB space issues. We’ve all come across the following message at one time or another:

Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'.
CREATE the necessary space by dropping objects in the filegroup, adding additional files to the filegroup,or setting autogrowth on for existing files in the filegroup.

When investigating a TempDB issue like this, most simply restart the SQL Server instance. It’s easy to see why they do – the issue quite often locks up completely, and if a customer wants their server to work again ASAP, then a restart is almost inevitable.

A restart will tackle the symptom, but not the cause. It also removes the evidence of the transaction that caused TempDB to fill up in the first place.

What to Do When a TempDB (Or Other Database) Fills up

Most customers haven’t thought of a solution to counteract a disk/database filling up. To deal with a suddenly full TempDB or other database – or to avoid catastrophe in future – there are two main strategies.

If you have a volume size limitation and need to ensure that you’ll be able to arrange for more disk space very quickly, one option is to use “dummy” or “buffer” files. These are files that are a particular size, but have no actual content. Created on the disk as “insurance”, you keep these files around until you reach a pre-defined threshold of “low disk space” or when the disk is actually full. When that happens, you simply delete one or more of these “dummy” files and you immediately have more disk space.

Alternatively, you can employ max size limits. Database settings let you control in what size “steps” a database can grow by when auto-growth is activated. You may also want to set a maximum size for the database file(s) to grow to, which will ensure it never grows so large that it completely fills a disk. Think of it like a disk quota on a database file level. I prefer setting an upper limit on the database file(s) rather than allowing unlimited growth because it’s always easier to change the maximum growth limitation than it is to extend a disk.

But Who Filled My TempDB?

As mentioned, the restart you may have used in case of a disk fill up erases evidence of what caused the fill up. To ensure the cause of the problem can actually be pinpointed, I have used Extended Events.

Extended Events is an event processing framework that has been built inside SQL Server since SQL 2008. It runs very lightweight background monitoring of system events, and can capture information about which command/transaction caused a file growth event to occur.

For a full introduction to Extended Events, please take a look at the 31 Days of Extended Events series from Jonathan Kehayias.

Catching the culprit

To be able to catch the offending process/user, create an Extended Events session on the server with the following code:

CREATE EVENT SESSION [Database_Growth_Watchdog] ON SERVER
ADD EVENT sqlserver.database_file_size_change (
ACTION ( sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_name, sqlserver.session_nt_username, sqlserver.sql_text )
WHERE ( [database_id] = ( 2 ) ) -- We filter on database_id=2 to get TempDB growth only
)
ADD TARGET package0.event_file ( SET filename = 'D:\Temp\Database_Growth_Wathdog.xel',
max_file_size = ( 10 ) )
WITH ( MAX_MEMORY = 4096 KB,
EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY = 1 SECONDS,
MAX_EVENT_SIZE = 0 KB,
MEMORY_PARTITION_MODE = NONE,
TRACK_CAUSALITY = OFF,
STARTUP_STATE = ON )
GO
ALTER EVENT SESSION [Database_Growth_Watchdog] ON SERVER STATE = START

Important notes about this code:

We are monitoring for data file and log file growth. The event sqlserver.database_file_size_change fires for any file size change. We want to be informed of any and all file growth to ensure nothing is missed.
If you have multiple data files for TempDB (such as in a multi-core environment) you will see one event fire for each file that is growing. E.g. You have 4 data files and the database grows, you will see 4 entries in the extended events output.
The session is set to flush events to the output file in 1 second intervals (MAX_DISPATCH_LATENCY). This is done to ensure we lose as few entries to the output file as possible. If TempDB fills up, the entire instance can often stop working completely. We want to catch as much information as possible before this happens, so we flush to the output file in very short intervals.
We start the session at instance startup (STARTUP_STATE). This ensures we have the event active immediately on server startup. Because we are monitoring file growth events, we should remain very lightweight so we won’t have to worry about swamping our system with extended events processing.
We limit the output file to 10MB and allow it to deliver to a total of 5 files. This means we have to have 50MB for the files in the output directory and won’t spam the folder with too much data.

When the event fires, we collect information about the query/command that caused the file growth to occur. This includes NT username, Hostname, origin database of the query, command text and application name.

This session collects vital information that will help pinpoint what’s causing the TempDB to fill up. However, there is the possibility of false positives in this setup. TempDB may have been almost completely filled by another previous transaction and the transaction causing the growth event is an innocent bystander. This is unavoidable, and needs to be kept in mind when analysing the data. If you don’t catch the exact transaction this way, you are still on the right track.

Analyse the data

Once the data has been collected, we need to load and parse the output files to make sense of what has happened. The following code will parse the xml output that is in the (up to) 5 files.

DECLARE @TraceFileLocation nvarchar(255) = N'D:\Temp\Database_Growth_Watchdog*.xel';
WITH FileSizeChangedEvent AS
(SELECT object_name AS Event,
CONVERT (xml, event_data) AS Data
FROM sys.fn_xe_file_target_read_file (@TraceFileLocation, NULL, NULL, NULL) )
SELECT FileSizeChangedEvent.Data.value ('(/event/@timestamp)[1]', 'DATETIME') AS EventTime,
FileSizeChangedEvent.Data.value ('(/event/data/value)[7]', 'BIGINT') AS GrowthInKB,
FileSizeChangedEvent.Data.value ('(/event/action/value)[2]', 'VARCHAR(MAX)') AS ClientUsername,
FileSizeChangedEvent.Data.value ('(/event/action/value)[4]', 'VARCHAR(MAX)') AS ClientHostname,
FileSizeChangedEvent.Data.value ('(/event/action/value)[5]', 'VARCHAR(MAX)') AS ClientAppName,
FileSizeChangedEvent.Data.value ('(/event/action/value)[3]', 'VARCHAR(MAX)') AS ClientAppDBName,
FileSizeChangedEvent.Data.value ('(/event/action/value)[1]', 'VARCHAR(MAX)') AS SQLCommandText,
FileSizeChangedEvent.Data.value ('(/event/data/value)[1]', 'BIGINT') AS SystemDuration,
FileSizeChangedEvent.Data.value ('(/event/data/value)[2]', 'BIGINT') AS SystemDatabaseId,
FileSizeChangedEvent.Data.value ('(/event/data/value)[8]', 'VARCHAR(MAX)') AS SystemDatabaseFileName,
FileSizeChangedEvent.Data.value ('(/event/data/text)[1]', 'VARCHAR(MAX)') AS SystemDatabaseFileType,
FileSizeChangedEvent.Data.value ('(/event/data/value)[5]', 'VARCHAR(MAX)') AS SystemIsAutomaticGrowth,
FileSizeChangedEvent.Data
FROM FileSizeChangedEvent;

Please take note of the variable @TraceFileLocation. The example uses a wildcard to allow loading/parsing of multiple files, this is particularly useful if you really do rollover into multiple files.

The query results provide a range of information, including which process caused growth, what command was submitted last and a set of information about the client.

Once you have an idea of who/what is causing TempDB to grow, you can discuss it – and possible solutions – with application owners.

Final Thoughts

It is extremely simple to extend the session to monitor all databases and run a separate monitoring solution to inform you of such growth events. Gianluca Sartori devoted a session to (and demonstrated) streaming Extended Event data and processing them in near real time at his SQLSat Oslo session. I am certain that this would be a fantastic way of doing some sort of real time monitoring of growth events.

Some homework for you: What could you use this Extended Events session for? How could you modify/improve on it to help with your system monitoring and administration? Let us know in the comments.