Reporting Services – SSRS, Reporting Services, SQL Server, Business Inteligence, Pentaho, T-SQL

August 9, 2020August 9, 2020

Multi-thread batch inserts to one table without wait

Multi-thread parallel inserts of multiple rows into one table without locks causing waits? I was convinced, that this task can be realized without deeper thinking about its implementation. But I am wrong, even though there are lots of posts on internet, describing that it is easy task.

I simulated this scenario on my laptop environment with SQL Server 2017 developer edition.

I have one table and I would like to insert data into the table from two sessions in parallel fashion. See structure of the heap table below.

CREATE TABLE dbo.ConcurentTable (b INT)

Execute long running insert to the table which should cause wait of another insert running in next query window.

BEGIN TRAN
INSERT INTO dbo.ConcurentTable
SELECT TOP 20000000 a.object_id
FROM sys.objects a WITH (NOLOCK)
JOIN sys.objects b WITH (NOLOCK) ON 1=1
JOIN sys.objects c WITH (NOLOCK) ON c.object_id=a.object_id

COMMIT

Here we have the short one insert.

BEGIN TRAN
INSERT INTO dbo.ConcurentTable  (b )
SELECT TOP 100000 a.object_id
FROM sys.objects a WITH (NOLOCK)
JOIN sys.objects b WITH (NOLOCK) ON b.object_id=a.object_id
COMMIT

Now run the long running query in first query window and then the query in the second one window.

In my environment I see that the second one query is waiting till the first one is finished.

What is the reason? Let’s look on the lock type first. Use dynamic views or sp_lock proc to get following output.

You can see that there is an exclusive lock as expected according to documentation defining that DML operations use exclusive locks to protect more processes trying to change the same data. Used for data-modification operations, such as INSERT, UPDATE, or DELETE. Ensures that multiple updates cannot be made to the same resource at the same time. Seehttps://bit.ly/39trAiq
The lock escalated to Table Resource level, because it is default behaviour of SQL server. The table level is more convenient for SQL server in our scenario because it spares some more resources. If SQL server would use lower level of locking mechanism, for example row resource level, for such a huge amount of inserted data, there will be lots of locks on lower level which is very time and resource consuming with impact on transaction log, etc…

I played with count of inserted rows to get locks on lower resources. When I get smaller batch for insert (about 2000 on my environment), SQL Server used Lock escalation on row level. For smaller amount of data, SQL Server would prefer to use lower resources to lock. Now you could see RID lock (because we don’t have any index on the table) which means that the whole table is not locked exclusively but its ranges of rows only.

Notice: There is defined threshold when SQL Server decides to use lock escalation. I found information in SQL Server 2008 Internals book.

The number of locks held by a single statement on one object, or on one partition of one object, exceeds a threshold. Currently that threshold is 5,000 locks, but it might change in future service packs. The lock escalation does not occur if the locks are spread over multiple objects in the same statement—for example, 3,000 locks in one index and 3,000 in another.
Memory taken by lock resources exceeds 40 percent of the non-AWE (32-bit) or regular (64-bit) enabled memory and the locks configuration option is set to 0. (In this case, the lock memory is allocated dynamically as needed, so the 40 percent value is not a constant.) If the locks option is set to a nonzero value, memory reserved for locks is statically allocated when SQL Server starts. Escalation occurs when SQL Server is using more than 40 percent of the reserved lock memory for lock resources.

But as I understood from other blog posts and discussions, above mentioned could change from distribution to distribution. It would be better to make some tests find the threshold on your environment. I am going to write another post about this topic.

Is there a way to manage above mentioned for huge batches of inserted data?

Let’s try.

We can use hint to force query to use lower resource level – ROW or PAGE. We can use ROWLOCK hint to force lock escalation on low resource level. But as states in documentation SQL Server still can decide which locking level will be used.
```
INSERT INTO dbo.ConcurentTable WITH (ROWLOCK)
```
To prevent such a behaviour, we can disable lock escalation on our table. Run following command. (there are other ways to change the lock escalation behaviour even on SQL Server level, but let’s describe in another post).
```
ALTER TABLE dbo.ConcurentTable SET (LOCK_ESCALATION = DISABLE)
```
Run the queries to simulate our scenario again. From sp_lock output you can see now exclusive lock on RID resource since we have heap table.

RID locks

Now you should be successful with your second one query. The data were inserted before the first one query is finished, without waiting. In my environment I have occasionally problems with RID lock and PAGE lock. Seems to me that these two kinds of locks cannot guarantee concurrent batch inserts in the same time. I extend the table about additional attribute to split the inserted data and support concurrent inserts with KEY lock, see below.

Notice you can check waiting behaviour with NOWAIT hint which raises error in case the query is waiting.

Msg 1222, Level 16, State 45, Line 13

Lock request time out period exceeded.

Let’s add an attribute to distinguish batches of data based on SPID and sort he tables according to it.
```
ALTER TABLE dbo.ConcurentTable ADD spidFilter SMALLINT NOT NULL DEFAULT @@spid

ALTER TABLE dbo.ConcurentTable ADD CONSTRAINT chSPidFIlter CHECK (SPIDFILTER = @@SPID)

CREATE CLUSTERED INDEX CX_ConcurentTable ON dbo.ConcurentTable (SPIDFILTER )
```
Here we got it.

The second one query finished while the first “blocking” one is still running. In sp_lock we can see that there is KEY lock resource type used because of added clustered index.

KEY locks

When I tried to put lock escalation back on the table, I returned at the beginning – table lock and the second query was waiting till the first one was finsihed.

Closure

I was successful in concurrency scenario when 2 batches were inserted data to the same table from multiple threads with:
- Lock escalation on the table was disabled
- Query was forced to use lock on low resources level
- Clustered index design to separate inserted rows helped with count of succesfull experiments
- Locks on lower level resources hepled to increase concurrency of inserts in general
Notices
- Still not sure if above mentioned will work in all cases
- This forced lower level escalation approach does not block the second query but from point of view of resources it could be better strategy to leave the first one query finish and after that to execute the second one query ( the behaviour we tried at the beginning of this post). But this could be subject of other tests and play I will realize in one of my next post
- If you know any better way how to manage above mentioned scenario, please share it
Stay tuned

July 14, 2020July 14, 2020

SSAS – make cube in 5 minutes

To support my current database team with additional database skills I decided to start some series focused on BI area. This POST describes quick start with Analysis Services – Multidimensional Databases.

I already wrote post regarding Analysis Services (Create SSAS database) and its settings for multidimensional mode. What should be prepared in advance are some relation data (Adventure Works) and Data Tools studio supporting Analysis Services. You can find in my previous post.

GOAL:

Let’s make a simple cube including 1 measure and 2 dimensions.

In short:

Let’s create new project first – Multidimensional SSAS – Create Analysis Services Database
Create Data Sources
Create Data Source View
Create Dimensions
Create CUBE
1. Define measures
2. Create dimension relationships
Deploy solution to server
To be continued

Create Analysis Services Multidimensional and Data Mining project first.

Define Data Sources. In our case, there is one Data Source to relation database. There could be defined more Data Sources from different connections. In Solution Explorer, right click on Data Sources section and click on New Data Source, define connection info in dialog box and confirm to create Data Source. Solution Explorer with defined Data Source should look like on picture bellow.

As next step we must define objects for logical database model. Right click on Data Source Views section in Solution explorer and click on new Data Source Views and Wizard to create Data Source Views appears. Select Data Source in next window, Adventure Works DW in our case and you should see dialog with list of objects form your Data Source on left side. Choose objects for your data model and continue with next.

After selection you should have chosen objects on the right side of Select Tables and Views window. We selected DimCustomer, FactInternetSales and DimDate in our case.

As last step name Data source View and Data Source View Wizard is finished.

In case you have relational layer defined with referential integrity, it should be transferred to the Data Source View. In case that primary and foreign keys are missing you can create logical relationships between tables in Data Source Views. It is possible to create more Data Source Views, in case you have lots of database objects it is good practice to split them according to their logical area.

When above mentioned steps are done you can see the same Solution Explorer as on picture bellow.

Solution explorer - Data Source View — Solution explorer – Data Source View

Let’s check data model in Data Source Views on picture bellow. You can see that relationships were created according to relation layer by default.

Data Source View - model — Data Source View – model

To create our cube, we use Cube Wizard where measures and dimensions are defined.

To speed up our process, check Use existing tables. In next window, select Fact table to define measures. You can click on Suggest button to check it automatically (based on defined Data Source View).

Cube Wizard - measure — Cube Wizard – measure

To finish Cube Wizard choose dimensions you use in the cube.

Cube Wizard - dimension — Cube Wizard – dimension

Now you should see created Cube and Dimensions in Solution explorer.

By double click on cube you see cube structure work space, with information regarding Measures, Dimensions, Data Source View etc.

On the top menu of the Cube Structure Workspace you can see menu yo can navigate to other sections, like Dimension usage, KPIs, Actions etc. What is interegsting now is to check Dimension usage you defined relationship between facts and dimensions. As you can see Cube Wizard did this task automatically.

Now we can deploy our solution to server. Go to Project -> Properties see Configuraion dialog as on picture bellow. In Deployment section set Target Server and Database name. In other options you can set for example Processing Option to specify whether Analysis Services objects should be processed with deployment of the project. Leave Default in our case to Process cube after deployment.

Project Configuration - Deployment — Project Configuration – Deployment

After you confirm above mentoned seting go to solution explorer, right click on project and click on Deploy in context menu on picture bellow.

After deployment we can Browse deployed and processed cube in Cube Browser – top right menu.

As next steps we could extend dimensions by other attributes, try to create Hierarchy, add another dimension, etc.
Lots of scenarios you can try now, but to be continued in next posts.

July 8, 2020July 14, 2020

Partitioning

In this post I would like to continue with partitioning series. Previous post we created partitioned view and look how it looks like in execution plan. In this post we will create partitioned table.

In case of simple partitioned table, we will need to do following:

Create table
Create partitioning function
Create partitioned schema
Relationship between partitioned schema and the table
Let’s check created partition
Let’s verify partition query in execution plan

As in the partitioned view post there will be created 3 partitions within the same filegroup (without separated disks per partitioned). This solution will not profit from parallel reading from partitions separated on different disks. For simple demonstration it will fit.

Let’s create the table first

DROP TABLE [dbo].[PartitionedTable]
CREATE TABLE [dbo].[PartitionedTable] (id INT NOT NULL , booking_date DATE NOT NULL, data  SYSNAME)

ALTER TABLE [dbo].[PartitionedTable] ADD PRIMARY KEY CLUSTERED ( [id] ASC, [booking_date] ASC)

Create partitioned function. It defines partition data according to following boundaries

<
'2019-10-01' , >= '2019-10-01' <
'2019-11-01' , >= '2019-11-01' <
'2019-12-01', >='2019-12-01'.

The side of boundaries values is defined by LEFT or RIGHT key word.

CREATE PARTITION FUNCTION PartitionedFunction (DATE)
AS RANGE  RIGHT FOR VALUES ('2019-10-01', '2019-11-01', '2019-12-01');

Create partition scheme to specify which partition of table or index belongs to which filegroup. All partitions are defined for Primary filegroup in our case.

CREATE PARTITION SCHEME PartitionedScheme 
AS PARTITION PartitionedFunction ALL TO ([PRIMARY])

Now let’s create relationship between table and partitioned schema.

ALTER TABLE [dbo].[PartitionedTable] DROP CONSTRAINT PK_PartitionedTable__booking_date

ALTER TABLE [dbo].[PartitionedTable] ADD CONSTRAINT PK_PartitionedTable__booking_date PRIMARY KEY CLUSTERED ([booking_date] ASC) ON [PartitionedScheme]([booking_date])

Here we have few queries to check our partitioned objects.

Let’s check connection between partition schema and partitioned table.

SELECT * 
FROM sys.tables AS t
JOIN sys.indexes AS i ON t.[object_id] = i.[object_id] AND i.[type] IN (0,1)
JOIN sys.partition_schemes ps ON i.data_space_id = ps.data_space_id
WHERE t.name = 'PartitionedTable';

View to check created partitions.

SELECT t.name AS TableName,
i.name AS IndexName,
p.partition_number,
p.partition_id,
i.data_space_id,
f.function_id,
f.type_desc,
r.boundary_id,
r.value AS BoundaryValue
FROM      sys.tables AS t
JOIN      sys.indexes AS i           ON t.object_id = i.object_id
JOIN     sys.partitions  AS p      ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN      sys.partition_schemes AS s ON i.data_space_id = s.data_space_id
JOIN      sys.partition_functions AS f ON s.function_id = f.function_id
LEFT JOIN sys.partition_range_values AS r ON f.function_id = r.function_id and r.boundary_id = p.partition_number
WHERE t.name = 'PartitionedTable' AND i.type <= 1
ORDER BY p.partition_number;

You can check partitioned column as well with following query.

SELECT
t.[object_id] AS ObjectID
, t.name AS TableName
, ic.column_id AS PartitioningColumnID
, c.name AS PartitioningColumnName
FROM sys.tables         AS t
JOIN sys.indexes         AS i ON t.[object_id] = i.[object_id] AND i.[type] <= 1 -- clustered index or a heap
JOIN sys.partition_schemes AS ps      ON ps.data_space_id = i.data_space_id
JOIN sys.index_columns AS ic ON ic.[object_id] = i.[object_id] AND ic.index_id = i.index_id AND ic.partition_ordinal >= 1 -- because 0 = non-partitioning column
JOIN sys.columns         AS c ON t.[object_id] = c.[object_id] AND ic.column_id = c.column_id
WHERE t.name = 'PartitionedTable'
;

Check Execution plan

If we run following query check execution plan how it is look like.

SELECT * FROM [dbo].[PartitionedTable] WHERE booking_date >=
CAST ('2019-11-01' AS DATE) AND booking_date <= CAST ('2019-11-30' AS DATE)

Execution plan - Actual partition count — Execution plan – Actual partition count

We can see Actual Partition Count = 1 which is ok since we would like to get data just for one moth= one partition in our case.

Let’s notice RangePartitionNew function which contains ranges from all defined partitions. I would expect to see our predicate values, so why optimizer show this? The reason is that we use simple predicate in our query which leads to simple parametrization.

SELECT * FROM [dbo].[PartitionedTable] WHERE booking_date >= CAST ('2019-11-01' AS DATE) AND booking_date <= CAST ('2019-11-30' AS DATE)

Modify the query as follows

SELECT * FROM [dbo].[PartitionedTable] WHERE booking_date >= CAST ('2019-11-01' AS DATE) AND booking_date <= CAST ('2019-11-30'  AS DATE) AND 1<>2.

It eliminates simple parametrization and we get in execution plan what we expected, see plan bellow.

Execution plan - Partition predicate — Execution plan – Partition predicate

Let’s try to modify query now u can see in execution plan that partition count value is 2.

SELECT * FROM [dbo].[PartitionedTable] WHERE booking_date >= CAST ('2019-11-01' AS DATE) AND booking_date < CAST ('2019-12-01' AS DATE)

It is important how the predicate is set. Since above mentioned query touches two partitions instead of 1.

That was very quick introduction to SQL Sever partitioning.

It would be nice to look at partitioning little bit deeper in one of my next post and try to compare some scenarios with partitioned views approach or how RangePartitionNew function works internally.

Stay tuned.