This might make me the odd one out but I actually really like replication. It took me a while to get comfortable with it but when I did and when I learned how to troubleshoot transactional replication confidently, I became a fan. Since I exclusively use transactional replication and not snapshot replication or merge replication, this post is only about transactional replication and in particular, how to troubleshoot transactional replication errors.
In the production system I work on, replication is highly reliable and rarely if ever causes the DBA’s headaches. It can be less so in our plethora of dev and qa boxes, probably down to rate of change in these environments with regular refreshes. Due to this, I’ve had to fix it many times. As I explain how I troubleshoot replication errors, I assume you know the basics of how replication works. If you don’t, a really good place to start is books online. It describes how replication uses a publishing metaphor and describes all the component parts in detail.
If you don’t currently use replication and you want to set up for the first time, I recommend having a playground instance, not related to any of your production, QA or dev instances that you can play about with and try things out. As I say in a previous post, “This is an ideal place to set up mirroring, replication and Always On Availability Groups.”
Look for errors here first
Replication errors are logged to the table MSRepl_errors in the distribution database. Run the query below to see what your problems are and start at the most recent.
[sql]
SELECT *
FROM distribution..msrepl_errors
ORDER BY ID DESC
[/sql]
If you have any errors, your output will be similar to below:
Regular Errors
Error code: 21074: The subscriptions(s) have been marked inactive. If this occurs, you need to reinitialize the subscription. For beginners, you want to do this in SSMS. Expand Replication -> Local publications. Then either right click on the publication name and choose Reinitialize All Subscriptions or expand the publication and right click on the subscription you want to reinitialize and choose Reinitialize. This can also be done in replication monitor.
Error Code: 3729: Cannot DROP TABLE ‘dbo.X’ because it is being referenced by an object ‘vw_x’ (This occurs when a table you are trying to drop and recreate via reinitializing is schemabound to a view). This indexed view is getting in the way. In a non-production environment, I would script out the view and indexes, drop them, then fix replication and recreate the indexed view. If this happens in production, I’m open to suggestions. You wouldn’t normally want to drop and recreate an indexed view without having a solid understanding of the consequences.
Error Code: 1205: Message: Transaction (Process ID x) was deadlocked on lock resources with another process… (Here you have the deadlock victim – unable to complete because of the said deadlock). This is more likely when setting up replication during an environment refresh and can be easily sorted by reinitializing.
I can’t find a reliable list of all of these errors. If someone could provide one in the comments, I can edit the post.
XACT_SEQNO
You may have noticed the XACT_SEQNO column in the msrepl_errors table. This might appear when you have an error such as ‘row not found at subscriber’. From the results pane, copy the XACT_SEQNO value in question and add it to the query below.
[sql]
select * from distribution.dbo.MSarticles
where article_id in
(
select article_id from distribution.dbo.MSrepl_commands
where xact_seqno = 0x000002B9000430B6000E00000000
)
[/sql]
To find which publication the article belongs to, replace the value of dest_table in the WHERE clause below with the article name from the query above:
[sql]
SELECT sp.name, sa.* FROM sysarticles sa
JOIN syspublications sp
ON sa.pubid = sp.pubid
WHERE dest_table = ‘<Name of table here>’
[/sql]
Replication monitor – Green is good
Green is good
It’s a great feeling when you can say, “replication is green now” after working on a problem. Replication monitor gives you a nice visual interface and immediately tells you if something is up by displaying a red X. There is also a warning icon, a yellow exclamation mark or bang as I’ve heard it called from American DBAs. The exclamation mark is usually an indicator of latency and can even be caused by the overhead of running replication monitor so I wouldn’t worry too much about these, especially if they come and go with any frequency in replication monitor.
Replication monitor also provides the visual interface for reinitializing subscriptions as well as other useful features like tracer tokens.
Being proactive
Alerts
First of all, you want a process alerting you to errors in your replication set up. Whether this be your own custom solution like an SQL agent job which regularly polls Distribution..MSRepl_Errors or a 3rd party monitoring solution, you should have something set up to alert you immediately to errors.
Tracer tokens – are they any good?
Tracer tokens can be good for confirming everything is running well when they succeed. They provide you with information on the full latency of a test transaction. The trouble is when they don’t succeed. They just hang around and don’t report back until they have ultimately failed (which won’t be anytime soon). So you can use tracer tokens with caution but there are other ways. I read a good post by Kendra Little on Monitoring SQL Server Transactional Replication and she describes the use of canary tables. I won’t explain it all here as it would only be repeating someone else’s work but take a look and try them out in your own dev environment.
Comparing row counts as a method of checking
Sometimes, replication can break or just stall and not report any errors. In this scenario, we need another solution for alerting the DBAs to any potential problems. This can be achieved by checking the row count on tables involved in replication at different time intervals. You would take a count and then take another count, say 5 or 10 minutes later and compare the results. If the row difference per article have stayed the same or increased, it can be a sign of an error. It’s not definitive but it can be an effective double check.
Leave a Reply