Monitoring MySQL backups with Datadog and TwinDB Backup Tool

01-10 09:24

Monitoring MySQL backups is a vital part of a good backup solution. Recovery Time Objective and Recovery Point Objective are most common disaster recovery metrics. TwinDB Backup Tool along with Datadog allows to monitor both of them.

Recovery Point Objective

Basically, Recovery Point Objective aka RPO means how much data you can lose if the disaster happens. If you take backups every hour you can lose up to an hour of data. If you take backups every day – you can lose a day. Sometime people come to us for the data recovery service. They have a backup copy from yesterday, but they couldn’t tolerate a day of data loss. Unfortunately they realised that when it was too late.

Recovery Time Objective

Recovery Time Objective aka RTO is time needed to fully restore the database. It’s important to measure it because that way you can check if your backups are usable at all. I’ll refer to our data recovery customers cases. Most of our clients thought they had backups but when they needed to restore the database it turned hour a backup job didn’t run, or it produced corrupt backups, or full copies were OK, but incremental weren’t and so on. After a decade in data recovery business I’ve seen thousands cases when backups were supposed to be available, but they weren’t. So, doverify your backups.

Needless to say a downtime hurts business. If you know your RTO you can be prepared and get, for example, an insurance that would cover losses in case of disaster.

Like with any other SLA metric it’s not enough just record it, you have to alert if the SLA is broken. Thus if backups aren’t taken longer than expected RPO or RTO exceeds threshold value a human must get a notification and take appropriate action to remediate the problem.

How we measure disaster recovery metrics

Technically, Recovery Point Objective is not measured, it’s rather pre-configured for a desired threshold and alerted if the threshold is exceeded. If RPO is an hour then we take backups every hour and send an alert if the most recent copy is older than that.

To measure Recovery Time Objective we restore the database from the latest copy and record time it took to do that.

When TwinDB Backup Tool takes or restores a backup copy it send respective metric to Datadog. In Datadog we put the metrics on a chart to see historical perspective and configure monitors to alert if our SLA is broken.

How to configure monitoring MySQL backups

In TwinDB Backup you would need to export metrics and in Datadog you accept the metrics and configure monitors for alerting.

TwinDB Backup Tool

TwinDB Backup installs a cron configuration where it runs a backup job every hour by default:

 
# cat /etc/cron.d/twindb-backup
@hourly root twindb-backup backup hourly
@daily root twindb-backup backup daily
@weekly root twindb-backup backup weekly
@monthly root twindb-backup backup monthly
@yearly root twindb-backup backup yearly
 

If you need to take backup more often, change the cron config accordingly. Don’t forget to check how often the tool will take full copies, if the database is too big it may be not enough time to take the full copy.

 
# cat /etc/twindb/twindb-backup.cfg
 
...
 
[mysql]
full_backup=daily
...
 

In an example above the full copies will be taken every day and incremental copies will be taken every hour.

Now, you need to configure metrics export from TwinDB Backup to Datadog. Every time TwinDB Backup takes or restores a backup it will report respective metrics to Datadog.

 
# cat twindb-backup.cfg
...
[export]
transport=datadog
app_key=***
api_key=***
...
 

Where app_key and api_key are the credentials of your Datadog account.

Datadog

On the Datadog side you need to enable Python integration, create keys, create graphs and monitors. Let’s illustrate whole process step by step.

1. Enable Python integration on  https://app.datadoghq.com/account/settings .

Code usage example.

2. Generate API and APP keys.

The generated keys should be be used in the twindb-backup config as it was shown above.

Note: Step 1 and 2 are prerequisites for the export feature in TwinDB Backup Tool.

3. Create your dashboard with new graphs or add new graphs to the existing dashboard.

Disaster Recovery metrics will be recorded in twindb.mysql.backup_time and   twindb.mysql.restore_time .

MySQL disaster recovery metrics

TwinDB Backup Tool reports backup and restore time for file backups, too.

Files disaster recovery metrics

4. Datadog monitors will alert when RPO or RTO SLA is broken.

We will create two monitors: “Backup time is too high” and “Restore time is too high”. Each of the monitors will have two functions. One is to alert if a backup/restore threshold is exceeded and second is to alert if TwinDB Backup hasn’t reported the backup or restore time metric for a long time.

Backup time exceeds threshold.

 
{
    "name": "Backup time exceeds threshold",
    "type": "query alert",
    "query": "max(last_1h):max:twindb.mysql.backup_time{*} > 3600",
    "message": "Backup monitor @pagerduty",
    "tags": [
        "*"
    ],
    "options": {
        "timeout_h": 0,
        "notify_no_data": true,
        "no_data_timeframe": 120,
        "notify_audit": false,
        "require_full_window": false,
        "new_host_delay": 300,
        "include_tags": true,
        "escalation_message": "",
        "locked": false,
        "renotify_interval": 60,
        "evaluation_delay": "",
        "thresholds": {
            "critical": 3600,
            "warning": 1800
        }
    }
}
 

Restore time is higher than threshold.

 
{
    "name": "Restore time exceeds threshold",
    "type": "metric alert",
    "query": "max(last_1h):max:twindb.mysql.restore_time{*} > 3600",
    "message": "Verify monitor @pagerduty",
    "tags": [
        "*"
    ],
    "options": {
        "timeout_h": 0,
        "notify_no_data": true,
        "no_data_timeframe": 120,
        "notify_audit": false,
        "require_full_window": false,
        "new_host_delay": 300,
        "include_tags": false,
        "escalation_message": "",
        "locked": false,
        "renotify_interval": "0",
        "evaluation_delay": "",
        "thresholds": {
            "critical": 3600,
            "warning": 1800
        }
    }
}
 
原文链接:https://twindb.com/monitoring-mysql-backups/?utm_source=tuicool&utm_medium=referral
标签: 备份 MySQL
© 2014 TuiCode, Inc.