May 152011
 

Bacula is a backup and restore solution which I’ve been using and contributing to since 2004. Recently, I started backing up to disk and then copying to tape later. This approach, combined with a scratch pool, has proven to be very low maintenance, from a sysadmin point of view. I think all sysadmin will agree: the less you have to do, the better.

At this time of year, I’m away running BSDCan and PGCon. 2011 is also the first year that I’ve been doing this backup and copy solution. I now know how long my tape magazine takes to fill up when I’m not there: 16 days. That includes one set of full backups. 17 tapes or about 475GB.

Looking at The amazing Bacula time and space machine, I noticed that one full tape contained only 4665 MB of data. The volume contained only one job: 61256. I took the job id and searched my Bacula email (I keep it all, just for this situation). I found this in the output for that job:

10-May 10:09 kraken-sd JobId 61255: Wrote label to prelabeled Volume “ETU081” on device “DTL01” (/dev/nsa0)
10-May 10:09 kraken-sd JobId 61255: New volume “ETU081” mounted on device “DTL01” (/dev/nsa0) at 10-May-2011 05:09.
10-May 10:22 kraken-sd JobId 61255: End of Volume “ETU081” at 4:13832 on device “DTL01” (/dev/nsa0). Write of 64512 bytes got 0.
10-May 10:22 kraken-sd JobId 61255: Error: dev.c:1745 ioctl MTWEOF error on “DTL01” (/dev/nsa0). ERR=Input/output error.
10-May 10:22 kraken-sd JobId 61255: End of medium on Volume “ETU081” Bytes=4,892,073,984 Blocks=75,831 at 10-May-2011 05:22.

The key here is the ‘Input/output error’. Various messages can appear after the MTWEOF error, including permission denied. I think this particular situation indicates a problem with the tape media itself.

While away this trip, the tape library started acting up. I started seeing card dumps on a regular basis. I sent an email to the SCSI mailing list, and here is the output I included:

May 14 10:04:04 kraken kernel: ahc0: Recovery Initiated
May 14 10:04:04 kraken kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins < <<<<<<<<<<<<<<<< May 14 10:04:04 kraken kernel: ahc0: Dumping Card State while idle, at SEQADDR 0x184 May 14 10:04:04 kraken kernel: Card was paused May 14 10:04:04 kraken kernel: ACCUM = 0x0, SINDEX = 0x20, DINDEX = 0xa0, ARG_2 = 0x3f May 14 10:04:04 kraken kernel: HCNT = 0x20 SCBPTR = 0x0 May 14 10:04:04 kraken kernel: SCSISIGI[0x0] ERROR[0x40]:(PCIERRSTAT) SCSIBUSL[0x0] May 14 10:04:04 kraken kernel: LASTPHASE[0x1]:(P_BUSFREE) SCSISEQ[0x12]:(ENAUTOATNP|ENRSELI) May 14 10:04:04 kraken kernel: SBLKCTL[0x2]:(SELWIDE) SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE) May 14 10:04:04 kraken kernel: SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED) SSTAT0[0x4]:(SDONE) May 14 10:04:04 kraken kernel: SSTAT1[0x0] SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x0] SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO) May 14 10:04:04 kraken kernel: SXFRCTL0[0x80]:(DFON) DFCNTRL[0xc]:(DIRECTION|HDMAEN) May 14 10:04:04 kraken kernel: DFSTATUS[0x71]:(FIFOEMP|MREQPEND|FIFOQWDEMP|DFCACHETH) May 14 10:04:04 kraken kernel: STACK: 0xe 0xe 0xe 0x179 May 14 10:04:04 kraken kernel: SCB count = 254 May 14 10:04:04 kraken kernel: Kernel NEXTQSCB = 240 May 14 10:04:04 kraken kernel: Card NEXTQSCB = 252 May 14 10:04:04 kraken kernel: QINFIFO entries: 252 May 14 10:04:04 kraken kernel: Waiting Queue entries: May 14 10:04:04 kraken kernel: Disconnected Queue entries: May 14 10:04:04 kraken kernel: QOUTFIFO entries: May 14 10:04:04 kraken kernel: Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 May 14 10:04:04 kraken kernel: Sequencer SCB Info: May 14 10:04:04 kraken kernel: 0 SCB_CONTROL[0x40]:(DISCENB) SCB_SCSIID[0x57] SCB_LUN[0x0] May 14 10:04:04 kraken kernel: SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 1 SCB_CONTROL[0xc0]:(DISCENB|TARGET_SCB) SCB_SCSIID[0x57] May 14 10:04:04 kraken kernel: SCB_LUN[0x0] SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 2 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 3 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) May 14 10:04:04 kraken kernel: SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] May 14 10:04:04 kraken kernel: Pending list: May 14 10:04:04 kraken kernel: 252 SCB_CONTROL[0x50]:(MK_MESSAGE|DISCENB) SCB_SCSIID[0x7] May 14 10:04:04 kraken kernel: SCB_LUN[0x0] May 14 10:04:04 kraken kernel: Kernel Free SCB list: 238 241 251 250 247 246 245 244 243 242 239 248 249 253 237 236 235 234 233 232 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 May 14 10:04:04 kraken kernel: Untagged Q(0): 252 May 14 10:04:04 kraken kernel: May 14 10:04:04 kraken kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
May 14 10:04:04 kraken kernel: (pass11:ahc0:0:0:0): SCB 0xfc – timed out
May 14 10:04:04 kraken kernel: sg[0] – Addr 0xa501040 : Length 56
May 14 10:04:04 kraken kernel: (pass11:ahc0:0:0:0): SCB 252: Immediate reset. Flags = 0x660
May 14 10:04:04 kraken kernel: (pass11:ahc0:0:0:0): no longer in timeout, status = 25b
May 14 10:04:04 kraken kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
May 14 10:04:04 kraken kernel: ahc0: Timedout SCBs already complete. Interrupts may not be functioning.

This would occur every 30 minutes. I decided to disable my copy-to-tape jobs using the disable command in bconsole. This will last only until the next restart of bacula-dir.

Then I noticed that some of the jobs were taking a very long time to finish. They were waiting on something. This I realized they were waiting on the tape drive. This came to light when I noticed this in the job emails:

Termination: Backup OK — with warnings

14-May 16:05 bacula-dir JobId 61600: Begin pruning Jobs older than 3 years .
14-May 16:05 bacula-dir JobId 61600: No Jobs found to prune.
14-May 16:05 bacula-dir JobId 61600: Begin pruning Jobs.
14-May 16:05 bacula-dir JobId 61600: No Files found to prune.
14-May 16:05 bacula-dir JobId 61600: End auto prune.

14-May 16:05 bacula-dir JobId 61600: shell command: run AfterJob “/home/dan/bin/dlt-stats-kraken”
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Long Report=yes
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Valid Residual=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Error Code=0 (Unknown?!)
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Sense Key=No Sense
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: FileMark=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: EOM=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: ILI=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Additional Sense Code = 00
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Additional Sense Qualifier = 00
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: BPV=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Error in CDB=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: SKSV=no
14-May 16:35 bacula-dir JobId 61600: AfterJob: INQUIRY command Failed
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Long Report=yes
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Valid Residual=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Error Code=0 (Unknown?!)
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Sense Key=No Sense
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: FileMark=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: EOM=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: ILI=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Additional Sense Code = 00
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Additional Sense Qualifier = 00
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: BPV=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: Error in CDB=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: mtx: Request Sense: SKSV=no
14-May 17:05 bacula-dir JobId 61600: AfterJob: INQUIRY command Failed
14-May 17:05 bacula-dir JobId 61600: AfterJob: camcontrol: error sending command
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): LOG SENSE. CDB: 4d 0 42 0 0 0 0 0 40 0
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): CAM status: Unconditionally Re-queue Request
14-May 17:05 bacula-dir JobId 61600: AfterJob: (standard_in) 1: syntax error
14-May 17:05 bacula-dir JobId 61600: AfterJob: [: !=: unexpected operator
14-May 17:05 bacula-dir JobId 61600: AfterJob: camcontrol: error sending mode sense command
14-May 17:05 bacula-dir JobId 61600: AfterJob: camcontrol: error sending command
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): LOG SENSE. CDB: 4d 0 72 0 0 0 0 0 4c 0
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): CAM status: Unconditionally Re-queue Request
14-May 17:05 bacula-dir JobId 61600: AfterJob: camcontrol: error sending command
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): LOG SENSE. CDB: 4d 0 42 0 0 0 0 0 40 0
14-May 17:05 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): CAM status: Unconditionally Re-queue Request
14-May 17:05 bacula-dir JobId 61600: AfterJob: (standard_in) 1: syntax error
14-May 17:05 bacula-dir JobId 61600: AfterJob: [: !=: unexpected operator
14-May 17:06 bacula-dir JobId 61600: AfterJob: camcontrol: error sending mode sense command
14-May 17:06 bacula-dir JobId 61600: AfterJob: camcontrol: error sending command
14-May 17:06 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): LOG SENSE. CDB: 4d 0 72 0 0 0 0 0 4c 0
14-May 17:06 bacula-dir JobId 61600: AfterJob: (pass12:ahc0:0:5:0): CAM status: Unconditionally Re-queue Request
14-May 17:06 bacula-dir JobId 61600: AfterJob: Total corrected errors / GB : N/A
14-May 17:06 bacula-dir JobId 61600: AfterJob: STATUS is
14-May 17:06 bacula-dir JobId 61600: AfterJob: STATUS is
14-May 17:06 bacula-dir JobId 61600: AfterJob:

Much of that ‘AfterJob’ output is the result of my script: /home/dan/bin/dlt-stats-kraken

This script is pretty simple:

#!/bin/sh
ssh -i /home/dan/bin/ssh-kraken-dlt-stats dan@kraken /home/dan/bin/dlt-stats

It sshs to the server that runs the bacula-sd with the massive disk backup and tape library. It them asks the tape drive for the number of corrected errors, and adds it to a text file.

The problem: not all of these jobs use the tape drive. Only those going to certain pools. However, for it to be run during the copy to tape, it needs to be specified in the original job. Or, at least, that it the way the jobs are set up now.

However, since the tape drive was acting up, and no jobs were being copied to tape anyway, I decided now was an ideal time to adjust the script.

This is the new version of that script:

#!/bin/sh

VOLUME=${1} 

if [ "${VOLUME}" = 'Incrementals' -o "${VOLUME}" = 'Differentials' -o "${VOLUME}" = 'Fulls' ]
then
  # only record DLT stats for tape pools
  ssh -i /home/dan/bin/ssh-kraken-dlt-stats dan@kraken /home/dan/bin/dlt-stats
fi

Combined with the above change, was a change to bacula-dir.conf:

RunAfterJob = “/home/dan/bin/dlt-stats-kraken %v”

This stopped the queries to the tape drive and allowed the jobs to proceed in their regular timely fashion.

However….

This solution is not correct. The RunAfterJob directive is passing %v to the script. But the script is looking at pool names (‘Incrementals’, ‘Differentials’, and ‘Fulls’ are the names of my Pools uses for tape volumes). What the script needs to do is query the database for a the volume provided and see what pool it is in.

NOTE: after I changed the script, the SCSI card dumps stopped.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive