====== Creating a watchdog in Linux ======
===== Intro =====
I've created a simple process that monitors if certain processes are being executed, and, if they don't, restart them again.
My first purpose for this is for monitor two minecraft server I am running in my server; so there are specific glitches to work only for this case. However, there are easily changed to fit other purposes.
===== The code =====
#!/bin/bash
#
# watchdog - monitors a process
#
#
#
pidfileList[0]="/home/minecraft/pidfile"
pidfileList[1]="/home/minecraft2/pidfile"
startcmd[0]="/etc/init.d/minecraft start"
startcmd[1]="/etc/init.d/minecraft2 start"
logfile=/var/log/watchdog.log
tries=0
umask 022
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# first we delete the log file
rm "$logfile"
# to write a message to the log
function log()
{
now=$(date +"%Y-%m-%d %H:%M:%S")
echo "$now $1" >> $logfile
} # log
# do a lazy start: the first time it will wait
# 20 minutes to let the sistem to stabilize
log "Waiting 20 minutes to let the sistem stabilize..."
sleep 20m
log "Recovering..."
while [ true ] ; do
for(( i = 0; i < 50; i++ )) ; do
its_ok_to_launch=0
if [ -n "${pidfileList[i]}" ] ; then
# get the pidfile value
pidfile="${pidfileList[i]}"
log "Checking pidfile $pidfile..."
# check the existence of this pidfile
if [ -e "$pidfile" ] ; then
# get the pid value
pidvalue=$(cat $pidfile)
log "The file exists and contains the value $pidvalue"
# check existence of this pidvalue
line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)
if [ -z "$line" ] ; then
# the process doesn't exist
# or the pid number doesn't correspond
# to a minecraft server
log "There is no process with id: $pidvalue"
its_ok_to_launch=1
fi # -z "$line"
else
# if the pidfile doesn't exist,
# it is correct to launch it
log "The file doesn't exist"
its_ok_to_launch=1
fi # -e $pidfile
fi # -n pidfileList[i]
if [ $its_ok_to_launch -eq 1 ] ; then
tries=$((tries+1))
if [ $tries -le 6 ]; then
# attempt to start the process
# if the maximum reach attempts
# haven't been reached
log "Attempting to start command ${startcmd[i]} (this is time $tries out
of 6)"
${startcmd[i]}
else
if [ $tries -eq 6 ] ; then
log "This is time number $tries, giving up"
fi # tries -eq 6
fi # tries -le 6
fi # its_ok_to_launc -eq 1
done # for
log "Sleeping for 10 minutes...."
sleep 10m
done # true
==== This is what the program does ====
First, it waits 20 minutes, to save the case this program is configured to be run in the booting of the server and the monitored processes aren't being started:
# do a lazy start: the first time it will wait
# 20 minutes to let the sistem to stabilize
log "Waiting 20 minutes to let the sistem stabilize..."
sleep 20m
log "Recovering..."
Next, it will run forever, awakening for every ten minutes:
while [ true ] ; do
....
log "Sleeping for 10 minutes...."
sleep 10m
done # true
Next, a for loop is run to traverse the arrah pidfileList:
for(( i = 0; i < 50; i++ )) ; do
....
done # for
For every element in the array that is not empty....
its_ok_to_launch=0
if [ -n "${pidfileList[i]}" ] ; then
....
fi # its_ok_to_launc -eq 1
Comes the real part. Get the content of the pidfile and put into a pidvalue variable:
# get the pidfile value
pidfile="${pidfileList[i]}"
log "Checking pidfile $pidfile..."
# check the existence of this pidfile
if [ -e "$pidfile" ] ; then
# get the pid value
pidvalue=$(cat $pidfile)
log "The file exists and contains the value $pidvalue"
....
Verify that this pidvalue correspond to a real, existing process:
# check existence of this pidvalue
line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)
if [ -z "$line" ] ; then
# the process doesn't exist
# or the pid number doesn't correspond
# to a minecraft server
log "There is no process with id: $pidvalue"
its_ok_to_launch=1
fi # -z "$line"
else
# if the pidfile doesn't exist,
# it is correct to launch it
log "The file doesn't exist"
its_ok_to_launch=1
fi # -e $pidfile
fi # -n pidfileList[i]
And if the process doesn't exist for whatever reason (the file doesn't exist, or the pid number doesn't correspond to a real process), try to restart the file up to a limit of six times:
if [ $its_ok_to_launch -eq 1 ] ; then
tries=$((tries+1))
if [ $tries -le 6 ]; then
# attempt to start the process
# if the maximum reach attempts
# haven't been reached
log "Attempting to start command ${startcmd[i]} (this is time $tries out
of 6)"
${startcmd[i]}
else
if [ $tries -eq 6 ] ; then
log "This is time number $tries, giving up"
fi # tries -eq 6
fi # tries -le 6
===== Confiuration =====
You have to configure the pid files to be monitorized (here is my example with the minecraft server):
pidfileList[0]="/home/minecraft/pidfile"
pidfileList[1]="/home/minecraft2/pidfile"
How this command are run in the event of a failure:
startcmd[0]="/etc/init.d/minecraft start"
startcmd[1]="/etc/init.d/minecraft2 start"
The location of the logfile:
logfile=/var/log/watchdog.log
And, in the case you have to use it for other purposes, how it's identified each process:
line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)
I've need to add this ''grep minecraft'' to avoid errors: in some ocasion, the process failed and another program start to occupy this process number.
===== Installation =====
I've used the file ''/etc/rc.local'' to start this file and I've put this under ''/usr/local/sbin'', but you can pick whatever directory best suits you.