====== Creating a watchdog in Linux ====== ===== Intro ===== I've created a simple process that monitors if certain processes are being executed, and, if they don't, restart them again. My first purpose for this is for monitor two minecraft server I am running in my server; so there are specific glitches to work only for this case. However, there are easily changed to fit other purposes. ===== The code ===== #!/bin/bash # # watchdog - monitors a process # # # pidfileList[0]="/home/minecraft/pidfile" pidfileList[1]="/home/minecraft2/pidfile" startcmd[0]="/etc/init.d/minecraft start" startcmd[1]="/etc/init.d/minecraft2 start" logfile=/var/log/watchdog.log tries=0 umask 022 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin # first we delete the log file rm "$logfile" # to write a message to the log function log() { now=$(date +"%Y-%m-%d %H:%M:%S") echo "$now $1" >> $logfile } # log # do a lazy start: the first time it will wait # 20 minutes to let the sistem to stabilize log "Waiting 20 minutes to let the sistem stabilize..." sleep 20m log "Recovering..." while [ true ] ; do for(( i = 0; i < 50; i++ )) ; do its_ok_to_launch=0 if [ -n "${pidfileList[i]}" ] ; then # get the pidfile value pidfile="${pidfileList[i]}" log "Checking pidfile $pidfile..." # check the existence of this pidfile if [ -e "$pidfile" ] ; then # get the pid value pidvalue=$(cat $pidfile) log "The file exists and contains the value $pidvalue" # check existence of this pidvalue line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep) if [ -z "$line" ] ; then # the process doesn't exist # or the pid number doesn't correspond # to a minecraft server log "There is no process with id: $pidvalue" its_ok_to_launch=1 fi # -z "$line" else # if the pidfile doesn't exist, # it is correct to launch it log "The file doesn't exist" its_ok_to_launch=1 fi # -e $pidfile fi # -n pidfileList[i] if [ $its_ok_to_launch -eq 1 ] ; then tries=$((tries+1)) if [ $tries -le 6 ]; then # attempt to start the process # if the maximum reach attempts # haven't been reached log "Attempting to start command ${startcmd[i]} (this is time $tries out of 6)" ${startcmd[i]} else if [ $tries -eq 6 ] ; then log "This is time number $tries, giving up" fi # tries -eq 6 fi # tries -le 6 fi # its_ok_to_launc -eq 1 done # for log "Sleeping for 10 minutes...." sleep 10m done # true ==== This is what the program does ==== First, it waits 20 minutes, to save the case this program is configured to be run in the booting of the server and the monitored processes aren't being started: # do a lazy start: the first time it will wait # 20 minutes to let the sistem to stabilize log "Waiting 20 minutes to let the sistem stabilize..." sleep 20m log "Recovering..." Next, it will run forever, awakening for every ten minutes: while [ true ] ; do .... log "Sleeping for 10 minutes...." sleep 10m done # true Next, a for loop is run to traverse the arrah pidfileList: for(( i = 0; i < 50; i++ )) ; do .... done # for For every element in the array that is not empty.... its_ok_to_launch=0 if [ -n "${pidfileList[i]}" ] ; then .... fi # its_ok_to_launc -eq 1 Comes the real part. Get the content of the pidfile and put into a pidvalue variable: # get the pidfile value pidfile="${pidfileList[i]}" log "Checking pidfile $pidfile..." # check the existence of this pidfile if [ -e "$pidfile" ] ; then # get the pid value pidvalue=$(cat $pidfile) log "The file exists and contains the value $pidvalue" .... Verify that this pidvalue correspond to a real, existing process: # check existence of this pidvalue line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep) if [ -z "$line" ] ; then # the process doesn't exist # or the pid number doesn't correspond # to a minecraft server log "There is no process with id: $pidvalue" its_ok_to_launch=1 fi # -z "$line" else # if the pidfile doesn't exist, # it is correct to launch it log "The file doesn't exist" its_ok_to_launch=1 fi # -e $pidfile fi # -n pidfileList[i] And if the process doesn't exist for whatever reason (the file doesn't exist, or the pid number doesn't correspond to a real process), try to restart the file up to a limit of six times: if [ $its_ok_to_launch -eq 1 ] ; then tries=$((tries+1)) if [ $tries -le 6 ]; then # attempt to start the process # if the maximum reach attempts # haven't been reached log "Attempting to start command ${startcmd[i]} (this is time $tries out of 6)" ${startcmd[i]} else if [ $tries -eq 6 ] ; then log "This is time number $tries, giving up" fi # tries -eq 6 fi # tries -le 6 ===== Confiuration ===== You have to configure the pid files to be monitorized (here is my example with the minecraft server): pidfileList[0]="/home/minecraft/pidfile" pidfileList[1]="/home/minecraft2/pidfile" How this command are run in the event of a failure: startcmd[0]="/etc/init.d/minecraft start" startcmd[1]="/etc/init.d/minecraft2 start" The location of the logfile: logfile=/var/log/watchdog.log And, in the case you have to use it for other purposes, how it's identified each process: line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep) I've need to add this ''grep minecraft'' to avoid errors: in some ocasion, the process failed and another program start to occupy this process number. ===== Installation ===== I've used the file ''/etc/rc.local'' to start this file and I've put this under ''/usr/local/sbin'', but you can pick whatever directory best suits you.