User Tools

Site Tools


linux:watchdoginlinux

Creating a watchdog in Linux

Intro

I've created a simple process that monitors if certain processes are being executed, and, if they don't, restart them again.

My first purpose for this is for monitor two minecraft server I am running in my server; so there are specific glitches to work only for this case. However, there are easily changed to fit other purposes.

The code

#!/bin/bash
#
# watchdog - monitors a process
#
#
#

pidfileList[0]="/home/minecraft/pidfile"
pidfileList[1]="/home/minecraft2/pidfile"

startcmd[0]="/etc/init.d/minecraft start"
startcmd[1]="/etc/init.d/minecraft2 start"

logfile=/var/log/watchdog.log

tries=0

umask 022

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin


# first we delete the log file
rm "$logfile"

# to write a message to the log
function log()
{
    now=$(date +"%Y-%m-%d %H:%M:%S")
    echo "$now $1" >> $logfile
} # log


# do a lazy start: the first time it will wait
# 20 minutes to let the sistem to stabilize
log "Waiting 20 minutes to let the sistem stabilize..."
sleep 20m
log "Recovering..."

while [ true ] ; do


  for(( i = 0; i < 50; i++ )) ; do

    its_ok_to_launch=0
    if [ -n "${pidfileList[i]}" ] ; then

      # get the pidfile value
      pidfile="${pidfileList[i]}"
      log "Checking pidfile $pidfile..."
      # check the existence of this pidfile
      if [ -e "$pidfile" ] ; then
        # get the pid value
        pidvalue=$(cat $pidfile)
        log "The file exists and contains the value $pidvalue"
        # check existence of this pidvalue
        line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)
        if [ -z "$line" ] ; then
          # the process doesn't exist
          # or the pid number doesn't correspond
          # to a minecraft server
          log "There is no process with id: $pidvalue"
          its_ok_to_launch=1
        fi # -z "$line"
      else
        # if the pidfile doesn't exist,
        # it is correct to launch it
        log "The file doesn't exist"
        its_ok_to_launch=1
      fi # -e $pidfile

    fi # -n pidfileList[i]

    if [ $its_ok_to_launch -eq 1 ] ; then

      tries=$((tries+1))
      if [ $tries -le 6 ]; then
        # attempt to start the process
        # if the maximum reach attempts
        # haven't been reached
        log "Attempting to start command ${startcmd[i]} (this is time $tries out
 of 6)"
        ${startcmd[i]}
      else
       if [ $tries -eq 6 ] ; then
         log "This is time number $tries, giving up"
       fi # tries -eq 6
      fi # tries -le 6

    fi # its_ok_to_launc -eq 1

  done # for

  log "Sleeping for 10 minutes...."
  sleep 10m
done # true

This is what the program does

First, it waits 20 minutes, to save the case this program is configured to be run in the booting of the server and the monitored processes aren't being started:

# do a lazy start: the first time it will wait
# 20 minutes to let the sistem to stabilize
log "Waiting 20 minutes to let the sistem stabilize..."
sleep 20m
log "Recovering..."

Next, it will run forever, awakening for every ten minutes:

while [ true ] ; do

  ....

  log "Sleeping for 10 minutes...."
  sleep 10m
done # true

Next, a for loop is run to traverse the arrah pidfileList:

  for(( i = 0; i < 50; i++ )) ; do

  ....

  done # for

For every element in the array that is not empty….

    its_ok_to_launch=0
    if [ -n "${pidfileList[i]}" ] ; then

    ....

    fi # its_ok_to_launc -eq 1

Comes the real part. Get the content of the pidfile and put into a pidvalue variable:

      # get the pidfile value
      pidfile="${pidfileList[i]}"
      log "Checking pidfile $pidfile..."
      # check the existence of this pidfile
      if [ -e "$pidfile" ] ; then
        # get the pid value
        pidvalue=$(cat $pidfile)
        log "The file exists and contains the value $pidvalue"
        ....

Verify that this pidvalue correspond to a real, existing process:

        # check existence of this pidvalue
        line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)
        if [ -z "$line" ] ; then
          # the process doesn't exist
          # or the pid number doesn't correspond
          # to a minecraft server
          log "There is no process with id: $pidvalue"
          its_ok_to_launch=1
        fi # -z "$line"
      else
        # if the pidfile doesn't exist,
        # it is correct to launch it
        log "The file doesn't exist"
        its_ok_to_launch=1
      fi # -e $pidfile

    fi # -n pidfileList[i]
    

And if the process doesn't exist for whatever reason (the file doesn't exist, or the pid number doesn't correspond to a real process), try to restart the file up to a limit of six times:

    if [ $its_ok_to_launch -eq 1 ] ; then

      tries=$((tries+1))
      if [ $tries -le 6 ]; then
        # attempt to start the process
        # if the maximum reach attempts
        # haven't been reached
        log "Attempting to start command ${startcmd[i]} (this is time $tries out
 of 6)"
        ${startcmd[i]}
      else
       if [ $tries -eq 6 ] ; then
         log "This is time number $tries, giving up"
       fi # tries -eq 6
      fi # tries -le 6
      

Confiuration

You have to configure the pid files to be monitorized (here is my example with the minecraft server):

pidfileList[0]="/home/minecraft/pidfile"
pidfileList[1]="/home/minecraft2/pidfile"

How this command are run in the event of a failure:

startcmd[0]="/etc/init.d/minecraft start"
startcmd[1]="/etc/init.d/minecraft2 start"

The location of the logfile:

logfile=/var/log/watchdog.log

And, in the case you have to use it for other purposes, how it's identified each process:

line=$(ps aux | grep $pidvalue | grep minecraft | grep -v grep)

I've need to add this grep minecraft to avoid errors: in some ocasion, the process failed and another program start to occupy this process number.

Installation

I've used the file /etc/rc.local to start this file and I've put this under /usr/local/sbin, but you can pick whatever directory best suits you.

linux/watchdoginlinux.txt · Last modified: 2022/12/02 21:02 by 127.0.0.1