Bash обработчик ошибок

In this article, I present a few tricks to handle error conditions—Some strictly don’t fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.

Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.

Say that you have a cron job on each one of your Linux systems, and you have a script to collect the hardware information from each:

#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)

DATADIR="$HOME/Documents/lshw-dump"

/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
    echo "Visiting: $server"
    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done

If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.

Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 136.9MB/s   00:00    
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
  "boot": "normal",
  "chassis": "desktop",
  "family": "Default string",
  "sku": "Default string",
  "uuid": "00020003-0004-0005-0006-000700080009"
}

Here are some possibilities of why things went wrong:

  • Your report didn’t run because the server was down
  • You couldn’t create the directory where the files need to be saved
  • The tools you need to run the script are missing
  • You can’t collect the report because your remote machine crashed
  • One or more of the reports is corrupt

The current version of the script has a problem—It will run from the beginning to the end, errors or not:

./collect_data_from_servers.sh 
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB  48.8MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9

Next, I demonstrate a few things to make your script more robust and in some times recover from failure.

The nuclear option: Failing hard, failing fast

The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $? or $! variable, have sometimes a broader meaning. The bash man page tells you:

For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.

As usual, you should always read the man page of the scripts you’re calling, to see what the conventions are for each of them. If you’ve programmed with a language like Java or Python, then you’re most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.

If you add set -o errexit to your script, from that point forward it will abort the execution if any command exists with a code != 0. But errexit isn’t used when executing functions inside an if condition, so instead of remembering that exception, I rather do explicit error handling.

Take a look at version two of the script. It’s slightly better:

1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/        ) 
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9   Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18  
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then 
21    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi 
23 declare -A server_pid
24 for server in ${servers[*]}; do
25    echo "Visiting: $server"
26    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27   server_pid[$server]=$! # Save the PID of the scp  of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33    wait ${server_pid[$server]}
34    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done

Here’s what changed:

  • Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
  • Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
  • On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
  • On lines 33-35, I wait for the scp task to finish, get the return code, and if it’s an error, abort.
  • On line 37, I check that the file could be parsed, otherwise, I exit with an error.

So how does the error handling look now?

Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 146.1MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory

As you can see, this version is better at detecting errors but it’s very unforgiving. Also, it doesn’t detect all the errors, does it?

When you get stuck and you wish you had an alarm

The code looks better, except that sometimes the scp could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.

Another example is to try to access a directory through NFS where $HOME is mounted from an NFS server:

/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

And you discover hours later that the NFS mount point is stale and your script is stuck.

A timeout is the solution. And, GNU timeout comes to the rescue:

/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it’s still running after 20.0 seconds, then send a KILL signal (kill -9). If in doubt, check which signals are supported in your system (kill -l, for example).

If this isn’t clear from my dialog, then look at the script for more clarity.

/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real    0m20.003s
user    0m0.000s
sys     0m0.003s

Back to the original script to add a few more options and you have version three:

 1 #!/bin/bash
  2 # Script to collect the status of lshw output from home servers
  3 # Dependencies:
  4 # * Open SSH: http://www.openssh.com/portable.html
  5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
  6 # * JQ: http://stedolan.github.io/jq/
  7 # * timeout: https://www.gnu.org/software/coreutils/
  8 #
  9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
 10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
 11 # Author: Jose Vicente Nunez
 12 #
 13 set -o errtrace # Enable the err trap, code will get called when an error is detected
 14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
 15 
 16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
 17 for dependency in ${dependencies[@]}; do
 18     if [ ! -x $dependency ]; then
 19         echo "ERROR: Missing $dependency"
 20         exit 100
 21     fi
 22 done
 23 
 24 declare -a servers=(
 25 macmini2
 26 mac-pro-1-1
 27 dmaf5
 28 )
 29 
 30 function remote_copy {
 31     local server=$1
 32     echo "Visiting: $server"
 33     /usr/bin/timeout --kill-after 25.0s 20.0s \
 34         /usr/bin/scp \
 35             -o BatchMode=yes \
 36             -o logLevel=Error \
 37             -o ConnectTimeout=5 \
 38             -o ConnectionAttempts=3 \
 39             ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
 40     return $?
 41 }
 42 
 43 DATADIR="$HOME/Documents/lshw-dump"
 44 if [ ! -d "$DATADIR" ]; then
 45     /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
 46 fi
 47 declare -A server_pid
 48 for server in ${servers[*]}; do
 49     remote_copy $server &
 50     server_pid[$server]=$! # Save the PID of the scp  of a given server for later
 51 done
 52 # Iterate through all the servers and:
 53 # Wait for the return code of each
 54 # Check the exit code from each scp
 55 for server in ${!server_pid[*]}; do
 56     wait ${server_pid[$server]}
 57     test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
 58 done
 59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
 60     /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
 61 done

What are the changes?:

  • Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
  • Created a remote_copy function, which uses a timeout to make sure the scp finishes no later than 45.0s—line 33.
  • Added a connection timeout of 5 seconds instead of the TCP default—line 37.
  • Added a retry to scp on line 38—3 attempts that wait 1 second between each.

There other ways to retry when there’s an error.

Waiting for the end of the world-how and when to retry

You noticed there’s an added retry to the scp command. But that retries only for failed connections, what if the command fails during the middle of the copy?

Sometimes you want to just fail because there’s very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you’re able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.

Here are the changes to the remote_copy, to keep this brief (version four):

#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3

# Blah blah blah...

function remote_copy {
    local server=$1
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        echo "INFO: Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s \
            /usr/bin/scp \
                -o BatchMode=yes \
                -o logLevel=Error \
                -o ConnectTimeout=5 \
                -o ConnectionAttempts=3 \
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    return $status
}

DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
    remote_copy $server $MAX_RETRIES &
    server_pid[$server]=$! # Save the PID of the scp  of a given server for later
done

# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
    wait ${server_pid[$server]}
    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done

# Blah blah blah, process the files you just copied...

How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there’s a retry for a random time between 1 and 60 seconds before exiting:

INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue

If I fail, do I have to do this all over again? Using a checkpoint

Suppose that the remote copy is the most expensive operation of this whole script and that you’re willing or able to re-run this script, maybe using cron or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.

You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don’t bother to check again for that day.

Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry).

A new version (version five) of the script has code to record the status of the copy (lines 15-33):

15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20   /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25  local machine=$1
26  test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30    machine=$1
31    /usr/bin/touch $CACHE_DIR/$machine
32    return $?
33 }

Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.

And then, add this new helper logic into the remote_copy function (lines 52-81):

52 function remote_copy {
53    local server=$1
54    check_previous_run $server
55    test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56    local retries=$2
57    local now=1
58    status=0
59    while [ $now -le $retries ]; do
60        echo "INFO: Trying to copy file from: $server, attempt=$now"
61        /usr/bin/timeout --kill-after 25.0s 20.0s \
62            /usr/bin/scp \
63                -o BatchMode=yes \
64                -o logLevel=Error \
65                -o ConnectTimeout=5 \
66               -o ConnectionAttempts=3 \
67                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68        status=$?
69        if [ $status -ne 0 ]; then
70            sleep_time=$(((RANDOM % 60)+ 1))
71            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72            /usr/bin/sleep ${sleep_time}s
73        else
74            break # All good, no point on waiting...
75        fi
76        ((now=now+1))
77    done
78    test $status -eq 0 && mark_previous_run $server
79    test $? -ne 0 && status=1
80    return $status
81 }

The first time it runs, a new new message for the cache directory is printed out:

./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow

If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:

./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1

Imagine how this speeds up when you have more machines that should not be revisited.

Leaving crumbs behind: What to log, how to log, and verbose output

If you’re like me, I like a bit of context to correlate with when something goes wrong. The echo statements on the script are nice but what if you could add a timestamp to them.

If you use logger, you can save the output on journalctl for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl right away.

So instead of just doing echo, you can also add a call to logger like this using a new bash function called ‘message’:

SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"

function message {
    message="$1"
    func_name="${2-unknown}"
    priority=6
    if [ -z "$2" ]; then
        echo "INFO:" $message
    else
        echo "ERROR:" $message
        priority=0
    fi
    /usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}

You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.

So how is this useful? Well, you could get the messages between 1:26 PM and 1:27 PM, only errors (priority=0) and only for our script (collect_data_from_servers.v6.sh) like this, output in JSON format:

journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
        "_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
        "_AUDIT_LOGINUID" : "1000",
        "SYSLOG_IDENTIFIER" : "logger",
        "PRIORITY" : "0",
        "_TRANSPORT" : "journal",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "__REALTIME_TIMESTAMP" : "1623518797641880",
        "_AUDIT_SESSION" : "3",
        "_GID" : "1000",
        "MESSAGE_ID" : "collect_data_from_servers.v6.sh",
        "MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
        "_CAP_EFFECTIVE" : "0",
        "CODE_FUNC" : "remote_copy",
        "_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
        "_COMM" : "logger",
        "CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
        "_PID" : "41832",
        "__MONOTONIC_TIMESTAMP" : "25928272252",
        "_HOSTNAME" : "dmaf5",
        "_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
        "__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
        "_UID" : "1000"
}

Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.

You can take a look at the whole version six of the script.

Don’t be so eager to replace your data until you’ve checked it.

If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:

Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'

That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don’t attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):

function remote_copy {
    local server=$1
    check_previous_run $server
    test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        message "Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s \
            /usr/bin/scp \
                -o BatchMode=yes \
                -o logLevel=Error \
                -o ConnectTimeout=5 \
                -o ConnectionAttempts=3 \
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    if [ $status -eq 0 ]; then
        /usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
        status=$?
        if [ $status -eq 0 ]; then
            /usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
            test $? -ne 0 && status=1
        else
            message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
        fi
    fi
    return $status
}

Choose the right tools for the task and prep your code from the first line

One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.

You should ALWAYS check for error syntax before running your script:

bash -n $my_bash_script.sh

Seriously. It should be as automatic as performing any other test.

Read the bash man page and get familiar with must-know options, like:

set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv

Use ShellCheck to check your bash scripts

It’s very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.

shellcheck collect_data_from_servers.v7.sh

In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
                  ^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.


In collect_data_from_servers.v7.sh line 16:
    if [ ! -x $dependency ]; then
              ^---------^ SC2086: Double quote to prevent globbing and word splitting.

Did you mean: 
    if [ ! -x "$dependency" ]; then
...

If you’re wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.

You noticed something with the background scp processes

You probably noticed that if you kill the script, it leaves some forked processes behind. That isn’t good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.

This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.

Let’s recap

[ Download now: A sysadmin’s guide to Bash scripting. ]

1.  You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2.  Speaking of transitory conditions, you don’t need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3.  Bash ‘trap’ is your friend. Use it for cleanup and error handling.
4.  When downloading data from any source, assume it’s corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5.  Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6.  You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7.  And finally: Use a Bash lint helper like  ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts…

If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.

Обработка ошибок — очень важная часть любого языка программирования. У Bash нет лучшего варианта, чем другие языки программирования, для обработки ошибки скрипта. Но важно, чтобы скрипт Bash был безошибочным во время выполнения скрипта из терминала. Функция обработки ошибок может быть реализована для сценария Bash несколькими способами. В этой статье показаны различные методы обработки ошибок в сценарии Bash.

Пример 1. Обработка ошибок с использованием условного оператора

Создайте файл Bash со следующим сценарием, который показывает использование условного оператора для обработки ошибок. Первый оператор «if» используется для проверки общего количества аргументов командной строки и вывода сообщения об ошибке, если значение меньше 2. Затем значения делимого и делителя берутся из аргументов командной строки. Если значение делителя равно 0, генерируется ошибка, и сообщение об ошибке печатается в файле error.txt. Вторая команда «if» используется для проверки того, является ли файл error.txt пустым или нет. Сообщение об ошибке печатается, если файл error.txt не пуст.

#!/bin/bash
#Проверить значения аргументов
if [ $# -lt 2 ]; then
   echo "Отсутствует один или несколько аргументов."
   exit
fi
#Чтение значения делимого из первого аргумента командной строки
dividend=$1
#Читание значения делителя из второго аргумента командной строки
divisor=$2
#Деление делимого на делитель
result=`echo "scale=2; $dividend/$divisor"|bc 2>error.txt`
#Читать содержимое файла ошибки
content=`cat error.txt`
if [ -n "$content" ]; then
  #Распечатать сообщение об ошибке, если файл error.txt непустой
  echo "Произошла ошибка, кратная нулю."
else
  #Распечатать результат
  echo "$dividend/$divisor = $result"

Вывод:

Следующий вывод появляется после выполнения предыдущего скрипта без каких-либо аргументов:

andreyex@andreyex:-/Desktop/bash$ bash error1.bash 
One or more argument is missing. 
andreyex@andreyex:~/Desktop/bash$

Следующий вывод появляется после выполнения предыдущего скрипта с одним значением аргумента:

andreyex@andreyex:-/Desktop/bash$ bash error1.bash 75 
One or more argument is missing. 
andreyex@andreyex:~/Desktop/bash$

Следующий вывод появляется после выполнения предыдущего скрипта с двумя допустимыми значениями аргумента:

andreyex@andreyex:-/Desktop/bash$ bash error1.bash 
75 8 75/8 = 9.37 
andreyex@andreyex:-/Desktop/bash$

Следующий вывод появляется после выполнения предыдущего скрипта с двумя значениями аргументов, где второй аргумент равен 0. Выводится сообщение об ошибке:

andreyex@andreyex:~/Desktop/bash$ bash error1.bash 75 0 
Divisible by zero error occurred. 
andreyex@andreyex:~/Desktop/bash$

Пример 2: Обработка ошибок с использованием кода состояния выхода

Создайте файл Bash со следующим сценарием, который показывает использование обработки ошибок Bash по коду состояния выхода. Любая команда Bash принимается в качестве входного значения, и эта команда выполняется позже. Если код состояния выхода не равен нулю, печатается сообщение об ошибке. В противном случае печатается сообщение об успешном выполнении.

#!/bin/bash

#Взять имя команды Linux
echo -n "Введите команду: "
read cmd_name
#Выполнить команду
$cmd_name
#Проверить, действительна ли команда,
if [ $? -ne 0 ]; then
   echo "$cmd_name - недопустимая команда."
else
   echo "$cmd_name является корректной командой."
fi

fi

Вывод:

Следующий вывод появляется после выполнения предыдущего скрипта с допустимой командой. Здесь «data» принимается как команда во входном значении, которая является допустимой:

andreyex@andreyex:-/Desktop/bash$ bash error2.bash

Enter a command: date 
Tue Dec 27 19:18:39 +06 2022 
date is a valid command. 

andreyex@andreyex:-/Desktop/bash$

Следующий вывод появляется после выполнения предыдущего скрипта для недопустимой команды. Здесь «cmd» воспринимается как недопустимая команда во входном значении:

andreyex@andreyex:-/Desktop/bash$ bash error2.bash

Enter a command: cmd
error2.bash: line 7: cmd: command not found cmd is a invalid command.

andreyex@andreyex: -/Desktop/bash$

Пример 3: остановить выполнение при первой ошибке

Создайте файл Bash со следующим сценарием, который показывает метод остановки выполнения при появлении первой ошибки сценария. В следующем скрипте используются две недопустимые команды. Таким образом, выдаются две ошибки. Сценарий останавливает выполнение после выполнения первой недопустимой команды с помощью команды «set -e».

#!/bin/bash
#Установите параметр для завершения скрипта при первой ошибке
set -e
echo 'Текущие дата и время: '
#Действительная команда
date
echo 'Текущий рабочий каталог: '
#Неверная команда
cwd
echo 'имя пользователя: '
#Действительная команда
whoami
echo 'Список файлов и папок: '
#Неверный список
list

Вывод:

Следующий вывод появляется после выполнения предыдущего скрипта. Сценарий останавливает выполнение после выполнения недопустимой команды «cwd»:

andreyex@andreyex:-/Desktop/bash$ bash error3.bash

Current date and time: Tue Dec 27 19:19:38 +06 2022
Current orking Directory:
error3.bash: line 9: cwd: command not found 

andreyex@andreyex:-/Desktop/bash$

Пример 4: остановить выполнение для неинициализированной переменной

Создайте файл Bash со следующим сценарием, который показывает метод остановки выполнения сценария для неинициализированной переменной. Значения имени пользователя и пароля берутся из значений аргументов командной строки. Если какое-либо из значений этих переменных не инициализировано, выводится сообщение об ошибке. Если обе переменные инициализированы, сценарий проверяет, являются ли имя пользователя и пароль действительными или недействительными.

#!/bin/bash
#Установите параметр завершения сценария для неинициализированной переменной
set -u
#Установите значение первого аргумента командной строки на имя пользователя
username=$1
#Проверьте правильность или недопустимость имени пользователя и пароля
password=$2
#Проверьте правильность или недопустимость имени пользователя и пароля
if [[ $username == 'admin' && $password == 'hidenseek' ]]; then
    echo "Действительный пользователь."
else
    echo "Неверный пользователь."
fi

Вывод:

Следующий вывод появляется, если сценарий выполняется без использования какого-либо значения аргумента командной строки. Скрипт останавливает выполнение после получения первой неинициализированной переменной:

andreyex@andreyex:~/Desktop/bash$ bash error4.bash 

error4.bash: line 7: $1: unbound variable 

andreyex@andreyex:~/Desktop/bash$

Следующий вывод появляется, если сценарий выполняется с одним значением аргумента командной строки. Скрипт останавливает выполнение после получения второй неинициализированной переменной:

andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin 

error4.bash: line 9: $2: unbound variable 

andreyex@andreyex:-/Desktop/bash$

Следующий вывод появляется, если сценарий выполняется с двумя значениями аргумента командной строки — «admin» и «hide». Здесь имя пользователя действительно, но пароль недействителен. Итак, выводится сообщение «Invalid user»:

andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin hide 

Invalid user. 

andreyex@andreyex:-/Desktop/bash$

Следующий вывод появляется, если сценарий выполняется с двумя значениями аргументов командной строки — «admin» и «hidenseek». Здесь имя пользователя и пароль действительны. Итак, выводится сообщение «Valid user»:

andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin hidenseek 

Valid user. 

andreyex@andreyex:~/Desktop/bash$

Заключение

Различные способы обработки ошибок в скрипте Bash показаны в этой статье на нескольких примерах. Мы надеемся, что это поможет пользователям Bash реализовать функцию обработки ошибок в своих сценариях Bash.

Если вы нашли ошибку, пожалуйста, выделите фрагмент текста и нажмите Ctrl+Enter.

Написание надежного, без ошибок сценария bash всегда является сложной задачей. Даже если вы написать идеальный сценарий bash, он все равно может не сработать из-за внешних факторов, таких как некорректный ввод или проблемы с сетью.

В оболочке bash нет никакого механизма поглощения исключений, такого как конструкции try/catch. Некоторые ошибки bash могут быть молча проигнорированы, но могут иметь последствия в дальнейшем. Перехват и обработка ошибок в bash

Проверка статуса завершения команды

Всегда рекомендуется проверять статус завершения команды, так как ненулевой статус выхода обычно указывает на ошибку

if ! command; then
    echo "command returned an error"
fi

Другой (более компактный) способ инициировать обработку ошибок на основе статуса выхода — использовать OR:

<command_1> || <command_2>

С помощью оператора OR, <command_2> выполняется тогда и только тогда, когда <command_1> возвращает ненулевой статус выхода.

В качестве второй команды, можно использовать свою Bash функцию обработки ошибок

error_exit()
{
    echo "Error: $1"
    exit 1
}

bad-command || error_exit "Some error"

В Bash имеется встроенная переменная $?, которая сообщает вам статус выхода последней выполненной команды.

Когда вызывается функция bash, $? считывает статус выхода последней команды, вызванной внутри функции. Поскольку некоторые ненулевые коды выхода имеют специальные значения, вы можете обрабатывать их выборочно.

status=$?

case "$status" in
"1") echo "General error";;
"2") echo "Misuse of shell builtins";;
"126") echo "Command invoked cannot execute";;
"128") echo "Invalid argument";;
esac

Выход из сценария при ошибке в Bash

Когда возникает ошибка в сценарии bash, по умолчанию он выводит сообщение об ошибке в stderr, но продолжает выполнение в остальной части сценария. Даже если ввести неправильную команду, это не приведет к завершению работы сценария. Вы просто увидите ошибку «command not found».

Такое поведение оболочки по умолчанию может быть нежелательным для некоторых bash сценариев. Например, если скрипт содержит критический блок кода, в котором не допускаются ошибки, вы хотите, чтобы ваш скрипт немедленно завершал работу при возникновении любой ошибки внутри этого блока . Чтобы активировать это поведение «выход при ошибке» в bash, вы можете использовать команду set следующим образом.

set -e
# некоторый критический блок кода, где ошибка недопустима
set +e

Вызванная с опцией -e, команда set заставляет оболочку bash немедленно завершить работу, если любая последующая команда завершается с ненулевым статусом (вызванным состоянием ошибки). Опция +e возвращает оболочку в режим по умолчанию. set -e эквивалентна set -o errexit. Аналогично, set +e является сокращением команды set +o errexit.

set -e
true | false | true
echo "Это будет напечатано" # "false" внутри конвейера не обнаружено

Если необходимо, чтобы при любом сбое в работе конвейеров также завершался сценарий bash, необходимо добавить опцию -o pipefail.

set -o pipefail -e
true | false | true # "false" внутри конвейера определен правильно
echo "Это не будет напечатано"

Для «защиты» критический блока в сценарии от любого типов ошибок команд или ошибок конвейера, необходимо использовать следующую комбинацию команд set.

set -o pipefail -e
# некоторый критический блок кода, в котором не допускается ошибка или ошибка конвейера
set +o pipefail +e

Contents

  • 1 Problem
  • 2 Solutions
    • 2.1 Executed in subshell, exit on error
    • 2.2 Executed in subshell, trap error
  • 3 Caveat 1: `Exit on error’ ignoring subshell exit status
    • 3.1 Solution: Generate error yourself if subshell fails
      • 3.1.1 Example 1
      • 3.1.2 Example 2
  • 4 Caveat 2: `Exit on error’ not exitting subshell on error
    • 4.1 Solution: Use logical operators (&&, ||) within subshell
      • 4.1.1 Example
  • 5 Caveat 3: `Exit on error’ not exitting command substition on error
    • 5.1 Solution 1: Use logical operators (&&, ||) within command substitution
    • 5.2 Solution 2: Enable posix mode
  • 6 The tools
    • 6.1 Exit on error
      • 6.1.1 Specify `bash -e’ as the shebang interpreter
        • 6.1.1.1 Example
      • 6.1.2 Set ERR trap to exit
        • 6.1.2.1 Example
  • 7 Solutions revisited: Combining the tools
    • 7.1 Executed in subshell, trap on exit
      • 7.1.1 Rationale
    • 7.2 Sourced in current shell
      • 7.2.1 Todo
      • 7.2.2 Rationale
        • 7.2.2.1 `Exit’ trap in sourced script
        • 7.2.2.2 `Break’ trap in sourced script
        • 7.2.2.3 Trap in function in sourced script without `errtrace’
        • 7.2.2.4 Trap in function in sourced script with ‘errtrace’
        • 7.2.2.5 `Break’ trap in function in sourced script with `errtrace’
  • 8 Test
  • 9 See also
  • 10 Journal
    • 10.1 20210114
    • 10.2 20060524
    • 10.3 20060525
  • 11 Comments

Problem

I want to catch errors in bash script using set -e (or set -o errexit or trap ERR). What are best practices?

Solutions

See #Solutions revisited: Combining the tools for detailed explanations.

If the script is executed in a subshell, it’s relative easy: You don’t have to worry about backing up and restoring shell options and shell traps, because they’re automatically restored when you exit the subshell.

Executed in subshell, exit on error

Example script:

#!/bin/bash -eu
# -e: Exit immediately if a command exits with a non-zero status.
# -u: Treat unset variables as an error when substituting.
 
(false)                   # Caveat 1: If an error occurs in a subshell, it isn't detected
(false) || false          # Solution: If you want to exit, you have to detect the error yourself
(false; true) || false    # Caveat 2: The return status of the ';' separated list is `true'
(false && true) || false  # Solution: If you want to control the last command executed, use `&&'

See also #Caveat 1: `Exit on error’ ignoring subshell exit status

Executed in subshell, trap error

#!/bin/bash -Eu
# -E: ERR trap is inherited by shell functions.
# -u: Treat unset variables as an error when substituting.
# 
# Example script for handling bash errors.  Exit on error.  Trap exit.
# This script is supposed to run in a subshell.
# See also: http://fvue.nl/wiki/Bash:_Error_handling

    #  Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR
trap onexit 1 2 3 15 ERR


#--- onexit() -----------------------------------------------------
#  @param $1 integer  (optional) Exit status.  If not set, use `$?'

function onexit() {
    local exit_status=${1:-$?}
    echo Exiting $0 with $exit_status
    exit $exit_status
}


# myscript


    # Allways call `onexit' at end of script
onexit

Caveat 1: `Exit on error’ ignoring subshell exit status

The `-e’ setting does not exit if an error occurs within a subshell, for example with these subshell commands: (false) or bash -c false

Example script caveat1.sh:

#!/bin/bash -e
echo begin
(false)
echo end

Executing the script above gives:

$ ./caveat1.sh
begin
end
$

Conclusion: the script didn’t exit after (false).

Solution: Generate error yourself if subshell fails

( SHELL COMMANDS ) || false

In the line above, the exit status of the subshell is checked. The subshell must exit with a zero status — indicating success, otherwise `false’ will run, generating an error in the current shell.

Note that within a bash `list’, with commands separated by a `;’, the return status is the exit status of the last command executed. Use the control operators `&&’ and `||’ if you want to control the last command executed:

$ (false; true) || echo foo
$ (false && true) || echo foo
foo
$

Example 1

Example script example.sh:

#!/bin/bash -e
echo begin
(false) || false
echo end

Executing the script above gives:

$ ./example.sh
begin
$

Conclusion: the script exits after false.

Example 2

Example bash commands:

$ trap 'echo error' ERR       # Set ERR trap
$ false                       # Non-zero exit status will be trapped
error
$ (false)                     # Non-zero exit status within subshell will not be trapped
$ (false) || false            # Solution: generate error yourself if subshell fails
error
$ trap - ERR                  # Reset ERR trap

Caveat 2: `Exit on error’ not exitting subshell on error

The `-e’ setting doesn’t always immediately exit the subshell `(…)’ when an error occurs. It appears the subshell behaves as a simple command and has the same restrictions as `-e’:

Exit immediately if a simple command exits with a non-zero status, unless the subshell is part of the command list immediately following a `while’ or `until’ keyword, part of the test in an `if’ statement, part of the right-hand-side of a `&&’ or `||’ list, or if the command’s return status is being inverted using `!’

Example script caveat2.sh:

#!/bin/bash -e
(false; echo A)                        # Subshell exits after `false'
!(false; echo B)                       # Subshell doesn't exit after `false'
true && (false; echo C)                # Subshell exits after `false'
(false; echo D) && true                # Subshell doesn't exit after `false'
(false; echo E) || false               # Subshell doesn't exit after `false'
if (false; echo F); then true; fi      # Subshell doesn't exit after `false'
while (false; echo G); do break; done  # Subshell doesn't exit after `false'
until (false; echo H); do break; done  # Subshell doesn't exit after `false'

Executing the script above gives:

$ ./caveat2.sh
B
D
E
F
G
H

Solution: Use logical operators (&&, ||) within subshell

Use logical operators `&&’ or `||’ to control execution of commands within a subshell.

Example

#!/bin/bash -e
(false && echo A)
!(false && echo B)
true && (false && echo C)
(false && echo D) && true
(false && echo E) || false
if (false && echo F); then true; fi
while (false && echo G); do break; done
until (false && echo H); do break; done

Executing the script above gives no output:

$ ./example.sh
$

Conclusion: the subshells do not output anything because the `&&’ operator is used instead of the command separator `;’ as in caveat2.sh.

Caveat 3: `Exit on error’ not exitting command substition on error

The `-e’ setting doesn’t immediately exit command substitution when an error occurs, except when bash is in posix mode:

$ set -e
$ echo $(false; echo A)
A

Solution 1: Use logical operators (&&, ||) within command substitution

$ set -e
$ echo $(false || echo A)

Solution 2: Enable posix mode

When posix mode is enabled via set -o posix, command substition will exit if `-e’ has been set in the
parent shell.

$ set -e
$ set -o posix
$ echo $(false; echo A)

Enabling posix might have other effects though?

The tools

Exit on error

Bash can be told to exit immediately if a command fails. From the bash manual («set -e»):

«Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or || list, or if the command’s return value is being inverted via !. A trap on ERR, if set, is executed before the shell exits.»

To let bash exit on error, different notations can be used:

  1. Specify `bash -e’ as shebang interpreter
  2. Start shell script with `bash -e’
  3. Use `set -e’ in shell script
  4. Use `set -o errexit’ in shell script
  5. Use `trap exit ERR’ in shell script

Specify `bash -e’ as the shebang interpreter

You can add `-e’ to the shebang line, the first line of your shell script:

#!/bin/bash -e

This will execute the shell script with `-e’ active. Note `-e’ can be overridden by invoking bash explicitly (without `-e’):

$ bash shell_script
Example

Create this shell script example.sh and make it executable with chmod u+x example.sh:

#!/bin/bash -e
echo begin
false     # This should exit bash because `false' returns error
echo end  # This should never be reached

Example run:

$ ./example.sh
begin
$ bash example.sh
begin
end
$

Set ERR trap to exit

By setting an ERR trap you can catch errors as well:

trap command ERR

By setting the command to `exit’, bash exits if an error occurs.

trap exit ERR
Example

Example script example.sh

#!/bin/bash
trap exit ERR
echo begin
false
echo end

Example run:

$ ./example.sh
begin
$

The non-zero exit status of `false’ is catched by the error trap. The error trap exits and `echo end’ is never reached.

Solutions revisited: Combining the tools

Executed in subshell, trap on exit

#!/bin/bash
# --- subshell_trap.sh -------------------------------------------------
# Example script for handling bash errors.  Exit on error.  Trap exit.
# This script is supposed to run in a subshell.
# See also: http://fvue.nl/wiki/Bash:_Error_handling
 
    # Let shell functions inherit ERR trap.  Same as `set -E'.
set -o errtrace 
    # Trigger error when expanding unset variables.  Same as `set -u'.
set -o nounset
    #  Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR
    #  NOTE1: - 9/KILL cannot be trapped.
    #+        - 0/EXIT isn't trapped because:
    #+          - with ERR trap defined, trap would be called twice on error
    #+          - with ERR trap defined, syntax errors exit with status 0, not 2
    #  NOTE2: Setting ERR trap does implicit `set -o errexit' or `set -e'.
trap onexit 1 2 3 15 ERR
 
 
#--- onexit() -----------------------------------------------------
#  @param $1 integer  (optional) Exit status.  If not set, use `$?'
 
function onexit() {
    local exit_status=${1:-$?}
    echo Exiting $0 with $exit_status
    exit $exit_status
}
 
 
 
# myscript
 
 
 
    # Allways call `onexit' at end of script
onexit

Rationale

+-------+   +----------+  +--------+  +------+
| shell |   | subshell |  | script |  | trap |
+-------+   +----------+  +--------+  +------+
     :           :            :           :
    +-+         +-+          +-+  error  +-+
    | |         | |          | |-------->| |
    | |  exit   | |          | !         | |
    | |<-----------------------------------+
    +-+          :            :           :
     :           :            :           :

Figure 1. Trap in executed script
When a script is executed from a shell, bash will create a subshell in which the script is run. If a trap catches an error, and the trap says `exit’, this will cause the subshell to exit.

Sourced in current shell

If the script is sourced (included) in the current shell, you have to worry about restoring shell options and shell traps. If they aren’t restored, they might cause problems in other programs which rely on specific settings.

#!/bin/bash
#--- listing6.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $ set +o errtrace         # Make sure errtrace is not set (bash default)
#    $ trap - ERR              # Make sure no ERR trap is set (bash default)
#    $ . listing6.inc.sh       # Source listing6.inc.sh
#    $ foo                     # Run foo()
#    foo_init
#    Entered `trap-loop'
#    trapped
#    This is always executed - with or without a trap occurring
#    foo_deinit
#    $ trap                    # Check if ERR trap is reset.
#    $ set -o | grep errtrace  # Check if the `errtrace' setting is...
#    errtrace        off        # ...restored.
#    $
#
# See: http://fvue.nl/wiki/Bash:_Error_handling
 
function foo_init {
    echo foo_init 
    fooOldErrtrace=$(set +o | grep errtrace)
    set -o errtrace
    trap 'echo trapped; break' ERR   # Set ERR trap 
}
function foo_deinit {
    echo foo_deinit
    trap - ERR                # Reset ERR trap
    eval $fooOldErrtrace      # Restore `errtrace' setting
    unset fooOldErrtrace      # Delete global variable
}
function foo {
    foo_init
        # `trap-loop'
    while true; do
        echo Entered \`trap-loop\'
        false
        echo This should never be reached because the \`false\' above is trapped
        break
    done
    echo This is always executed - with or without a trap occurring
    foo_deinit
}

Todo

  • an existing ERR trap must be restored and called
  • test if the `trap-loop’ is reached if the script breaks from a nested loop

Rationale

`Exit’ trap in sourced script

When the script is sourced in the current shell, it’s not possible to use `exit’ to terminate the program: This would terminate the current shell as well, as shown in the picture underneath.

+-------+                 +--------+  +------+
| shell |                 | script |  | trap |
+-------+                 +--------+  +------+
    :                         :           :
   +-+                       +-+  error  +-+
   | |                       | |-------->| |
   | |                       | |         | |
   | | exit                  | |         | |
<------------------------------------------+
    :                         :           :

Figure 2. `Exit’ trap in sourced script
When a script is sourced from a shell, bash will run the script in the current shell. If a trap catches an error, and the trap says `exit’, this will cause the current shell to exit.

`Break’ trap in sourced script

A solution is to introduce a main loop in the program, which is terminated by a `break’ statement within the trap.

+-------+    +--------+  +--------+   +------+
| shell |    | script |  | `loop' |   | trap |
+-------+    +--------+  +--------+   +------+
     :           :            :          :  
    +-+         +-+          +-+  error +-+
    | |         | |          | |------->| |
    | |         | |          | |        | |
    | |         | |  break   | |        | |
    | |  return | |<----------------------+
    | |<----------+           :          :
    +-+          :            :          :
     :           :            :          :

Figure 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. . ./script, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.

For example:

#!/bin/bash
#--- listing3.sh -------------------------------------------------------
# See: http://fvue.nl/wiki/Bash:_Error_handling

trap 'echo trapped; break' ERR;  # Set ERR trap

function foo { echo foo; false; }  # foo() exits with error

    # `trap-loop'
while true; do
    echo Entered \`trap-loop\'
    foo
    echo This is never reached
    break
done

echo This is always executed - with or without a trap occurring

trap - ERR  # Reset ERR trap

Listing 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. ./script, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.

Example output:

$> source listing3.sh
Entered `trap-loop'
foo
trapped
This is always executed after a trap
$>
Trap in function in sourced script without `errtrace’

A problem arises when the trap is reset from within a function of a sourced script. From the bash manual, set -o errtrace or set -E:

If set, any trap on `ERR’ is inherited by shell functions, command

substitutions, and commands executed in a subshell environment.

The `ERR’ trap is normally not inherited in such cases.

So with errtrace not set, a function does not know of any `ERR’ trap set, and thus the function is unable to reset the `ERR’ trap. For example, see listing 4 underneath.

#!/bin/bash
#--- listing4.inc.sh ---------------------------------------------------
# Demonstration of ERR trap not being reset by foo_deinit()
# Example run:
# 
#    $> set +o errtrace     # Make sure errtrace is not set (bash default)
#    $> trap - ERR          # Make sure no ERR trap is set (bash default)
#    $> . listing4.inc.sh   # Source listing4.inc.sh
#    $> foo                 # Run foo()
#    foo_init
#    foo
#    foo_deinit             # This should've reset the ERR trap...
#    $> trap                # but the ERR trap is still there:
#    trap -- 'echo trapped' ERR
#    $> trap

# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init   { echo foo_init 
                      trap 'echo trapped' ERR;} # Set ERR trap 

function foo_deinit { echo foo_deinit
                      trap - ERR             ;} # Reset ERR trap

function foo        { foo_init
                      echo foo
                      foo_deinit             ;}

Listing 4. Trap in function in sourced script
foo_deinit() is unable to unset the ERR trap, because errtrace is not set.

Trap in function in sourced script with ‘errtrace’

The solution is to set -o errtrace. See listing 5 underneath:

#!/bin/bash
#--- listing5.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $> set +o errtrace         # Make sure errtrace is not set (bash default)
#    $> trap - ERR              # Make sure no ERR trap is set (bash default)
#    $> . listing5.inc.sh       # Source listing5.inc.sh
#    $> foo                     # Run foo()
#    foo_init
#    foo
#    foo_deinit                 # This should reset the ERR trap...
#    $> trap                    # and it is indeed.
#    $> set +o | grep errtrace  # And the `errtrace' setting is restored.
#    $>
#
# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init   { echo foo_init 
                      fooOldErrtrace=$(set +o | grep errtrace)
                      set -o errtrace
                      trap 'echo trapped' ERR   # Set ERR trap 
                    }
function foo_deinit { echo foo_deinit
                      trap - ERR                # Reset ERR trap
                      eval($fooOldErrtrace)     # Restore `errtrace' setting
                      fooOldErrtrace=           # Delete global variable
                    }
function foo        { foo_init
                      echo foo
                      foo_deinit             ;}
`Break’ trap in function in sourced script with `errtrace’

Everything combined in listing 6 underneath:

#!/bin/bash
#--- listing6.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $> set +o errtrace         # Make sure errtrace is not set (bash default)
#    $> trap - ERR              # Make sure no ERR trap is set (bash default)
#    $> . listing6.inc.sh       # Source listing6.inc.sh
#    $> foo                     # Run foo()
#    foo_init
#    Entered `trap-loop'
#    trapped
#    This is always executed - with or without a trap occurring
#    foo_deinit
#    $> trap                    # Check if ERR trap is reset.
#    $> set -o | grep errtrace  # Check if the `errtrace' setting is...
#    errtrace        off        # ...restored.
#    $>
#
# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init {
    echo foo_init 
    fooOldErrtrace=$(set +o | grep errtrace)
    set -o errtrace
    trap 'echo trapped; break' ERR   # Set ERR trap 
}
function foo_deinit {
    echo foo_deinit
    trap - ERR                # Reset ERR trap
    eval $fooOldErrtrace      # Restore `errtrace' setting
    unset fooOldErrtrace      # Delete global variable
}
function foo {
    foo_init
        # `trap-loop'
    while true; do
        echo Entered \`trap-loop\'
        false
        echo This should never be reached because the \`false\' above is trapped
        break
    done
    echo This is always executed - with or without a trap occurring
    foo_deinit
}

Test

#!/bin/bash

    # Tests

    # An erroneous command should have exit status 127.
    # The erroneous command should be trapped by the ERR trap.
#erroneous_command

    #  A simple command exiting with a non-zero status should have exit status
    #+ <> 0, in this case 1.  The simple command is trapped by the ERR trap.
#false

    # Manually calling 'onexit'
#onexit

    # Manually calling 'onexit' with exit status
#onexit 5

    #  Killing a process via CTRL-C (signal 2/SIGINT) is handled via the SIGINT trap
    #  NOTE: `sleep' cannot be killed via `kill' plus 1/SIGHUP, 2/SIGINT, 3/SIGQUIT
    #+       or 15/SIGTERM.
#echo $$; sleep 20

    #  Killing a process via 1/SIGHUP, 2/SIGQUIT, 3/SIGQUIT or 15/SIGTERM is
    #+ handled via the respective trap.
    #  NOTE: Unfortunately, I haven't found a way to retrieve the signal number from
    #+       within the trap function.
echo $$; while true; do :; done

    # A syntax error is not trapped, but should have exit status 2
#fi

    # An unbound variable is not trapped, but should have exit status 1
    # thanks to 'set -u'
#echo $foo

     # Executing `false' within a function should exit with 1 because of `set -E'
#function foo() {
#    false
#    true
#} # foo()
#foo

echo End of script
   # Allways call 'onexit' at end of script
onexit

See also

Bash: Err trap not reset
Solution for trap - ERR not resetting ERR trap.

Journal

20210114

Another caveat: exit (or an error-trap) executed within «process substitution» doesn’t end outer process. The script underneath keeps outputting «loop1»:

#!/bin/bash
# This script outputs "loop1" forever, while I hoped it would exit all while-loops
set -o pipefail
set -Eeu
 
while true; do
    echo loop1
    while read FOO; do
        echo loop2
        echo FOO: $FOO
    done < <( exit 1 )
done

The ‘< <()’ notation is called process substitution.

See also:

  • https://mywiki.wooledge.org/ProcessSubstitution
  • https://unix.stackexchange.com/questions/128560/how-do-i-capture-the-exit-code-handle-errors-correctly-when-using-process-subs
  • https://superuser.com/questions/696855/why-doesnt-a-bash-while-loop-exit-when-piping-to-terminated-subcommand

Workaround: Use «Here Strings» ([n]<<<word):

#!/bin/bash
# This script will exit correctly if building up $rows results in an error
 
set -Eeu
 
rows=$(exit 1)
while true; do
    echo loop1
    while read FOO; do
        echo loop2
        echo FOO: $FOO
    done <<< "$rows"
done

20060524

#!/bin/bash
#--- traptest.sh --------------------------------------------
# Example script for trapping bash errors.
# NOTE: Why doesn't this scripts catch syntax errors?

    # Exit on all errors
set -e
    # Trap exit
trap trap_exit_handler EXIT


    # Handle exit trap
function trap_exit_handler() {
        # Backup exit status if you're interested...
    local exit_status=$?
        # Change value of $?
    true
    echo $?
    #echo trap_handler $exit_status
} # trap_exit_handler()


    # An erroneous command will trigger a bash error and, because
    # of 'set -e', will 'exit 127' thus falling into the exit trap.
#erroneous_command
    # The same goes for a command with a false return status
#false

    # A manual exit will also fall into the exit trap
#exit 5

    # A syntax error isn't catched?
fi

    # Disable exit trap
trap - EXIT
exit 0

Normally, a syntax error exits with status 2, but when both ‘set -e’ and ‘trap EXIT’ are defined, my script exits with status 0. How can I have both ‘errexit’ and ‘trap EXIT’ enabled, *and* catch syntax errors
via exit status? Here’s an example script (test.sh):

set -e
trap 'echo trapped: $?' EXIT
fi

$> bash test.sh; echo \$?: $?
test.sh: line 3: syntax error near unexpected token `fi'
trapped: 0
$?: 0

More trivia:

  • With the line ‘#set -e’ commented, bash traps 258 and returns an exit status of 2:
trapped: 258
$?: 2
  • With the line ‘#trap ‘echo trapped $?’ EXIT’ commented, bash returns an exit status of 2:
$?: 2
  • With a bogus function definition on top, bash returns an exit status of 2, but no exit trap is executed:
function foo() { foo=bar }
set -e
trap 'echo trapped: $?' EXIT
fi
fred@linux:~>bash test.sh; echo \$?: $?
test.sh: line 4: syntax error near unexpected token `fi'
test.sh: line 4: `fi'
$?: 2

20060525

Example of a ‘cleanup’ script

trap

Writing Robust Bash Shell Scripts

#!/bin/bash
#--- cleanup.sh ---------------------------------------------------------------
# Example script for trapping bash errors.
# NOTE: Use 'cleanexit [status]' instead of 'exit [status]'

    # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM
    # @see catch_sig()
trap catch_sig 1 2 3 15
    # Trap errors (simple commands exiting with a non-zero status)
    # @see catch_err()
trap catch_err ERR


#--- cleanexit() --------------------------------------------------------------
#  Wrapper around 'exit' to cleanup on exit.
#  @param $1 integer  Exit status.  If $1 not defined, exit status of global
#+                    variable 'EXIT_STATUS' is used.  If neither $1 or
#+                    'EXIT_STATUS' defined, exit with status 0 (success).
function cleanexit() {
    echo "Exiting with ${1:-${EXIT_STATUS:-0}}"
    exit ${1:-${EXIT_STATUS:-0}}
} # cleanexit()


#--- catch_err() --------------------------------------------------------------
#  Catch ERR trap.
#  This traps simple commands exiting with a non-zero status.
#  See also: info bash | "Shell Builtin Commands" | "The Set Builtin" | "-e"
function catch_err() {
    local exit_status=$?
    echo "Inside catch_err"
    cleanexit $exit_status
} # catch_err()


#--- catch_sig() --------------------------------------------------------------
# Catch signal trap.
# Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM
# @NOTE1: Non-trapped signals are 0/EXIT, 9/KILL.
function catch_sig() {
    local exit_status=$?
    echo "Inside catch_sig"
    cleanexit $exit_status
} # catch_sig()


    # An erroneous command should have exit status 127.
    # The erroneous command should be trapped by the ERR trap.
#erroneous_command

    # A command returning false should have exit status <> 0
    # The false returning command should be trapped by the ERR trap.
#false

    # Manually calling 'cleanexit'
#cleanexit

    # Manually calling 'cleanexit' with exit status
#cleanexit 5

    # Killing a process via CTRL-C is handled via the SIGINT trap
#sleep 20

    # A syntax error is not trapped, but should have exit status 2
#fi

    # Allways call 'cleanexit' at end of script
cleanexit

blog comments powered by

Advertisement

blog comments powered by

Error handling is a very important part of any programming language. Bash has no better option than other programming languages to handle the error of the script.  But it is essential to make the Bash script error-free at the time of executing the script from the terminal. The error handling feature can be implemented for the Bash script in multiple ways. The different techniques to handle the errors in the Bash script are shown in this tutorial.

Example 1: Error Handling Using a Conditional Statement

Create a Bash file with the following script that shows the use of the conditional statement for error handling. The first “if” statement is used to check the total number of command line arguments and print an error message if the value is less than 2. Next, the dividend and divisor values are taken from the command line arguments. If the divisor value is equal to 0, an error is generated and the error message is printed in the error.txt file. The second “if” command is used to check whether the error.txt file is empty or not. An error message is printed if the error.txt file is non-empty.

#!/bin/bash
#Check the argument values
if [ $# -lt 2 ]; then
   echo «One or more argument is missing.»
   exit
fi
#Read the dividend value from the first command-line argument
dividend=$1
#Read the divisor value from the second command-line argument
divisor=$2
#Divide the dividend by the divisor
result=`echo «scale=2; $dividend/$divisor«|bc 2>error.txt`
#Read the content of the error file
content=`cat error.txt`
if [ -n «$content« ]; then
  #Print the error message if the error.txt is non-empty
  echo «Divisible by zero error occurred.»
else
  #Print the result
  echo «$dividend/$divisor = $result«

 
Output:

The following output appears after executing the previous script without any argument:


The following output appears after executing the previous script with one argument value:


The following output appears after executing the previous script with two valid argument values:


The following output appears after executing the previous script with two argument values where the second argument is 0. The error message is printed:

Example 2: Error Handling Using the Exit Status Code

Create a Bash file with the following script that shows the use of the Bash error handling by exit status code. Any Bash command is taken as input value and that command is executed later. If the exit status code is not equal to zero, an error message is printed. Otherwise, a success message is printed.

#!/bin/bash

#Take a Linux command name
echo -n «Enter a command: «
read cmd_name
#Run the command
$cmd_name
#Check whether the command is valid or invalid
if [ $? -ne 0 ]; then
   echo «$cmd_name is an invalid command.»
else
   echo «$cmd_name is a valid command.»
fi

fi

 
Output:

The following output appears after executing the previous script with the valid command. Here, the “date” is taken as the command in the input value that is valid:


The following output appears after executing the previous script for the invalid command. Here, the “cmd” is taken as the command in the input value that is invalid:

Example 3: Stop the Execution on the First Error

Create a Bash file with the following script that shows the method to stop the execution when the first error of the script appears. Two invalid commands are used in the following script. So, two errors are generated. The script stops the execution after executing the first invalid command using the “set –e” command.

#!/bin/bash
#Set the option to terminate the script on the first error
set -e
echo ‘Current date and time: ‘
#Valid command
date
echo ‘Current working Directory: ‘
#Invalid command
cwd
echo ‘login username: ‘
#Valid command
whoami
echo ‘List of files and folders: ‘
#Invalid command
list

 
Output:

The following output appears after executing the previous script. The script stops the execution after executing the invalid command which is “cwd”:

Example 4: Stop the Execution for Uninitialized Variable

Create a Bash file with the following script that shows the method to stop the execution of the script for the uninitialized variable. The username and password values are taken from the command line argument values. If any of the values of these variables are uninitialized, an error message is printed. If both variables are initialized, the script checks if the username and password are valid or invalid.

#!/bin/bash
#Set the option to terminate the script for an uninitialized variable
set -u
#Set the first command-line argument value to the username
username=$1
#Set the second command-line argument value to the password
password=$2
#Check the username and password are valid or invalid
if [[ $username == ‘admin’ && $password == ‘hidenseek’ ]]; then
    echo «Valid user.»
else
    echo «Invalid user.»
fi

 
Output:

The following output appears if the script is executed without using any command-line argument value. The script stops the execution after getting the first uninitialized variable:


The following output appears if the script is executed with one command-line argument value. The script stops the execution after getting the second uninitialized variable:


The following output appears if the script is executed with two command-line argument values – “admin” and “hide”. Here, the username is valid but the password is invalid. So, the “Invalid user” message is printed:


The following output appears if the script is executed with two command-line argument values – “admin” and “hidenseek”. Here, the username and password are valid. So, the “Valid user” message is printed:

Conclusion

The different ways to handle the errors in the Bash script are shown in this tutorial using multiple examples. We hope that this will help the Bash users to implement the error-handling feature in their Bash script.

About the author

I am a trainer of web programming courses. I like to write article or tutorial on various IT topics. I have a YouTube channel where many types of tutorials based on Ubuntu, Windows, Word, Excel, WordPress, Magento, Laravel etc. are published: Tutorials4u Help.

Понравилась статья? Поделить с друзьями:
  • Bash игнорировать ошибку
  • Bat игнорировать ошибки
  • Bash это каталог ошибка
  • Bad host ошибка
  • Baltur котел газовый ошибка e72