In this article, I present a few tricks to handle error conditions—Some strictly don’t fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.
Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.
Say that you have a cron
job on each one of your Linux systems, and you have a script to collect the hardware information from each:
#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)
DATADIR="$HOME/Documents/lshw-dump"
/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
echo "Visiting: $server"
/usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
/usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done
If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.
Visiting: dmaf5
lshw-dump.json 100% 54KB 136.9MB/s 00:00
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
"boot": "normal",
"chassis": "desktop",
"family": "Default string",
"sku": "Default string",
"uuid": "00020003-0004-0005-0006-000700080009"
}
Here are some possibilities of why things went wrong:
- Your report didn’t run because the server was down
- You couldn’t create the directory where the files need to be saved
- The tools you need to run the script are missing
- You can’t collect the report because your remote machine crashed
- One or more of the reports is corrupt
The current version of the script has a problem—It will run from the beginning to the end, errors or not:
./collect_data_from_servers.sh
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 48.8MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9
Next, I demonstrate a few things to make your script more robust and in some times recover from failure.
The nuclear option: Failing hard, failing fast
The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $?
or $!
variable, have sometimes a broader meaning. The bash man page tells you:
For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.
As usual, you should always read the man page of the scripts you’re calling, to see what the conventions are for each of them. If you’ve programmed with a language like Java or Python, then you’re most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.
If you add set -o errexit
to your script, from that point forward it will abort the execution if any command exists with a code != 0
. But errexit
isn’t used when executing functions inside an if
condition, so instead of remembering that exception, I rather do explicit error handling.
Take a look at version two of the script. It’s slightly better:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/ )
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9 Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then
21 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi
23 declare -A server_pid
24 for server in ${servers[*]}; do
25 echo "Visiting: $server"
26 /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27 server_pid[$server]=$! # Save the PID of the scp of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33 wait ${server_pid[$server]}
34 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done
Here’s what changed:
- Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
- Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
- On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
- On lines 33-35, I wait for the
scp
task to finish, get the return code, and if it’s an error, abort. - On line 37, I check that the file could be parsed, otherwise, I exit with an error.
So how does the error handling look now?
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 146.1MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory
As you can see, this version is better at detecting errors but it’s very unforgiving. Also, it doesn’t detect all the errors, does it?
When you get stuck and you wish you had an alarm
The code looks better, except that sometimes the scp
could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.
Another example is to try to access a directory through NFS where $HOME
is mounted from an NFS server:
/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
And you discover hours later that the NFS mount point is stale and your script is stuck.
A timeout is the solution. And, GNU timeout comes to the rescue:
/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it’s still running after 20.0 seconds, then send a KILL signal (kill -9
). If in doubt, check which signals are supported in your system (kill -l
, for example).
If this isn’t clear from my dialog, then look at the script for more clarity.
/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real 0m20.003s
user 0m0.000s
sys 0m0.003s
Back to the original script to add a few more options and you have version three:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * Open SSH: http://www.openssh.com/portable.html
5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
6 # * JQ: http://stedolan.github.io/jq/
7 # * timeout: https://www.gnu.org/software/coreutils/
8 #
9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
11 # Author: Jose Vicente Nunez
12 #
13 set -o errtrace # Enable the err trap, code will get called when an error is detected
14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
15
16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
17 for dependency in ${dependencies[@]}; do
18 if [ ! -x $dependency ]; then
19 echo "ERROR: Missing $dependency"
20 exit 100
21 fi
22 done
23
24 declare -a servers=(
25 macmini2
26 mac-pro-1-1
27 dmaf5
28 )
29
30 function remote_copy {
31 local server=$1
32 echo "Visiting: $server"
33 /usr/bin/timeout --kill-after 25.0s 20.0s \
34 /usr/bin/scp \
35 -o BatchMode=yes \
36 -o logLevel=Error \
37 -o ConnectTimeout=5 \
38 -o ConnectionAttempts=3 \
39 ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
40 return $?
41 }
42
43 DATADIR="$HOME/Documents/lshw-dump"
44 if [ ! -d "$DATADIR" ]; then
45 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
46 fi
47 declare -A server_pid
48 for server in ${servers[*]}; do
49 remote_copy $server &
50 server_pid[$server]=$! # Save the PID of the scp of a given server for later
51 done
52 # Iterate through all the servers and:
53 # Wait for the return code of each
54 # Check the exit code from each scp
55 for server in ${!server_pid[*]}; do
56 wait ${server_pid[$server]}
57 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
58 done
59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
60 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
61 done
What are the changes?:
- Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
- Created a
remote_copy
function, which uses a timeout to make sure thescp
finishes no later than 45.0s—line 33. - Added a connection timeout of 5 seconds instead of the TCP default—line 37.
- Added a retry to
scp
on line 38—3 attempts that wait 1 second between each.
There other ways to retry when there’s an error.
Waiting for the end of the world-how and when to retry
You noticed there’s an added retry to the scp
command. But that retries only for failed connections, what if the command fails during the middle of the copy?
Sometimes you want to just fail because there’s very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you’re able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.
Here are the changes to the remote_copy
, to keep this brief (version four):
#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3
# Blah blah blah...
function remote_copy {
local server=$1
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
echo "INFO: Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s \
/usr/bin/scp \
-o BatchMode=yes \
-o logLevel=Error \
-o ConnectTimeout=5 \
-o ConnectionAttempts=3 \
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
return $status
}
DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
/usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
remote_copy $server $MAX_RETRIES &
server_pid[$server]=$! # Save the PID of the scp of a given server for later
done
# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
wait ${server_pid[$server]}
test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done
# Blah blah blah, process the files you just copied...
How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there’s a retry for a random time between 1 and 60 seconds before exiting:
INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
If I fail, do I have to do this all over again? Using a checkpoint
Suppose that the remote copy is the most expensive operation of this whole script and that you’re willing or able to re-run this script, maybe using cron
or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.
You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don’t bother to check again for that day.
Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry
).
A new version (version five) of the script has code to record the status of the copy (lines 15-33):
15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20 /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25 local machine=$1
26 test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30 machine=$1
31 /usr/bin/touch $CACHE_DIR/$machine
32 return $?
33 }
Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.
And then, add this new helper logic into the remote_copy
function (lines 52-81):
52 function remote_copy {
53 local server=$1
54 check_previous_run $server
55 test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56 local retries=$2
57 local now=1
58 status=0
59 while [ $now -le $retries ]; do
60 echo "INFO: Trying to copy file from: $server, attempt=$now"
61 /usr/bin/timeout --kill-after 25.0s 20.0s \
62 /usr/bin/scp \
63 -o BatchMode=yes \
64 -o logLevel=Error \
65 -o ConnectTimeout=5 \
66 -o ConnectionAttempts=3 \
67 ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68 status=$?
69 if [ $status -ne 0 ]; then
70 sleep_time=$(((RANDOM % 60)+ 1))
71 echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72 /usr/bin/sleep ${sleep_time}s
73 else
74 break # All good, no point on waiting...
75 fi
76 ((now=now+1))
77 done
78 test $status -eq 0 && mark_previous_run $server
79 test $? -ne 0 && status=1
80 return $status
81 }
The first time it runs, a new new message for the cache directory is printed out:
./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:
./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
Imagine how this speeds up when you have more machines that should not be revisited.
Leaving crumbs behind: What to log, how to log, and verbose output
If you’re like me, I like a bit of context to correlate with when something goes wrong. The echo
statements on the script are nice but what if you could add a timestamp to them.
If you use logger
, you can save the output on journalctl
for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl
right away.
So instead of just doing echo
, you can also add a call to logger
like this using a new bash function called ‘message
’:
SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
function message {
message="$1"
func_name="${2-unknown}"
priority=6
if [ -z "$2" ]; then
echo "INFO:" $message
else
echo "ERROR:" $message
priority=0
fi
/usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}
You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.
So how is this useful? Well, you could get
the messages between 1:26 PM and 1:27 PM, only errors (priority=0
) and only for our script (collect_data_from_servers.v6.sh
) like this, output in JSON format:
journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
"_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
"_AUDIT_LOGINUID" : "1000",
"SYSLOG_IDENTIFIER" : "logger",
"PRIORITY" : "0",
"_TRANSPORT" : "journal",
"_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
"__REALTIME_TIMESTAMP" : "1623518797641880",
"_AUDIT_SESSION" : "3",
"_GID" : "1000",
"MESSAGE_ID" : "collect_data_from_servers.v6.sh",
"MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
"_CAP_EFFECTIVE" : "0",
"CODE_FUNC" : "remote_copy",
"_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
"_COMM" : "logger",
"CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
"_PID" : "41832",
"__MONOTONIC_TIMESTAMP" : "25928272252",
"_HOSTNAME" : "dmaf5",
"_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
"__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
"_UID" : "1000"
}
Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.
You can take a look at the whole version six of the script.
Don’t be so eager to replace your data until you’ve checked it.
If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:
Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'
That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don’t attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):
function remote_copy {
local server=$1
check_previous_run $server
test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
message "Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s \
/usr/bin/scp \
-o BatchMode=yes \
-o logLevel=Error \
-o ConnectTimeout=5 \
-o ConnectionAttempts=3 \
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
if [ $status -eq 0 ]; then
/usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
status=$?
if [ $status -eq 0 ]; then
/usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
test $? -ne 0 && status=1
else
message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
fi
fi
return $status
}
Choose the right tools for the task and prep your code from the first line
One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.
You should ALWAYS check for error syntax before running your script:
bash -n $my_bash_script.sh
Seriously. It should be as automatic as performing any other test.
Read the bash man page and get familiar with must-know options, like:
set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv
Use ShellCheck to check your bash scripts
It’s very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.
shellcheck collect_data_from_servers.v7.sh
In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.
In collect_data_from_servers.v7.sh line 16:
if [ ! -x $dependency ]; then
^---------^ SC2086: Double quote to prevent globbing and word splitting.
Did you mean:
if [ ! -x "$dependency" ]; then
...
If you’re wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.
You noticed something with the background scp processes
You probably noticed that if you kill the script, it leaves some forked processes behind. That isn’t good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.
This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.
Let’s recap
[ Download now: A sysadmin’s guide to Bash scripting. ]
1. You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2. Speaking of transitory conditions, you don’t need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3. Bash ‘trap’ is your friend. Use it for cleanup and error handling.
4. When downloading data from any source, assume it’s corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5. Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6. You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7. And finally: Use a Bash lint helper like ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts…
If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.
Обработка ошибок — очень важная часть любого языка программирования. У Bash нет лучшего варианта, чем другие языки программирования, для обработки ошибки скрипта. Но важно, чтобы скрипт Bash был безошибочным во время выполнения скрипта из терминала. Функция обработки ошибок может быть реализована для сценария Bash несколькими способами. В этой статье показаны различные методы обработки ошибок в сценарии Bash.
Пример 1. Обработка ошибок с использованием условного оператора
Создайте файл Bash со следующим сценарием, который показывает использование условного оператора для обработки ошибок. Первый оператор «if» используется для проверки общего количества аргументов командной строки и вывода сообщения об ошибке, если значение меньше 2. Затем значения делимого и делителя берутся из аргументов командной строки. Если значение делителя равно 0, генерируется ошибка, и сообщение об ошибке печатается в файле error.txt. Вторая команда «if» используется для проверки того, является ли файл error.txt пустым или нет. Сообщение об ошибке печатается, если файл error.txt не пуст.
#!/bin/bash #Проверить значения аргументов if [ $# -lt 2 ]; then echo "Отсутствует один или несколько аргументов." exit fi #Чтение значения делимого из первого аргумента командной строки dividend=$1 #Читание значения делителя из второго аргумента командной строки divisor=$2 #Деление делимого на делитель result=`echo "scale=2; $dividend/$divisor"|bc 2>error.txt` #Читать содержимое файла ошибки content=`cat error.txt` if [ -n "$content" ]; then #Распечатать сообщение об ошибке, если файл error.txt непустой echo "Произошла ошибка, кратная нулю." else #Распечатать результат echo "$dividend/$divisor = $result"
Вывод:
Следующий вывод появляется после выполнения предыдущего скрипта без каких-либо аргументов:
andreyex@andreyex:-/Desktop/bash$ bash error1.bash One or more argument is missing. andreyex@andreyex:~/Desktop/bash$
Следующий вывод появляется после выполнения предыдущего скрипта с одним значением аргумента:
andreyex@andreyex:-/Desktop/bash$ bash error1.bash 75 One or more argument is missing. andreyex@andreyex:~/Desktop/bash$
Следующий вывод появляется после выполнения предыдущего скрипта с двумя допустимыми значениями аргумента:
andreyex@andreyex:-/Desktop/bash$ bash error1.bash 75 8 75/8 = 9.37 andreyex@andreyex:-/Desktop/bash$
Следующий вывод появляется после выполнения предыдущего скрипта с двумя значениями аргументов, где второй аргумент равен 0. Выводится сообщение об ошибке:
andreyex@andreyex:~/Desktop/bash$ bash error1.bash 75 0 Divisible by zero error occurred. andreyex@andreyex:~/Desktop/bash$
Пример 2: Обработка ошибок с использованием кода состояния выхода
Создайте файл Bash со следующим сценарием, который показывает использование обработки ошибок Bash по коду состояния выхода. Любая команда Bash принимается в качестве входного значения, и эта команда выполняется позже. Если код состояния выхода не равен нулю, печатается сообщение об ошибке. В противном случае печатается сообщение об успешном выполнении.
#!/bin/bash #Взять имя команды Linux echo -n "Введите команду: " read cmd_name #Выполнить команду $cmd_name #Проверить, действительна ли команда, if [ $? -ne 0 ]; then echo "$cmd_name - недопустимая команда." else echo "$cmd_name является корректной командой." fi fi
Вывод:
Следующий вывод появляется после выполнения предыдущего скрипта с допустимой командой. Здесь «data» принимается как команда во входном значении, которая является допустимой:
andreyex@andreyex:-/Desktop/bash$ bash error2.bash Enter a command: date Tue Dec 27 19:18:39 +06 2022 date is a valid command. andreyex@andreyex:-/Desktop/bash$
Следующий вывод появляется после выполнения предыдущего скрипта для недопустимой команды. Здесь «cmd» воспринимается как недопустимая команда во входном значении:
andreyex@andreyex:-/Desktop/bash$ bash error2.bash Enter a command: cmd error2.bash: line 7: cmd: command not found cmd is a invalid command. andreyex@andreyex: -/Desktop/bash$
Пример 3: остановить выполнение при первой ошибке
Создайте файл Bash со следующим сценарием, который показывает метод остановки выполнения при появлении первой ошибки сценария. В следующем скрипте используются две недопустимые команды. Таким образом, выдаются две ошибки. Сценарий останавливает выполнение после выполнения первой недопустимой команды с помощью команды «set -e».
#!/bin/bash #Установите параметр для завершения скрипта при первой ошибке set -e echo 'Текущие дата и время: ' #Действительная команда date echo 'Текущий рабочий каталог: ' #Неверная команда cwd echo 'имя пользователя: ' #Действительная команда whoami echo 'Список файлов и папок: ' #Неверный список list
Вывод:
Следующий вывод появляется после выполнения предыдущего скрипта. Сценарий останавливает выполнение после выполнения недопустимой команды «cwd»:
andreyex@andreyex:-/Desktop/bash$ bash error3.bash Current date and time: Tue Dec 27 19:19:38 +06 2022 Current orking Directory: error3.bash: line 9: cwd: command not found andreyex@andreyex:-/Desktop/bash$
Пример 4: остановить выполнение для неинициализированной переменной
Создайте файл Bash со следующим сценарием, который показывает метод остановки выполнения сценария для неинициализированной переменной. Значения имени пользователя и пароля берутся из значений аргументов командной строки. Если какое-либо из значений этих переменных не инициализировано, выводится сообщение об ошибке. Если обе переменные инициализированы, сценарий проверяет, являются ли имя пользователя и пароль действительными или недействительными.
#!/bin/bash #Установите параметр завершения сценария для неинициализированной переменной set -u #Установите значение первого аргумента командной строки на имя пользователя username=$1 #Проверьте правильность или недопустимость имени пользователя и пароля password=$2 #Проверьте правильность или недопустимость имени пользователя и пароля if [[ $username == 'admin' && $password == 'hidenseek' ]]; then echo "Действительный пользователь." else echo "Неверный пользователь." fi
Вывод:
Следующий вывод появляется, если сценарий выполняется без использования какого-либо значения аргумента командной строки. Скрипт останавливает выполнение после получения первой неинициализированной переменной:
andreyex@andreyex:~/Desktop/bash$ bash error4.bash error4.bash: line 7: $1: unbound variable andreyex@andreyex:~/Desktop/bash$
Следующий вывод появляется, если сценарий выполняется с одним значением аргумента командной строки. Скрипт останавливает выполнение после получения второй неинициализированной переменной:
andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin error4.bash: line 9: $2: unbound variable andreyex@andreyex:-/Desktop/bash$
Следующий вывод появляется, если сценарий выполняется с двумя значениями аргумента командной строки — «admin» и «hide». Здесь имя пользователя действительно, но пароль недействителен. Итак, выводится сообщение «Invalid user»:
andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin hide Invalid user. andreyex@andreyex:-/Desktop/bash$
Следующий вывод появляется, если сценарий выполняется с двумя значениями аргументов командной строки — «admin» и «hidenseek». Здесь имя пользователя и пароль действительны. Итак, выводится сообщение «Valid user»:
andreyex@andreyex:-/Desktop/bash$ bash error4.bash admin hidenseek Valid user. andreyex@andreyex:~/Desktop/bash$
Заключение
Различные способы обработки ошибок в скрипте Bash показаны в этой статье на нескольких примерах. Мы надеемся, что это поможет пользователям Bash реализовать функцию обработки ошибок в своих сценариях Bash.
Если вы нашли ошибку, пожалуйста, выделите фрагмент текста и нажмите Ctrl+Enter.
Написание надежного, без ошибок сценария bash всегда является сложной задачей. Даже если вы написать идеальный сценарий bash, он все равно может не сработать из-за внешних факторов, таких как некорректный ввод или проблемы с сетью.
В оболочке bash нет никакого механизма поглощения исключений, такого как конструкции try/catch. Некоторые ошибки bash могут быть молча проигнорированы, но могут иметь последствия в дальнейшем.
Проверка статуса завершения команды
Всегда рекомендуется проверять статус завершения команды, так как ненулевой статус выхода обычно указывает на ошибку
if ! command; then
echo "command returned an error"
fi
Другой (более компактный) способ инициировать обработку ошибок на основе статуса выхода — использовать OR:
<command_1> || <command_2>
С помощью оператора OR, <command_2> выполняется тогда и только тогда, когда <command_1> возвращает ненулевой статус выхода.
В качестве второй команды, можно использовать свою Bash функцию обработки ошибок
error_exit()
{
echo "Error: $1"
exit 1
}
bad-command || error_exit "Some error"
В Bash имеется встроенная переменная $?, которая сообщает вам статус выхода последней выполненной команды.
Когда вызывается функция bash, $? считывает статус выхода последней команды, вызванной внутри функции. Поскольку некоторые ненулевые коды выхода имеют специальные значения, вы можете обрабатывать их выборочно.
status=$?
case "$status" in
"1") echo "General error";;
"2") echo "Misuse of shell builtins";;
"126") echo "Command invoked cannot execute";;
"128") echo "Invalid argument";;
esac
Выход из сценария при ошибке в Bash
Когда возникает ошибка в сценарии bash, по умолчанию он выводит сообщение об ошибке в stderr, но продолжает выполнение в остальной части сценария. Даже если ввести неправильную команду, это не приведет к завершению работы сценария. Вы просто увидите ошибку «command not found».
Такое поведение оболочки по умолчанию может быть нежелательным для некоторых bash сценариев. Например, если скрипт содержит критический блок кода, в котором не допускаются ошибки, вы хотите, чтобы ваш скрипт немедленно завершал работу при возникновении любой ошибки внутри этого блока . Чтобы активировать это поведение «выход при ошибке» в bash, вы можете использовать команду set следующим образом.
set -e
# некоторый критический блок кода, где ошибка недопустима
set +e
Вызванная с опцией -e, команда set заставляет оболочку bash немедленно завершить работу, если любая последующая команда завершается с ненулевым статусом (вызванным состоянием ошибки). Опция +e возвращает оболочку в режим по умолчанию. set -e эквивалентна set -o errexit. Аналогично, set +e является сокращением команды set +o errexit.
set -e
true | false | true
echo "Это будет напечатано" # "false" внутри конвейера не обнаружено
Если необходимо, чтобы при любом сбое в работе конвейеров также завершался сценарий bash, необходимо добавить опцию -o pipefail.
set -o pipefail -e
true | false | true # "false" внутри конвейера определен правильно
echo "Это не будет напечатано"
Для «защиты» критический блока в сценарии от любого типов ошибок команд или ошибок конвейера, необходимо использовать следующую комбинацию команд set.
set -o pipefail -e
# некоторый критический блок кода, в котором не допускается ошибка или ошибка конвейера
set +o pipefail +e
Contents
- 1 Problem
- 2 Solutions
- 2.1 Executed in subshell, exit on error
- 2.2 Executed in subshell, trap error
- 3 Caveat 1: `Exit on error’ ignoring subshell exit status
- 3.1 Solution: Generate error yourself if subshell fails
- 3.1.1 Example 1
- 3.1.2 Example 2
- 3.1 Solution: Generate error yourself if subshell fails
- 4 Caveat 2: `Exit on error’ not exitting subshell on error
- 4.1 Solution: Use logical operators (&&, ||) within subshell
- 4.1.1 Example
- 4.1 Solution: Use logical operators (&&, ||) within subshell
- 5 Caveat 3: `Exit on error’ not exitting command substition on error
- 5.1 Solution 1: Use logical operators (&&, ||) within command substitution
- 5.2 Solution 2: Enable posix mode
- 6 The tools
- 6.1 Exit on error
- 6.1.1 Specify `bash -e’ as the shebang interpreter
- 6.1.1.1 Example
- 6.1.2 Set ERR trap to exit
- 6.1.2.1 Example
- 6.1.1 Specify `bash -e’ as the shebang interpreter
- 6.1 Exit on error
- 7 Solutions revisited: Combining the tools
- 7.1 Executed in subshell, trap on exit
- 7.1.1 Rationale
- 7.2 Sourced in current shell
- 7.2.1 Todo
- 7.2.2 Rationale
- 7.2.2.1 `Exit’ trap in sourced script
- 7.2.2.2 `Break’ trap in sourced script
- 7.2.2.3 Trap in function in sourced script without `errtrace’
- 7.2.2.4 Trap in function in sourced script with ‘errtrace’
- 7.2.2.5 `Break’ trap in function in sourced script with `errtrace’
- 7.1 Executed in subshell, trap on exit
- 8 Test
- 9 See also
- 10 Journal
- 10.1 20210114
- 10.2 20060524
- 10.3 20060525
- 11 Comments
Problem
I want to catch errors in bash script using set -e
(or set -o errexit
or trap ERR
). What are best practices?
Solutions
See #Solutions revisited: Combining the tools for detailed explanations.
If the script is executed in a subshell, it’s relative easy: You don’t have to worry about backing up and restoring shell options and shell traps, because they’re automatically restored when you exit the subshell.
Executed in subshell, exit on error
Example script:
#!/bin/bash -eu # -e: Exit immediately if a command exits with a non-zero status. # -u: Treat unset variables as an error when substituting. (false) # Caveat 1: If an error occurs in a subshell, it isn't detected (false) || false # Solution: If you want to exit, you have to detect the error yourself (false; true) || false # Caveat 2: The return status of the ';' separated list is `true' (false && true) || false # Solution: If you want to control the last command executed, use `&&'
See also #Caveat 1: `Exit on error’ ignoring subshell exit status
Executed in subshell, trap error
#!/bin/bash -Eu # -E: ERR trap is inherited by shell functions. # -u: Treat unset variables as an error when substituting. # # Example script for handling bash errors. Exit on error. Trap exit. # This script is supposed to run in a subshell. # See also: http://fvue.nl/wiki/Bash:_Error_handling # Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR trap onexit 1 2 3 15 ERR #--- onexit() ----------------------------------------------------- # @param $1 integer (optional) Exit status. If not set, use `$?' function onexit() { local exit_status=${1:-$?} echo Exiting $0 with $exit_status exit $exit_status } # myscript # Allways call `onexit' at end of script onexit
Caveat 1: `Exit on error’ ignoring subshell exit status
The `-e’ setting does not exit if an error occurs within a subshell, for example with these subshell commands: (false)
or bash -c false
Example script caveat1.sh:
#!/bin/bash -e echo begin (false) echo end
Executing the script above gives:
$ ./caveat1.sh begin end $
Conclusion: the script didn’t exit after (false).
Solution: Generate error yourself if subshell fails
( SHELL COMMANDS ) || false
In the line above, the exit status of the subshell is checked. The subshell must exit with a zero status — indicating success, otherwise `false’ will run, generating an error in the current shell.
Note that within a bash `list’, with commands separated by a `;’, the return status is the exit status of the last command executed. Use the control operators `&&’ and `||’ if you want to control the last command executed:
$ (false; true) || echo foo $ (false && true) || echo foo foo $
Example 1
Example script example.sh:
#!/bin/bash -e echo begin (false) || false echo end
Executing the script above gives:
$ ./example.sh begin $
Conclusion: the script exits after false.
Example 2
Example bash commands:
$ trap 'echo error' ERR # Set ERR trap $ false # Non-zero exit status will be trapped error $ (false) # Non-zero exit status within subshell will not be trapped $ (false) || false # Solution: generate error yourself if subshell fails error $ trap - ERR # Reset ERR trap
Caveat 2: `Exit on error’ not exitting subshell on error
The `-e’ setting doesn’t always immediately exit the subshell `(…)’ when an error occurs. It appears the subshell behaves as a simple command and has the same restrictions as `-e’:
- Exit immediately if a simple command exits with a non-zero status, unless the subshell is part of the command list immediately following a `while’ or `until’ keyword, part of the test in an `if’ statement, part of the right-hand-side of a `&&’ or `||’ list, or if the command’s return status is being inverted using `!’
Example script caveat2.sh:
#!/bin/bash -e (false; echo A) # Subshell exits after `false' !(false; echo B) # Subshell doesn't exit after `false' true && (false; echo C) # Subshell exits after `false' (false; echo D) && true # Subshell doesn't exit after `false' (false; echo E) || false # Subshell doesn't exit after `false' if (false; echo F); then true; fi # Subshell doesn't exit after `false' while (false; echo G); do break; done # Subshell doesn't exit after `false' until (false; echo H); do break; done # Subshell doesn't exit after `false'
Executing the script above gives:
$ ./caveat2.sh B D E F G H
Solution: Use logical operators (&&, ||) within subshell
Use logical operators `&&’ or `||’ to control execution of commands within a subshell.
Example
#!/bin/bash -e (false && echo A) !(false && echo B) true && (false && echo C) (false && echo D) && true (false && echo E) || false if (false && echo F); then true; fi while (false && echo G); do break; done until (false && echo H); do break; done
Executing the script above gives no output:
$ ./example.sh $
Conclusion: the subshells do not output anything because the `&&’ operator is used instead of the command separator `;’ as in caveat2.sh.
Caveat 3: `Exit on error’ not exitting command substition on error
The `-e’ setting doesn’t immediately exit command substitution when an error occurs, except when bash is in posix mode:
$ set -e $ echo $(false; echo A) A
Solution 1: Use logical operators (&&, ||) within command substitution
$ set -e $ echo $(false || echo A)
Solution 2: Enable posix mode
When posix mode is enabled via set -o posix
, command substition will exit if `-e’ has been set in the
parent shell.
$ set -e $ set -o posix $ echo $(false; echo A)
Enabling posix might have other effects though?
The tools
Exit on error
Bash can be told to exit immediately if a command fails. From the bash manual («set -e»):
- «Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or || list, or if the command’s return value is being inverted via !. A trap on ERR, if set, is executed before the shell exits.»
To let bash exit on error, different notations can be used:
- Specify `bash -e’ as shebang interpreter
- Start shell script with `bash -e’
- Use `set -e’ in shell script
- Use `set -o errexit’ in shell script
- Use `trap exit ERR’ in shell script
Specify `bash -e’ as the shebang interpreter
You can add `-e’ to the shebang line, the first line of your shell script:
#!/bin/bash -e
This will execute the shell script with `-e’ active. Note `-e’ can be overridden by invoking bash explicitly (without `-e’):
$ bash shell_script
Example
Create this shell script example.sh and make it executable with chmod u+x example.sh
:
#!/bin/bash -e echo begin false # This should exit bash because `false' returns error echo end # This should never be reached
Example run:
$ ./example.sh begin $ bash example.sh begin end $
Set ERR trap to exit
By setting an ERR trap you can catch errors as well:
trap command ERR
By setting the command to `exit’, bash exits if an error occurs.
trap exit ERR
Example
Example script example.sh
#!/bin/bash trap exit ERR echo begin false echo end
Example run:
$ ./example.sh begin $
The non-zero exit status of `false’ is catched by the error trap. The error trap exits and `echo end’ is never reached.
Solutions revisited: Combining the tools
Executed in subshell, trap on exit
#!/bin/bash # --- subshell_trap.sh ------------------------------------------------- # Example script for handling bash errors. Exit on error. Trap exit. # This script is supposed to run in a subshell. # See also: http://fvue.nl/wiki/Bash:_Error_handling # Let shell functions inherit ERR trap. Same as `set -E'. set -o errtrace # Trigger error when expanding unset variables. Same as `set -u'. set -o nounset # Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR # NOTE1: - 9/KILL cannot be trapped. #+ - 0/EXIT isn't trapped because: #+ - with ERR trap defined, trap would be called twice on error #+ - with ERR trap defined, syntax errors exit with status 0, not 2 # NOTE2: Setting ERR trap does implicit `set -o errexit' or `set -e'. trap onexit 1 2 3 15 ERR #--- onexit() ----------------------------------------------------- # @param $1 integer (optional) Exit status. If not set, use `$?' function onexit() { local exit_status=${1:-$?} echo Exiting $0 with $exit_status exit $exit_status } # myscript # Allways call `onexit' at end of script onexit
Rationale
+-------+ +----------+ +--------+ +------+ | shell | | subshell | | script | | trap | +-------+ +----------+ +--------+ +------+ : : : : +-+ +-+ +-+ error +-+ | | | | | |-------->| | | | exit | | | ! | | | |<-----------------------------------+ +-+ : : : : : : :
Figure 1. Trap in executed script
When a script is executed from a shell, bash will create a subshell in which the script is run. If a trap catches an error, and the trap says `exit’, this will cause the subshell to exit.
Sourced in current shell
If the script is sourced (included) in the current shell, you have to worry about restoring shell options and shell traps. If they aren’t restored, they might cause problems in other programs which rely on specific settings.
#!/bin/bash #--- listing6.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $ set +o errtrace # Make sure errtrace is not set (bash default) # $ trap - ERR # Make sure no ERR trap is set (bash default) # $ . listing6.inc.sh # Source listing6.inc.sh # $ foo # Run foo() # foo_init # Entered `trap-loop' # trapped # This is always executed - with or without a trap occurring # foo_deinit # $ trap # Check if ERR trap is reset. # $ set -o | grep errtrace # Check if the `errtrace' setting is... # errtrace off # ...restored. # $ # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped; break' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval $fooOldErrtrace # Restore `errtrace' setting unset fooOldErrtrace # Delete global variable } function foo { foo_init # `trap-loop' while true; do echo Entered \`trap-loop\' false echo This should never be reached because the \`false\' above is trapped break done echo This is always executed - with or without a trap occurring foo_deinit }
Todo
- an existing ERR trap must be restored and called
- test if the `trap-loop’ is reached if the script breaks from a nested loop
Rationale
`Exit’ trap in sourced script
When the script is sourced in the current shell, it’s not possible to use `exit’ to terminate the program: This would terminate the current shell as well, as shown in the picture underneath.
+-------+ +--------+ +------+ | shell | | script | | trap | +-------+ +--------+ +------+ : : : +-+ +-+ error +-+ | | | |-------->| | | | | | | | | | exit | | | | <------------------------------------------+ : : :
Figure 2. `Exit’ trap in sourced script
When a script is sourced from a shell, bash will run the script in the current shell. If a trap catches an error, and the trap says `exit’, this will cause the current shell to exit.
`Break’ trap in sourced script
A solution is to introduce a main loop in the program, which is terminated by a `break’ statement within the trap.
+-------+ +--------+ +--------+ +------+ | shell | | script | | `loop' | | trap | +-------+ +--------+ +--------+ +------+ : : : : +-+ +-+ +-+ error +-+ | | | | | |------->| | | | | | | | | | | | | | break | | | | | | return | |<----------------------+ | |<----------+ : : +-+ : : : : : : :
Figure 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. . ./script
, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.
For example:
#!/bin/bash #--- listing3.sh ------------------------------------------------------- # See: http://fvue.nl/wiki/Bash:_Error_handling trap 'echo trapped; break' ERR; # Set ERR trap function foo { echo foo; false; } # foo() exits with error # `trap-loop' while true; do echo Entered \`trap-loop\' foo echo This is never reached break done echo This is always executed - with or without a trap occurring trap - ERR # Reset ERR trap
Listing 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. ./script
, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.
Example output:
$> source listing3.sh Entered `trap-loop' foo trapped This is always executed after a trap $>
Trap in function in sourced script without `errtrace’
A problem arises when the trap is reset from within a function of a sourced script. From the bash manual, set -o errtrace
or set -E
:
If set, any trap on `ERR’ is inherited by shell functions, command
substitutions, and commands executed in a subshell environment.
The `ERR’ trap is normally not inherited in such cases.
So with errtrace
not set, a function does not know of any `ERR’ trap set, and thus the function is unable to reset the `ERR’ trap. For example, see listing 4 underneath.
#!/bin/bash #--- listing4.inc.sh --------------------------------------------------- # Demonstration of ERR trap not being reset by foo_deinit() # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing4.inc.sh # Source listing4.inc.sh # $> foo # Run foo() # foo_init # foo # foo_deinit # This should've reset the ERR trap... # $> trap # but the ERR trap is still there: # trap -- 'echo trapped' ERR # $> trap # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init trap 'echo trapped' ERR;} # Set ERR trap function foo_deinit { echo foo_deinit trap - ERR ;} # Reset ERR trap function foo { foo_init echo foo foo_deinit ;}
Listing 4. Trap in function in sourced script
foo_deinit()
is unable to unset the ERR trap, because errtrace
is not set.
Trap in function in sourced script with ‘errtrace’
The solution is to set -o errtrace
. See listing 5 underneath:
#!/bin/bash #--- listing5.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing5.inc.sh # Source listing5.inc.sh # $> foo # Run foo() # foo_init # foo # foo_deinit # This should reset the ERR trap... # $> trap # and it is indeed. # $> set +o | grep errtrace # And the `errtrace' setting is restored. # $> # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval($fooOldErrtrace) # Restore `errtrace' setting fooOldErrtrace= # Delete global variable } function foo { foo_init echo foo foo_deinit ;}
`Break’ trap in function in sourced script with `errtrace’
Everything combined in listing 6 underneath:
#!/bin/bash #--- listing6.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing6.inc.sh # Source listing6.inc.sh # $> foo # Run foo() # foo_init # Entered `trap-loop' # trapped # This is always executed - with or without a trap occurring # foo_deinit # $> trap # Check if ERR trap is reset. # $> set -o | grep errtrace # Check if the `errtrace' setting is... # errtrace off # ...restored. # $> # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped; break' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval $fooOldErrtrace # Restore `errtrace' setting unset fooOldErrtrace # Delete global variable } function foo { foo_init # `trap-loop' while true; do echo Entered \`trap-loop\' false echo This should never be reached because the \`false\' above is trapped break done echo This is always executed - with or without a trap occurring foo_deinit }
Test
#!/bin/bash # Tests # An erroneous command should have exit status 127. # The erroneous command should be trapped by the ERR trap. #erroneous_command # A simple command exiting with a non-zero status should have exit status #+ <> 0, in this case 1. The simple command is trapped by the ERR trap. #false # Manually calling 'onexit' #onexit # Manually calling 'onexit' with exit status #onexit 5 # Killing a process via CTRL-C (signal 2/SIGINT) is handled via the SIGINT trap # NOTE: `sleep' cannot be killed via `kill' plus 1/SIGHUP, 2/SIGINT, 3/SIGQUIT #+ or 15/SIGTERM. #echo $$; sleep 20 # Killing a process via 1/SIGHUP, 2/SIGQUIT, 3/SIGQUIT or 15/SIGTERM is #+ handled via the respective trap. # NOTE: Unfortunately, I haven't found a way to retrieve the signal number from #+ within the trap function. echo $$; while true; do :; done # A syntax error is not trapped, but should have exit status 2 #fi # An unbound variable is not trapped, but should have exit status 1 # thanks to 'set -u' #echo $foo # Executing `false' within a function should exit with 1 because of `set -E' #function foo() { # false # true #} # foo() #foo echo End of script # Allways call 'onexit' at end of script onexit
See also
- Bash: Err trap not reset
- Solution for
trap - ERR
not resetting ERR trap.
Journal
20210114
Another caveat: exit (or an error-trap) executed within «process substitution» doesn’t end outer process. The script underneath keeps outputting «loop1»:
#!/bin/bash # This script outputs "loop1" forever, while I hoped it would exit all while-loops set -o pipefail set -Eeu while true; do echo loop1 while read FOO; do echo loop2 echo FOO: $FOO done < <( exit 1 ) done
The ‘< <()’ notation is called process substitution.
See also:
- https://mywiki.wooledge.org/ProcessSubstitution
- https://unix.stackexchange.com/questions/128560/how-do-i-capture-the-exit-code-handle-errors-correctly-when-using-process-subs
- https://superuser.com/questions/696855/why-doesnt-a-bash-while-loop-exit-when-piping-to-terminated-subcommand
Workaround: Use «Here Strings» ([n]<<<word):
#!/bin/bash # This script will exit correctly if building up $rows results in an error set -Eeu rows=$(exit 1) while true; do echo loop1 while read FOO; do echo loop2 echo FOO: $FOO done <<< "$rows" done
20060524
#!/bin/bash #--- traptest.sh -------------------------------------------- # Example script for trapping bash errors. # NOTE: Why doesn't this scripts catch syntax errors? # Exit on all errors set -e # Trap exit trap trap_exit_handler EXIT # Handle exit trap function trap_exit_handler() { # Backup exit status if you're interested... local exit_status=$? # Change value of $? true echo $? #echo trap_handler $exit_status } # trap_exit_handler() # An erroneous command will trigger a bash error and, because # of 'set -e', will 'exit 127' thus falling into the exit trap. #erroneous_command # The same goes for a command with a false return status #false # A manual exit will also fall into the exit trap #exit 5 # A syntax error isn't catched? fi # Disable exit trap trap - EXIT exit 0
Normally, a syntax error exits with status 2, but when both ‘set -e’ and ‘trap EXIT’ are defined, my script exits with status 0. How can I have both ‘errexit’ and ‘trap EXIT’ enabled, *and* catch syntax errors
via exit status? Here’s an example script (test.sh):
set -e trap 'echo trapped: $?' EXIT fi $> bash test.sh; echo \$?: $? test.sh: line 3: syntax error near unexpected token `fi' trapped: 0 $?: 0
More trivia:
- With the line ‘#set -e’ commented, bash traps 258 and returns an exit status of 2:
trapped: 258 $?: 2
- With the line ‘#trap ‘echo trapped $?’ EXIT’ commented, bash returns an exit status of 2:
$?: 2
- With a bogus function definition on top, bash returns an exit status of 2, but no exit trap is executed:
function foo() { foo=bar } set -e trap 'echo trapped: $?' EXIT fi
fred@linux:~>bash test.sh; echo \$?: $? test.sh: line 4: syntax error near unexpected token `fi' test.sh: line 4: `fi' $?: 2
20060525
Example of a ‘cleanup’ script
trap
Writing Robust Bash Shell Scripts
#!/bin/bash #--- cleanup.sh --------------------------------------------------------------- # Example script for trapping bash errors. # NOTE: Use 'cleanexit [status]' instead of 'exit [status]' # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM # @see catch_sig() trap catch_sig 1 2 3 15 # Trap errors (simple commands exiting with a non-zero status) # @see catch_err() trap catch_err ERR #--- cleanexit() -------------------------------------------------------------- # Wrapper around 'exit' to cleanup on exit. # @param $1 integer Exit status. If $1 not defined, exit status of global #+ variable 'EXIT_STATUS' is used. If neither $1 or #+ 'EXIT_STATUS' defined, exit with status 0 (success). function cleanexit() { echo "Exiting with ${1:-${EXIT_STATUS:-0}}" exit ${1:-${EXIT_STATUS:-0}} } # cleanexit() #--- catch_err() -------------------------------------------------------------- # Catch ERR trap. # This traps simple commands exiting with a non-zero status. # See also: info bash | "Shell Builtin Commands" | "The Set Builtin" | "-e" function catch_err() { local exit_status=$? echo "Inside catch_err" cleanexit $exit_status } # catch_err() #--- catch_sig() -------------------------------------------------------------- # Catch signal trap. # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM # @NOTE1: Non-trapped signals are 0/EXIT, 9/KILL. function catch_sig() { local exit_status=$? echo "Inside catch_sig" cleanexit $exit_status } # catch_sig() # An erroneous command should have exit status 127. # The erroneous command should be trapped by the ERR trap. #erroneous_command # A command returning false should have exit status <> 0 # The false returning command should be trapped by the ERR trap. #false # Manually calling 'cleanexit' #cleanexit # Manually calling 'cleanexit' with exit status #cleanexit 5 # Killing a process via CTRL-C is handled via the SIGINT trap #sleep 20 # A syntax error is not trapped, but should have exit status 2 #fi # Allways call 'cleanexit' at end of script cleanexit
blog comments powered by
Advertisement
blog comments powered by
Error handling is a very important part of any programming language. Bash has no better option than other programming languages to handle the error of the script. But it is essential to make the Bash script error-free at the time of executing the script from the terminal. The error handling feature can be implemented for the Bash script in multiple ways. The different techniques to handle the errors in the Bash script are shown in this tutorial.
Example 1: Error Handling Using a Conditional Statement
Create a Bash file with the following script that shows the use of the conditional statement for error handling. The first “if” statement is used to check the total number of command line arguments and print an error message if the value is less than 2. Next, the dividend and divisor values are taken from the command line arguments. If the divisor value is equal to 0, an error is generated and the error message is printed in the error.txt file. The second “if” command is used to check whether the error.txt file is empty or not. An error message is printed if the error.txt file is non-empty.
#!/bin/bash
#Check the argument values
if [ $# -lt 2 ]; then
echo «One or more argument is missing.»
exit
fi
#Read the dividend value from the first command-line argument
dividend=$1
#Read the divisor value from the second command-line argument
divisor=$2
#Divide the dividend by the divisor
result=`echo «scale=2; $dividend/$divisor«|bc 2>error.txt`
#Read the content of the error file
content=`cat error.txt`
if [ -n «$content« ]; then
#Print the error message if the error.txt is non-empty
echo «Divisible by zero error occurred.»
else
#Print the result
echo «$dividend/$divisor = $result«
Output:
The following output appears after executing the previous script without any argument:
The following output appears after executing the previous script with one argument value:
The following output appears after executing the previous script with two valid argument values:
The following output appears after executing the previous script with two argument values where the second argument is 0. The error message is printed:
Example 2: Error Handling Using the Exit Status Code
Create a Bash file with the following script that shows the use of the Bash error handling by exit status code. Any Bash command is taken as input value and that command is executed later. If the exit status code is not equal to zero, an error message is printed. Otherwise, a success message is printed.
#!/bin/bash
#Take a Linux command name
echo -n «Enter a command: «
read cmd_name
#Run the command
$cmd_name
#Check whether the command is valid or invalid
if [ $? -ne 0 ]; then
echo «$cmd_name is an invalid command.»
else
echo «$cmd_name is a valid command.»
fi
fi
Output:
The following output appears after executing the previous script with the valid command. Here, the “date” is taken as the command in the input value that is valid:
The following output appears after executing the previous script for the invalid command. Here, the “cmd” is taken as the command in the input value that is invalid:
Example 3: Stop the Execution on the First Error
Create a Bash file with the following script that shows the method to stop the execution when the first error of the script appears. Two invalid commands are used in the following script. So, two errors are generated. The script stops the execution after executing the first invalid command using the “set –e” command.
#!/bin/bash
#Set the option to terminate the script on the first error
set -e
echo ‘Current date and time: ‘
#Valid command
date
echo ‘Current working Directory: ‘
#Invalid command
cwd
echo ‘login username: ‘
#Valid command
whoami
echo ‘List of files and folders: ‘
#Invalid command
list
Output:
The following output appears after executing the previous script. The script stops the execution after executing the invalid command which is “cwd”:
Example 4: Stop the Execution for Uninitialized Variable
Create a Bash file with the following script that shows the method to stop the execution of the script for the uninitialized variable. The username and password values are taken from the command line argument values. If any of the values of these variables are uninitialized, an error message is printed. If both variables are initialized, the script checks if the username and password are valid or invalid.
#!/bin/bash
#Set the option to terminate the script for an uninitialized variable
set -u
#Set the first command-line argument value to the username
username=$1
#Set the second command-line argument value to the password
password=$2
#Check the username and password are valid or invalid
if [[ $username == ‘admin’ && $password == ‘hidenseek’ ]]; then
echo «Valid user.»
else
echo «Invalid user.»
fi
Output:
The following output appears if the script is executed without using any command-line argument value. The script stops the execution after getting the first uninitialized variable:
The following output appears if the script is executed with one command-line argument value. The script stops the execution after getting the second uninitialized variable:
The following output appears if the script is executed with two command-line argument values – “admin” and “hide”. Here, the username is valid but the password is invalid. So, the “Invalid user” message is printed:
The following output appears if the script is executed with two command-line argument values – “admin” and “hidenseek”. Here, the username and password are valid. So, the “Valid user” message is printed:
Conclusion
The different ways to handle the errors in the Bash script are shown in this tutorial using multiple examples. We hope that this will help the Bash users to implement the error-handling feature in their Bash script.
About the author
I am a trainer of web programming courses. I like to write article or tutorial on various IT topics. I have a YouTube channel where many types of tutorials based on Ubuntu, Windows, Word, Excel, WordPress, Magento, Laravel etc. are published: Tutorials4u Help.