This tutorial is intended to give you a foundation in how to write Bash scripts, to get the computer to do complex, repetitive tasks for you. You won’t be a bash expert at the end but we hope you will be well on your way with the right knowledge and skills to get you there if that’s what you want.
Bash
scripts are used by Systems Administrators, Programmers, Network
Engineers, Scientists and just about anyone else who uses a
Linux
/ Unix
system regularly. No matter what
you do or what your general level of computer proficiency is, you can
generally find a way to use Bash scripting to make your life
easier. Bash is a command line language.
BASH
is the default shell on most Linux distributions
and Apple’s macOS (formerly OS X).
Since it is a programming language, more often than not we will be writing bash scripts. We will deal with this shortly but beforehand, it is worth learning about some fundamental features of the bash language. These will help us build towards creating efficient and effective scripts.
Variables in programming languages are just things that we refer to in the environment we are working in. One way to think of them is like naming objects in real life. In other words, you can think of a variable as short hand for something you want your code to refer to. In bash, we declare a variable like so:
VARIABLE="Hello world"
Note that the variable doesn’t have to be small letters or
allcaps but that allcaps
is the convention in
bash. It is worth sticking to this convention because all command line
tools have lowercase names. This makes your code much easier to
read.
Recalling variables is also simple, we just precede our variable name
with $
. To print it to the screen, we must use the utility
echo:
echo $VARIABLE
We can also declare multiple variables and combine them together:
NAME="my name is Hal"
echo ${VARIABLE} ${NAME}
Note that we wrap the variable names here in curly brackets in order to preserve them. This is not always necessary but it does ensure your code is interpreted properly. This makes more sense if we combine them in a string, i.e. to make them a full sentence.
echo "${VARIABLE}, ${NAME}"
Although these examples are actual words, more often than not, you will use variables to store the names of files. Using variables in your scripting is a really useful way to make your code very efficient. Doing this means you can change what an entire script does just by changing a single variable.
One last note on variables - we have actually already encountered one
before this section of the tutorial - i.e. the $HOME
variable. This is one of multiple environmental variables which are
stored in our Unix environment. You can see all of them using the
env
command. It is best to not set any variables with the
same names as these.
Let’s have a look of the different UNIX environment commands, please
type env
:
env
Now that we have learned to create variables, we can also explore how to manipulate them. This is not always straightforward in bash, but again it is really worth learning how to do this as it can make script writing much more straightforward. In most cases we will use string manipulation on filenames and paths, so we will use this as an example now.
First, let’s declare a variable. We’ll make a dummy filename in this instance:
FILE="$HOME/an_example_file.txt"
Let’s echo this back to the screen:
echo $FILE
An important point to note here is that the $HOME
variable has been interpreted so that we now have the entire file
path.
Let’s say we just want the actual filename, i.e. without the
directory or path? We can use basename
:
basename $FILE
Alternatively, we could remove the filename and keep only the directory or path:
dirname $FILE
For now though, we want to operate on the filename itself, so let’s redeclare the variable so it is only the filename.
FILE=$(basename $FILE)
echo $FILE
Note that here, we have to wrap the basename $FILE
command in $()
because it is an actual command.
OK so onto some proper string manipulation. Let’s remove the
.txt
suffix.
echo ${FILE%.*}
What did we do here? First we have to wrap the entire variable name
in curly brackets - this will not work without them. The %
denotes that we want to delete everything after the next character,
which in this case is .*
- i.e. everything after the
period.
Note that the following would have also worked:
echo ${FILE%.txt}
We don’t need to limit ourselves to the suffix. We could also delete everything after the last underscore. Like so:
echo ${FILE%_*}
We could also set it so that we delete everything after the first underscore:
echo ${FILE%%_*}
We can also delete from infront of the characters in our string manipulation example. For example:
echo ${FILE#*.}
This deletes everything up to and including the period character. We could also do the same with the underscores:
echo ${FILE#*_}
echo ${FILE##*_}
Where again, a single #
states we want to delete only
after the last occurrence and a double ##
denotes we want
to delete everything after the first occurrence.
You might be wondering, what exactly is the point of this? Well altering filenames is very important in most bioinformatics pipelines. So for example, with simple string manipulation you can change the suffix of a filename quickly and easily:
echo $FILE
echo ${FILE%.*}.jpg
One last point here; string manipulation in bash is not straightforward. It takes a lot of practice to get right and remember properly. We google this excellent tutorial nearly all the time!
Control flow is an important part of many different programming languages. It is essentially a way of controlling how code is carried out.
Imagine you have to perform the same operation on many different files - do you want to type out a command for each and everyone of them? Of course not! This is why you might use control flow to repeat a command multiple times. There are many different types of control flow, but for now we will focus on the most common one - a for loop.
Now, let’s have a look at a simple example:
for i in {1..10}
do
echo "This is $i"
done
Indentation within the loop is not required but helps legibility
This chunk of code is is saying is that for each number between 1 and
10, echo a “This is 1
”, “This is 2
” and so on
to the screen. do
and done
initiate and stop
the loop respectively. Here, the variable i
is used within
the loop but this is completely arbitrary - you can use whatever
variable you would like. Indeed, it is often much more convenient to use
a variable that makes sense to you. For example:
for NUMBER in {1..10}
do
echo "This is $NUMBER"
done
It doesn’t just have to be numbers either. You can use a loop to iterate across multiple strings too. For example:
for NAME in Tony Luc Natacha Patrice Amanda
do
echo "My name is $NAME"
done
Of course, this is a silly example, but you could easily substitute this with filenames - making it quite clear why control flow is an essential skill for effective bash programming in bioinformatics.
Imagine we want to run a for loop on some text files. We’ll make five of them to demonstrate - and we can actually do this with a for loop too:
for i in {1..5}
do
touch file_${i}.txt
done
Use ls
after running this code and you’ll see five text
files.
Now, what if we want to do something simple like go through all of them and print their names to the screen? We could do it like this:
for FILE in *.txt
do
echo $FILE
done
This works really well for this simple example. However as your code
becomes more advanced, it is easy for something like this to become
quite dangerous. Imagine for example that in our for loop
,
we create a new .txt file each time? We would be in danger of creating
an infinite loop, that continually prints the names of the new files it
creates.
For this reason, it is best practice in bash to use
arrays
. These are essentially predefined lists of
variables. They are easy to make too. Let’s try a simple example.
ARRAY=(Natacha Patrice Amanda)
Now we can try printing this to the screen:
echo $ARRAY
This only prints the first value of our array. Actually, arrays have indexes, so we can print any value we specify like so:
echo ${ARRAY[0]}
echo ${ARRAY[1]}
echo ${ARRAY[2]}
Notice that like python
(and unlike R) everything in
bash is zero-indexed - i.e. the first variable is zero
and so on.
What if we want to print everything in the array?
echo ${ARRAY[@]}
echo ${ARRAY[*]}
Either of these will work fine.
We can also loop through the array, like so:
for CHARACTER in ${ARRAY[@]}
do
echo $CHARACTER
done
The purpose of the array here is that it ensures the scope of our loop is limited and that it doesn’t get carried away, operating on things it shouldn’t do.
One last point about arrays - it is often quite cumbersome to define them by hand. Imagine if you wanted to make an array for hundreds of files? Luckily you can also declare them from for loops too:
ARRAY2=($(for i in *.txt
do
echo $i
done))
This doesn’t look very neat though - you can actually write a for loop like this on a single line - i.e.:
ARRAY2=($(for i in *.txt ;do echo $i; done))
Where ;
indicates a separate line.
The convenience of arrays, like much of this tutorial will become much more apparent as you become more experienced in using Unix for bioinformatics.
So far we have learned a lot about bash as a programming language - but can we use it to write a program? Well actually… yes! This is extremely easy and it is exactly what we set out to do when we write a bash script.
Let’s start with a really basic example. Type nano
into
the command line in order to open the nano
text editor.
Note that you can also use one of your GUI editors like
notepad++, Gedit or Atom.
Then we can write a simple bash script, like so:
#!/bin/sh
# a simple bash script
echo "Hello world"
exit
Save it as hello_world.sh. You can take another look at the output
with cat or less if you want to check you saved it properly. Let’s
breakdown some of the script.
- First of all there is this #!/bin/sh
line. You don’t need
to worry too much about that - it’s just good practice to ensure the
script is run in the bash language.
- We also have another line starting with #
- this is just
a comment. Here it explains something about the script. Comments are
really important actually and you should fill your script with them -
they are a good way of letting yourself know what you have done. Again,
they can be invaluable when you come back to your scripts after some
time away…
Now we can actually run the script. We do that like so:
sh hello_world.sh.
You just wrote your first program.
You can also write a script that is interactive. Let’s write another script to take an input from the command line.
Open nano and create a script with the following:
#!/bin/sh
# a simple bash script with input
echo "My name is ${1} and my best friend is ${2}"
exit
Save it as name_script.sh
. In this script, the
${1}
and ${2}
variables are just specifying
that this script will take the first and second
arguments to the script from the command line. Let’s see it in
action.
sh name_script.sh David Chris
Feel free to add whatever combination you want in here. You can actually try running this without the arguments too and see that it still works, just the output doesn’t make much sense.
Now we have learned a little about how to script, let’s write one
that will do something for us. We’ll create five files (again using a
for loop) and then convert them all from .txt
to
.jpg
.
Firstly, let’s make those files:
for i in {1..5}
do
touch file_${i}.txt
done
Now, we can open up nano
and write out script. We would
do this like so:
#!/bin/sh
# a script to rename files
# declare an array
ARRAY=($(for i in file*.txt; do echo $i; done))
# loop over array
for FILE in ${ARRAY[@]}
do
echo "Creating ${FILE%.*}.jpg"
mv $FILE ${FILE%.*}.jpg
done
Now write this script out as a rename_file.sh.
If you run this as sh rename_file.sh
, it will print the
name of each file it converts to the screen. You can then use
ls
to see that it has indeed converted all the
.txt files to .jpg.
With all the skills we have learned in this tutorial, it is now time for you to put them to the test. Return to the unix_exercises directory and write a short script to do the following:
To learn more about Unix Command Line:
This tutorial is adapted from Mark Ravinet & Joana Meier course