Computational Biology Resources FTW
This is my ever-growing collection of links, solutions and sources I have discovered and used when trying to learn and teach computational biology. I often use it as a one-stop resource page for whomever asks me about a good book, website or that command that lets you execute line 45 from history
and to learn about handling data in shell and R.
A bunch of papers
If you need a good reference or just to persuade your colleague or supervisor that she really needs to get to where the puck is going to be.
- Loman, N. & Watson, M. So you want to be a computational biologist? Nat Biotechnol 31, 996–998 (2013).
- Wilson, G. et al. Best Practices for Scientific Computing. PLoS Biol 12, e1001745 (2014).
- Wilson, G. et al. Good Enough Practices in Scientific Computing. PLoS Comput Biol 13, e1005510 (2017).
- Tippmann, S. Programming tools: Adventures with R. Nature 517, 109–110 (2015).
- Lindsay Barone, Jason Williams, David Micklos Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators (2017) PLoS Comput Biol 13(10): e1005755.
- Melissa A. Wilson Sayres et al. Bioinformatics Core Competencies for Undergraduate Life Sciences Education PLoS ONE 13, e0196878–20 (2018).
- For a very contrarian view, be sure to read this blog post from Vicki Boykis Data science is different now and plan accordingly.
R For The Win
-
If you want just one or maximum two things to explain to someone why R is super awesome, show them Paul Campbell’s presentation A whirlwind tour of working with data in R and Gina Reynolds’ The Tidyverse in Action. You’re welcome.
-
While in certain fields SPSS is still holding up (see The Popularity of Data Science Software and Popularity of statistical softwares in epidemiology), R is poised to take over in number of citations by 2020 SPSS is dying. It’s time to change.. See also The Top Programming Languages 2019 at IEEE Spectrum website.
Also, bioinformatics != computational biology.
Highly recommended books on computational biology and data science
- Practical Computing for Biologists by Steven H.D. Haddock and Casey W. Dunn. It covers command line, Python, installing software and manipulation of graphics.
- Bioinformatics Data Skills by Vince Buffalo. Shell, R, Git with emphasis on life science data analysis, including next-generation sequencing file handling.
- R for Data Science by Garett Golemund and Hadley Wickham. Solid introduction to
tidyverse
ways of handling data and analysis by the creators and evangelists :-) - STAT 545: Data wrangling, exploration, and analysis with R by Jenny Bryan, the
tidyverse
expanded :-)
And in paritcular these about the visualisations:
- R Graphics Cookbook by Winston Chang.
ggplot2
explained using clear examples akin to recipes (“if you want to plot this, do this and that”). - Fundamentals of Data Visualization by Claus O. Wilke
- Data Visualization: A practical introduction by Kieran Healy
- Also check The R Graph Gallery
A more thorough list is available at bookdown.org.
A good book to learn Python
- Automate the Boring Stuff with Python by Al Sweigart. The link leads to a free online version, but there are also a hard copy and an ebook version available.
Do not use Excel for handling dates and gene identifiers!
In particular, do not export gene IDs and dates to Excel and then import it back to R or other programming tools. You have been warned.
- Zeeberg, B. R. et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5, 80 (2004). Also check this blog post (with comments), from 2012: Gene name errors and Excel: lessons not learned.
- Ziemann, M. et al. Gene name errors are widespread in the scientific literature Genome Biology (2016) (sic) 17:177
- Mallona, I. & Peinado, M. A. Truke, a web tool to check for and handle excel misidentified gene symbols. 1–3 (2017). doi:10.1186/s12864-017-3631-8
If you have to use Excel for dates, split your date into three numerical columns: year, month and day and use package lubridate to handle the dates after importing to R. Also, here is a good website with tricks for power users.
However, Excel is often good enough for many things, and sometimes it is inevitable. Before you go for it, have a look at this paper: Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
Get a good text editor
This is essential. A good text editor has to support regular expressions and understand different line ending conventions. All the software below is free to use.
- Notepad++ on Windows
- BBEdit on Macs (free version is powerful enough and entirely sufficient for a start)
- Gedit on Linux (available by default on Ubuntu)
- Visual Studio Code on everything, made by Microsoft and actually very good ;-)
- Atom on everything (it runs as a Chrome-based browser)
Do it in style
Code style guides for R. I recommend tidyverse style guide but it’s not the only one (see also Google R code style guide and Jean Fan’s R code style guide). The important part is to pick one and stick to it. There is also a package to help you with adhering to the tidyverse style guide: styler.
Also important:
- Naming things Jenny Bryan’s definitive slides on how to name things FTW
- Project structure slide deck by Danielle Navarro. Also covers file paths (very important and very confusing to students these days!) and naming things. Make sure to check other talks/slides by Danielle on her website.
- 365 useful functions in R - a Twitter thread turned into a book.
- Full R documentation online (including 13k+ packages)
- How to write a reproducible example. If you need to ask for R help online, this is how you do it. Now in a form of R package: reprex
- Reserved words in R. The list is short:
if
,else
,repeat
,while
,function
,for
,in
,next
,break
,TRUE
,FALSE
,NULL
,Inf
,NaN
,NA
,NA_integer_
,NA_real_
,NA_complex_
,NA_character_
Tools useful in teaching or just for mucking about
- rstudio.cloud - what is says on the tin. Free to use (for now), and it seems R Studio people are thinking this is going to be big for teaching R: https://resources.rstudio.com/webinars/rstudio-cloud-in-the-classroom by Mine Cetinkaya-Rundel.
- learnr Interactive tutorials with R Notebook and Shiny.
Collaborative notebooks and recorders
- etherpad for collaborative real time editing (a la Google Docs). This is what Software and Data Carpentry use, but you need to host it (there are free public hosts available).
- HackMD a possibly better alternative to etherpad. Does not require hosting and uses Markdown (it formats the text automatically).
- ASCIinema Recording your shell sessions is useful for your students, and this system let’s you select the text in the recording and copy/paste it! What would be super useful though, is a real-time shell recording system that would output the recording as-is (both commands and their output) to an accesible location like a website or even a file.
Make your website, blog, paper or a presentation with R
-
R Blogdown is a fantastic way to set up your website from within R (this Twitter thread from Dan Quintana is rather useful as well).
-
Do not miss Alison Hill’s excellent tutorial on setting up your Blogdown website. She also developed a nice Hugo template for personal websites.
-
If you want to write a book or a paper within R, try R Bookdown and rticles.
-
If you want to prepare your presentation within R, with dynamic R code, use xaringan. (BTW, blogdown, bookdown and xaringan are all made by the amazing Yihui Xie). And definitely check Gina Reynolds’ ggplot flipbook concept, where each element and layer of the ggplot plot is revealed step-by-step.
-
Hugo + Netlify seem to be the new Jekyll + GitHub Pages. But as of 2021, I use and recommend Distill as the most straightforward way of putting your website or blog online, as long as you don’t mind sticking largely to the default aesthetics.
Regular expressions in R
- regexplain by Garrick Aden-Buie is an R Studio addin that enables interactive construction of regular expressions with real-time preview of their workings.
- RVerbalExpressions by Tyler Littlefield is an R package that uses natural-language-like expressions to construct regular expression patterns.
Other stuff
- Choose an open source license: great source to figure out in plain English what license to use for your open source project.
Some teaching ideology, with emphasis on R
-
Software Carpentry’s founder Greg Wilson’s book on teaching programming: How to Teach Programming (And Other Things). Free versions available on his site, as an epub, mobi or as a low-cost hard copy.
-
Brown, N. and Wilson, G. Ten quick tips for teaching programming, PLoS Comput Biol 14(4): e1006023 (2018).
-
David Robinson’s Teach the tidyverse to beginners. Very sensible, but do check the comments that point out the advantages of
base
R. The complementaryTidyverse
vsbase
R philosophies are actually a result of evolution of R and its users, what Roger Peng expertly summarised in his talk Teaching R to New Users - From tapply to the Tidyverse. -
For a contrarian view of the Tidyverse, read An alternate view of the Tidyverse “dialect” of the R language, and its promotion by RStudio by Norm Matloff and Why R is Hard to Learn by Robert A. Muenchen.
-
Mine Cetinkaya-Rundel used to teach stats with R and Git at Duke and is at the forefront of implementing these tools in high-throughput teaching context. Check out her paper Infrastructure and tools for teaching computing throughout the statistical curriculum, her talk on the useR! conference Teaching data science to new useRs and the course that she teaches itself http://www2.stat.duke.edu/courses/Spring18/Sta199/.
-
How to Help Someone Use a Computer, a very insightful list by Phil Agre (from 1996! - I guess nothing ever changes)
-
A list of R courses : university, online, workshops, etc.. All of the university-level courses listed so far are based in North America.
R tutorials/codethroughs I like
There has been a recent proliferation of tutorials on various aspects of R, so the below list has been expanded.
-
Introduction to R for Statistics and Data Science by Dr Kelly Bodwin: “On this site, you will find materials for a full, 8-week, college-level course focused on learning to use R for Data Science and Statistical Analysis”.
-
Learning Statistics with R and Data science with R, both by Danielle Navarro.
-
Teaching Statistics and Data Science Online by Mine Çetinkaya-Rundel. There is now an open textbook: Introduction to Modern Statistics by Mine Çetinkaya-Rundel and Johanna Hardin.
-
Data Visualization: Use R, ggplot2, and the principles of graphic design to create beautiful and truthful visualizations of data by Andrew Heiss
-
Pretty much anything Jenny Brian does, but in particular her UBC course mentioned above Data wrangling, exploration, and analysis with R (now as an online book!) and her tutorial on purrr.
-
David Robinson’s step-by-step demonstrations of exploratory data analysis: Modeling gene expression with broom: a case study in tidy analysis and Cleaning and visualizing genomic data: a case study in tidy analysis. I also very much like his Tidy Tuesday code-through YouTube channell and the accompanying GitHub repository.
-
Julia Silge’s amazing text mining walkthrough. She also has a book: Text Mining with R (free online version), paid hardcopy.
-
Mara Averic’s collection of purrr tutorials.
-
Susan Baert’s crystal clear, in-depth four-part tutorial on dplyr and her 10 dplyr tips.
-
A list of online resources for learning R from Martin Skarzynski Free online #Rstats resources
-
Practical Data Science: an introduction to the PeerJ collection “contains a series of short papers focused on the practical side of data science workflows and statistical analysis”
-
Software and Data carpentry R lessons are a bit inconsistent in their depth and scope, but I think the Data Carpentry R Ecology Lesson is the best one to start with.
-
Also: dplyr vs data.table.
-
Also also: Fastest way to edit multiple lines of code at the same time
Do not let Jenny Bryan set your computer on fire!
The only two things that make @JennyBryan 😤😠🤯. Instead use projects + here::here() #rstats pic.twitter.com/GwxnHePL4n
— Hadley Wickham (@hadleywickham) December 11, 2017
…use the right way to organise your R work:
- Prime Hints For Running A Data Project In R by Kasia Kulma, with tips from commenters incorporated into her post. The best post on the topic that I know of.
- Project-oriented workflow where Jenny Bryan what’s up with burning of the computers.
- File organisation best practices by Andrew Tran that summarises and builds on Jenny’s and Joris Muller’s solutions.
- Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten Simple Rules for Reproducible Computational Research PLoS Comput Biol 9, e1003285 (2013).
Shell-fu
Some history of Unix
Apart from Wikipedia, this is a very nice overview of how the pipe and the Unix philosophy came about:
-
Pipe: How the System Call That Ties Unix Together Came About by David Cassell at The New Stack.
-
If you are a video person, try this: AT&T Archives: The UNIX Operating System on YouTube. 27 minutes of UNIX history, including amazing piping demo by the guy who invented it.
A nifty tool
This little Mac utility by Jay Tuley will install an icon in Finder that opens the current folder in Terminal: CDto
Recommended general tutorials and tools on command line
- The Missing Semester of Your CS Education, this is a top-notch one-stop-shop for learning shell stuff - if you can only try one thing about shell from this site, this is it
- http://ryanstutorials.net/linuxtutorial/navigation.php
- http://korflab.ucdavis.edu/Unix_and_Perl/
- Software Carpentry Unix Shell lesson
- explainshell.com will try to give you explanation for every element of a command line expression that you type (try it, it’s really cool)
- The Best Keyboard Shortcuts for Bash
- A great BASH scripting cheatsheet from @rstacruz at devhints.io
Overview of some second-generation command-line tools
fd
and fzf
look particularly cool.
How to install Bash shell on Windows 10
Three very useful and inexpensive or free books on command line
- Take Control: Command Line by Joe Kissell (aimed at Mac users, but good for everyone - as usual ;-)
- The UNIX workbench by Sean Kross (donationware); now with a Coursera course!
- Data Science at the Command Line by Jeroen Janssens
Shell prompt
Take time to make your terminal window and the font big enough!
- Default (at least on my machine):
\h:\W \u\$
- How to check what’s your current prompt:
echo $PS1
- How to change your prompt:
PS1="yournewprompt"
. A nice trick is to use PS1=”\n\W \u-$ “ so that you have a new line before your prompt - it’s visually separated from the output of a previous command.
Useful link with options to modify your prompt: https://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html
Difference between .bash_profile and .bashrc
This is relevant for modifying the $PATH
:
- https://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html
- http://stackoverflow.com/questions/9832770/where-is-the-default-terminal-path-located-on-mac
How to move around shell
control-a
: move cursor to beginning of linecontrol-e
: move cursor to end of linecontrol-c
: cancel input or stop a running commandcontrol-k
: delete all text from cursor to end of linecontrol-d
deletes a character in placeoption-delete
: delete an entire word (may not work depending on whether your option key is reassigned; this is a preference in your Terminal settings)option-b
: move cursor backwards an entire word (as above)option-f
: move cursor forwards an entire word (as above)up arrow
: access last entered commandcontrol-r
: start searching shell history (start typing to search; enter will enter the current command;command-.
will cancel)control-v + [some key]
will literally print[some key]
- useful if you want to enter a tab and\t
doesn’t workhistory | ![some number]
where[some number]
is a number of a history command you want to execute (no need to copy and paste)- You can also narrow down the last command selection by including the first letter of the last command you want to use, e.g.:
!d
(if your favourite last command starts withd
) !$
retrieves the last word of the last command
Clear your screen
How to really clear the terminal
clear
: clears the screencontrol-l
: works just likeclear
command-k
: clears the screen and prevents from scrolling backexit
: exit shell (it closes the terminal window)
Listing stuff (ls
)
ls [a-z]*.txt
list every .txt file with lowercase letters in their namels {pear,peach}.txt
lists pear.txt and peach.txtls -1
show output in a single columnls -alh
show output including hidden files (-a
), in a long format (-l
) and human-readable file sizes (-h
)history
displays history of the commands (can be piped into a file). If you don’t want the terminal to remember the history between sessions, start with this thread on Stack Overflow.
How to move around your folders
cd -
: go to last foldercd .
: go to a current foldercd ..
: go to a parent folder
Four ways to go home:
cd
cd ~
cd /Users/Jarek
cd -
(if you were in your home folder in a previous command)
If your folder or file names include spaces
\
: will escape the space character (e.g. “My\ folder”)- If you drag your folder from Finder to a Terminal window, it will automatically recognise the path to this folder and escape spaces
To repeat last command
!!
: works just like theup arrow
, but you can modify it by adding stuff in front or behind it, e. g.:!! -h
orsudo !!
- You can also narrow down the last command selection by including the first letter of the last command you want to use, e.g.:
!d
(if your favourite last command starts with “d”)
Reading/displaying text files
cat
less
: space to move forward, B to move back, Q to quitmore
:more
on a Mac is the same asless
head
: show first few lines of the file; parameter -n specifies number of lines to showtail
: as above, but for the end of the file(head -n5; tail -n5) < inputfile
: display the first and last 5 lines of the input filetouch newfilename
: will create an empty file with a name newfilenametouch existingfilename
: will update modification date of the exsitingfilenamehead -n[line number]
to display [line number] number of lines (if you want a range use pipes andtail
after head -n)wc
word count (displays line, word and character count);-l -w -c
limits display to line, word or character only\
Wildcards in shell (to do stuff on more than one file at a time)
*
: a wildcard for “zero or more” instances (*og would catch anything that ends with “og” including just “og”)?
: a wildcard for “any single” instance (?og would catch: dog, fog, log etc.){}
: brackets will select a range of stuff ({A..Z}, {1..3}, {apple, pear, watermelon}) (this is called “brace expansion”)
Regular expressions and grep
Everything you wanted to know about regular expressions
Two useful regular expression testers
…but rememeber that grep
in Notepadd++, Ruby, JavaScript or Mac terminal can have slightly different implementations (i.e. not all functions will work or not all functions will work the same way). When stuff doesn’t work, try egrep
(extended grep) and always RTFM.
A cool regular expression recognition web app - you put in your input and it tries to automatically find a regexp pattern to match it. When it works, it’s like magic.
There is now also a way of testing and visualising regular expressions inside R studio: Regexplain by Garrick Aden-Buie. And if you want a very nerdy regular expressions’ testing site, try regexcrossword.com (this site tests you).
Wildcards for regular expression pattern matching
\w
Letters, numbers and _.
Any character except \n \r\d
Numerical digits\t
Tab\r
Return character. Also used as the generic end-of-line character in BBEdit\n
Line-feed character. Also used as the generic end-of-line character in Notepad++\s
Space, tab, or end of line[A-Z]
A single character of the ranges indicated in square brackets[^A-Z]
A single character including all characters not in the brackets. Note that this will include \n unless otherwise specified, and may cause you to match across lines\
Used to escape punctuation characters so they are searched for as them- selves, not interpreted as wildcards or special symbols\\
The \ symbol itself, escaped
Boundaries
^
Match the start of the line, i.e., the position before the first character$
Match the last position before the end-of-line character
Quantifiers, used in combination with characters and wildcards
+
Look for the longest possible match of one or more occurrences of the character, wildcard, or bracketed character range immediately preced- ing. The match will extend as far as it can while still allowing the entire expression to match.*
As above, matches as many of the previous character to occur, but allows for the character not to occur at all if the match still succeeds?
Modifies greediness of + or * to match the shortest possible match instead of longest{}
Specify a range of numbers to repeat the match of the previous character. For example:\d{2,4}
matches between 2 and 4 digits in a row[AC]{4,}
matches 4 or more of the letter A or C in a row
Capturing and replacing
()
Capture the search results between the parentheses for use in the re- placement term\1
or$1
Substitute the contents of the matched pattern with the replacement term, in numerical order. Syntax depends on the text editor or language that you are using.
Basic grep commands
grep "@" [file name]
search for lines that contain “@”grep -c "@" [file name]
count matching linesgrep -v "@" [file name]
find non-matching linesgrep -v -c "@"
grep -c "^CGATA" [file name]
count lines beginning with CGATAgrep "0\.98"
greps literal dot
Other bits that didn’t fit anywhere else
mkdir -p
: make multiple directories at oncetr
to substitute one thing with another or delete a query from a string
Extracting columns and sorting
cut
will cut out characters or columns from a delimited filecut -d":" -f2
will first split each line into columns delimited with the “:” and then extract -f2 (second) column from each linesort
can use column numberssort -k[number of the column]n
(n is for numerical, r is for reverse). You can combine sorting by column, i.e. first by column 3 then by 2sort -k 3 -k 2nr
uniq
will collapse multiple matches, but they have to be next to each other, so the file has to be sorted bysort
first
Prevent accidental deletion or overwriting files or folders
rm -i
flag-i
will prompt you to confirm before proceeding to remove. It can be used with other commands, such asmv
.
Some less basic stuff
Git basics
Jenny Brian’s book about Git for R users is great: Happy Git and GitHub for the useR.
git init
to initialise repository (a tracked directory)git remote add origin https://github.com/jarekbryk/example_repository.git
to add remote repository link for local trackinggit add [files]
to explicitly add [files] to tracking (files can also be explicitly ignored withgit ignore
)git commit
to “upload” the tracked version to a repository, always with a [comment] on what was donegit commit -m"[your comment here]"
git status
to check, er, statusgit diff
to check differences between committed version and current version (I think it must be done before add?)git log
to list all commits in reverse chronological ordergit -u push origin master
to upload local changes (“master) to github (“origin”):git remote -v
to check if it was pushed all right (?)
Another book on bioinformatics
- Computational Biology - A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R by Röbbe Wünschiers (Amazon.co.uk), which includes good coverage of awk and sed. The book’s website is at http://www.staff.hs-mittweida.de/~wuenschi/doku.php?id=rwbook2.
The extensive “missing manuals” for awk and sed
And a very good tutorial that let’s you use Awk right away: Why you should learn just a little Awk: An Awk tutorial by Example by Greg Grothaus.
Utitlities to handle fastq files etc.
Extract sequences from the fastq file
- https://www.biostars.org/p/72433/
- http://linuxcommando.blogspot.co.uk/2008/04/using-awk-to-extract-lines-in-text-file.html
- http://bioinformatics.cvr.ac.uk/blog/essential-awk-commands-for-next-generation-sequence-analysis/
reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt
awk '0 == (NR + 1) % 2' inputfile.txt
cat barcount.txt | sed -E -e 's/^ +([0-9]+) [ACGTN]+/\1/' | awk 'BEGIN{total=0} {if ($1>10000) total+=$1} END{print total}'
Enable NTFS read/write in macOS
This will let you read anc write to a Windows partition from macOS:
- http://www.makeuseof.com/tag/write-ntfs-drives-el-capitan-free/
- http://osxdaily.com/2013/10/02/enable-ntfs-write-support-mac-os-x/
open /Volumes
sudo echo "LABEL=DRIVE_NAME none ntfs rw,auto,nobrowse" >> /etc/fstab
Enable ext4 read in macOS
This will let you read from a Linux partition on macOS:
- Install FUSE for macOS
- Install ext4fuse
Setting up ftp proxy via command line
This assumes you cannot modify or don’t trust the system–wide settings in Ubuntu/Mac.
- HowTo: Use a Proxy on the Linux Command Line
- How to change proxy setting using Command line in Mac OS?
How to use screen
Ctrl-a
d to disconnect from the screen
screen -ls
list of screens
screen -r [id of the screen]
to reconnect to the screen