1-Quality controlling your reads

<aside> 💡 This may take a while. Remember to start a screen first!

</aside>

# Activate the conda environment which has all the installed packages, default is 'rnaseq' or 'envs/rnaseq'
conda activate rnaseq       # change 'rnaseq' if necessary

# Move to the directory that contains your raw reads
cd /path/to/your/dir

# Run the fastqc (assuming your raw reads are named ...fq.gz)
# -t specifies the no. of threads (CPUs) to use, be realistic especially if many users are running their jobs at the same time
# *.fq.gz is a wildcard, it means all files with the pattern ...fq.gz
fastqc -t 24 *.fq.gz

It may take a while to run, and output files in the name of ...fastqc... in the same directory. Let's make the outputs more neat:

mkdir fastqc          # crate a sub-directory called fastqc
mv *fastqc* fastqc    # move all files with 'fastqc' in their names into the sub-directory fastqc
cd fastqc             # go into the sub-directory fastqc

You can then inspect the QC files for each read separately, OR use another tool called multiqc to produce a neat QC report:

multiqc .      # run multiqc in the current directory (.)

It will then produce a neat Multiqc report (.html) for you to visualise all reads at the same time. You can open the .html with any common browser on your desktop.

Example

multiqc_report.html

The QC reports may flag some warnings or even errors. Most of them are actually safe to ignore because the analysis is tolerant to a bit of "dirty" reads. However, discuss with a bioinformatician if some errors are more substantial.

What are some important QC metrics should I look at?