<aside> 💡 Make sure that you have a good-quality reference transcriptome (.fasta) and an annotation file (.gff or .gtf), which usually both come with a reference genome (not needed in this guide).

Discuss with a bioinformatician before proceeding. They can determine the quality of a transcriptome by looking at metrics such as N50, BUSCO etc.

</aside>

<aside> 💡 Have you started your screen and activated your conda?

</aside>

Downloading reference transcriptome

You will have to find your own source of reference transcriptome and annotation. NCBI Assembly is a reliable source. Below is an example using the Arabidopsis thaliana:

# wget is a package for grapping materials directly through HTML protocol
conda install -c anaconda wget

# Create a directory for the references
mkdir ~/reference
cd ~/reference

# Below are the links to the transcriptome assembly and gtf annotation of TAIR 10 (*Arabidopsis thaliana*)
wget <https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/683/475/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_cds_from_genomic.fna.gz>
wget <https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/683/475/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_genomic.gtf.gz>

# Decompress these files
gunzip *.gz           # if they are gunzipped only
#tar -xzvf *.tar.gz   # if they are gunzipped tar-balls

# You may want to rename them
mv GCF_001683475.1_ASM168347v1_cds_from_genomic.fna quinoa_cds.fna
mv GCF_001683475.1_ASM168347v1_genomic.gtf quinoa_annt.gtf

Chenopodium quinoa transcriptome

quinoa.fna

Let's inspect the transcriptome file.

head quinoa.fna

>lcl|NW_018742204.1_cds_XP_021760211.1_1 [gene=LOC110725042] [db_xref=GeneID:110725042] [protein=uncharacterized protein LOC110725042] [protein_id=XP_021760211.1] [location=66397..67071] [gbkey=CDS]
ATGCCAAACTACTCTAAGTTTCTTAAGGAAATTTTGAGCGGCAAGAGAGATTGCAATCTGGTTGAACCAGTGAGTTTGGG
GGATTGTTGTAGTGCCTTTATCCATAATGACTTGCCCCCAAAGATGAAAGACCTGGGGATTTTCTCCATCCCTTTCAATA
TTAAAGGAAAATTGTTCCAAAATTCCCTTTGTGATCTTGGTGCTAGTGTTAGCATCATGCCTTATTCCGTCTTCAAGAGA

As you can see the sequence header >lcl|NW_018742204.1_cds_XP_021760211.1_1 [gene=LOC110725042] [db_xref=GeneID:110725042] [protein=uncharacterized protein LOC110725042] [protein_id=XP_021760211.1] [location=66397..67071] [gbkey=CDS] is quite messy. In fact, these information are already contained in the GTF annotation file. We only need the protein_id as the identifiers.

# The following sed command implements a regex, this varies by case
sed -E -i 's|>.*protein_id=([^]]*)].*|>\\1|g' quinoa_cds.fna

# Now inspect the transcriptome file again, it looks simpler
head quinoa_cds.fna

Creating a transcriptome index

cd /path/to/your/reads         # go back to where your reads are

kallisto index -i quinoa.kallisto ~/reference/quinoa_cds.fna    # create a transcriptome index

Pseudo-aligning reads against the indexed transcriptome