<aside> 💡 Make sure that you have a good-quality reference transcriptome (.fasta) and an annotation file (.gff or .gtf), which usually both come with a reference genome (not needed in this guide).
Discuss with a bioinformatician before proceeding. They can determine the quality of a transcriptome by looking at metrics such as N50, BUSCO etc.
</aside>
<aside>
💡 Have you started your screen and activated your conda?
</aside>
You will have to find your own source of reference transcriptome and annotation. NCBI Assembly is a reliable source. Below is an example using the Arabidopsis thaliana:
# wget is a package for grapping materials directly through HTML protocol
conda install -c anaconda wget
# Create a directory for the references
mkdir ~/reference
cd ~/reference
# Below are the links to the transcriptome assembly and gtf annotation of TAIR 10 (*Arabidopsis thaliana*)
wget <https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/683/475/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_cds_from_genomic.fna.gz>
wget <https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/683/475/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_genomic.gtf.gz>
# Decompress these files
gunzip *.gz # if they are gunzipped only
#tar -xzvf *.tar.gz # if they are gunzipped tar-balls
# You may want to rename them
mv GCF_001683475.1_ASM168347v1_cds_from_genomic.fna quinoa_cds.fna
mv GCF_001683475.1_ASM168347v1_genomic.gtf quinoa_annt.gtf
Chenopodium quinoa transcriptome
Let's inspect the transcriptome file.
head quinoa.fna
>lcl|NW_018742204.1_cds_XP_021760211.1_1 [gene=LOC110725042] [db_xref=GeneID:110725042] [protein=uncharacterized protein LOC110725042] [protein_id=XP_021760211.1] [location=66397..67071] [gbkey=CDS]
ATGCCAAACTACTCTAAGTTTCTTAAGGAAATTTTGAGCGGCAAGAGAGATTGCAATCTGGTTGAACCAGTGAGTTTGGG
GGATTGTTGTAGTGCCTTTATCCATAATGACTTGCCCCCAAAGATGAAAGACCTGGGGATTTTCTCCATCCCTTTCAATA
TTAAAGGAAAATTGTTCCAAAATTCCCTTTGTGATCTTGGTGCTAGTGTTAGCATCATGCCTTATTCCGTCTTCAAGAGA
As you can see the sequence header >lcl|NW_018742204.1_cds_XP_021760211.1_1 [gene=LOC110725042] [db_xref=GeneID:110725042] [protein=uncharacterized protein LOC110725042] [protein_id=XP_021760211.1] [location=66397..67071] [gbkey=CDS] is quite messy. In fact, these information are already contained in the GTF annotation file. We only need the protein_id as the identifiers.
# The following sed command implements a regex, this varies by case
sed -E -i 's|>.*protein_id=([^]]*)].*|>\\1|g' quinoa_cds.fna
# Now inspect the transcriptome file again, it looks simpler
head quinoa_cds.fna
cd /path/to/your/reads # go back to where your reads are
kallisto index -i quinoa.kallisto ~/reference/quinoa_cds.fna # create a transcriptome index