1. Introduction

Integrative Short Reads Navigator (ISRNA) is an online toolkit for searching, analyzing and visualizing short sequence reads data generated by high-throughput sequencing technologies. Major functions of the toolkit include:

  • Fast and accurate mapping of short reads to a given genome.
  • General bioinformatic analysis of the input sequence reads, such as nucleotide composition, length distribution, sequence annotation, miRNA identification, RNA secondary structure prediction, short read cluster discovery, etc.
  • Versatile searching functions for curation of sequence reads, including searching the dataset by certain sequence or sequence motifs, by known miRNA names, by reads counts, by genomic location, and by the name or annotation of adjacent protein coding genes.
  • A genome browser for visualizing the distribution of sequencing reads and corresponding genes.
  • Multiple datasets management and cross-datasets comparison.

2. Exemplification

1) Datasets preparation and uploading

ISRNA accepts Fastq file format and customized files with read counts in the following format:

TGAACGGCGTTCACGGCAGACT     9452
TTCCGTAAGTTCACGGCAGAC         9018
ATCCGGGCGTTCACGGCAGAC       8774
...

The files should contain two tab-delimited columns in each line, the first column is the sequence, and the second column is the read counts of the sequence.

  • File formatting

Though raw Fastq files are supported, we strongly recommand users to convert them to customized files. The raw fastq file (.fastq or .fq, often as large as several gigabytes) of one sequencing reaction can be converted to the required format using this script.

The sequence files with correct format can be uploaded to ISRNA via the file upload module at the homepage.

  • Genome mapping

ISRNA uses Bowtie to map sequences to reference genomes. Users can set mapping parameters such as permitted mismatches (default: 0) and maximum loci (default: 10). Parameter for ‘maximum loci’ is the threshold for the maximum number of genomic loci allowed for a sequence read. If the number of genomic loci of a sequence read exceeds the threshold, the sequence will be discarded after mapping and not present in further analysis results.



2) Sequence analysis functions

The sequence analysis functions of ISRNA include:

  • Data overview

This module characterizes some basic features of the uploaded dataset, including: 1) the counts of total uploaded reads and non-redundant sequence reads; 2) the counts and percentages of total reads and non-redundant sequence reads with or without genomic matches; 3) the length distribution of total and non-redundant sequence reads with or without genomic matches.

  • Nucleotide composition

This module enables users to calculate the nucleotide composition of both the total reads and non-redundant sequence reads.

  • Sequence annotation

ISRNA uses the BLAST+ program and the Rfam database to annotate the short sequence reads. Sequence reads classified to major non-coding RNA groups can be viewed as demonstrated and be downloaded via the “Export” function.

  • MiRNA identification

ISRNA identifies sequences of known miRNAs or highly homologous (with no more than 2 nt mismatches) to known miRNAs deposited in miRBase. miRNA expression abundances are normalized by the total reads count and presented as RPM (Reads Per Million). The analysis results will be presented in the following format, and can be downloaded via the "Export" function.

  • RNA secondary structure prediction

ISRNA uses CentroidFold to predict the secondary structures of sequences with flanking genomic sequences. The length of the flanking sequences of the examined sequence can be defined by users. Two predicted secondary structures of the examined sequence with flanking sequence will be displayed, with the two side flanking sequence being added to either the 5' or 3' end of the examined sequence. The examined sequence is shown as capital letters and the flanking sequences are shown in lowercase letters. Each predicted base-pair is colored with the heat color gradation from blue to red corresponding the base-pairing probability from 0 to 1, where the base-pairing probability is the probability that a pair of bases forms a base-pair via hydrogen bonds in their secondary structures, and can be interpreted as confidence measure of predicted base-pairs. Higher resolution picture can be obtained in the PS file.

  • Short sequence read cluster discovery

This function module identifies genomic regions with clustered short sequence reads, which may represent certain classes of known ncRNAs (e.g. piRNAs, etc.) or new class of ncRNAs.

3) Searching functions

ISRNA provides robust searching functions to help users to make discoveries among among millions of sequences. These can be achieved via:

  • Search by sequence
  • Search by miRNA ID
  • Search by gene ID/key words
  • Search by genomic region
  • Search by read count
  • Search by sequence read coverage

Qualified matching sequences with or without perfect genomic matches will both be returned as the results of these searching functions. For sequences with genomic matches, their genomic location as well as the neighboring genes can be reviewed by clicking the "View" button.

  • Search by sequence

ISRNA allows users to search the input dataset either with any nucleotide combination of at least 5 nt. The program will identify sequences identical to or containing the query sequence as a substring. The program also allows users to assign required sequence similarity (90%-100%) as search threshold, therefore homologous matches of the query sequence can also be identified. In the fuzzy searching results, perfect matches are highlighted in green and mismatches are highlighted in red. Black letters (not highlighted) are nucleotides not matching to the query string.

  • Search by miRNA ID

This function allows users to identify sequences identical or homologous (up to 2 mismatches and isomiRNAs) to the query miRNA from the dataset. isomiRNAs with terminal extensions or eliminations are also included in the results. For users interested in isomiRNAs, we recommand isomiRex for further research.

  • Search by read counts

This function allows users to select sequence reads with read counts greater than the defined threshold. It is especially useful for selecting highly expressed sequences. A link to the summary of read counts of all sequences is also provided to help users to select appropriate read count threshold.

  • Search by genomic region

This function allows users to examine short sequence reads distribution within certain genomic region. A genome browser will be displayed for users to navigate along the genome. Detailed functions of the genome browser are described below.

  • Search by gene ID/key words

This function allows users to identify sequences derived from given protein coding genes. Sequences matching to the genomic region of the query gene will be returned. If the query key word presents in the annotation of multiple genes, all of the matching genes will be displayed, which allows the short sequence read distribution status to be compared among homologous genes or different members of the same gene family.

  • Search by sequence read coverage on genes

This function enables users to identify genes with certain proportion of lengths generating short sequences, providing a way to identify potential small RNA precursors, or to discover unique degradation patterns. The read coverage ratio is defined as length covered by short sequences vs. the total length of a gene.

4) Short sequence reads browser

In the display pages of all search results, a genome browser is embedded to show the genomic mapping results of short sequence reads and genes. Therefore, users can easily identify the positional relationship among short sequence reads, as well as between short sequence reads and annotated genes. The IDs, genomic positions and annotations of genes, together with the information of short sequence reads will be displayed via mousing-over the corresponding gene or sequence read. The zoom in/out and move left/right functions allow users to navigate along each chromosome. Sequence reads will be displayed as cumulated short segments with color gradients representing different ranges of read counts. The genome browser also enables users to examine the overall sequence reads distribution on each chromosome, therefore to identify regions enriched or depleted of short sequence reads.

5) Multiple datasets management and comparison

ISRNA supports data storage of multiple datasets and comparison analysis. The names of datasets are listed on the right panel of user index page. Users can click the name and the following analysis will correspond to the selected dataset. Also users can use this index page to add another dataset or remove an existed one.

  • Differentially expressed sequence reads between two datasets

When users have multiple datasets in one project, they can compare the expression of reads in each dataset and find the differentially expressed ones. P-values are calculated by edgeR package. Also the RPM values and fold changes are presented to measure the differences.

  • Sequence loci and reads distribution along chromosomes of multiple datasets

6) Customized data export

Some analysis results of ISRNA can be exported as plain text files for further curation or advanced analysis.