General description
  • This Perl program can help you to format raw fastq file to clean reads counts file.
  • It can run on Linux/Unix/Mac without any prerequisite, or MS-Windows with Perl support. MS-Windows user please follow this link to install Perl.


Why do I need it?
  • High-throughput sequencing often produces huge files in fastq format, which contain mass of redundant data and unnecessary information for ISRAN.
  • The huge data sizes tend to cause network problems during the long uploading time, especially for the users with a slow network.
  • This Perl program can help user to reduce the file size dramatically, with most of the useful information reserved.


How to use it?
  • First, download the program and save it to the folder where the fastq file is.
  • Download it now
  • Then, enter the folder in command line and launch the program like this:
  • perl format_fastq.pl -i sample.fq -o clean.txt
    Or:
    perl format_fastq.pl -i sample.fq -x "GTTCAGAGTTCTACAGTCCGACGATC" -y "TCGTATGCCGTCTTCTGCTTG" -l 18 -f 2 -o clean.txt
  • For detailed usage of this script, please run:
  • perl format_fastq.pl -h
  • When the program finishes, the output file will be ready for ISRNA.


Detailed description
  • This script can be used to filter low quality short reads, remove polyA, trim 3' / 5' adapter and report the general information of the input fastq file.
  • Options:
  • -i<file>Short reads file in fastq format
    -x<str>5' adaptor sequence, default="GTTCAGAGTTCTACAGTCCGACGATC"
    -y<str>3' adaptor sequence, default="TCGTATGCCGTCTTCTGCTTG"
    -l<int>The minmal length of the reads, default=18
    -f<int>Fastq file format: 1=Sanger format; 2=Solexa/Illumina 1.0 format;
    3=Illumina 1.3+ format; default=1
    -o<str>Output file
    -hHelp
  • Trim polyA and adapters:
  • User can define the adapter sequences. This script uses the standard Illumina adapters for default.
    Raw sequences containg polyA or 3' / 5' adapters will be trimed to be clean.
  • Filter sequences with length cutoff:
  • Raw sequences with length less than the cutoff will be discarded. Default minimal length is 18.
  • Choose the fastq format:
  • Before running this script, user should determine the fastq format.
    Consulting the sequencing service would be the most reliable, though the recent output files are usually in type 1 format.
  • Criteria to filter low quality reads:
  • For Sanger format, the quality value can be calculated by Q = (ASCII character code) - 33. If Q < 15, then the reads will be defined as low quality reads and discarded.
    For Solexa/Illumina 1.0 format, the quality value can be calculated by Q = (ASCII character code) - 64. If Q < 9, then the reads will be defined as low quality reads and discarded.
    For Illumina 1.3+ format, the quality value can be calculated by Q = (ASCII character code) - 64. If Q < 10, then the reads will be defined as low quality reads and discarded.
  • Report file:
  • After running, a report file named "report.txt" will be automatically generated, which records the program setting information and the summary of reads count.


I still have a question
  • If you still have a question about the program or you have any suggestion, don't hesitate to contact us at: gzluo@genetics.ac.cn