So what is SCALCE?
/skeɪlz/, a.k.a. boosting Sequence Compression Algorithms using Locally Consistent Encoding) is a tool for compressing FASTQ files. It is designed specifically for the Illumina-generated FASTQ files, but supports any valid FASTQ with consistent read lengths. SCALCE was publised in Bioinformatics in October 2012.
How do I get SCALCE?
Just clone our repository and issue
git clone https://github.com/sfu-compbio/scalce.git cd scalce make download make
If you have issues with compiling, please try our CentOS 7-compiled x86_64 binary.
git clone https://github.com/sfu-compbio/scalce.git binary
If you don’t have git, you can always fetch pre-packaged SCALCE archives:
Note: You will need zlib >= 1.2.6 and libbzip2 library to compile the sources. Unfortunately, RHEL/CentOS 5.x and older come with antiquated versions of zlib, so we recommend downloading the newer version via
make download. pigz is also recommended for multi-threaded mode. See Usage for explanation.
Note: SCALCE prior to version v2.7 does not support variable read lengths. Starting with v2.8, EXPERIMENTAL (AND VERY BUGGY) support for varable read lengths has been added. In order to use it, please compile with
make -j pacbioand use
scalce-pacbiobinary to run SCALCE. All options for SCALCE are as well valid for scalce-pacbio. Please note that SCALCE is not designed for very long reads (e.g. PacBio, Nanopore), and thus the compression performance might not be ideal. Also make sure to double-check long read decompression and validity.
How do I use SCALCE?
SCALCE is invoked as following:
scalce [input_1.fastq] -o [result]
input_1.fastq to the files
scalce [input_1.fastq] -r -o [result] -n [library]
input_1.fastq together with its paired end
input_2.fastq, discarding the names and setting library name to
scalce [input_1.scalcen] -d -o [something.fastq]
input_1.scalce\* SCALCE file to
Input and output
SCALCE is a FASTQ compression tool designed specifically for the Illumina-generated FASTQ files. SCALCE will compress provided FASTQ files and generate three output files with extension:
.scalcenfor read names,
.scalcerfor reads, and
.scalceqfor quality scores.
The read length should be fixed in any of the mates in one run. This means, if you are passing 3 paired-end libraries, then the read lengths in the first mate should be fixed (i.e. 50bp). The read lengths for the second mate could be different that those of the first mate.
Specifies the prefix for the output file names. Extensions and basic information will be appended.
Standard output is supported in decompression mode. Use
-to indicate standart output (i.e.
-o -). Standard output is supported unless you use
-rparameters during decompression.
Shared arguments (both for compression and decompression)
Use paired-end FASTQ files when the two ends are in seprate files. The files should be named with
_2. When you are passing it as input, only give
_1file and SCALCE will replace
_2and read the second file.
File name example:
-n, --skip-names [library]
Discard original read names, and rename each read with the library prefix, such as
library.2etc. This option can improve compression rate a lot.
Prints short usage information.
Prints the current version of SCALCE.
Uncompress scalce files. Provide just one file name (
scalceqfor example), and the program will take care for the other files.
-S, --split-reads [count]
Split the output files into a bunch of files, where each file contains the given number of the reads.
Default: 0 (do not split)
-B, --bucket-set-size [size][MG]
Set bucket set size (M)egabytes or (G)igabytes. This parameter limits the main memory accessible to SCALCE. Swap files will be used to keep all neccessary data.
-c, --compression [mode]
Select compression mode. Currently available modes are:
- no - No compression
- gz - gzip compression level 6
- pigz - parallelized gzip
- bz - bzip2 compression
Default: gz, or pigz if number of threads is greater than 1
Disable arithmetic coding for the quality compression and use default compression mode. This helps reduce both compression and decompression time, but the compression ratio may suffer.
Default: not activated
-p, --lossy-percentage [percentage]
Set lossy error percentage.
Default: 0 (no lossy)
-s, --sample-size [count]
Specifies how many quality values should be used for statistical analysis during the lossy trasformation table creation.
-t, --temp-directory [directory]
Set directory for holding temporary files.
-T, --threads [num]
Specify the number of working threads
Default: 4 (if the system offers less than 4 cores, number of threads will be automatically adjusted)
Note: In order to take the advantage of multi-threading, pigz binary should be located within the PATH. Otherwise, you should use SCALCE with -T1 (single thread) option
Contact & Support
SCALCE has been brought to you by:
Copyright (c) 2011–2012, Simon Fraser University. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the Simon Fraser University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- (10-Jan-2016) SCALCE version 2.8 release
- Bugfixes (arithmetic decoding bugfix)
- Fixed a decompression bug when number of reads was greater than 2^32. Compression was not affected.
- New: support for variable length reads via
- (20-May-2013) SCALCE version 2.7 release
- Bugfixes (no-arithmetic fix)
- (13-May-2013) SCALCE version 2.6 release
- (02-Apr-2013) SCALCE version 2.5 release
- Read splitting supported
- Standard output during decompression supported
- (25-Mar-2013) SCALCE version 2.4 release
- Auto-pigz detection
- (10-Sep-2012) SCALCE version 2.3 release
- Decompression speed improvements
- (25-Jul-2012) SCALCE version 2.2 release
- Speed improvements
- Arithmetic coding for qualities is now optional
- Multiple bug fixes
- (06-Jun-2012) SCALCE version 2.1 release
- Better compression of reads
- Arithmetic coding for qualities
- Multiple bug fixes
- (02-Mar-2012) SCALCE version 1.4 release
- Serious data loss when using multithreading bug fixed
- (20-Feb-2012) SCALCE version 1.3 release
- Various bug fixes
- (17-Feb-2012) SCALCE version 1.2 release
- Various bug fixes
- (08-Feb-2012) SCALCE version 1.1 release
- OpenMP support
- pigz support
- (06-Dec-2011) SCALCE version 1.0 release
- Initial release