NeoFuse is a user-friendly pipeline for the prediction of fusion neoantigens from tumor RNA-seq data.
NeoFuse takes single-sample FASTQ files of RNA-seq reads (single- or paired-end) as input and predicts putative fusion neoantigens through five main analytical modules based on state-of-the-art computational tools:
We advise using paired-end data to increase sensitivity and accuracy of gene fusion detection.
NeoFuse can be installed through the following four steps.
Instructions for Docker installation
Instructions for Singularity installation
The script is freely available at the Downloads section.
Unzip the archive and add it to PATH:
$ export PATH=$PATH:/path/to/NeoFuse
The NeoFuse image can be automatically generated using the NeoFuse script:
Docker:
$ NeoFuse -B --docker
Singularity:
$ NeoFuse -B --singularity
The NeoFuse script can be also used to generate the genomes and indexes required by the analysis:
Docker:
Singularity:
<Arguments>
-o: Output directory
[Options]
-n: Number of cores (default: 1)
-V: Genome version, either “GRCh37” and “GRCh38” (default: GRCh38)
Note: this process may take more than 1 hour, depending on the internet connection and the processing power.
NeoFuse can process single samples with the following command:
$ NeoFuse <arguments> [options] --singularity (or --docker)
<Arguments>
-o: Path to output directory (default: "./")
-d: Run ID - name of the output files (default: input filename)
-1: Path to read 1 FASTQ file (mandatory)
-2: Path to read 2 FASTQ fie (optional for single-end reads)
-s: Path to STAR index directory (mandatory)
-g: Path to reference genome FASTA file (mandatory)
-a: Path to annotation GTF file (mandatory)
Note: All input files passed as arguments must be unzipped.
[Options]
-m: Minimum peptide length (values: 8, 9, 10, or 11; default: 8)
-M: Maximum peptide length (values: 8, 9, 10, or 11; default: 8) *
-n: Number of cores (default: 1)
-t: IC50 binding affinity threshold (default: 500)
-T: Percentile rank threshold (default: Inf)
-c: Mimimum confidence score (values: H, M, or L; default: L) **
-l: Maximum available RAM (bytes) for sorting BAM. If -l is set to 0, it will be set to the genome index size. (values: 0 - Inf; default: 0)
--singularity: NeoFuse will use the Singularity image
--docker: NeoFuse will use the Docker image
* NeoFuse will compute the binding affinity for all the possible lengths of peptides between the minimum and maximum input. For example if a user specifies '-m 8' and '-M 11', NeoFuse will compute the binding affinity for all peptides of length 8, 9, 10, and 11. To consider just one specific length, use only the '-m' argument.
** The mimimum Arriba confidence score can be set to: H (to return only high confidence fusions), M (for high and medium confidence fusions), or L (for high, medium, and low confidence fusions).
For multiple-sample analysis, a TSV input file reporting the sample identifiers and path to input files has to be prepared. Format:
Paired-end reads:
Single-end reads:
#ID Read1
Sample1 /path/to/Sample1_read_1.fastq
Sample2 /path/to/Sample2_read_1.fastq
Notes: The first line of the TSV should start with an hashtag. There should always be one blank row at the end of the TSV file.
Once the TSV file is created, the samples can be analyzed with the following command:
$ NeoFuse <arguments> [options] --singularity (or --docker)
<Arguments>
-o: Path to output directory (default: "./")
-d: Run ID - name of the output files (default: input filename)
-i: Path to the input TSV file (mandatory)
-s: Path to STAR index directory (mandatory)
-g: Path to reference genome FASTA file (mandatory)
-a: Path to annotation GTF file (mandatory)
Note: All input files passed as arguments must be unzipped.
[Options]
-m: Minimum peptide length (values: 8, 9, 10, or 11; default: 8)
-M: Maximum peptide length (values: 8, 9, 10, or 11; default: 8) *
-n: Number of cores (default: 1)
-t: IC50 binding affinity threshold (default: 500)
-T: Percentile rank threshold (default: Inf)
-c: Mimimum confidence score (values: H, M, or L; default: L) **
-l: Maximum available RAM (bytes) for sorting BAM. If -l is set to 0, it will be set to the genome index size. (values: 0 - Inf; default: 0)
--singularity: NeoFuse will use the Singularity image
--docker: NeoFuse will use the Docker image
* NeoFuse will compute the binding affinity for all the possible lengths of peptides between the minimum and maximum input. For example if a user specifies '-m 8' and '-M 11', NeoFuse will comppute the binding affinity for all peptides of length 8, 9, 10, and 11. To consider just one specific length, use only the '-m' argument.
** The mimimum Arriba confidence score can be set to: H (to return only high confidence fusions), M (for high and medium confidence fusions), or L (for high, medium, and low confidence fusions).
Due to license compatability issues, netMHCpan is fully integrated but not distributed as part of NeoFuse.
If there is an existing local installation of netMHCpan, peptide-HLA binding affinity (IC50 and rank) can be predicted with netMHCpan instead of MHCflurry using the following command:
NeoFuse will create an output directory with the following structure:
/NeoFuse/output/directory/
├── Sample1
│ ├── Arriba
│ ├── LOGS
│ ├── NeoFuse
│ ├── OptiType
│ └── TPM
├── Sample2
│ ├── Arriba
│ ├── LOGS
│ ├── NeoFuse
│ ├── OptiType
│ └── TPM
…
└── SampleN
├── Arriba
├── LOGS
├── NeoFuse
├── OptiType
└── TPM
Sample.fusions.tsv file contains a list of gene fusions sorted from highest to lowest confidence.
Sample.fusions.discarded.tsv contains all events that Arriba classified as artifacts or that are also observed in healthy tissues.
/Arriba
├── Sample1.fusions.discarded.tsv
└── Sample1.fusions.tsv
The standard output (sdout and stderr) for every tool used in the run is stored in the LOGS directory. File names may differ depending on the tools, peptide length, etc.
/LOGS
├── Sample1_10_MHCFlurry.log
├── Sample1_11_MHCFlurry.log
├── Sample1_8_MHCFlurry.log
├── Sample1_9_MHCFlurry.log
├── Sample1.arriba.err
├── Sample1.arriba.log
├── Sample1.cleave_peptides.log
├── Sample1.counts_to_tpm.log
├── Sample1.featureCounts.log
├── Sample1.final.log
├── Sample1.Log.final.out
├── Sample1.Log.out
├── Sample1.Log.std.out
├── Sample1.optitype.log
├── Sample1.razer1.log
├── Sample1.razer2.log
├── Sample1.STAR.err
├── Sample1.STAR.log
└── Sample1.association.log
HLA_Optitype.txt contains the HLA types predicted by OptiType
coverage_plot.pdf is a PDF file with the read coverage plots of the HLA alleles (see example here)
/OptiType
├── Sample1_HLA_Optitype.txt
└── Sample1_coverage_plot.pdf
Contains all TPM expression values for all the genes
/TPM
└── Sample1.tpm.txt
Contains the final output of the pipeline, which consists of three files:
/NeoFuse
├── Sample1_filtered.tsv
├── Sample1_unfiltered.tsv
└── Sample1_unsupported.txt
Sample_unsupported.txt contains the HLA types predicted by OptiType that are not supported by MHCflurry. Note: if netMHCpan is used instead of MHCfurry, this file is not generated.
Sample_unfiltered.tsv contains all the predicted fusion peptides and their annotations.
Sample_filtered.tsv contains a list of putative fusion neoantigens (selected considering the user-defined IC50/rank and confidence score thresholds). This file reports for each putative neoantigen: confidence score, binding HLA type, expression of the fusion and HLA genes in TPM, and information about the presence of a premature stop codon that might cause nonsense mediated decay of the fusion transcript. Example format:
Dobin,A. et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 15–21.
Jurtz, V. et al. (2017) NetMHCpan-4.0: Improved Peptide–MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J. Immunol., 199, 3360-3368.
Liao,Y. et al. (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30, 923–930.
O’Donnell,T.J. et al. (2018) MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst, 7, 129–132.e4.
Szolek,A. et al. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310–3316.