.. role:: bash(code) :language: bash Welcome to the TIminer documentation! ************************************* TIminer encloses a panel of computational tools for the investigation of tumor-immune cell interactions [4]_: - **Gene expression**: `Kallisto `_ is used to efficiently quantify gene expression levels [2]_. - **Immune infiltrates**: the enrichment of immune subpopulations is assessed through gene set enrichment analysis (`GSEA `_) [9]_ as in [1]_ [3]_. - **Tumor immunogenicity**: the expression of genes related to Human Leukocite Antigens (HLA) genes, co-inhibitory or co-stimulatory signaling, and infiltration of immune cells, is used to compute an “immunophenogram” and an “immmunophenoscore” (`IPS `_), describing the tumor antigenicity and the mechanisms of immune evasion [3]_. - **HLA typing**: `Optitype `_ is used to predict class-I HLA types, at 4-digit resolution, from RNA or DNA sequencing data [10]_. - **Neoantigens**: candidate neoantigens are predicted in three steps: (1) mutated proteins are inferred from somatic mutations using the Variant Effect Predictor (`VEP `_) [6]_; (2) mutated peptides 8-to-11 amino acids long are extracted and analyzed with `NetMHcpan `_ 3.0 to predict their binding affinity to class-I HLAs [5]_ [7]_ [8]_; (3) neoantigens are selected considering gene expression. Similar methods were used to generate the data of the The Cancer Immunome Database (`TCIA `_) [3]_. The pipeline is a Python script executable from the command line or through a user interface. The :ref:`user interface ` allows the analysis of single-subject data on standard desktop PCs, whereas the :ref:`command-line ` version allows the analysis of single- or multiple-subject data in a single run. The :ref:`input files ` must contain RNA sequencing reads and somatic DNA mutations. In addition to the pipeline, each analytical tool is available through a unified Python API that can be executed independently on data from :ref:`single ` or :ref:`multiple subjects `. The Python API masks the complexity arising from different programming languages, handles necessary data pre- and post-processing internally, and introduces parallel execution traces to optimize for cluster units. TIminer is available as a Docker-based image (`www.docker.com `_) and is distributed under BSD 3-Clause License. The third party tools included come with their own :ref:`licensing scheme `. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` * :ref:`Sensitive Filtering ` Requirements ============ - HLA-typing requires the same amount of memory as the total input file size (e.g. for two paired-end FASTQ files of 6Gb each, 12Gb of free memory are required). - All tools can leverage on multicore architectures (the more CPU cores are available, the faster the analysis). - Data from single subjects can be analyzed on modern desktop PCs or laptops. - 22Gb of free disk space is required to download the genome reference databases. - The API is available for Python 2.7. Install ======= **Mac OS X** - Download the `TIminer installer `_. - Execute :bash:`install_mac.command` inside the 'install' directory of the tool and follow the instructions. - For maximal performance, increase dockers CPU and Memory count to the maximum (Docker->Preferences->General). Note: TIminer requires OS X Yosemite or newer. **Red Hat derivatives** - Download the `TIminer installer `_. - Excute :bash:`sh install_red_hat.sh` inside the 'install' directory of the tool and follow the instructions. Note: TIminer requires Fedora 23 / Centos 7 or newer. **Singularity** - Download the `TIminer installer `_. 1. optional but recommended: :bash:`conda create -n timiner python=2.7` 2. optional but recommended: :bash:`conda activate timiner` 3. :bash:`mkdir timiner-singularity` 4. :bash:`cd timiner-singularity` 5. :bash:`tar -xzvf ../timiner_Singularity_installer.tar.gz` 6. :bash:`cd install` 7. :bash:`sh install_singularity.sh` 8. optional but recommended: :bash:`python TIminerPipeline.py --input ../samples/inputInfo.txt --out ../samples/out` .. **Manuall installation** .. Although, Windows and other Linux systems are not supported by now, the pipeline can be used but requires manual installation. .. - Install `Docker `_ and load our image from the Docker hub :bash:`docker pull elitap/tcia`. .. - If not available yet, install `Python `_. .. - Download our wrapper code containing the documentation, sample files, the TIminerPipeline.py and TIminerUI.py Python script and the Python based API. The latter can be installed by executing :bash:`pip install --user path_to_src_dir`. .. - Download `the genome database `_ for the generation of mutated proteins. The API allows to set the path to the database individually. .. - Verify the installation by executing :bash:`python TIminerTest.py`. TIminer Pipeline ================ An overview of the TIminer pipeline is provided below. .. figure:: _static/tciaScheme_new.png The pipeline considers two types of :ref:`input data `: RNA-seq data reads and DNA mutations. RNA-seq data are used to compute gene raw counts, Transcripts Per Millions (TPM), and normalized log2(TPM+1), with Kallisto and to predict class-I HLA types with Optitype. Normalized TPM are used to estimate the enrichment of different immune subpopulations through gene set enrichment analysis, which returns the normalized enrichment scores (NES) and the associated false-discovery-rate (FDR)-corrected p-values. Normalized TPM are also used to compute the immunophenoscore (IPS) and to chart the immonophenogram. Single-point DNA mutations are used to predict mutated proteins with VEP. The mutated proteins and the predicted class-I HLA types are then subjected to NetMHCPan to predict mutated peptides binding to HLAs. These mutated peptides can be filtered considering only peptides arising from expressed genes, identified from normalized TPM, to finally select candidate neoantigens. .. _input-files: Input files ----------- RNA-seq reads must be provided as FASTQ files (one file in case of single-end sequencing, two files in case of paired-end sequencing). DNA mutations can be provided in `VCF format `_. All RNA-seq and DNA mutations files can be processed also in gzipped format. .. _command-line: Running the pipeline from command line --------------------------------------- The *TIminerPipeline.py* is a single Python script executing the wrapped tools over the defined API and forms a full data analysis pipeline. :bash:`usage: python TIminerPipeline.py [-h] --input INPUT --out OUT [--database DATABASE] [--threadcount THREADCOUNT] [--sensitivefiltering]` - **input** (str) \- Path to the :download:`input information file <_static/inputInfo.txt>`: a tab-delimited file specifying, for each subject: the subject ID, the cancer type, the path to the input RNA-seq files (:download:`RNAseq1.fastq <_static/rnaseq_sample_1.fastq.txt>` and :download:`RNAseq2.fastq <_static/rnaseq_sample_2.fastq.txt>`) and the path to the mutation files (:download:`mutation1 <_static/sample_vep_mut.txt>`, :download:`mutation2 <_static/sample_vep_mut2.txt>`, etc.). In case of single-end data, the second RNA-seq file must be simply specified as “None”. One mutation file is mandatory; multiple mutation files can be specified in additional columns. - **out** (str) \- Path to the output directory. - **threadcount** (int) \- Number of threads to be used (default=2). - **sensitivefiltering** \- acitvates the :ref:`sensitive filtering ` option After the installation, small example data can be analyzed by executing from the :bash:`scripts` directory the following command: :bash:`python TIminerPipeline.py --input ../samples/inputInfo.txt --out ../samples/out` At the end of the computation, all :ref:`output files ` are saved in the specified output directory. .. _user-interface: Running the pipeline from the user interface --------------------------------------------- The analysis tools for a single subject are also accessible over an easy-to-use user interface, which can be used on standard hardware desktop PCs and laptops. The user interface can be started by executing the following command from the script directory: :bash:`usage: python TIminerUI.py [--database DATABASE]` The '—database' option specifies the path to the Kallisto reference index and VEP genomic annotation (default = directory defined at installation time). .. _fig-ui: .. figure:: _static/ui_scaled.png The user interface allows loading the input data, selecting the set of analyses to be performed, and specifying the output directory were the results will be saved. Alternatively, the “Load examples” button allows automatically loading mock example data and selecting all the analyses to be run. When available, pre-computed HLA types can be specified by the user to be used for neoantigen prediction (“Use given HLA types”). Up to the six HLA types have to be specified through in the available text fields, using the following format: HLA-A01:01. While the analysis is being performed, the current state and possible issues are reported at the top of the user interface window (“State ...”). The selected analyses can be run clicking the “Run” button, the results are saved in the specified output directory, which can be opened directely from the user interface. .. _output-files: Output files ------------ The pipeline generates the following sub-directories and files: The *gene-expression* directory contains the output files generated by Kallisto: - :download:`expression.txt <_static/output/gene-expression/expression.txt>`: main file of gene expression values, computed as log2(TPM+1). It is a tab-delimited file with genes in its rows and subjects in its columns; the first column contains the HGNC gene symbols. - :download:`tpm_expression.txt <_static/output/gene-expression/tpm_expression.txt>`: file of gene TPM, with the same format of expression.txt. - :download:`raw_count_expression.txt <_static/output/gene-expression/raw_count_expression.txt>`: file of raw gene counts, with the same format of expression.txt. The *immunophenoscore* directory contains the following files: - :download:`IPS_Sample.pdf<_static/output/immunophenoscore/IPS_Sample.pdf>`: the plot of the immunophenogram (one for each subject). - :download:`IPS.txt<_static/output/immunophenoscore/IPS.txt>`: tab-delimited file with subjects inside the rows and scores inside the columns. The considered scores are: z-scores for major histocompatibility complex genes (MHC), checkpoint blockers and immunomodulators (CP), effector immune cells (EC), suppressor cells (SC), aggregated z-score (AZ) and immunophenoscore (IPS). The *gene-set-enrichment-analysis* directory contains: - :download:`NES<_static/output/gene-set-enrichment-analysis/NES.txt>`: a tab-delimited file of normalized enrichment scores (NES) computed for each subject (inside the rows) and each immune cell type (inside the columns). The first column contains the subject ID. - :download:`FDR_q-val.txt<_static/output/gene-set-enrichment-analysis/FDR_q-val.txt>`: a tab-delimited file of false-discovery-rate (FDR)-adjusted p-values (i.e. q-values), with the same format of the *NES.txt* file. - SubjectID.GseaPreranked: a subdirectory (for each subject) containing all GSEA original output files. The *hla-types* directory contains: - :download:`hlas.tsv<_static/output/HLA-types/hlas.tsv>`: a tab-delimited file of 4-digits HLA types, with subjects inside the rows and two alleles for each HLA gene (HLA-A, HLA-B, and HLA-C) inside the columns. The *mutated-proteins* directory contains output files generated by VEP: - :download:`subjectID_mutprotein_info.txt<_static/output/mutated-proteins/mutprotein_info.txt>`: a tab-delimited file (one file for each subject) with the annotated somatic DNA mutations. The information inside the columns is: - Uploaded_variation: subject ID - Location: genomic position of the mutation - Allele: reference allele - Gene: Ensemble gene ID - Feature: feature ID - Feature_type: feature type - Consequence: mutation type - cDNA_position: position of the mutation in the cDNA - CDS_position: position of the mutation in the CDS - Protein_position: position of the mutation in the protein - Amino_acids: amino acid change - Codons: codon change - Existing_variation: information about existing variation - Extra: additional annotations, among which the ID of the affected (i.e. mutated) protein. - :download:`subjectID_mutprotein_seq.txt <_static/output/mutated-proteins/mutprotein_seq.txt>`: a FASTA file (one file for each subject) with the amino acid sequence of all mutated proteins reported in the file of annotated mutations. - :download:`vep.tsv <_static/output/mutated-proteins/vep.tsv>`: text file reporting, for each subject, the names of the two output files generated by VEP for the *mutprotein_info.txt* path on the first column and the *mutprotein_seq.txt* path on the second column. The *binding-peptides* directory contains the output files generated by NetMHCpan: - :download:`subjectID_binding_peptides.txt <_static/output/binding-peptides/binding_peptides.txt>`: a tab-delimited file for each subject containing the information about all HLA-binding peptides. The information inside the columns is: - SubjectID: subject ID - Pos: genomic position of the mutation - GeneID: Ensemble gene ID - TranscriptID: Ensemble transcript ID - GeneSymbol: Gene Symbol - Protein: mutated protein - ProteinPos: position of the mutated amino acid in the protein - Mut: amino acid change - VariantType: type of variant - HLA: HLA type - MutPeptide: mutated peptide - MutAFF: affinity of the mutated peptide in nM - MutRank: rank of the mutated peptide - RefPeptide: unmutated peptide - RefAFF: affinity of the unmutated peptide in nM - RefRank: rank of the unmutated peptide - :download:`netmhcpan.tsv <_static/output/binding-peptides/netmhcpan.tsv>`: text file reporting, for each subject, the name of the output file generated by NetMHCpan *binding_peptides.txt*. The *neoantigens* directory contains the files: - :download:`subjectID_neoantigens <_static/output/filtered-neoantigens/neoantigens.txt>`: a file for each subject containing only the expressed neoantigens; it has the same columns as the files generated by NetMHCpan, plus an additional field reporting the normalized expression of the mutated gene: log2(TPM + 1). Result interpretation --------------------- By running the pipeline from command line on the example files :ref:`(as explained here) `, several :ref:`output files ` are generated for a subject called *Sample*. Excluding the files with straightforward interpretation and some intermediate files, such as the file of annotated DNA mutations, we can briefly interpret the output data giving some insights on tumor-immune cell interactions. **HLA typing** Optitype predicted heterozygous HLA alleles for the HLA-A gene (HLA-A31:01 and HLA-A26:01) and homozygous alleles for the HLA-B and HLA-C genes (HLA-B38:01 and HLA-C12:03, respectively). **GSEA** From GSEA results we can see that activated CD4+ T cells (ACT_CD4) are significantly enriched at a false-discovery reate of 5% (NES=2.99, q-val=0.00<0,.05). Contrariwise, the enrichment of effector memory CD8+ T cells (TEM_CD8) is not significant (NES=1.35, q-val=0.19>0.05). Activated B cells (ACT_B_CELL) are instead depleted, as their NES score is negative (NES=-2.32, q-val=0.01). **Immunophenoscore** From the immunophenoscore plot we can see that HLA-related genes are strongly up-regulated (top-left outersection, in dark red), while genes related to both effector immune cells (EC, top-right) and suppressor immune cells (SC, bottom-right) are slightly up-regulated. Most of the checkpoints molecules (CP) with immunoinhibitory effects (identified by the “-“ sign) are down-regulated, while co-stimulators (“+” sign) are both up- (CD27) and down-regulated (ICOS). Taken all together, these positive and negative contributions can be summarized in an immunophenoscore of 10 (on a [0-10] scale), representing a good immunophenotype, i.e. a tumor which is likely to elicit an effective immune response. **Neoantigens** The expressed neoantigens consist of seven mutated peptides arising from two genes: NCOA6, and TP53. The original pool of mutated peptides comprised eight peptides from three genes, but one of them was not expressed (TP53TG3D). Among the expressed neoantigens, the MNRRPILTI peptide, arising from an R>G missense mutation in TP53, was predicted to bind to HLA-C12:03 with high affinity (407.5 nM), whereas a lower affinity was predicted for its wild-type version MNRGPILTI (1059.7 nM). TIminer API =========== .. _single-sample: **Single-Subject data analysis** .. automodule:: TIminer.TIminerAPI :members: executeKallisto, executeIPS, executeGSEA, executeOptitype, executeVep, executeNetmhcpan, filterNeoantigenFile, convertExpressionFile .. _multiple-sample: **Multiple-Subject data analysis** .. automodule:: TIminer.TIminerAPI :members: executeKallistoDir, executeIPS, executeGSEA, executeOptitypeDir, executeVepDir, executeNetmhcpanDir, filterNeoantigenDir .. _licensing-scheme: Example ------- The TIminer API is available as a Python module installed during the installation process. The :ref:`TIminer Pipeline ` script is a good guideline on how to use the API. Additionally, a simple example generating the IPS for a single subject is shown here: .. literalinclude:: _static/ips_sample.py :linenos: :language: python Licensing ========= Kallisto: `http://pachterlab.github.io/kallisto/ `_. Kallisto is distributed under a non-commercial `license `_ with permission to use, copy, modify, and distribute the software and its documentation for educational and research not-for-profit purposes Gene Set Enrichment Analysis: `http://software.broadinstitute.org/gsea/index.jsp `_. The GSEA `license `_ allows free access to individuals in both academia and industry for internal research purposes. However, it is requested that every unique user registers at the GSEA website. Optitype: `https://github.com/FRED-2/OptiType `_. OptiType is released under a three-clause BSD `license `_. Variant Effect Predictor: `http://www.ensembl.org/info/docs/tools/vep/index.html `_. VEP is released under the Apache 2.0 `license `_. NetMHCPan: `http://www.cbs.dtu.dk/services/NetMHCpan/ `_. NetMHCPan is released under an `Academic Software License Agreement `_ and is free for non-profit educational, academic and/or research purpose. Immunophenoscore: `https://github.com/MayerC-imed/Immunophenogram `_. The Immunophenoscore script is released under a BSD `license `_. Extended filtering: `https://software.broadinstitute.org/gatk/download/auth?package=GATK `_. GATK tools are release under an ACADEMIC NON-COMMERCIAL RESEARCH license. Picard `https://github.com/broadinstitute/picard/blob/master/LICENSE.txt `_ is under MIT License. Hisat2 `https://ccb.jhu.edu/software/hisat2/index.shtml `_ is under GNU License. References ========== .. [1] Angelova, Mihaela, et al. "Characterization of the immunophenotypes and antigenomes of colorectal cancers reveals distinct tumor escape mechanisms and novel targets for immunotherapy." Genome biology 16.1 (2015): 1. .. [2] Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527. .. [3] Charoentong, P., Finotello, F., Angelova, M., Mayer, C., Efremova, M., Rieder, D., Hackl, H. and Trajanoski, Z. (2017) Pan-cancer immunogenomic analyses reveals genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep, 18, 1-15. .. [4] Hackl, Hubert, et al. "Computational genomics tools for dissecting tumour-immune cell interactions." Nature Reviews Genetics 17.8 (2016): 441-458. .. [5] Hoof, Ilka, et al. "NetMHCpan, a method for MHC class I binding prediction beyond humans." Immunogenetics 61.1 (2009): 1-13. .. [6] McLaren, William, et al. "The Ensembl Variant Effect Predictor." Genome Biology Jun 6;17(1):122. (2016). .. [7] Nielsen, Morten, et al. "NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and-B locus protein of known sequence." PloS one 2.8 (2007): e796. .. [8] Nielsen, Morten, and Massimo Andreatta. "NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets." Genome medicine 8.1 (2016): 1. .. [9] Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550. .. [10] Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M. and Kohlbacher, O. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.