Analysis¶

This analysis is performed inside the docker container condaforge/mambaforge:4.10.1-0.

The entire Snakefile can be viewed at Snakefile.

Download Gencode v26 Reference¶

This step is handled entirely by the snakemake shell directive. wget is used to fetch the annotations, which are then grepped to keep only those features marked gene. These features are then piped through a serious of cut and tr commands to extract the relevant fields.

Download MANE Reference¶

Download the summary file from the MANE project.

Matched annotation from NCBI and EMBL-EBI, or MANE, is an initiative to “provide a minimal set of matching RefSeq and Ensembl transcripts of human protien-coding genes.” Though it represents only a minimal set and has not yet achieved full genome coverage, it is an incredibly valuable resource for matching these two references together.

Here, the MANE.GRCh38.v0.93.summary.txt.gz file is retrieved with the ever useful pandas.read_csv. After renaming several columns for compatibility with GTEx and BioMart data, the verion numbers are stripped from the ENSG, ENST, and RefSeq IDs, as GTEx does not use the most recent Gencode.

Make GTEx Requests¶

Make a GET request to GTEx medianTranscriptExpression.

This step queries the GTEx API for transcript expression data in the region specified by the user, using a user provided list of genes names. To simplify the use experience, the user should provide common gene names. These are then automatically converted to the GTEx-required Ensembl IDs by referencing Gencode v26 as this is the version used by GTEx. If a gene name is not found in Gencode, then a warning is dumped to the logs, and a blank file created to propagate this error downstream.

As the GTEx API is quite straightforward, these queries can be made using the standard requests.session object. Data were pulled from the gtex_v8 dataset limited to the region specified by the user.

There are a number of bugs in the GTEx API that necesitated some work arounds. The data are returned in a TSV format, as attempts to return in a CSV format produced a tab-delimited first line and comma-delimited body. Additionally, the hclust option cannot be passed as it is force-set to true, no matter the user-passed value. This is circumvented by not passing the value in the first place.

Make BioMart Requests¶

Use BioMart to convert ENST IDs to RefSeq IDs.

Once the data has been retrieved from GTEx, the RefSeq IDs need to be added. The most straightforward way to achieve this is to query BioMart, using the ENST as a filter.

This is achieved using the BioMart class from the bioservices package. The query is made against the hsapiens_gene_ensembl dataset, filtered by ENST, and the ENSG ID, ENST ID, HGNC symbol, and RefSeq ID are returned.

Due to the differing standards used by Ensembl and RefSeq, not all ENST IDs have a corresponding RefSeq ID and, confusingly, some ENST IDs map to multiple RefSeq IDs. It was the latter case that necesitated the addition of data from MANE to provide some clarity to the situation.

Process Final Data¶

Process data to xlsx.

Here, the input data from the previous three steps is combined and written to XLSX. The GTEx data is merged BioMart data using an outer merge, so as to keep all entries. Then, the MANE data is added using a left merge, so as to only keep the data from the GTEx query.