Working offline

Introduction

piomart is a command line tool that is designed to make the process of annotating differential expression results easier. It can be used in a notebook, on the command line, in a Snakefile etc. Below is an example of using it on a command line to annotated a csv file of differential expression results. At the end is an example how it would look integrated into a Snakefile.

Downloading the gtf file

To download the gtf file use the gtf command. To create the json file needed the gtf file needs to be unzipped

piomart gtf –species homo_sapiens -u

While this is all you need to download and unzip over ftp the gtf file, it is a good idea to specify the release as well to ensure reproducibility, using the –release <ver> option. If no release is specified, the data is pulled from the http://ftp.ensembl.org/pub/current_gtf/ directory which is updated every time a new release is published. it is also a good idea to give a name to the file using –output, or else you will be left with whatever the name of the file in the directory is. Currently it’s Homo_sapiens.GRCh38.93.gtf.gz

Creating the json file

After the gtf file for the species of interest is download, it’s time to use to create the json file which is parsed to get the gene symbols and other information the command

piomart json -f Homo_sapiens.GRCh38.93.gtf -o homo_sapiens.json

Will parse the gtf file Homo_sapiens.GRCh38.93.gtf.gz and create the json file homo_sapiens.json if no output file is specified .json will be added to the input file creating Homo_sapiens.GRCh38.93.gtf.json so it is recommended to specify an output.

Using the info Command

The info command is useful for getting information on on one or two genes. Using our previously downloaded homo_sapiens.json

piomart info ENSG00000278384 –offline -f homo_sapiens.json

will produce the output in the terminal

gene_id ENSG00000278384

gene_version 1

gene_name AL354822.1

gene_source ensembl

gene_biotype protein_coding

seqname GL000218.1

source ensembl

feature gene

start 51867

end 54893

score .

strand -

frame .

Any number of genes can be specified using spaces. info can also take a text file which contains id’s as well.

Appending data to a csv file using dataframe

When used in offline mode the dataframe command will append up to all the information for each gene in the gtf file. The most common way to use it

piomart dataframe MyCsv.csv –offline -f homo_sapiens.json –columns=gene_name,gene_id –output=Mycsv_with_symbols.csv

piomart assumes that the index column is the first column, and that that column contains Ensembl ids. If the first column is not the index column. It can be specified with –index using either the column name, or integer.

If your Ensembl ids in the csv column contain versions, information about that will only be appended if the version of the gene in the csv file matches the data in the json file. If that is not the case, then the original Ensembl id with version will be returned. If all’s you are interested is getting the information regardelss of version –inexact can be passed on the command line. –inexact will convert all Ensemble ids with versions to just Ensemble ids. If the id is in the json file then the information will be appended. If it does not exist in the json file, it usually means the Ensembl id has been deprecated and every field after gene_id and gene_name that is specified will have ‘deprecated’ in it. Instead of a gene_name the id will be returned with ‘_d’ appended to it.

There is one unique case with paralogs. ENSG00000197976.11 and gene ENSG00000197976.11_PAR_Y are both the same gene, but one is a paralog. Both genes will be present in the returned file, but insterad of “_d” being appended to the gene “_PAR_Y” will be returned. So you will see “ENSG00000197976.11_PAR_Y” in your csv file. All other fields will contain the world “paralog” in them.

Bonus Using Piomart in a Snakefile

Below is an example of using piomart on some deseq2 results to generate annoated csv files

deseq_files = [“onset_deseq2_results”, “cortical_deseq2_results”,”cag_repeat_deseq2_results”]

deseq_out = expand(“{file}_annotated.csv”, file=deseq_files)

rule all:
input:
deseq_out
rule download_gtf:
output:
“homo_sapiens_grch38.json”
shell:

“piomart gtf -u –species homo_sapiens –release release-93 –output homo_sapiens_grch38.gtf “

“&& piomart json -f homo_sapiens_grch38.gtf -o {output[0]}”

rule annotated_deseq_results:
input:

“homo_sapiens_grch38.json”,

“{file}.csv”

output:
“{file}_annotated.csv”
shell:
“piomart dataframe {input[1]} –offline -f {input[0]} –columns=gene_name –output={output[0]}”