Extended Cross-Population (ExP) Heatmap

Welcome to the ExP Heatmap python package and command-line tool. Our software is focused on displaying multidimensional data, expecially the so called cross-population data - differences/similarities/p-values/or any other parameters of your choice between several groups.

We aim our tool to display the cross-population data from 1000 Genomes Project, phase 3 (all data freely downloadable from 1000genomes.org). This data-set consists of 26 populations that are further grouped into 5 super-populations. We have developed the basis of ExP heatmap method while analysing the selection pressure patern on the 1000 Genomes Project, phase 3, data. The greatest obstacle in this study was not getting the data or computing the selection tests, but it was to sort the huge number of results and select those who might be important to anwer our research questions. Here, we are talking about tens or hunderds of milions of p-values. A lot of them will be important, but another bunch will be just false-positives and random noise. The tests, we are talking about are the above-mentioned cross-population tests. In case of 1000 Genomes dataset, from the 26 populations we can create 325 unique pairs (combinations), and for each of this population pair, we can compute the appropriate test for many thousands loci (usually SNPs) in a chosen genomic window.

The ExP Heatmap manual is divided into following sections:

  1. ExP Heatmap - methodological outline

  2. Python package install and description

  3. Examples, usage, and prepared scripts

  4. Licence and final remarks


1. ExP Heatmap - methodological outline

The core of our approach to displaying mutlidimensional cross-population data is based on a heatmap graph. Here, the numerical value to be displayed is coded by a color or color shade. As noted above, our method focuses on displaying cross-population values, 325 combinations in the case of 26 population data-set from 1000 Genomes Project, phase 3 data. So, as an input data for our heatmap we have table with 325 rows and variable number of columns. Here, the columns are the variable positions in our genomic data (e.g. SNPs) and the values themselves are for example the p-values of cross-populational tests (e.g. XP-EHH) or pairwise parameters (genetic distance, Fst).

2. Python package install and description

To install the expheatmap package from the provided wheel file:

pip install expheatmap-0.9.0-py3-none-any.whl

3. Examples, usage, and prepared scripts

The expheatmap can be used as standalone command-line tool, or imported directly into your python script/jupyter notebook.

Here we outline a solution to 3 possible and most common scenarios where the ExP is being used. Possible model scenarios:

a) you have values to display ready

Your data are in a *.tsv file, tab-delimited text file (table), where the pvalues or results are stored (327 * Y), first column is 'CHROM', second column 'POS', followed by 325 columns of pairwise parameters (i.e. rank p-values).

command-line usage:

expheatmap_cli.py [-h] -i INPUT -o OUTPUT [-d DPI] [-x XLABEL]
                  [-y YLABEL] [-t TITLE] [-c CUTOFF] [-mx MARKERX]
                  [-l MARKERLABEL]

optional arguments:
  -h, --help            show this help message and exit

I/O options:
  -i INPUT, --input INPUT
                        input file, tab-delimited table, first columns are
                        CHROM, POS, follow by the population pair-wise
                        pvalues/statistic
  -o OUTPUT, --output OUTPUT
                        output picture file, supported formats are png, pdf,
                        ps, eps and svg.
  -d DPI, --dpi DPI     dpi of the saved picture

graph layout:
  -x XLABEL, --xlabel XLABEL
                        x-axis label
  -y YLABEL, --ylabel YLABEL
                        y-axis label
  -t TITLE, --title TITLE
                        title of the graph
  -c CUTOFF, --cutoff CUTOFF
                        limit the range of the displayed values
  -mx MARKERX, --markerx MARKERX
                        position of a vertical line marker on x-axis, use POS
                        value from vcf
  -l MARKERLABEL, --markerlabel MARKERLABEL
                        label of the markerx position

Alternatively, you can import the ExP Heatmap module in your Python 3 script:

import expheatmap
import pandas as pd

# input data in the form of pandas DataFrame, expected shape (x, 327)
# 327 columns consisting of CHROM, POS and 325 columns of pairwise p-values
# x represents the number of SNPs to display
# column names are expected to include the 1000 Genomes population abbreviations
data = pd.read_csv("test_results_pvalues.csv", sep="\t")


expheatmap.draw_heatmap(data, pic_name="data_expheatmap.png", title="My results",
                        y_label="1000 Genomes Project (phase 3) populations",
                        colormap="expheatmap", cutoff=(1.301, 4.262),
                        marker_x=None, marker_label="", display=False,
                        display_xticks=False, dpi=300)

b) you have some kind of parameters/test results, need to compute the p-values and display them

Here, you will need to compute the p-values using a prepared function in expheatmap python package.

import expheatmap
import pandas as pd

# input data in the form of pandas DataFrame, expected shape (x, 327)
# 327 columns consisting of CHROM, POS and 325 columns of pairwise p-values
# x represents the number of SNPs to display
# column names are expected to include the 1000 Genomes population abbreviations
data = pd.read_csv("test_results.csv", sep="\t")

# compute the rank p-values using expheatmap.rank_pvalues() function
# first two columns (CHROM, POS) in input pd.DataFrame not included in computations
# but are added later to the pvalues df
pvalues = data.iloc[:, 2:].apply(lambda x: expheatmap.rank_pvalues(x, source="series"), axis=0)
pvalues.insert(loc=0, column="CHROM", value=data["CHROM"])
pvalues.insert(loc=1, column="POS", value=data["POS"])

# lets save the p-values to a file
pvalues.to_csv("test_results_pvalues.csv", sep="\t", index=False)

expheatmap.draw_heatmap(pvalues, pic_name="data_expheatmap.png", title="My results",
                        y_label="1000 Genomes Project (phase 3) populations",
                        colormap="expheatmap", cutoff=(1.301, 4.262),
                        marker_x=None, marker_label="", display=False,
                        display_xticks=False, dpi=300)

c) you only have the input data (vcf)...

...and need to compute the parameters/tests, turn them into p-values and display them as ExP heatmap. Here the process will differ depending on what kind test you want to run. Below we give different examples using common tools (VCFtools) and pythonic library scikit-allel.

##############################################################
# c) you only have the input data (vcf)                      #
#    and need to compute the parameters/tests                #
#    turn them into p-values and display them as ExP heatmap.#
##############################################################

# Fst computation for all 325 population pairs using VCFtools

vcf_in = "LCT_1Mbp_win.vcf"  # your input data
scripts_file = "test_pop_pairwise_fst.txt"  # temporary file, that holds up the vcftools commands for parallel execution
n_jobs = 6  # number of jobs to run in parallel (with 325 population combinations, the computation will take some time)
temp_dir = "test_out"  # VCFtools results temporary directory

# VCFtools command
command = "vcftools --vcf {} --out {} --weir-fst-pop {} --weir-fst-pop {}"


# 1. Create the file with commands to be executed by vcftools
with open(scripts_file, "w", encoding="utf-8") as file:
    for i in expheatmap.pop_doublets:
        file.write(command.format(vcf_in,
                                  "{}/Fst_{}_{}".format(temp_dir, i[0], i[1]),
                                  "expheatmap/data/pop_samples/{}.txt".format(i[0]), # need the sample names for each 1000genomes population
                                  "expheatmap/data/pop_samples/{}.txt".format(i[1])  # dtto
                                 ))
        file.write("\n")
# 2. Run vcftools commands from the scripts_file in parallel right from the jupyter notebook (i.e. compute the pairwise Fst values)
cat $scripts_file | parallel --jobs $n_jobs --no-run-if-empty
# 3. Gather the Fst results from VCFtools output files
# cycle through the files from vcftools and add the Fst results into a pd.DataFrame
data = expheatmap.prepare_data(pos_file=vcf_in)

# for all population combinations (325), read the file and append the Fst results to master dataframe
for p1, p2 in expheatmap.pop_doublets:
    filename = "{}/Fst_{}_{}.weir.fst".format(temp_dir, p1, p2)
    
    try:
        x = pd.read_csv(filename, sep="\t", header=0)
    
    except FileNotFoundError:
        print("{}_{} not found".format(p1, p2))
        data["{}_{}".format(p1, p2)] = x["WEIR_AND_COCKERHAM_FST"] = np.nan
        
    else: # only if no error in block 1 (i.e. file exists)
        data["{}_{}".format(p1, p2)] = x["WEIR_AND_COCKERHAM_FST"]



# 4. Calculate the p-values
pvalues = data.iloc[:, 2:].apply(lambda x: expheatmap.rank_pvalues(x, source="series"), axis=0)
pvalues.insert(loc=0, column="CHROM", value=data["CHROM"])
pvalues.insert(loc=1, column="POS", value=data["POS"])


# 5. Draw the heatmap
# marker at the 2:136,608,646 of the rs4988235 SNP
expheatmap.draw_heatmap(pvalues, pic_name=None, title="LCT Fst results, variant c)",
                        y_label="1000 Genomes Project (phase 3) populations",
                        colormap="expheatmap", cutoff=(2.000, 4.262),
                        marker_x=136608646, marker_label="rs4988235", display=True,
                        display_xticks=False, dpi=300)

4. Licence and final remarks

The ExP Heatmap package is available under the GNU Affero General Public License v3.0. (link)

If you would be interested in using this method in your commercial software under another licence, please, contact us at edvard.ehler@img.cas.cz.