For detailed information about GQT please see our Nature Methods paper at:
http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3654.html
For questions and discussion about GQT please visit the forum at:
https://groups.google.com/forum/#!forum/gqt-discuss
GQT
Genotype Query Tools (GQT) is command line software and a C API for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduce computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard "variant-centric" indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals.
Table of Contents
- [Installation] (#installation)
- [Example work flow] (#example-workflow)
- [Extra detail] (#extra-detail)
Installation
GQT depends on htslib, sqlite3, and lex (flex).
Step 1. Install htslib.
git clone https://github.com/samtools/htslib.git
cd htslib
./configure --disable-bz2 --disable-lzma --enable-libcurl
make
cd ..
Step 2. Download sqlite amalgamation source.
wget http://www.sqlite.org/2014/sqlite-amalgamation-3080701.zip
unzip sqlite-amalgamation-3080701.zip
Step 3. Check to see if your system has lex/flex installed. If not, install.
lex -V
flex -V
If both fail then install from source, otherwise skip to step 4
wget http://downloads.sourceforge.net/project/flex/flex-2.5.39.tar.bz2
bunzip2 flex-2.5.39.tar.bz2
tar xvf flex-2.5.39.tar
cd flex-2.5.39
./configure
make
make install
cd ..
Step 4. Get GQT source then modify the GQT Makefile by setting the
HTS_ROOT and SQLITE_ROOT variable in src/Makfile to reflect their
locations based on the directories in which they were placed during steps 1 and 2. Compile.
git clone https://github.com/ryanlayer/gqt.git
cd gqt/
make
At this point, it is recommended that you copy the gqt binary to a directory that is on your PATH.
Step 5 (Optional) Run the GQT functional tests.
5a. Install bcftools (not necessary for GQT to function, but useful for the functional tests below).
cd ..
git clone https://github.com/samtools/bcftools
cd bcftools
make
5b. Install plink (v1.9) (not necessary for GQT to function, but useful for the functional tests below)..
# Download the appropriate binary from:
https://www.cog-genomics.org/plink2
# Now copy the plink binary to a directory on your PATH
5c. In addition, after you install bcftools,
you need to also update the directory assigned to BCFTOOLS_PLUGIN in your gqt/test/func/functional_tests.sh
file to be the plugins directory in the bcftools source tree. For example, if you compiled bcftool
in ~/src, the correct path for BCFTOOLS_PLUGIN would be ~/src/bcftools/plugins. That is:
export BCFTOOLS_PLUGINS="$HOME/src/bcftools/plugins"
5d. Lastly, you may need to also upate the directories assigned to either LD_LIBRARY_PATH (linux)
or DYLD_LIBRARY_PATH (Mac OS X) in your gqt/test/func/functional_tests.sh file to include the directory where libhts.so.1 is located. For example, if you compiled htslib in ~/src, the correct settings would be:
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/src/htslib"
export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$HOME/src/htslib"
Okay, now you are ready to run the tests. Make sure you are back in the make gqt source directory.
cd ../gqt/test/func
./functional_tests.sh
cd ../../..
Example workflow
Create a GQT index from a BCF
GQT starts with a target VCF, VCF.GZ, or BCF file. Download an example BCF file.
wget --trust-server-names http://bit.ly/gqt_bcf
To index the BCF, GQT needs to know the number of columns and rows in the BCF. This information can be either supplied with command line parameters or GQT can extract this info from the CSI index produced by bcftools.
bcftools index chr11.11q14.3.bcf
With the BCF index created, now create the GQT index.
gqt/bin/gqt convert bcf -i chr11.11q14.3.bcf
This will create three files:
chr11.11q14.3.bcf.gqtis the GQT indexchr11.11q14.3.bcf.bimis the compressed variant metadatachr11.11q14.3.bcf.vidis the variant order information
Create a GQT sample database
GQT requires a database that describes the samples in the target BCF file.
This database is based on the samples in the BCF, and can be optionally
augmented by file (typically called PED file) that contains tab separated
fields, one line that gives the field name, and the subsequent lines give the
field values for each sample. One column in this file must correspond to a
sample in the BCF file. By default column 2 is used, but this can be modified
by using the -c option. A warning will be generated for each record in the
PED file that does not have a corresponding sample in the VCF/BCF. NOTE: Each
field in the first line will become a database column. Given the naming
restriction, many special characters are not allowed (e.g., %, $, :), and all
spaces will be converted to underscores ("_").
For example, the following file includes Family ID, Population and Gender information. Using the default sample column (2), the each Individual ID will be matched with the Sample ID in the VCF/BCF.
Family ID Individual ID Population Gender
HG00096 HG00096 GBR 1
HG00171 HG00171 FIN 2
Without a PED file the database can be created from the BCF file:
gqt/bin/gqt convert ped -i chr11.11q14.3.bcf
This creates a database named chr11.11q14.3.bcf.db
When a PED file is available it can be included, thus allowing more sophisticated queries based on the phenotypes, ancestries, and relationships of the samples. More on this below.
wget --trust-server-names http://bit.ly/gqt_ped
gqt/bin/gqt convert ped -i chr11.11q14.3.bcf -p 1kg.phase3.ped
This creates a database named 1kg.phase3.ped.db
Query the GQT index
When the database is based only on the BCF file (chr11.11q14.3.bcf.db),
queries can be based on the sample names. For example, you can find the
variants where at least two of the three listed individuals is heterozygous.
gqt/bin/gqt query \
-i chr11.11q14.3.bcf.gqt \
-p "BCF_Sample in ('NA21126','NA21127','NA21128')" \
-g "count(HET)>1" \
> NA21126-8.HET.vcf