The DOE Joint Genome Institute (JGI), a national user facility that supports the managing and analysis of complex genomic data, has been working for two years to improve its user interface and infrastructure. The Genome Portal (http://genome.jgi.doe.gov), the massive genomic database and data management system operated by the JGI, now boasts significant upgrades to support efficient handling of the rapidly growing diverse genomic data stored there.
The JGI provides high-throughput sequencing and computational analysis in support of DOE missions related to clean energy generation and environmental characterization and cleanup. The Genome Portal allowsusers to search, download and explore multiple data sets. All DOE JGI sequencing projects are available, as well as the status, assemblies and annotations of sequenced genomes.
The DOE JGI and its partners are no stranger to big data. As a recent paper in Nucleic Acids Research highlights, JGI completed 2,635 projects in 2012, a three-fold increase over 2011. The JGI generated more than 56 trillion nucleotides of genome-sequence data in 2012 and over 70 trillion nucleotides in 2013. Over the past year (2013), JGI has added 650 genomes to the public databases. Because of the increased amount and complexity of data, it became necessary to upgrade the Genome Portal. The main focus of the upgrade was expanding computational resources to enable efficient storage, access, download and analysis of data.
Among the updates are new tools designed to make it easier to locate a specific genome, including a detailed list of all JGI projects, an interactive “Tree of Life” and domain-specific comparative resources. Enhanced search functionality supports searching for genomes and projects by keyword (e.g. plants, algae, single cell, water), name and other categories of data.
The Genome Portal website was built using Apache HTTPD, Tomcat and MySQL, and most of the Genome Portal components have been developed using Java and open sources tools. The more robust infrastructure includes four load-balanced Web servers, talking to two back-end database servers. An automated build system uses Jenkins to allow updates to be applied with disruption users.
Partnerships have also been instrumental to the upgrade effort. A strong alliance with the National Energy Research Scientific Computing Center (NERSC) has led to increased HPC-level capabilities, according to the paper’s authors. NERSC hosts the servers that run the Genome Portal and provides access to ESnet (Energy Sciences Network), which facilitates high-speed data transfers.
According to JGI’s Inna Dubchak, JGI’s alliance with NERSC will enable “faster and smoother access for users tapping into the Genome Portal’s resources.”