HDF5 in Bioinformatics

DNA sequencing workflows can be very complex, and face a number of data management challenges. Typical workflows are characterized by diverse formats, highly redundant data, multiple levels of information, complex associations, repeated file processing, non-scalable storage, and lack of persistence. Recent work has investigated the use of HDF5 to manage such data.

Two strengths of HDF5 in particular are exploited in these studies: the ability of HDF5 to store and access very large arrays efficiently, and the ability of HDF5 to serve as a container for heterogeneous data. A possible data model was developed for describing the objects involved in a genome experiment, and some experiments were conducted to investigate the use of HDF5 for three applications. One is the use of HDF5 as a project file containing all data involved in a genome experiment. The second is for storing very large tables of haplotype data. The third is for creating, storing and accessing a very large linkage disequilibrium matrix.

Back to abstracts


Last modified: 06/02/2017
About Us | Contact Info | Archive Info | Disclaimer
Sponsored by Subcontract number 4400528183 under Raytheon Contract number NNG15HZ39C, funded by NASA / Maintained by The HDF Group