HDF5 in Bioinformatics
DNA sequencing workflows can be very complex, and face a number of data management challenges. Typical workflows are characterized by diverse formats, highly redundant data, multiple levels of information, complex associations, repeated file processing, non-scalable storage, and lack of persistence. Recent work has investigated the use of HDF5 to manage such data.
Two strengths of HDF5 in particular are exploited in these studies: the ability of HDF5 to store and access very large arrays efficiently, and the ability of HDF5 to serve as a container for heterogeneous data. A possible data model was developed for describing the objects involved in a genome experiment, and some experiments were conducted to investigate the use of HDF5 for three applications. One is the use of HDF5 as a project file containing all data involved in a genome experiment. The second is for storing very large tables of haplotype data. The third is for creating, storing and accessing a very large linkage disequilibrium
matrix.