Chapter 2
The HDF5 Library and Programming Model

1. Introduction

The HDF5 Library implements the HDF5 abstract data model and storage model as described in the preceding chapter, “The HDF5 Data Model”. The library exports a set of application programming interfaces, APIs, as its external interface. These APIs perform several categories of operations, as listed in Table 1, "The HDF APIs."

Two major objectives of the HDF5 products are to provide tools that can be used on as many computational platforms as possible, i.e., portability, and to provide a reasonably object oriented data model and programming interface. These objectives are somewhat in conflict as cross-platform portability is still a weak point with true object oriented programming languages.

To be as portable as possible, the HDF5 Library is implemented in portable C. C is not an object-oriented language, but the library uses several mechanisms and conventions to implement an object model.

Table 1. The HDF5 API
naming scheme

Prefix 

Operates on  

H5A

Attributes

H5D

Datasets

H5E

Error reports

H5F

Files

H5G

Groups

H5I

Identifiers

H5L

Links

H5O

Objects

H5P

Property lists

H5R

References

H5S

Dataspaces

H5T

Datatypes

H5Z

Filters

First, the HDF5 library implements the objects as data structures. To refer to an object, the HDF5 library implements its own pointers, called identifiers. The identifier is then used to invoke operations on a specific instance of an object. For example, when a group is opened, the API returns a group identifier. This identifier is a reference to that specific group and will be used to invoke future operations on that group. The identifier is valid only within the context it is created and remains valid until it is closed or the file is closed. This mechanism is essentially the same that C++ or other object-oriented languages use to refer to objects, except the syntax is C.

Similarly, object-oriented languages collect all the methods for an object in a single name space, e.g., the methods of a C++ Class. The C language does not have any such mechanism, but the HDF5 Library API simulates this through its scheme of API names by giving names that begin with a common prefix to operations on a particular class of objects. Table 1 lists the HDF5 objects and the standard prefixes used by the corresponding HDF5 APIs. For example, functions that operate on datatype objects all have names beginning with H5T.

2. The HDF5 Programming Model

In this section we introduce the HDF5 programming model by means of a series of short code samples illustrating a broad selection of common HDF5 tasks. These are merely illustrative examples; full details are provided in the following chapters and in the HDF5 Reference Manual

2.1 Creating an HDF5 file

Before an HDF5 file can be used or referred to in any matter, must be explicitly created or opened. When using the default property lists, as we will for now, this is a simple matter. When the need for access to a file ends, the file must be closed. Figure 1 provides a C code fragment illustrating these steps.

If there is a possibility that a file of the declared name already exists and you wish to open a new file regardless of that possibility, the flag H5ACC_TRUNC will cause the operation to overwrite the previous file. If the operation should fail in such a circumstance, use the flag H5ACC_EXCL instead.


    Hid_t       file;                 /* declare file identifier */
    /*
    * Create a new file using H5ACC_TRUNC 
    * to truncate and overwrite any file of the same name,
    * default file creation properties, and 
    * default file access properties.
    * Then close the file.
    */
    file = H5Fcreate(FILE, H5ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
    status = H5Fclose(file); 
    
Figure 1. Creating and closing an HDF5 file.

2.2 Creating and initializing the essential components of a dataset

The datatype and dataspace, i.e., the dimensionality of the array containing raw data of the dataset, are independent objects and are created separately from any dataset it which they may be attached. Hence, creating a dataset requires, at a minimum, the following steps:

  1. Create and initialize a dataspace defining the dimensions of the dataset array.
  2. Define the dataset datatype.
  3. Create and initialize the dataset itself.

The code in Figure 2 illustrates the execution of these steps.


    hid_t    dataset, datatype, dataspace;  /* declare identifiers */
    
    /* 
     * Create a dataspace: Describe the size of the array and 
     * create the dataspace for a fixed-size dataset. 
     */
    dimsf[0] = NX;
    dimsf[1] = NY;
    dataspace = H5Screate_simple(RANK, dimsf, NULL); 
    /*
     * Define a datatype for the data in the dataset.
     * We will store little endian integers.
     */
    datatype = H5Tcopy(H5T_NATIVE_INT);
    status = H5Tset_order(datatype, H5T_ORDER_LE);
    /*
     * Create a new dataset within the file using the defined 
     * dataspace and datatype and default dataset creation
     * properties.
     * NOTE: H5T_NATIVE_INT can be used as the datatype if 
     * conversion to little endian is not needed.
     */
    dataset = H5Dcreate(file, DATASETNAME, datatype, dataspace,
    H5P_DEFAULT); 
    
Figure 2. The most basic steps in creating an HDF5 dataset.

2.3 Closing an object once it is no longer needed

An application should close a datatype, dataspace, or dataset object once it is no longer needed. Since each is an independent object, the must be released (or closed) separately. This action is frequently referred to as releasing the object's identifier. The code in Figure 3 closes the datatype, dataspace, and dataset that were created in the preceding section.


    H5Tclose(datatype); 
    H5Dclose(dataset); 
    H5Sclose(dataspace);
    
Figure 3. Closing objects in an HDF5 file, or releasing their identifiers.

2.4 Writing or reading a dataset from/to a file

Having created the dataset, the actual data can be written with a call to H5Dwrite, as illustrated in Figure 4.


    /*
    * Write the data to the dataset using default transfer
    * properties.
    */
    status = H5Dwrite(dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
              H5P_DEFAULT, data);
    
Figure 4. Writing the dataset.

Note that the third and fourth H5Dwrite parameters in the above example describe the dataspaces in memory and in the file, respectively. For now, these are both set to H5S_ALL, indicating that the entire dataset is to be written. The selection of partial datasets and the use of differing dataspaces in memory and in storage will be discussed later in this chapter and in more detail elsewhere in this guide.

Reading the dataset from storage is analogous to writing. If we wished to read an entire dataset, we could simply substitute H5Dread for H5Dwrite in the above example.

2.5 Reading and writing a portion of a dataset

In the previous discussion, we described writing or reading an entire dataset. HDF5 also supports access to selected portions of a dataset, known as selections, without having to read or write the entire dataset.

The simplest type of selection is a simple hyperslab, an n-dimensional rectangular subset of a dataset where n is equal to the dataset’s rank. Other available selections include a more complex hyperslab with user-defined stride and block size, a list of independent points, and the union of any of these.

Figure 5 illustrates several sample instances of selections.






Figure 5. HDF5 dataspace selections can take the form of a simple hyperslab, a hyperslab with user-defined stride and block, a selection of points, or a union of any of the above.

Selections and hyperslabs are portions of a dataset. As described above, a simple hyperslab is a rectangular array of data elements with the same rank as the dataset’s dataspace. Thus, a simple hyperslab is a logically contiguous collection of points within the datset.

The more general case of a hyperslab, can also be a regular pattern of points or blocks within the dataspace. Four parameters are required to describe a general hyperslab: the starting coordinates, the block size, the stride or space between blocks, and the number of blocks. These parameters are each expressed as a one-dimensional array whose length is equal to the rank of the dataspace. These parameters are described in Table 2.

Table 2. The parameters required to fully define a general hyperslab.
Parameter Definition
start The coordinates of the starting location of the hyperslab in the dataset’s dataspace.
block The size of each block to be selected from the dataspace. If the block parameter is set to NULL, the block size defaults to a single element in each dimension, as if the block array was set to all 1s (all ones). This will result in the selection of a uniformly spaced set of count points starting at start and on the interval defined by stride.
stride The number of elements separating the starting point of each element or block to be selected. If the stride parameter is set to NULL, the stride size defaults to 1 (one) in each dimension and no elements are skipped.
count The number of elements or blocks to select along each dimension.

Hyperslab example without strides or blocks. For maximum flexibility in user applications, a selection in storage can be mapped into a differently-shaped selection in memory; all that is required is that the two selections contain the same number of data elements. In this example, we will first define the selection to be read from the dataset in storage; we will then define the selection as it will appear in application memory.

Suppose we want to read a 3x4 hyperslab from a two-dimensional dataset in a file, beginning at the dataset element <1,2>. As illustrated in Figure 6, we must create the dataspace that describes the overall rank and dimensions of the dataset in the file, as well as the position and size of the in-file hyperslab that we are extracting from that dataset.


    /* 
    * Define dataset dataspace in file.
    */
    dataspace = H5Dget_space(dataset);    /* dataspace identifier */
    rank      = H5Sget_simple_extent_ndims(dataspace);
    status_n  = H5Sget_simple_extent_dims(dataspace, dims_out, NULL);

    /* 
    * Define hyperslab in the dataset. 
    */
    offset[0] = 1;
    offset[1] = 2;
    count[0]  = 3;
    count[1]  = 4;
    status = H5Sselect_hyperslab(dataspace, H5S_SELECT_SET, offset, NULL, 
             count, NULL);
    
Figure 6. Define the selection to be read from storage.

The next task is to define an analogous dataspace in memory. Suppose, for instance, that we have in memory a three-dimensional 7x7x3 array into which we wish to read the two-dimensional 3x4 hyperslab described above and that we want the memory to selection to begin at the element <3,0,0> and reside in the plane of the first two dimensions of the array. Since the in-memory dataspace is three-dimensional, we have to describe the in-memory selection as three-dimensional. Since we are keeping the selection in the plane of the first two dimensions of the in-memory dataset, the in-memory selection will be a 3x4x1 array, defined as <3,4,1>.

Notice that we must describe two things: the dimensions of the in-memory array, and the size and position of the hyperslab that we wish to read in. Figure 7 illustrates how this would be done.


    /*
    * Define memory dataspace.
    */
    dimsm[0] = 7;
    dimsm[1] = 7;
    dimsm[2] = 3;
    memspace = H5Screate_simple(RANK_OUT,dimsm,NULL);   

    /* 
    * Define memory hyperslab. 
    */
    offset_out[0] = 3;
    offset_out[1] = 0;
    offset_out[2] = 0;
    count_out[0]  = 3;
    count_out[1]  = 4;
    count_out[2]  = 1;
    status = H5Sselect_hyperslab(memspace, H5S_SELECT_SET, offset_out, NULL, 
             count_out, NULL);
    
Figure 7.

The hyperslab in the above figure has the following parameters: start=(3,0,0), count=(3,4,1), stride and block size are NULL.

For a second example, consider an example going in the other direction, writing a selection from memory to a selection in a dataset in a file. Suppose that the source dataspace in memory is a 50-element, one-dimensional array called vector, as illustrated in Figure 8 and that the source selection is a 48-element simple hyperslab selection that starts at the second element of vector.

  -1     1     2     3     ...     49     50     -1  
 
Figure 8

Further suppose that we wish to write this data to the file as a series of 3x2-element blocks in a 2-dimensional dataset, skipping one row and one column between blocks. Since the source selection contains 48 data elements and each block in the destination selection contains 6 data elements, we must define the destination selection with 8 blocks; we'll write 2 blocks in the first dimension and 4 in the second. Figure 9 provides sample code to achieve this objective.


    /* Select the hyperslab for the dataset in the file, using 3x2 blocks, 
     * a (4,3) stride, a (2,4) count, and starting at the position (0,1).
     */
    start[0]  = 0; start[1]  = 1;
    stride[0] = 4; stride[1] = 3;
    count[0]  = 2; count[1]  = 4;    
    block[0]  = 3; block[1]  = 2;
    ret = H5Sselect_hyperslab(fid, H5S_SELECT_SET, start, stride, count, block);

    /*
     * Create dataspace for the first dataset.
     */
    mid1 = H5Screate_simple(MSPACE1_RANK, dim1, NULL);

    /*
    /*
     * Select hyperslab. 
     * We will use 48 elements of the vector buffer starting at the second element.
     * Selected elements are 1 2 3 . . . 48
     */
    start[0]  = 1;
    stride[0] = 1;
    count[0]  = 48;
    block[0]  = 1;
    ret = H5Sselect_hyperslab(mid1, H5S_SELECT_SET, start, stride, count, block);
 
    /*
     * Write selection from the vector buffer to the dataset in the file.
     *
    ret = H5Dwrite(dataset, H5T_NATIVE_INT, mid1, fid, H5P_DEFAULT, vector)
    
Figure 9

2.6 Getting information about a dataset

Although reading is analogous to writing, it is often first necessary to query a file to obtain information about the dataset to be read. For instance, we often need to determine the datatype associated with a dataset, or its dataspace (i.e., rank and dimensions). As illustrated in Figure 10, there are several get routines for obtaining this information.


    /*
    * Get datatype and dataspace identifiers,  
    * then query datatype class, order and size, and 
    * dataspace rank and dimensions.
    */

    datatype  = H5Dget_type(dataset);     /* datatype identifier */ 
    class     = H5Tget_class(datatype);
    if (class == H5T_INTEGER) printf("Dataset has INTEGER type \n");
    order     = H5Tget_order(datatype);
    if (order == H5T_ORDER_LE) printf("Little endian order \n");

    size  = H5Tget_size(datatype);
    printf(" Data size is %d \n", size);

    dataspace = H5Dget_space(dataset);    /* dataspace identifier */
    rank      = H5Sget_simple_extent_ndims(dataspace);
    status_n  = H5Sget_simple_extent_dims(dataspace, dims_out);
    printf("rank %d, dimensions %d x %d \n", rank, dims_out[0], dims_out[1]);
    
Figure 10

2.7 Creating and defining compound datatypes

An HDF5 compound datatype is similar to a C struct or a Fortran common block. Though not originally designed with databases in mind, HDF5 compound datatypes are sometimes used in a manner analogous to a database record.

HDF5 defines a compound datatype as a collection of one or more data elements. Each elements is an atomic type, a small array, or another compound datatype. The provision for nested compound datatypes allows these structures become quite complex. Compound datatypes thus become either a powerful tool or a complex and difficult to debug construct; reasonable caution is advised.

To create and use a compound datatype, you need to create a datatype with class compound (H5T_COMPOUND) and specify the total size of the data element in bytes. A compound datatype consists of zero or more uniquely named members. Members can be defined in any order but must occupy non-overlapping regions within the datum. Table 3 lists the properties of compound datatype members.

Table 3

Parameter

Definition

Index

An index number between zero and N-1, where N is the number of members in the compound. The elements are indexed in the order of their location in the array of bytes.

Name

A string that must be unique within the members of the same datatype.

Datatype

An HDF5 datatype.

Offset

A fixed byte offset, which defines the location of the first byte of that member in the compound datatype.

Properties of the members of a compound datatype are defined when the member is added to the compound type and cannot be subsequently modified.

Defining compound datatypes.

Compound datatypes must be built out of other datatypes. First, one creates an empty compound datatype and specifies its total size. Members are then added to the compound datatype in any order.

Each member must have a descriptive name, which is the key used to uniquely identify the member within the compound datatype. A member name in an HDF5 datatype does not necessarily have to be the same as the name of the corresponding member in the C struct in memory, although this is often the case. Nor does one need to define all members of the C struct in the HDF5 compound datatype (or vice versa).

Usually a C struct will be defined to hold a data point in memory, and the offsets of the members in memory will be the offsets of the struct members from the beginning of an instance of the struct. The library defines the macro that computes the offset of member m within a struct variable s.:

HOFFSET(s,m)

Figure 11 shows an example in which a compound datatype is created to describe complex numbers whose type is defined by the complex_t struct.


    Typedef struct {
       double re;   /*real part */
       double im;   /*imaginary part */
    } complex_t;

    complex_t tmp;  /*used only to compute offsets */
    hid_t complex_id = H5Tcreate (H5T_COMPOUND, sizeof tmp);
    H5Tinsert (complex_id, "real", HOFFSET(tmp,re),
               H5T_NATIVE_DOUBLE);
    H5Tinsert (complex_id, "imaginary", HOFFSET(tmp,im),
               H5T_NATIVE_DOUBLE);
    
Figure 11

2.8 Creating and writing extendible and chunked datasets

An extendible dataset is one whose dimensions can grow. One can define an HDF5 dataset to have certain initial dimensions, with the capacity to later increase the size of any of the initial dimensions.

For example, Figure 12 shows a 3x3 dataset (a), which is later extended to be a 10x3 dataset by adding 7 rows (b), and further extended to be a 10x5 dataset by adding two columns (c).


  1     1     1  
  1     1     1  
  1     1     1  
a) Initially, 3 x 3
 
  1     1     1  
  1     1     1  
  1     1     1  
  2     2     2  
  2     2     2  
  2     2     2  
  2     2     2  
  2     2     2  
  2     2     2  
  2     2     2  
b) Extend to 10 x 3
      
  1     1     1     3     3  
  1     1     1     3     3  
  1     1     1     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
  2     2     2     3     3  
c) Extend to 10 x 5
 
Figure 12

HDF5 requires the use of chunking when defining extendible datasets. Chunking makes it possible to extend datasets efficiently, without having to reorganize contiguous storage excessively.
To summarize, an extendible dataset requires two conditions:

  1. The dataspace of the dataset to be defined as unlimited in all dimensions that might eventually be extended.
  2. The dataset creation properties must enable chunking.

For example, suppose we wish to create a dataset similar to the one shown in Figure 12. We want to start with a 3x3 dataset, then later extend it.

Declare the dataspace to have unlimited dimensions with the following code. Note the use of the predefined constant H5S_UNLIMITED to specify that a dimension is unlimited.


    Hsize_t dims[2] = {3, 3}; /* dataset dimensions
    at the creation time */ 
    hsize_t maxdims[2] = {H5S_UNLIMITED, H5S_UNLIMITED};
    /*
     * Create the data space with unlimited dimensions. 
     */
    dataspace = H5Screate_simple(RANK, dims, maxdims); 
    
Figure 13

Next set the dataset creation property list to enable chunking:


    Hid_t cparms; 
    hsize_t chunk_dims[2] ={2, 5};
    /* 
     * Modify dataset creation properties to enable chunking.
     */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk( cparms, RANK, chunk_dims);
    
Figure 14

Then create the dataset:


    /*
     * Create a new dataset within the file using cparms
     * creation properties.
     */
    dataset = H5Dcreate(file, DATASETNAME, H5T_NATIVE_INT, dataspace,
                     cparms);
    
Figure 15

Finally, when the time comes to extend the size of the dataset, invoke H5Dextend. Extending the dataset along the first dimension, by seven rows, leaves the dataset with new dimensions of <10,3>:


    /*
     * Extend the dataset. Dataset becomes 10 x 3.
     */
    dims[0] = dims[0] + 7;
    size[0] = dims[0]; 
    size[1] = dims[1]; 
    status = H5Dextend (dataset, size);
    
Figure 16

2.9 Creating and working with groups in a file

Groups provide a mechanism for organizing meaningful and extendible sets of datasets within an HDF5 file. The H5G API provides several routines for working with groups.

Creating a group.

With no datatype, dataspace or storage layout to define, creating a group is considerably simpler than creating a dataset. For example, the following code creates a group called Data in the root group of file.


    /*
      *  Create a group in the file.
      */
     grp = H5Gcreate(file, "/Data", 0);
    
Figure 17

A group may be created in another group by providing the absolute name of the group to the H5Gcreate function or by specifying its location. For example, to create the group Data_new in the group Data, one can use the following sequence of calls:


    /*
      * Create group "Data_new" in the group "Data" by specifying
      * absolute name of the group.
      */
     grp_new = H5Gcreate(file, "/Data/Data_new", 0);

    or 

     /*
      * Create group "Data_new" in the "Data" group.
      */
     grp_new = H5Gcreate(grp, "Data_new", 0);
    
Figure 18

This first parameter is a location identifier. file in the first example specifies only the file. grp in the second example specifies a particular group in a particular file. Note that in this instance, the group identifier grp is used as the first parameter in the H5Gcreate call so that the relative name of Data_new can be used.

The third parameter of H5Gcreate optionally specifies how much file space to reserve to store the names of objects that will be created in this group. If a non-positive value is supplied, the library provides a default size.

H5Gclose closes the group and releases the group identifier.

Creating a dataset in a particular group. As with groups, a dataset can be created in a particular group by specifying either its absolute name in the file or its relative name with respect to that group. The next code excerpt uses the absolute name:


  /*
    * Create the dataset "Compressed_Data" in the group Data using the 
    * absolute name. The dataset creation property list is modified 
    * to use GZIP compression with the compression effort set to 6.
    * Note that compression can be used only when the dataset is 
    * chunked.
    */
   dims[0] = 1000;
   dims[1] = 20;
   cdims[0] = 20;
   cdims[1] = 20;
   dataspace = H5Screate_simple(RANK, dims, NULL);
   plist     = H5Pcreate(H5P_DATASET_CREATE);
               H5Pset_chunk(plist, 2, cdims);
               H5Pset_deflate( plist, 6);
   dataset = H5Dcreate(file, "/Data/Compressed_Data", 
                 H5T_NATIVE_INT, dataspace, plist);
    
Figure 19

Alternatively, one can first obtain an identifier for the group in which the dataset is to be created, then create the dataset with a relative name:


  /* 
    * Open the group.
    */
   grp = H5Gopen(file, "Data");

   /*
    * Create the dataset "Compressed_Data" in the "Data" group
    * by providing a group identifier and a relative dataset 
    * name as parameters to the H5Dcreate function.
    */
   dataset = H5Dcreate(grp, "Compressed_Data", H5T_NATIVE_INT,
                       dataspace, plist);
    
Figure 20

Accessing an object in a group. Any object in a group can be accessed by its absolute or relative name. The first code snippet below illustrates the use of the absolute name to access the dataset Compressed_Data in the group Data created in the examples above. The second code snippet illustrates the use of the relative name.


    /*
     * Open the dataset "Compressed_Data" in the "Data" group. 
     */
    dataset = H5Dopen(file, "/Data/Compressed_Data");
    
Figure 21


    /*
     * Open the group "data" in the file.
     */
    grp  = H5Gopen(file, "Data");
 
    /*
     * Access the "Compressed_Data" dataset in the group.
     */
    dataset = H5Dopen(grp, "Compressed_Data");
    
Figure 22

2.10 Working with attributes

An attribute is a small datasets that is attached to a normal dataset or group. Attributes share many of the characteristics of datasets, so the programming model for working with attributes is analogous in many ways to the model for working with datasets. The primary differences are that an attribute must be attached to a dataset or a group and subsetting operations cannot be performed on attributes.

To create an attribute belonging to a particular dataset or group, first create a dataspace for the attribute with the call to H5Screate, then create the attribute using H5Acreate. For example, the following code creates an attribute called Integer_attribute that is a member of a dataset whose identifier is dataset. The attribute identifier is attr2. H5Awrite then sets the value of the attribute of that of the integer variable point. H5Aclose then releases the attribute identifier.


    Int point = 1;                         /* Value of the scalar attribute */ 

    /*
     * Create scalar attribute.
     */
    aid2  = H5Screate(H5S_SCALAR);
    attr2 = H5Acreate(dataset, "Integer attribute", H5T_NATIVE_INT, aid2,
                      H5P_DEFAULT);

    /*
     * Write scalar attribute.
     */
     ret = H5Awrite(attr2, H5T_NATIVE_INT, &point); 

    /*
     * Close attribute dataspace.
     */
    ret = H5Sclose(aid2); 

    /*
     * Close attribute.
     */
    ret = H5Aclose(attr2); 
    
Figure 23

To read a scalar attribute whose name and datatype are known, first open the attribute using H5Aopen_name, then use H5Aread to get its value. For example the following reads a scalar attribute called Integer_attribute whose datatype is a native integer, and whose parent dataset has the identifier dataset.


    /*
     * Attach to the scalar attribute using attribute name, then read and 
     * display its value.
     */
    attr = H5Aopen_name(dataset,"Integer attribute");
    ret  = H5Aread(attr, H5T_NATIVE_INT, &point_out);
    printf("The value of the attribute \"Integer attribute\" is %d \n", point_out); 
    ret =  H5Aclose(attr);
    
Figure 24

Reading an attribute whose characteristics are not known. It may be necessary to query a file to obtain information about an attribute, namely its name, datatype, rank and dimensions. The following code opens an attribute by its index value using H5Aopen_index, then reads in information about its datatype.


    /*
     * Attach to the string attribute using its index, then read and display the value.
     */
    attr =  H5Aopen_idx(dataset, 2);
    atype = H5Tcopy(H5T_C_S1);
            H5Tset_size(atype, 4);
    ret   = H5Aread(attr, atype, string_out);
    printf("The value of the attribute with the index 2 is %s \n", string_out);
    
Figure 25

In practice, if the characteristics of attributes are not known, the code involved in accessing and processing the attribute can be quite complex. For this reason, HDF5 includes a function called H5Aiterate, which applies a user-supplied function to each of a set of attributes. The user-supplied function can contain the code that interprets, accesses and processes each attribute.

3. The Data Transfer Pipeline

The HDF5 Library implements data transfers between different storage locations. At the lowest levels, the HDF5 Library reads and writes blocks of bytes to and from storage using calls to the VFL drivers. In addition to this, the HDF5 Library manages caches of metadata and a data I/O pipeline that applies compression to data blocks, transforms data elements, and implements selections.

As a data management library, a substantial portion of the HDF5 Library's work is in transferring data from one environment or media to another. This most often involves a transfer between system memory and a storage medium. With the use of compression and encryption, machine-dependent differences in numerical representation, etc., the bit-by-bit representation of a given dataset is often substantially different in the two environments.

Consider the representation on disk of a compressed and encrypted little-endian array as compared to the same array after it has read from disk, decrypted, decompressed and loaded into memory on a big-endian system. HDF5 performs all of the operations necessary to make that transition during the I/O process, with many of the operations being handled by the virtual file layer (VFL) and the data transfer pipeline.

Figure 26 provides a simplified view of a sample data transfer with four stages. Note that the modules are used only when needed, e.g., if the data is not compressed, the compression stage is omitted.
Figure 26

For a given I/O requests, different combinations of actions may be performed by the pipeline. The library automatically sets up the pipeline and passes data through the processing steps. For example, for a read request (from disk to memory), the library must determine which logical blocks contain the requested data elements and fetch each block into the library's cache. If the data needs to be decompressed, then the compression algorithm is applied to the block after it is read from disk. If the data is a selection, the selected elements are extracted from the data block after it is decompressed. If the data needs to be transformed (e.g., byte swapped), then the data elements are transformed after decompression and selection.

While an application must sometimes set up some elements of the pipeline, e.g., the use of a custom error-checking algorithm, use of the pipeline is normally transparent to the user program; the library determines what must be done based on the metadata for the file, the object, and the specific request.

In some cases it is necessary to pass parameters to and from modules in the pipeline, or among other parts of the library that are not directly called through the programming API. This is accomplished through the use of dataset transfer and data access property lists.

The VFL, or virtual file layer, provides an interface whereby user applications can add custom modules to the data transfer pipeline. For example, a custom compression algorithm can be used with the HDF5 Library by linking an appropriate module into the pipeline through the VFL. This requires creating an appropriate wrapper for the compression module and registering it with the library with H5Zregister. It can then be applied to a dataset with an H5Pset_filter call, which will add it to the selected dataset's transfer property list.








1HDF5 requires random access to the linear address space. For this reason it is not well suited for some data media, such as streams.

3It could be said that HDF5 extends the organizing concepts of a file system to the internal structure of a single file.

4As of HDF5-1.4, the storage used for an object is reclaimed, even if all links are deleted.