... durch planmässiges Tattonieren.
[... through systematic, palpable experimentation.]
On this chapter, you will get deeper knowledge of PyTables internals. PyTables has several places where you can improve the performance of your application. If you are planning to deal with really large data, you should read carefully this section in order to learn how to get an important efficiency boost for your code. But if your dataset is small (say, up to 10 MB), you should not worry about that as the default parameters in PyTables are already tuned for those sizes.
The underlying HDF5 library that is used by PyTables allows for certain datasets (the chunked datasets) to take the data in bunches of a certain length, so-called chunks, to write them on disk as a whole, i.e. the HDF5 library treats chunks as atomic objects and disk I/O is always made in terms of complete chunks. This allows data filters to be defined by the application to perform tasks such as compression, encryption, checksumming, etc. on entire chunks.
An in-memory B-tree is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention for the metadata cache. Consequently, it's important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).
PyTables can determine an optimum chunk size to make B-trees
adequate to your dataset size if you help it by providing an
estimation of the final number of rows for an extensible
leaf[11]. This must be made at leaf creation time by
passing this value to the expectedrows
argument of
the createTable()
method (see description) and
createEArray()
method (see Section ). For
VLArray
leaves, pass the expected size in MBytes by
using the argument expectedsizeinMB
of
createVLArray()
(see Section )
instead.
When your leaf size is bigger than 10 MB (take this figure only as a reference, not strictly), by providing this guess of the number of rows you will be optimizing the access to your data. When the table or array size is larger than, say 100MB, you are strongly suggested to provide such a guess; failing to do that may cause your application to do very slow I/O operations and to demand huge amounts of memory. You have been warned!
If you are going to use a lot of searches like the next one:
result = [row['var2'] for row in table if row['var1'] > 2 and row['var1'] <= 20 and row['var2'] > 3]
(for future reference, we will call this the regular selection mode) and you want to improve the time taken to run it, keep reading.
PyTables provides a way to accelerate data selections relating
to a single table, through the use of the
Table.where()
iterator and related query methods
(see Section 4.6.4). We will call this mode of selecting data
in-kernel. Let's see an example of
in-kernel selection based on the
regular selection mentioned above:
result = [row['var2'] for row in table.where('var1 <= 20')]
This simple change of mode selection can account for an improvement in search times of more than 30x for large tables as you can see in Figure 5.1.
Figure 5.1. Times for different sequential selection modes over
Float64
values. Benchmark made on a machine
with AMD Opteron (AMD64) @ 2 GHz processors with IDE disk @ 7200
RPM.
So, where is the trick? It's easy. In the
regular selection mode the data for column
var1
has to be carried up into Python space so as
to evaluate the condition and decide if the var2
value should be added to the result
list. On the
contrary, in the in-kernel mode, the
condition is passed to the PyTables kernel
(hence the name), written in C, and evaluated there at C speed (with
the help of the integrated Numexpr package, see [11]), so that the only values that are brought to
Python space are the rows that fulfilled the condition. Hence, for
selections that only have a tiny number of hits (compared with the
total amount of rows), the savings are huge.
Incidentally, in-kernel searches are not only much faster than regular PyTables queries, but also much more than sequential queries in relational databases coded in pure C (see the Postgres line in Figure 5.1). I'm not completely sure about the reasons of why in-kernel queries work so fast in comparison with relational engines made in C, but a couple of reasons come to my mind:
PyTables implements adaptive buffers (i.e. their size grows as the table grows) for doing I/O. This fact, combined with the use of the highly optimized HDF5 hyperslice reads, allows for a very high data input speed.
In addition, PyTables uses the powerful Numexpr computing kernel, written in C, for evaluating mathematical expressions (see [11]). This kernel takes full advantage of on-die caches in modern processors for achieving a very high computing throughput (and, in particular, allowing ultra-fast condition evaluations).
Furthermore, you can mix the in-kernel and regular selection modes for evaluating arbitrarily complex conditions making use of external functions. Look at this example:
result = [ row['var2'] for row in table.where('(var3 == "foo") & (var1 <= 20)') if your_function(row['var2']) ]
Here, we use an in-kernel selection to
choose rows according to the values of the var3
and var1
fields. Then, we apply a
regular selection to complete the query. Of
course, when you mix the in-kernel and
regular selection modes you should pass the
most restrictive condition to the in-kernel
part, i.e. to the where()
iterator. In situations
where it is not clear which is the most restrictive condition, you
might want to experiment a bit in order to find the best
combination.
However, since in-kernel condition strings allow rich expressions with multiple columns, variables, arithmetic operations and some functions, it is unlikely that you will be forced to use external regular selections in conditions of small to medium complexity. See Appendix B for more information on in-kernel condition syntax.
![]() | Note |
---|---|
Indexing is only available in PyTables Pro. |
When you need more speed than in-kernel selections can offer you, PyTables offers a third selection method, the so-called indexed mode (based on the highly efficient, extremely large datasets oriented, OPSI indexing engine [20]). In this mode, you have to decide which column(s) you are going to do your selections on, and index them. Indexing is just a kind of sort operation over a column, so that next searches along such a column will look at this sorted information using kind of a binary search which is much faster than a typical sequential search.
You can index the columns you want by calling the
Column.createIndex()
method (see description) on an already created table.
For example:
indexrows = table.cols.var1.createIndex() indexrows = table.cols.var2.createIndex() indexrows = table.cols.var3.createIndex()
will create indexes for all
var1
, var2
and
var3
columns.
After you have indexed a column, PyTables will try to use the index in your queries. However, as the query optimizer is not very sophisticated right now[12] it is recommended that you avoid placing comparisons involving indexed colums too deep into the condition expression in order to maximise the chances of actually using indexes. See below for examples where the optimizer can safely determine if an index can be used or not.
Example conditions where an index can be used:
var1 >= "foo"
(var1 is
used)
var1 >= mystr
(var1 is
used)
(var1 >= "foo") & (var3 >
10)
(var1 is used)
(var1 >= "foo") & (var4 >
0.0)
(var1 is used)
("bar" <= var1) & (var1 <
"foo")
(var1 is used)
(("bar" <= var1) & (var1 < "foo"))
& (var4 > 0.0)
(var1 is used)
Example conditions where an index can not be used:
var4 > 0.0
(var4 is not
indexed)
var1 != 0.0
(range has two
pieces)
(var1 >= "foo") | (var3 > 10)
(conditions are ORed)
![]() | Note |
---|---|
If you want to know for sure whether a particular query will
use indexing or not (without actually running it), you are advised
to use the |
One important aspect of indexing in PyTables is that it has been designed from the ground up with the goal of being capable to effectively manage very large tables. In Figure 5.2, you can see that the times to index columns in tables are pretty short. In particular, the time to index a column with 1 billion rows (1 gigarow) with no index optimization is roughly 15 min. while indexing the same column with full optimization is around 100 min., which are a quite small figures compared with a relational database (in this case, Postgres 8), which takes more than 500 min. for doing the same job. This is because PyTables has chosen an algorithm that does a partial sort of the columns in order to ensure contained indexing times. On the contrary, most relational databases try to do a complete sort of columns, and this fact makes the time to create an index much bigger.
Figure 5.2. Times for indexing a Float64
column. Benchmark made on a machine with AMD Opteron (AMD64) @ 2
GHz processors with IDE disk @ 7200 RPM.
Another important feature to stress out is the aforementioned
capability of PyTables to support a customizable quality for
indexes, allowing the user to select the best one that suits her
needs. This quality can be specified by passing the desired
optlevel
argument to the
createIndex()
method (see description). In the figures 5.3 and 5.4 you can see the effect in lookup
time of specifying several optimization levels.
Figure 5.3. Times for querying a Float64
column with
a cold cache (mean of 10 first queries). Benchmark made on a
machine with AMD Opteron (AMD64) @ 2 GHz processors with IDE disk
@ 7200 RPM.
Figure 5.4. Times for querying a Float64
column with
a warm cache (mean of 500 queries). Benchmark made on a
machine with AMD Opteron (AMD64) @ 2 GHz processors with IDE disk
@ 7200 RPM.
It is worth to say that, in figures 5.3 and 5.4, the query was chosen so as to return 0 rows (0 hits query). This has been done that way in order to have a measure of the pure lookup time, and not taking in account the retrieving time for actual data (see Figure 5.6 for a graph measuring retrieval times for different number of hits in range queries).
As an aside, let's point out that, despite the fact that relational databases use a complete sorting algorithm for indexes might induce the reader to think that that they would be faster for searching purposes than the PyTables approach, this is not necessarily the case (or, at least, not always), as you can see in the figures 5.3 and 5.4. This is because when one chooses a high optimization level for medium to large table sizes (< 10 billions of rows), the PyTables indexing engine can effectively achieve a complete sorting, making its lookup times usually faster than using a relational database (most specially if compression is used). For very large tables (> 10 billions of rows) the PyTables indexing engine cannot normally achieve the goal of getting a completely sorted index. However, this is not a grave limitation in practice because the time required for completing such a complete sort in a traditional relational database can be huge, so it is not normally worth the effort doing this, while, on its hand, PyTables can still perform a partial sort in a reasonable amount of time while keeping very good lookup times. In other words, the PyTables engine scales much better than relational databases, so don't worry if you have extremely large columns to index: PyTables is designed to cope with that perfectly.
In addition, PyTables implements an efficient cache for (small) results of queries. Such a cache can be very important in cases where you repeat the same query interspersed with others. In that case, the repeated queries will be done very fast as you can see in Figure 5.5, effectively accelerating the whole query process.
Figure 5.5. Times for doing a query that is already in cache for a
Float64
column. Benchmark made on a machine
with AMD Opteron (AMD64) @ 2 GHz processors with IDE disk @ 7200
RPM.
Finally, let's conclude this section with a view at how PyTables indexed searches does perform for queries with different result sizes (or number of hits). In Figure 5.6 you can see the mean time for the first 5 distinct queries (but having aproximately the same number of hits) along a table with one billion of rows (a gigarow) for different query ranges (and hence, having different number of hits) for the next query:
results = [ row[col4] for row in table.where("(inf<=col4) & (col4<=sup) & (sqrt(col1+3.1*col2+col3*col4) > 3)") ]
where col4 is an indexed float64 column and col1, col2 and col3 are unindexed ones (of types int32, float64 and int32 respectively). The inf and sup values defines the query range. You can see how, for a number of hits greater than 10, PyTables performs between 2x and 50x faster than a traditional relational database (in this case, Postgres 8). Again, this is the consequence of the combination of a highly optimized access to data through the HDF5 library and the use of Numexpr ([11]) for quick evaluation of complex expressions for each row.
Figure 5.6. Times for doing a query with different number of hits on a indexed table with one gigarow. Benchmark made on a machine with AMD Opteron (AMD64) @ 2 GHz processors with IDE disk @ 7200 RPM.
You can find a more complete description and benchmarking about OPSI, the indexing system of PyTables Pro in [20].
One of the beauties of PyTables is that it supports compression on tables and arrays[13], although it is not used by default. Compression of big amounts of data might be a bit controversial feature, because compression has a legend of being a very big consumer of CPU time resources. However, if you are willing to check if compression can help not only by reducing your dataset file size but also by improving I/O efficiency, specially when dealing with very large datasets, keep reading.
The compression library used by default is the Zlib (see [12]). Since HDF5 requires it, you can safely use it and expect that your HDF5 files will be readable on any other platform that has HDF5 libraries installed. Zlib provides good compression ratio, although somewhat slow, and reasonably fast decompression. Because of that, it is a good candidate to be used for compressing you data.
However, in some situations it is critical to have a very good decompression speed (at the expense of lower compression ratios or more CPU wasted on compression, as we will see soon). In others, the emphasis is put in achieving the maximum compression ratios, no matter which reading speed will result. This is why support for two additional compressors has been added to PyTables: LZO (see [13]) and bzip2 (see [14]). Following the author of LZO (and checked by the author of this section, as you will see soon), LZO offers pretty fast compression and extremely fast decompression. In fact, LZO is so fast when compressing/decompressing that it may well happen (that depends on your data, of course) that writing or reading a compressed dataset is sometimes faster than if it is not compressed at all (specially when dealing with extremely large datasets). This fact is very important, specially if you have to deal with very large amounts of data. Regarding bzip2, it has a reputation of achieving excellent compression ratios, but at the price of spending much more CPU time, which results in very low compression/decompression speeds.
Be aware that the LZO and bzip2 support in PyTables is not
standard on HDF5, so if you are going to use your PyTables files in
other contexts different from PyTables you will not be able to read
them. Still, see the Section D.2 (where the ptrepack
utility is described) to find a way to free your files from LZO or
bzip2 dependencies, so that you can use these compressors locally with
the warranty that you can replace them with Zlib (or even remove
compression completely) if you want to use these files with other HDF5
tools or platforms afterwards.
In order to allow you to grasp what amount of compression can be achieved, and how this affects performance, a series of experiments has been carried out. All the results presented in this section (and in the next one) have been obtained with synthetic data and using PyTables 1.3. Also, the tests have been conducted on a IBM OpenPower 720 (e-series) with a PowerPC G5 at 1.65 GHz and a hard disk spinning at 15K RPM. As your data and platform may be totally different for your case, take this just as a guide because your mileage may vary. Finally, and to be able to play with tables with a number of rows as large as possible, the record size has been chosen to be small (16 bytes). Here is its definition:
class Bench(IsDescription): var1 = StringCol(length=4) var2 = IntCol() var3 = FloatCol()
With this setup, you can look at the compression ratios that can be achieved in Figure 5.7. As you can see, LZO is the compressor that performs worse in this sense, but, curiosly enough, there is not much difference between Zlib and bzip2.
Also, PyTables lets you select different compression levels for Zlib and bzip2, although you may get a bit disappointed by the small improvement that these compressors show when dealing with a combination of numbers and strings as in our example. As a reference, see plot 5.8 for a comparison of the compression achieved by selecting different levels of Zlib. Very oddly, the best compression ratio corresponds to level 1 (!). See later for an explanation and more figures on this subject.
Have also a look at Figure 5.9. It shows how the speed of writing rows evolves as the size (number of rows) of the table grows. Even though in these graphs the size of one single row is 16 bytes, you can most probably extrapolate these figures to other row sizes.
In Figure 5.10 you can see how compression affects the reading performance. In fact, what you see in the plot is an in-kernel selection speed, but provided that this operation is very fast (see Section 5.2.1), we can accept it as an actual read test. Compared with the reference line without compression, the general trend here is that LZO does not affect too much the reading performance (and in some points it is actually better), Zlib makes speed drop to a half, while bzip2 is performing very slow (up to 8x slower).
Also, in the same Figure 5.10 you can notice some strange peaks in the speed that we might be tempted to attribute to libraries on which PyTables relies (HDF5, compressors...), or to PyTables itself. However, Figure 5.11 reveals that, if we put the file in the filesystem cache (by reading it several times before, for example), the evolution of the performance is much smoother. So, the most probable explanation would be that such peaks are a consequence of the underlying OS filesystem, rather than a flaw in PyTables (or any other library behind it). Another consequence that can be derived from the aforementioned plot is that LZO decompression performance is much better than Zlib, allowing an improvement in overal speed of more than 2x, and perhaps more important, the read performance for really large datasets (i.e. when they do not fit in the OS filesystem cache) can be actually better than not using compression at all. Finally, one can see that reading performance is very badly affected when bzip2 is used (it is 10x slower than LZO and 4x than Zlib), but this was somewhat expected anyway.
So, generally speaking and looking at the experiments above, you can expect that LZO will be the fastest in both compressing and decompressing, but the one that achieves the worse compression ratio (although that may be just OK for many situations, specially when used with shuffling —see Section 5.4). bzip2 is the slowest, by large, in both compressing and decompressing, and besides, it does not achieve any better compression ratio than Zlib. Zlib represents a balance between them: it's somewhat slow compressing (2x) and decompressing (3x) than LZO, but it normally achieves better compression ratios.
Finally, by looking at the plots 5.12, 5.13, and the aforementioned 5.8 you can see why the recommended compression level to use for all compression libraries is 1. This is the lowest level of compression, but as the size of the underlying HDF5 chunk size is normally rather small compared with the size of compression buffers, there is not much point in increasing the latter (i.e. increasing the compression level). Nonetheless, in some situations (like for example, in extremely large tables or arrays, where the computed chunk size can be rather large) you may want to check, on your own, how the different compression levels do actually affect your application.
You can select the compression library and level by setting the
complib
and complevel
keywords
in the Filters
class (see Section 4.14.1). A
compression level of 0 will completely disable compression (the
default), 1 is the less memory and CPU time demanding level, while 9
is the maximum level and the most memory demaning and CPU
intensive. Finally, have in mind that LZO is not accepting a
compression level right now, so, when using LZO, 0 means that
compression is not active, and any other value means that LZO is
active.
So, in conclusion, if your ultimate goal is writing and reading as fast as possible, choose LZO. If you want to reduce as much as possible your data, while retaining acceptable read speed, choose Zlib. Finally, if portability is important for you, Zlib is your best bet. So, when you want to use bzip2? Well, looking at the results, it is difficult to recommend its use in general, but you may want to experiment with it in those cases where you know that it is well suited for your data pattern (for example, for dealing with repetitive string datasets).
Figure 5.13. Selecting values in tables with different levels of compression. The file is in the OS cache.
The HDF5 library provides an interesting filter that can leverage the results of your favorite compressor. Its name is shuffle, and because it can greatly benefit compression and it does not take many CPU resources (see below for a justification), it is active by default in PyTables whenever compression is activated (independently of the chosen compressor). It is deactivated when compression is off (which is the default, as you already should know). Of course, you can deactivate it if you want, but this is not recommended.
So, how does this mysterious filter exactly work? From the HDF5 reference manual: “The shuffle filter de-interlaces a block of data by reordering the bytes. All the bytes from one consistent byte position of each data element are placed together in one block; all bytes from a second consistent byte position of each data element are placed together a second block; etc. For example, given three data elements of a 4-byte datatype stored as 012301230123, shuffling will re-order data as 000111222333. This can be a valuable step in an effective compression algorithm because the bytes in each byte position are often closely related to each other and putting them together can increase the compression ratio.”
In Figure 5.14 you can see a benchmark that shows how the shuffle filter can help the different libraries in compressing data. In this experiment, shuffle has made LZO compress almost 3x more (!), while Zlib and bzip2 are seeing improvements of 2x. Once again, the data for this experiment is synthetic, and shuffle seems to do a great work with it, but in general, the results will vary in each case[14].
Figure 5.14. Comparison between different compression libraries with and without the shuffle filter.
At any rate, the most remarkable fact about the shuffle filter is the relatively high level of compression that compressor filters can achieve when used in combination with it. A curious thing to note is that the Bzip2 compression rate does not seem very much improved (less than a 40%), and what is more striking, Bzip2+shuffle does compress quite less than Zlib+shuffle or LZO+shuffle combinations, which is kind of unexpected. The thing that seems clear is that Bzip2 is not very good at compressing patterns that result of shuffle application. As always, you may want to experiment with your own data before widely applying the Bzip2+shuffle combination in order to avoid surprises.
Now, how does shuffling affect performance? Well, if you look at plots 5.15, 5.16 and 5.17, you will get a somewhat unexpected (but pleasant) surprise. Roughly, shuffle makes the writing process (shuffling+compressing) faster (aproximately a 15% for LZO, 30% for Bzip2 and a 80% for Zlib), which is an interesting result by itself. But perhaps more exciting is the fact that the reading process (unshuffling+decompressing) is also accelerated by a similar extent (a 20% for LZO, 60% for Zlib and a 75% for Bzip2, roughly).
Figure 5.16. Reading with different compression libraries with the shuffle filter. The file is not in OS cache.
Figure 5.17. Reading with different compression libraries with and without the shuffle filter. The file is in OS cache.
You may wonder why introducing another filter in the write/read pipelines does effectively accelerate the throughput. Well, maybe data elements are more similar or related column-wise than row-wise, i.e. contiguous elements in the same column are more alike, so shuffling makes the job of the compressor easier (faster) and more effective (greater ratios). As a side effect, compressed chunks do fit better in the CPU cache (at least, the chunks are smaller!) so that the process of unshuffle/decompress can make a better use of the cache (i.e. reducing the number of CPU cache faults).
So, given the potential gains (faster writing and reading, but specially much improved compression level), it is a good thing to have such a filter enabled by default in the battle for discovering redundancy when you want to compress your data, just as PyTables does.
Psyco (see [16]) is a kind of specialized compiler for Python that typically accelerates Python applications with no change in source code. You can think of Psyco as a kind of just-in-time (JIT) compiler, a little bit like Java's, that emits machine code on the fly instead of interpreting your Python program step by step. The result is that your unmodified Python programs run faster.
Psyco is very easy to install and use, so in most scenarios it
is worth to give it a try. However, it only runs on Intel 386
architectures, so if you are using other architectures, you are out of
luck (and, moreover, it seems that there are no plans to support other
platforms). Besides, with the addition of flexible (and very fast)
in-kernel queries (by the way, they cannot be optimized at all by
Psyco), the use of Psyco will only help in rather few scenarios. In
fact, the only important situation that you might benefit right now
from using Psyco (I mean, in PyTables contexts) is for speeding-up the
write speed in tables when using the Row interface (see Section 4.6.7, “The Row
class”. But again, this latter case can also be
accelerated by using the Table.append() (see description) method and building your own
buffers.
As an example, imagine that you have a small script that reads and selects data over a series of datasets, like this:
def readFile(filename): "Select data from all the tables in filename" fileh = openFile(filename, mode = "r") result = [] for table in fileh("/", 'Table'): result = [p['var3'] for p in table if p['var2'] <= 20] fileh.close() return result if __name__=="__main__": print readFile("myfile.h5")
In order to accelerate this piece of code, you can rewrite your main program to look like:
if __name__=="__main__": import psyco psyco.bind(readFile) print readFile("myfile.h5")
That's all!. From now on, each time that you execute your Python script, Psyco will deploy its sophisticated algorithms so as to accelerate your calculations.
You can see in the graphs 5.18 and 5.19 how much I/O speed improvement you can get by using Psyco. By looking at this figures you can get an idea if these improvements are of your interest or not. In general, if you are not going to use compression you will take advantage of Psyco if your tables are medium sized (from a thousand to a million rows), and this advantage will disappear progressively when the number of rows grows well over one million. However if you use compression, you will probably see improvements even beyond this limit (see Section 5.3). As always, there is no substitute for experimentation with your own dataset.
One limitation of the initial versions of PyTables was that they needed to load all nodes in a file completely before being ready to deal with them, making the opening times for files with a lot of nodes very high and unacceptable in many cases.
Starting from PyTables 1.2, a new LRU cache was introduced that avoids loading all the nodes of the object tree in memory. This cache is responsible of loading just up to a certain amount of nodes and discard the least recent used ones when there is a need to load new ones. This represents a big advantage over the old schema, specially in terms of memory usage (as there is no need to load every node in memory), but it also adds very convenient optimizations for working interactively like, for example, speeding-up the opening times of files with lots of nodes, allowing to open almost any kind of file in typically less than one tenth of second (compare this with the more than 10 seconds for files with more than 10000 nodes in PyTables pre-1.2 era). See [19] for more info on the advantages (and also drawbacks) of this approach.
One thing that deserves some discussion is the election of the
parameter that sets the maximum amount of nodes to be held in memory
at any time. As PyTables is meant to be deployed in machines that have
potentially low memory, the default for it is quite conservative (you
can look at its actual value in the NODE_CACHE_SIZE
parameter in module tables/parameters.py
). However,
if you usually have to deal with files that have many more nodes than
the maximum default, and you have a lot of free memory in your system,
then you may want to experiment which is the appropriate value of
NODE_CACHE_SIZE
that fits better your needs.
As an example, look at the next code:
def browse_tables(filename): fileh = openFile(filename,'a') group = fileh.root.newgroup for j in range(10): for tt in fileh.walkNodes(group, "Table"): title = tt.attrs.TITLE for row in tt: pass fileh.close()
We will be running the code above against a couple of files
having a /newgroup
containing 100 tables and 1000
tables respectively. We will run this small benchmark for different
values of the LRU cache size, specifically 256 and 1024. You can see
the results in Table 5.1.
100 nodes | 1000 nodes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Memory (MB) | Time (ms) | Memory (MB) | Time (ms) | ||||||
Node is coming from... | Cache size | 256 | 1024 | 256 | 1024 | 256 | 1024 | 256 | 1024 |
Disk | 14 | 14 | 1.24 | 1.24 | 51 | 66 | 1.33 | 1.31 | |
Cache | 14 | 14 | 0.53 | 0.52 | 65 | 73 | 1.35 | 0.68 |
Table 5.1. Retrieval speed and memory consumption depending on the number of nodes in LRU cache.
From the data in Table 5.1, one can see that when the number of objects that you are dealing with does fit in cache, you will get better access times to them. Also, incrementing the node cache size does effectively consume more memory only if the total nodes exceeds the slots in cache; otherwise the memory consumption remains the same. It is also worth noting that incrementing the node cache size in the case you want to fit all your nodes in cache does not take much more memory than being too conservative. On the other hand, it might happen that the speed-up that you can achieve by allocating more slots in your cache is not worth the amount of memory used.
Anyway, if you feel that this issue is important for you, set
your own experiments up and proceed to fine-tune the
NODE_CACHE_SIZE
parameter.
![]() | Note |
---|---|
PyTables Pro sports an optimized LRU cache node written in C, so you should expect significantly faster LRU cache operations when working with it. |
Let's suppose that you have a file where you have made a lot of
row deletions on one or more tables, or deleted many leaves or even
entire subtrees. These operations might leave
holes (i.e. space that is not used anymore) in
your files that may potentially affect not only the size of the files
but, more importantly, the performance of I/O. This is because when
you delete a lot of rows in a table, the space is not automatically
recovered on the fly. In addition, if you add many more rows to a
table than specified in the expectedrows
keyword at
creation time this may affect performance as well, as explained in
Section 5.1.
In order to cope with these issues, you should be aware that
PyTables includes a handy utility called ptrepack
which can be very useful not only to compact
fragmented files, but also to adjust some
internal parameters in order to use better buffer and chunk sizes for
optimum I/O speed. Please check the Section D.2 for a brief tutorial on its use.
Another thing that you might want to use
ptrepack
for is changing the compression filters or
compression levels on your existing data for different goals, like
checking how this can affect both final size and I/O performance, or
getting rid of the optional compressors like LZO
or
bzip2
in your existing files, in case you want to
use them with generic HDF5 tools that do not have support for these
filters.
[11] CArray
nodes, though not
extensible, are chunked and have their optimum chunk size
automatically computed at creation time, since their final shape is
known.
[12] We plan to address this limitation in the short future.
[13] Except for
Array
objects.
[14] Some users reported that the typical improvement with real data is between a factor 1.5x and 2.5x over the already compressed datasets.