Discussion:
[pytables-users] Possible performance regression between 2.3.1 and 3.1.1
Tom Aldcroft
2015-02-17 11:50:51 UTC
Permalink
I posted this as an issue on github earlier [1], but I have a better
demonstration now and Francesc suggested that I post to the mailing list.

I have about 100 files which contain a very simple HDF5 data structure with
a data node of ~10^9 float values. These data files are hosted on a NetApp
server (by external requirements). My application requires reading small
slices of this for all 100 files. Here is code snippet that is
representative of the real code:

import tables
vals = {}
for filename in filenames:
h5 = tables.openFile(os.path.join(filename))
vals[filename] = h5.root.data[10000:10050]
h5.close()

This is running on CentOS-6 machine with a local network NetApp server
using Python 2.7 and numpy 1.8. Using PyTables 2.3.1 + HDF5 1.8.9, the 100
files are consistently read in a couple of seconds, which is acceptable for
our needs. Using PyTables 3.1.1 + HDF5 1.8.9 (or 1.8.13), the first time it
runs through, each file read is taking 2 to 10 seconds. Once a file has
been read then subsequent reads of that file are fast, maybe due to the
file being cached by the NetApp.

To demonstrate this with numbers, I took one of my files which is about 600
Mb (with zlib compression). Then I ran the following script using 2.3.1
and 3.1.1. Note that 2.3.1 is built-from source along (as is the
accompanying HDF5 library), while the 3.1.1 version is Anaconda. Not sure
if that might make a difference.

import tables
import time
import shutil

def read_slice(filename, i0):
t0 = time.time()
h5 = tables.openFile(filename)
vals = h5.root.data[i0:i0 + 50]
h5.close()
print('Reading {} values took {:.4f} seconds with tables {}'
.format(len(vals), time.time() - t0, tables.__version__))

# Make sure we use a fresh uncached file
filename = 'test2.h5'
shutil.copy('test.h5', filename)

read_slice(filename, 10000)
read_slice(filename, 1000000)
read_slice(filename, 2000000)

The results are as follows. As you can see with 3.1.1 the initial read is
about 100 times slower than 2.3.1.

PyTables 2.3.1
--------------

In [1]: run test_read
Reading 50 values took *0.0840* seconds with tables 2.3.1
Reading 50 values took 0.0043 seconds with tables 2.3.1
Reading 50 values took 0.0042 seconds with tables 2.3.1

In [2]: run test_read
Reading 50 values took *0.0993* seconds with tables 2.3.1
Reading 50 values took 0.0154 seconds with tables 2.3.1
Reading 50 values took 0.0142 seconds with tables 2.3.1

In [3]: run test_read
Reading 50 values took *0.0849* seconds with tables 2.3.1
Reading 50 values took 0.0037 seconds with tables 2.3.1
Reading 50 values took 0.0034 seconds with tables 2.3.1

PyTables 3.1.1
--------------

In [1]: run test_read
Reading 50 values took* 16.3195* seconds with tables 3.1.1
Reading 50 values took 0.0356 seconds with tables 3.1.1
Reading 50 values took 0.0256 seconds with tables 3.1.1

In [2]: run test_read
Reading 50 values took *8.1044* seconds with tables 3.1.1
Reading 50 values took 0.0459 seconds with tables 3.1.1
Reading 50 values took 0.0273 seconds with tables 3.1.1

In [3]: run test_read
Reading 50 values took *10.2829* seconds with tables 3.1.1
Reading 50 values took 0.0246 seconds with tables 3.1.1
Reading 50 values took 0.0250 seconds with tables 3.1.1

Thanks for any help!
Tom


[1]: https://github.com/PyTables/PyTables/issues/402
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Tom Aldcroft
2015-02-17 17:23:41 UTC
Permalink
As an update to this issue, it seems that the performance regression
appeared in 2.4.0:

In [3]: run test_read
Reading 50 values took 28.2938 seconds with tables 2.4.0
Reading 50 values took 0.0371 seconds with tables 2.4.0
Reading 50 values took 0.0332 seconds with tables 2.4.0

In [4]: run test_read
Reading 50 values took 22.7698 seconds with tables 2.4.0
Reading 50 values took 0.0363 seconds with tables 2.4.0
Reading 50 values took 0.0295 seconds with tables 2.4.0

Is it possible that I need to re-write the files using the same PyTables
version that I'm reading with?

- Tom
Post by Tom Aldcroft
I posted this as an issue on github earlier [1], but I have a better
demonstration now and Francesc suggested that I post to the mailing list.
I have about 100 files which contain a very simple HDF5 data structure
with a data node of ~10^9 float values. These data files are hosted on a
NetApp server (by external requirements). My application requires reading
small slices of this for all 100 files. Here is code snippet that is
import tables
vals = {}
h5 = tables.openFile(os.path.join(filename))
vals[filename] = h5.root.data[10000:10050]
h5.close()
This is running on CentOS-6 machine with a local network NetApp server
using Python 2.7 and numpy 1.8. Using PyTables 2.3.1 + HDF5 1.8.9, the 100
files are consistently read in a couple of seconds, which is acceptable for
our needs. Using PyTables 3.1.1 + HDF5 1.8.9 (or 1.8.13), the first time it
runs through, each file read is taking 2 to 10 seconds. Once a file has
been read then subsequent reads of that file are fast, maybe due to the
file being cached by the NetApp.
To demonstrate this with numbers, I took one of my files which is about
600 Mb (with zlib compression). Then I ran the following script using
2.3.1 and 3.1.1. Note that 2.3.1 is built-from source along (as is the
accompanying HDF5 library), while the 3.1.1 version is Anaconda. Not sure
if that might make a difference.
import tables
import time
import shutil
t0 = time.time()
h5 = tables.openFile(filename)
vals = h5.root.data[i0:i0 + 50]
h5.close()
print('Reading {} values took {:.4f} seconds with tables {}'
.format(len(vals), time.time() - t0, tables.__version__))
# Make sure we use a fresh uncached file
filename = 'test2.h5'
shutil.copy('test.h5', filename)
read_slice(filename, 10000)
read_slice(filename, 1000000)
read_slice(filename, 2000000)
The results are as follows. As you can see with 3.1.1 the initial read is
about 100 times slower than 2.3.1.
PyTables 2.3.1
--------------
In [1]: run test_read
Reading 50 values took *0.0840* seconds with tables 2.3.1
Reading 50 values took 0.0043 seconds with tables 2.3.1
Reading 50 values took 0.0042 seconds with tables 2.3.1
In [2]: run test_read
Reading 50 values took *0.0993* seconds with tables 2.3.1
Reading 50 values took 0.0154 seconds with tables 2.3.1
Reading 50 values took 0.0142 seconds with tables 2.3.1
In [3]: run test_read
Reading 50 values took *0.0849* seconds with tables 2.3.1
Reading 50 values took 0.0037 seconds with tables 2.3.1
Reading 50 values took 0.0034 seconds with tables 2.3.1
PyTables 3.1.1
--------------
In [1]: run test_read
Reading 50 values took* 16.3195* seconds with tables 3.1.1
Reading 50 values took 0.0356 seconds with tables 3.1.1
Reading 50 values took 0.0256 seconds with tables 3.1.1
In [2]: run test_read
Reading 50 values took *8.1044* seconds with tables 3.1.1
Reading 50 values took 0.0459 seconds with tables 3.1.1
Reading 50 values took 0.0273 seconds with tables 3.1.1
In [3]: run test_read
Reading 50 values took *10.2829* seconds with tables 3.1.1
Reading 50 values took 0.0246 seconds with tables 3.1.1
Reading 50 values took 0.0250 seconds with tables 3.1.1
Thanks for any help!
Tom
[1]: https://github.com/PyTables/PyTables/issues/402
--
You received this message because you are subscribed to a topic in the
Google Groups "pytables-users" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/pytables-users/F7Bc0TE9_Xk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...