Tom Aldcroft
2015-02-17 11:50:51 UTC
I posted this as an issue on github earlier [1], but I have a better
demonstration now and Francesc suggested that I post to the mailing list.
I have about 100 files which contain a very simple HDF5 data structure with
a data node of ~10^9 float values. These data files are hosted on a NetApp
server (by external requirements). My application requires reading small
slices of this for all 100 files. Here is code snippet that is
representative of the real code:
import tables
vals = {}
for filename in filenames:
h5 = tables.openFile(os.path.join(filename))
vals[filename] = h5.root.data[10000:10050]
h5.close()
This is running on CentOS-6 machine with a local network NetApp server
using Python 2.7 and numpy 1.8. Using PyTables 2.3.1 + HDF5 1.8.9, the 100
files are consistently read in a couple of seconds, which is acceptable for
our needs. Using PyTables 3.1.1 + HDF5 1.8.9 (or 1.8.13), the first time it
runs through, each file read is taking 2 to 10 seconds. Once a file has
been read then subsequent reads of that file are fast, maybe due to the
file being cached by the NetApp.
To demonstrate this with numbers, I took one of my files which is about 600
Mb (with zlib compression). Then I ran the following script using 2.3.1
and 3.1.1. Note that 2.3.1 is built-from source along (as is the
accompanying HDF5 library), while the 3.1.1 version is Anaconda. Not sure
if that might make a difference.
import tables
import time
import shutil
def read_slice(filename, i0):
t0 = time.time()
h5 = tables.openFile(filename)
vals = h5.root.data[i0:i0 + 50]
h5.close()
print('Reading {} values took {:.4f} seconds with tables {}'
.format(len(vals), time.time() - t0, tables.__version__))
# Make sure we use a fresh uncached file
filename = 'test2.h5'
shutil.copy('test.h5', filename)
read_slice(filename, 10000)
read_slice(filename, 1000000)
read_slice(filename, 2000000)
The results are as follows. As you can see with 3.1.1 the initial read is
about 100 times slower than 2.3.1.
PyTables 2.3.1
--------------
In [1]: run test_read
Reading 50 values took *0.0840* seconds with tables 2.3.1
Reading 50 values took 0.0043 seconds with tables 2.3.1
Reading 50 values took 0.0042 seconds with tables 2.3.1
In [2]: run test_read
Reading 50 values took *0.0993* seconds with tables 2.3.1
Reading 50 values took 0.0154 seconds with tables 2.3.1
Reading 50 values took 0.0142 seconds with tables 2.3.1
In [3]: run test_read
Reading 50 values took *0.0849* seconds with tables 2.3.1
Reading 50 values took 0.0037 seconds with tables 2.3.1
Reading 50 values took 0.0034 seconds with tables 2.3.1
PyTables 3.1.1
--------------
In [1]: run test_read
Reading 50 values took* 16.3195* seconds with tables 3.1.1
Reading 50 values took 0.0356 seconds with tables 3.1.1
Reading 50 values took 0.0256 seconds with tables 3.1.1
In [2]: run test_read
Reading 50 values took *8.1044* seconds with tables 3.1.1
Reading 50 values took 0.0459 seconds with tables 3.1.1
Reading 50 values took 0.0273 seconds with tables 3.1.1
In [3]: run test_read
Reading 50 values took *10.2829* seconds with tables 3.1.1
Reading 50 values took 0.0246 seconds with tables 3.1.1
Reading 50 values took 0.0250 seconds with tables 3.1.1
Thanks for any help!
Tom
[1]: https://github.com/PyTables/PyTables/issues/402
demonstration now and Francesc suggested that I post to the mailing list.
I have about 100 files which contain a very simple HDF5 data structure with
a data node of ~10^9 float values. These data files are hosted on a NetApp
server (by external requirements). My application requires reading small
slices of this for all 100 files. Here is code snippet that is
representative of the real code:
import tables
vals = {}
for filename in filenames:
h5 = tables.openFile(os.path.join(filename))
vals[filename] = h5.root.data[10000:10050]
h5.close()
This is running on CentOS-6 machine with a local network NetApp server
using Python 2.7 and numpy 1.8. Using PyTables 2.3.1 + HDF5 1.8.9, the 100
files are consistently read in a couple of seconds, which is acceptable for
our needs. Using PyTables 3.1.1 + HDF5 1.8.9 (or 1.8.13), the first time it
runs through, each file read is taking 2 to 10 seconds. Once a file has
been read then subsequent reads of that file are fast, maybe due to the
file being cached by the NetApp.
To demonstrate this with numbers, I took one of my files which is about 600
Mb (with zlib compression). Then I ran the following script using 2.3.1
and 3.1.1. Note that 2.3.1 is built-from source along (as is the
accompanying HDF5 library), while the 3.1.1 version is Anaconda. Not sure
if that might make a difference.
import tables
import time
import shutil
def read_slice(filename, i0):
t0 = time.time()
h5 = tables.openFile(filename)
vals = h5.root.data[i0:i0 + 50]
h5.close()
print('Reading {} values took {:.4f} seconds with tables {}'
.format(len(vals), time.time() - t0, tables.__version__))
# Make sure we use a fresh uncached file
filename = 'test2.h5'
shutil.copy('test.h5', filename)
read_slice(filename, 10000)
read_slice(filename, 1000000)
read_slice(filename, 2000000)
The results are as follows. As you can see with 3.1.1 the initial read is
about 100 times slower than 2.3.1.
PyTables 2.3.1
--------------
In [1]: run test_read
Reading 50 values took *0.0840* seconds with tables 2.3.1
Reading 50 values took 0.0043 seconds with tables 2.3.1
Reading 50 values took 0.0042 seconds with tables 2.3.1
In [2]: run test_read
Reading 50 values took *0.0993* seconds with tables 2.3.1
Reading 50 values took 0.0154 seconds with tables 2.3.1
Reading 50 values took 0.0142 seconds with tables 2.3.1
In [3]: run test_read
Reading 50 values took *0.0849* seconds with tables 2.3.1
Reading 50 values took 0.0037 seconds with tables 2.3.1
Reading 50 values took 0.0034 seconds with tables 2.3.1
PyTables 3.1.1
--------------
In [1]: run test_read
Reading 50 values took* 16.3195* seconds with tables 3.1.1
Reading 50 values took 0.0356 seconds with tables 3.1.1
Reading 50 values took 0.0256 seconds with tables 3.1.1
In [2]: run test_read
Reading 50 values took *8.1044* seconds with tables 3.1.1
Reading 50 values took 0.0459 seconds with tables 3.1.1
Reading 50 values took 0.0273 seconds with tables 3.1.1
In [3]: run test_read
Reading 50 values took *10.2829* seconds with tables 3.1.1
Reading 50 values took 0.0246 seconds with tables 3.1.1
Reading 50 values took 0.0250 seconds with tables 3.1.1
Thanks for any help!
Tom
[1]: https://github.com/PyTables/PyTables/issues/402
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.