Discussion:
[pytables-users] Bit-for-bit reproducibility with PyTables
Alex Cobb
2017-08-21 14:20:20 UTC
Permalink
Hi All,

I have been using PyTables to store large simulations as part of my
scientific work. Using PyTables dramatically reduced the I/O bottleneck
issues I was having with minimal of coding. Thanks for making this great
tool!

Usually, the HDF5 files I generate with PyTables provide the data for
plots, etc. I like to put everything in a makefile-type graph of build
rules (using SCons). Because computing a hash is much faster than most
further computation / analysis I do on these simulation outputs, it would
be very helpful to be able to turn off timestamps on objects created by
PyTables.
https://support.hdfgroup.org/HDF5/hdf5-quest.html#bit-for-bit

Currently (as of commit 30a6420075befafe8ff87f7a7249496d2519a68b in the
development branch), I can't find a way to turn off table or array
timestamps. This means that my HDF5 output files always have different
checksums even if all the data is identical, because of different object
timestamps, making regression testing more difficult. This issue was
raised on PyTables-users some time back
https://sourceforge.net/p/pytables/mailman/message/24835507/
I also can't find a way to access object timestamps from Python via
PyTables, i.e., the object timestamps are written but can't be easily read.

I have a workaround for Arrays etc using H5Py --- those folks addressed the
timestamp issue with a patch a while ago, so I can create an array in H5Py
with timestamps turned off, then open in PyTables and populate the array.
https://github.com/h5py/h5py/issues/225
However, I can't seem to do the same thing with PyTables's Tables, which
are the killer feature of PyTables for my purposes.

I wrote a patch to
1. allow disabling object time tracking when creating Tables, Arrays,
CArrays, EArrays, or VLArrays using H5Pset_obj_track_times, via an optional
keyword argument ("track_times") in constructors (passed in through the
Leaf class, the way filters are, to the functions in the C sources), and
2. allow access to HDF5 object timestamps via the H5O_info_t struct with a
new method of Nodes (._get_obj_timestamps()).

I wrote new tests for the changes introduced by the patch, and existing
tests all pass on my machine with the patch. The main possible interaction
would be if anyone is accessing the C-level (not Cython-level) functions,
like H5ARRAYmake(), etc. Not sure if these C-level functions considered
part of the public API of PyTables and should be kept backwards-compatible.

Is this something the development team would be open to discussing? May I
submit a pull request?

I'm new here --- I searched the archive for similar issues, but apologies
in advance if I missed something already in-progress or am otherwise
stepping on toes.

Thanks and regards

Alex
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted
2017-08-21 15:49:32 UTC
Permalink
Hi Alex,
Post by Alex Cobb
Hi All,
I have been using PyTables to store large simulations as part of my
scientific work. Using PyTables dramatically reduced the I/O bottleneck
issues I was having with minimal of coding. Thanks for making this great
tool!
Usually, the HDF5 files I generate with PyTables provide the data for
plots, etc. I like to put everything in a makefile-type graph of build
rules (using SCons). Because computing a hash is much faster than most
further computation / analysis I do on these simulation outputs, it would
be very helpful to be able to turn off timestamps on objects created by
PyTables.
https://support.hdfgroup.org/HDF5/hdf5-quest.html#bit-for-bit
Currently (as of commit 30a6420075befafe8ff87f7a7249496d2519a68b in the
development branch), I can't find a way to turn off table or array
timestamps. This means that my HDF5 output files always have different
checksums even if all the data is identical, because of different object
timestamps, making regression testing more difficult. This issue was
raised on PyTables-users some time back
https://sourceforge.net/p/pytables/mailman/message/24835507/
I also can't find a way to access object timestamps from Python via
PyTables, i.e., the object timestamps are written but can't be easily read.
I have a workaround for Arrays etc using H5Py --- those folks addressed
the timestamp issue with a patch a while ago, so I can create an array in
H5Py with timestamps turned off, then open in PyTables and populate the
array.
https://github.com/h5py/h5py/issues/225
However, I can't seem to do the same thing with PyTables's Tables, which
are the killer feature of PyTables for my purposes.
I wrote a patch to
1. allow disabling object time tracking when creating Tables, Arrays,
CArrays, EArrays, or VLArrays using H5Pset_obj_track_times, via an optional
keyword argument ("track_times") in constructors (passed in through the
Leaf class, the way filters are, to the functions in the C sources), and
2. allow access to HDF5 object timestamps via the H5O_info_t struct with a
new method of Nodes (._get_obj_timestamps()).
I wrote new tests for the changes introduced by the patch, and existing
tests all pass on my machine with the patch. The main possible interaction
would be if anyone is accessing the C-level (not Cython-level) functions,
like H5ARRAYmake(), etc. Not sure if these C-level functions considered
part of the public API of PyTables and should be kept backwards-compatible.
​To my understanding, no one should be using the ​H5ARRAYmake() sort of
functions because there has never been an intention of making them public.
Post by Alex Cobb
Is this something the development team would be open to discussing? May I
submit a pull request?
​Yes, submitting a pull request is the thing to do.​ Someone can then
review your changes and eventually accept the patch.

Francesc
Post by Alex Cobb
I'm new here --- I searched the archive for similar issues, but apologies
in advance if I missed something already in-progress or am otherwise
stepping on toes.
Thanks and regards
Alex
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alex Cobb
2017-08-22 08:07:50 UTC
Permalink
Hi Francesc,
Post by Francesc Alted
Hi Alex,
Post by Alex Cobb
Hi All,
I have been using PyTables to store large simulations as part of my
scientific work. Using PyTables dramatically reduced the I/O bottleneck
issues I was having with minimal of coding. Thanks for making this great
tool!
Usually, the HDF5 files I generate with PyTables provide the data for
plots, etc. I like to put everything in a makefile-type graph of build
rules (using SCons). Because computing a hash is much faster than most
further computation / analysis I do on these simulation outputs, it would
be very helpful to be able to turn off timestamps on objects created by
PyTables.
https://support.hdfgroup.org/HDF5/hdf5-quest.html#bit-for-bit
Currently (as of commit 30a6420075befafe8ff87f7a7249496d2519a68b in the
development branch), I can't find a way to turn off table or array
timestamps. This means that my HDF5 output files always have different
checksums even if all the data is identical, because of different object
timestamps, making regression testing more difficult. This issue was
raised on PyTables-users some time back
https://sourceforge.net/p/pytables/mailman/message/24835507/
I also can't find a way to access object timestamps from Python via
PyTables, i.e., the object timestamps are written but can't be easily read.
I have a workaround for Arrays etc using H5Py --- those folks addressed
the timestamp issue with a patch a while ago, so I can create an array in
H5Py with timestamps turned off, then open in PyTables and populate the
array.
https://github.com/h5py/h5py/issues/225
However, I can't seem to do the same thing with PyTables's Tables, which
are the killer feature of PyTables for my purposes.
I wrote a patch to
1. allow disabling object time tracking when creating Tables, Arrays,
CArrays, EArrays, or VLArrays using H5Pset_obj_track_times, via an optional
keyword argument ("track_times") in constructors (passed in through the
Leaf class, the way filters are, to the functions in the C sources), and
2. allow access to HDF5 object timestamps via the H5O_info_t struct with
a new method of Nodes (._get_obj_timestamps()).
I wrote new tests for the changes introduced by the patch, and existing
tests all pass on my machine with the patch. The main possible interaction
would be if anyone is accessing the C-level (not Cython-level) functions,
like H5ARRAYmake(), etc. Not sure if these C-level functions considered
part of the public API of PyTables and should be kept backwards-compatible.
​To my understanding, no one should be using the ​H5ARRAYmake() sort of
functions because there has never been an intention of making them public.
Post by Alex Cobb
Is this something the development team would be open to discussing? May
I submit a pull request?
​Yes, submitting a pull request is the thing to do.​ Someone can then
review your changes and eventually accept the patch.
Thank you Francesc! I have done so and will look forward to feedback.

Alex
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...