Discussion:
[pytables-users] padding field data (or not)
Ken Walker
2018-10-30 00:28:59 UTC
Permalink
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read some
data rows from one HDF5 file/dataset into a numpy record array, then write
that array to a dataset in a different HDF5 file (no change to the data).
The data in the new file looks fine when interrogated with Pytables or
viewed with HDFView. However, a downstream C++ app can't read the Pytables
data.
I am told (by the developers) that the compiler for the upstream program is
set to pad the data when it writes the original file (that I am reading),
and the pad is expected by the downstream reader (that reads the file I
created). Padding adds 4 pad characters to the a 4 byte S4 field so the
next field starts at an 8 byte memory boundary. Based on observed behavior,
they have inferred that Pytables removes the pad characters when reading
the dataset, and does not add a pad when writing the new dataset. (all
perfectly legal in hdf5 and does not affect data integrity). However the
missing pad is expected by the downstream reader, and causes an error (I
know, bad code design).

So....I'm wondering...is there something in Pytables that controls padding
when reading/writing datasets like this?

FYI, I recreated this read/write process with h5py, and the output file is
compatible with my downstream app. Apparently h5py retains the padded
characters. This is confirmed when I write the dataset.dtype: h5py reports
itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible)
because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted
2018-10-30 08:10:20 UTC
Permalink
Yes, I remember that PyTables do not implement padding (although I do not
remember the reason, but probably just a matter of simplicity). Having
said this, introducing padding should not be that difficult, so pull
requests are welcome.

Francesc
Post by Ken Walker
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read
some data rows from one HDF5 file/dataset into a numpy record array, then
write that array to a dataset in a different HDF5 file (no change to the
data). The data in the new file looks fine when interrogated with Pytables
or viewed with HDFView. However, a downstream C++ app can't read the
Pytables data.
I am told (by the developers) that the compiler for the upstream program
is set to pad the data when it writes the original file (that I am
reading), and the pad is expected by the downstream reader (that reads the
file I created). Padding adds 4 pad characters to the a 4 byte S4 field so
the next field starts at an 8 byte memory boundary. Based on observed
behavior, they have inferred that Pytables removes the pad characters when
reading the dataset, and does not add a pad when writing the new dataset.
(all perfectly legal in hdf5 and does not affect data integrity). However
the missing pad is expected by the downstream reader, and causes an error
(I know, bad code design).
So....I'm wondering...is there something in Pytables that controls padding
when reading/writing datasets like this?
FYI, I recreated this read/write process with h5py, and the output file is
compatible with my downstream app. Apparently h5py retains the padded
characters. This is confirmed when I write the dataset.dtype: h5py reports
itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible)
because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ken Walker
2018-10-30 15:11:56 UTC
Permalink
Hi Francesc,
I thought you might say that. :-) Thanks for confirming. At least I'm not
assuming anymore.
I have never submitted a pull request. Do I need to join github to do that?

I can use h5py...wish it had some of pytables handy functions:
walk_nodes(), read_where(), etc. :-)
I recall an announcement that h5py and pytables had plans to work together.
What's the status on that effort?
-Ken
Post by Francesc Alted
Yes, I remember that PyTables do not implement padding (although I do not
remember the reason, but probably just a matter of simplicity). Having
said this, introducing padding should not be that difficult, so pull
requests are welcome.
Francesc
Post by Ken Walker
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read
some data rows from one HDF5 file/dataset into a numpy record array, then
write that array to a dataset in a different HDF5 file (no change to the
data). The data in the new file looks fine when interrogated with Pytables
or viewed with HDFView. However, a downstream C++ app can't read the
Pytables data.
I am told (by the developers) that the compiler for the upstream program
is set to pad the data when it writes the original file (that I am
reading), and the pad is expected by the downstream reader (that reads the
file I created). Padding adds 4 pad characters to the a 4 byte S4 field so
the next field starts at an 8 byte memory boundary. Based on observed
behavior, they have inferred that Pytables removes the pad characters when
reading the dataset, and does not add a pad when writing the new dataset.
(all perfectly legal in hdf5 and does not affect data integrity). However
the missing pad is expected by the downstream reader, and causes an error
(I know, bad code design).
So....I'm wondering...is there something in Pytables that controls
padding when reading/writing datasets like this?
FYI, I recreated this read/write process with h5py, and the output file
is compatible with my downstream app. Apparently h5py retains the padded
characters. This is confirmed when I write the dataset.dtype: h5py reports
itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible)
because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted
2018-11-01 08:57:12 UTC
Permalink
Post by Ken Walker
Hi Francesc,
I thought you might say that. :-) Thanks for confirming. At least I'm not
assuming anymore.
I have never submitted a pull request. Do I need to join github to do that?
Yes, joining github is the simplest approach.
Post by Ken Walker
walk_nodes(), read_where(), etc. :-)
I recall an announcement that h5py and pytables had plans to work together.
What's the status on that effort?
Well, there have been two major attempts to make PyTables working on top of
h5py. The first one took place during a hackfest in Perth back in 2016
(thanks to Curtin University funds and most specially to Andrea Bedini
enthusiasm), where different maintainers gathered there for starting the
porting process. We did quite a bit of progress, but still, there was a
long way to go; you can read the final report here: https://github.com/
PyTables/PyTables/blob/pt4/doc/New-Backend-Interface.rst. The other
important push happened past year (2017), by using a small grant from
NumFOCUS. Alberto Sabater, the receiver of the grant did also a lot of
progress on top of the existing 2016 work, specially on the *Array (EArray,
CArray, VLArray) front; you can find his contribution in this pull
request: https://github.com/PyTables/PyTables/pull/634.

My perception from both attempts makes me think that the amount of job
remaining for completing the port is still very significant, and that small
grants (like NumFOCUS ones, which are 3000 USD max) are not really suitable
for getting the job done. So for this year's small grant from NumFOCUS I
suggested to Javier Sancho (the receiver of the grant) to concentrate on
fixing bugs and applying pending pull requests and doing a new release of
PyTables, and with the remaining time, to implement a web interface for
visualizing Table objects remotely; you can see the outcome of this effort
here: https://github.com/PyTables/datasette-pytables. I have to say that I
am really happy about the outcome of this latest grant.

From all of this experience and frankly speaking, I am unsure about the
feasibility of the PyTables/h5py merge because we would require quite more
than a small grant for this, and I am not sure the users/foundations would
never really pay for this cost. So, what I'd like to do instead is to
continue applying for small NumFOCUS grants in order to do maintenance
works for PyTables, and perhaps some small improvements; for example, I
recently applied for a NumFOCUS grant to extend the support of advanced
indexing and sorting to general compound datatypes in generic HDF5 files
(rings a bell to you?). I do think this approach would result in a better
use of the (scarce) resources that we currently have for PyTables
maintenance, and the the users will benefit the most from it (but in case
we get bigger funds, the PyTables/h5py merge would still be an option, of
course).

Hope this clarifies the current status of PyTables a bit more.

Francesc
Post by Ken Walker
Post by Francesc Alted
Yes, I remember that PyTables do not implement padding (although I do not
remember the reason, but probably just a matter of simplicity). Having
said this, introducing padding should not be that difficult, so pull
requests are welcome.
Francesc
Post by Ken Walker
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read
some data rows from one HDF5 file/dataset into a numpy record array, then
write that array to a dataset in a different HDF5 file (no change to the
data). The data in the new file looks fine when interrogated with Pytables
or viewed with HDFView. However, a downstream C++ app can't read the
Pytables data.
I am told (by the developers) that the compiler for the upstream program
is set to pad the data when it writes the original file (that I am
reading), and the pad is expected by the downstream reader (that reads the
file I created). Padding adds 4 pad characters to the a 4 byte S4 field so
the next field starts at an 8 byte memory boundary. Based on observed
behavior, they have inferred that Pytables removes the pad characters when
reading the dataset, and does not add a pad when writing the new dataset.
(all perfectly legal in hdf5 and does not affect data integrity). However
the missing pad is expected by the downstream reader, and causes an error
(I know, bad code design).
So....I'm wondering...is there something in Pytables that controls
padding when reading/writing datasets like this?
FYI, I recreated this read/write process with h5py, and the output file
is compatible with my downstream app. Apparently h5py retains the padded
characters. This is confirmed when I write the dataset.dtype: h5py reports
itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible)
because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken
--
You received this message because you are subscribed to the Google
Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ken Walker
2018-11-12 20:38:07 UTC
Permalink
Hi Francesc,
Thanks for the update on Pytables/h5py merge. As we would say in the US:
"Don't hold your breath".
I will continue to use h5py when I have to deal with padded data fields (or
until my downstream application knows doesn't expect packed data -- maybe
in 2019?).
Thanks!
-Ken
Post by Francesc Alted
Post by Ken Walker
Hi Francesc,
I thought you might say that. :-) Thanks for confirming. At least I'm
not assuming anymore.
I have never submitted a pull request. Do I need to join github to do that?
Yes, joining github is the simplest approach.
Post by Ken Walker
walk_nodes(), read_where(), etc. :-)
I recall an announcement that h5py and pytables had plans to work together.
What's the status on that effort?
Well, there have been two major attempts to make PyTables working on top
of h5py. The first one took place during a hackfest in Perth back in
2016 (thanks to Curtin University funds and most specially to Andrea
Bedini enthusiasm), where different maintainers gathered there for
starting the porting process. We did quite a bit of progress, but still,
there was a long way to go; you can read the final report here: https://
github.com/PyTables/PyTables/blob/pt4/doc/New-Backend-Interface.rst. The
other important push happened past year (2017), by using a small grant from
NumFOCUS. Alberto Sabater, the receiver of the grant did also a lot of
progress on top of the existing 2016 work, specially on the *Array (EArray,
CArray, VLArray) front; you can find his contribution in this pull
request: https://github.com/PyTables/PyTables/pull/634.
My perception from both attempts makes me think that the amount of job
remaining for completing the port is still very significant, and that small
grants (like NumFOCUS ones, which are 3000 USD max) are not really
suitable for getting the job done. So for this year's small grant from
NumFOCUS I suggested to Javier Sancho (the receiver of the grant) to
concentrate on fixing bugs and applying pending pull requests and doing a
new release of PyTables, and with the remaining time, to implement a web
interface for visualizing Table objects remotely; you can see the outcome
of this effort here: https://github.com/PyTables/datasette-pytables. I
have to say that I am really happy about the outcome of this latest grant.
From all of this experience and frankly speaking, I am unsure about the
feasibility of the PyTables/h5py merge because we would require quite
more than a small grant for this, and I am not sure the users/foundations
would never really pay for this cost. So, what I'd like to do instead is
to continue applying for small NumFOCUS grants in order to do maintenance
works for PyTables, and perhaps some small improvements; for example, I
recently applied for a NumFOCUS grant to extend the support of advanced
indexing and sorting to general compound datatypes in generic HDF5 files
(rings a bell to you?). I do think this approach would result in a better
use of the (scarce) resources that we currently have for PyTables
maintenance, and the the users will benefit the most from it (but in case
we get bigger funds, the PyTables/h5py merge would still be an option, of
course).
Hope this clarifies the current status of PyTables a bit more.
Francesc
Post by Ken Walker
Post by Francesc Alted
Yes, I remember that PyTables do not implement padding (although I do
not remember the reason, but probably just a matter of simplicity). Having
said this, introducing padding should not be that difficult, so pull
requests are welcome.
Francesc
Post by Ken Walker
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read
some data rows from one HDF5 file/dataset into a numpy record array, then
write that array to a dataset in a different HDF5 file (no change to the
data). The data in the new file looks fine when interrogated with Pytables
or viewed with HDFView. However, a downstream C++ app can't read the
Pytables data.
I am told (by the developers) that the compiler for the upstream
program is set to pad the data when it writes the original file (that I am
reading), and the pad is expected by the downstream reader (that reads the
file I created). Padding adds 4 pad characters to the a 4 byte S4 field so
the next field starts at an 8 byte memory boundary. Based on observed
behavior, they have inferred that Pytables removes the pad characters when
reading the dataset, and does not add a pad when writing the new dataset.
(all perfectly legal in hdf5 and does not affect data integrity). However
the missing pad is expected by the downstream reader, and causes an error
(I know, bad code design).
So....I'm wondering...is there something in Pytables that controls
padding when reading/writing datasets like this?
FYI, I recreated this read/write process with h5py, and the output file
is compatible with my downstream app. Apparently h5py retains the padded
characters. This is confirmed when I write the dataset.dtype: h5py reports
itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible)
because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken
--
You received this message because you are subscribed to the Google
Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...