Discussion:
searching for group names
Nyirő Gergő
2013-08-05 09:11:32 UTC
Permalink
Hello,


We develop a measurement evaluation tool, and we'd like to use
pytables/hdf5 as a middle layer for signal accessing.

We have to deal with the silly structure of the recorder device
measurement format.



The signals can be accessed via two identifiers:

* device name: <source of the signal>-<channel of the
message>-<another tag>-<yet another tag>

* signal name



The first identifier says the source information of the signal, which
can be quite long.

Therefore I grouped the device name into two layers:

/<source of the signal>

/<channel of the message>...

/<signal name>



So if you have the same message from two channels, than you will get
/foo-device-name

/channel-1

/bar

/baz

/channel-2

/bar

/baz



Besides signal loading, we have to search for signal name as fast as
possible, and return with the shortest unique device name part and the
signal name.

Using the structure above, iterating over the group names is quite
slow. So I build up a table from device and signal name.

As far as I know, the pytables query does not support string searching
(e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us
to a pure python loop which is slow again.

Therefore I build up a python dictionary from the table, which provide
fast iteration against the table, but the init time increased from 100
ms to 3-4 sec (we have more than 40 000 signals).



Do you have any advice how to search for group names in hdf5 with
pytables in an efficient way?

ps: I would be most happy with a glob interface.



thanks for your advices in advance,

gergo
Anthony Scopatz
2013-08-05 14:50:20 UTC
Permalink
Post by Nyirő Gergő
Hello,
We develop a measurement evaluation tool, and we'd like to use
pytables/hdf5 as a middle layer for signal accessing.
We have to deal with the silly structure of the recorder device
measurement format.
* device name: <source of the signal>-<channel of the
message>-<another tag>-<yet another tag>
* signal name
The first identifier says the source information of the signal, which
can be quite long.
/<source of the signal>
/<channel of the message>...
/<signal name>
So if you have the same message from two channels, than you will get
/foo-device-name
/channel-1
/bar
/baz
/channel-2
/bar
/baz
Besides signal loading, we have to search for signal name as fast as
possible, and return with the shortest unique device name part and the
signal name.
Using the structure above, iterating over the group names is quite
slow. So I build up a table from device and signal name.
As far as I know, the pytables query does not support string searching
(e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us
to a pure python loop which is slow again.
Therefore I build up a python dictionary from the table, which provide
fast iteration against the table, but the init time increased from 100
ms to 3-4 sec (we have more than 40 000 signals).
Do you have any advice how to search for group names in hdf5 with
pytables in an efficient way?
Hi grego,

Searching through group names, like accessing all HDF5 metadata, is slow.
For group names this is because rather than searching through a list you
are traversing a B-tree, IIRC. So you have to use the couple of tricks
that you used: 1) have another Table / Array of all table names, 2) read
this in once to a native Python data structure (dict here).

However, 4 sec to read in this table seems excessive for data of this size.
You are probably not reading this in properly. You should be using:

raw_grps = f.root.grp_names[:]

or similar.

Maybe other people have some other ideas.

Be Well
Anthony
Post by Nyirő Gergő
ps: I would be most happy with a glob interface.
thanks for your advices in advance,
gergo
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Gabriel J.L. Beckers
2013-08-07 11:39:13 UTC
Permalink
Hi,

I don't know if this is related in any way to Gergo's problem, but I
have slow responses when querying which children a group contains, if
that group contains big leafs. I am using pytables 2.5 and hdf5 1.8.9
on linux 64 bit.

Specifically, I found that using the _g_get_objinfo method (which is
used by other methods that I use) is slow when used on a large leaf.
The slowness is proportional to the size of the leaf. It is almost as
if some process is actually reading the data instead of just info on
the type of data. I am noticing this because my data is on an external
usb3 disk. To give you an idea: that method takes almost 80 seconds to
return the string 'Leaf' when used on a 5 Gb EArray. That should
roughly correspond to reading the complete disk-based array. The info
is cached somehow, because if I run the method a second time in the
same python session it is very fast.

If I copy my hdf5 file to my SSD disk, things are much faster, but
running the method still takes 2 seconds or so on a 5 Gb leaf.

Is this expected behavior and should I just avoid this method in my
applications, or is something wrong?

Best, Gabriel
Post by Anthony Scopatz
Post by Nyirő Gergő
Hello,
We develop a measurement evaluation tool, and we'd like to use
pytables/hdf5 as a middle layer for signal accessing.
We have to deal with the silly structure of the recorder device
measurement format.
* device name: <source of the signal>-<channel of the
message>-<another tag>-<yet another tag>
* signal name
The first identifier says the source information of the signal, which
can be quite long.
/<source of the signal>
/<channel of the message>...
/<signal name>
So if you have the same message from two channels, than you will get
/foo-device-name
/channel-1
/bar
/baz
/channel-2
/bar
/baz
Besides signal loading, we have to search for signal name as fast as
possible, and return with the shortest unique device name part and the
signal name.
Using the structure above, iterating over the group names is quite
slow. So I build up a table from device and signal name.
As far as I know, the pytables query does not support string searching
(e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us
to a pure python loop which is slow again.
Therefore I build up a python dictionary from the table, which provide
fast iteration against the table, but the init time increased from 100
ms to 3-4 sec (we have more than 40 000 signals).
Do you have any advice how to search for group names in hdf5 with
pytables in an efficient way?
Hi grego,
Searching through group names, like accessing all HDF5 metadata, is slow.
For group names this is because rather than searching through a list you
are traversing a B-tree, IIRC. So you have to use the couple of tricks
that you used: 1) have another Table / Array of all table names, 2) read
this in once to a native Python data structure (dict here).
However, 4 sec to read in this table seems excessive for data of this size.
raw_grps = f.root.grp_names[:]
or similar.
Maybe other people have some other ideas.
Be Well
Anthony
Post by Nyirő Gergő
ps: I would be most happy with a glob interface.
thanks for your advices in advance,
gergo
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Anthony Scopatz
2013-08-07 17:13:39 UTC
Permalink
On Wed, Aug 7, 2013 at 4:39 AM, Gabriel J.L. Beckers <
Post by Gabriel J.L. Beckers
Hi,
I don't know if this is related in any way to Gergo's problem, but I
have slow responses when querying which children a group contains, if
that group contains big leafs. I am using pytables 2.5 and hdf5 1.8.9
on linux 64 bit.
Specifically, I found that using the _g_get_objinfo method (which is
used by other methods that I use) is slow when used on a large leaf.
The slowness is proportional to the size of the leaf. It is almost as
if some process is actually reading the data instead of just info on
the type of data. I am noticing this because my data is on an external
usb3 disk. To give you an idea: that method takes almost 80 seconds to
return the string 'Leaf' when used on a 5 Gb EArray. That should
roughly correspond to reading the complete disk-based array. The info
is cached somehow, because if I run the method a second time in the
same python session it is very fast.
If I copy my hdf5 file to my SSD disk, things are much faster, but
running the method still takes 2 seconds or so on a 5 Gb leaf.
Is this expected behavior and should I just avoid this method in my
applications, or is something wrong?
Hi Gabriel,

Are you using compression on this EArray? This method is basically a thin
wrapper over some HDF5 functions. I think that the data that you are asking
for (inadvertently, maybe) is just expensive to get.

Be Well
Anthony
Post by Gabriel J.L. Beckers
Best, Gabriel
Post by Anthony Scopatz
Post by Nyirő Gergő
Hello,
We develop a measurement evaluation tool, and we'd like to use
pytables/hdf5 as a middle layer for signal accessing.
We have to deal with the silly structure of the recorder device
measurement format.
* device name: <source of the signal>-<channel of the
message>-<another tag>-<yet another tag>
* signal name
The first identifier says the source information of the signal, which
can be quite long.
/<source of the signal>
/<channel of the message>...
/<signal name>
So if you have the same message from two channels, than you will get
/foo-device-name
/channel-1
/bar
/baz
/channel-2
/bar
/baz
Besides signal loading, we have to search for signal name as fast as
possible, and return with the shortest unique device name part and the
signal name.
Using the structure above, iterating over the group names is quite
slow. So I build up a table from device and signal name.
As far as I know, the pytables query does not support string searching
(e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us
to a pure python loop which is slow again.
Therefore I build up a python dictionary from the table, which provide
fast iteration against the table, but the init time increased from 100
ms to 3-4 sec (we have more than 40 000 signals).
Do you have any advice how to search for group names in hdf5 with
pytables in an efficient way?
Hi grego,
Searching through group names, like accessing all HDF5 metadata, is slow.
For group names this is because rather than searching through a list you
are traversing a B-tree, IIRC. So you have to use the couple of tricks
that you used: 1) have another Table / Array of all table names, 2) read
this in once to a native Python data structure (dict here).
However, 4 sec to read in this table seems excessive for data of this
size.
Post by Anthony Scopatz
raw_grps = f.root.grp_names[:]
or similar.
Maybe other people have some other ideas.
Be Well
Anthony
Post by Nyirő Gergő
ps: I would be most happy with a glob interface.
thanks for your advices in advance,
gergo
------------------------------------------------------------------------------
Post by Anthony Scopatz
Post by Nyirő Gergő
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
Post by Anthony Scopatz
Post by Nyirő Gergő
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Gabriel J.L. Beckers
2013-08-08 15:25:18 UTC
Permalink
Post by Anthony Scopatz
Are you using compression on this EArray? This method is basically a thin
wrapper over some HDF5 functions. I think that the data that you are asking
for (inadvertently, maybe) is just expensive to get.
No, no compression. But I saw this is one of the first pytables data
sets I created years ago. The chunk size was not chosen well. I
improved that now (better chunk size/shape, transposed axes, and using
CArray) and things are roughly 50% faster.

But I still don't understand why so much data is apparently being read
when I only want to know which children (i.e. the leaf names) a group
contains. To do this in my program I loop over _v_children.items(),
i.e., like,

d = {}
for label, node in f.root.recordings.AB_5000._v_children.items():
d[label] = node

I would have expected code like this to yield a dictionary with node
objects, without reading/inspecting the data content that nodes
contain. But apparently under the hood HDF5 is looking at the contents
of the nodes, which takes a while if they are large, especially over a
usb3 connection. It is not reading the full array into RAM, because
the memory footprint of the python session doesn't increase
appreciably if I run the code above.

Thanks, all the best, Gabriel

Continue reading on narkive:
Loading...