[pytables-users] Pytables AttributeSet functions relatively slow, Is there anything I can do?

Discussion:

Sean Mackay

2016-10-31 17:34:52 UTC

Hi,

The short version is I'm using PyTables for a task it might not actually be
perfectly suited for:

I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).

Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.

ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002 attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)

Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.

Loading the table into RAM with the `driver='H5DF_CORE'` option during file
open didn't change the execution time, it all seems to be the overheard of
the AttreibuteSet object.

So my questions:
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?

2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.

Thank you guys if anyone can offer insight.
Sean

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sean Mackay

2016-11-04 03:19:00 UTC

Permalink

Found a solution which minimizes the problem outside of PyTables, which was
to import the functools lru_cache decorator
(https://docs.python.org/3/library/functools.html#functools.lru_cache)
which allows you to add a decorator which will memoize function arguments
and the results, meaning that the first call to select a subset of groups
based on an attribute will be very slow(as it reads the faile and imports
all attributes for all child nodes), but any repeated calls with the same
paramaters will be instant as the returned list of nodes is cached and
returned without ever touching the .h5 file.

I may even be able to further optimize this by adding the property to the
function that gets all child nodes, or by loading all nodes into memory
once and maintaining a variable with the loaded list.

Sean

Post by Sean Mackay
Hi,
The short version is I'm using PyTables for a task it might not actually
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002 attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during
file open didn't change the execution time, it all seems to be the
overheard of the AttreibuteSet object.
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean

Francesc Alted

2016-11-04 08:45:03 UTC

Permalink

Another thing that you can try is to create your files without the system
attributes so as to prevent PyTables to load too many attrs. This is
achieved by passing the `pytables_sys_attrs=False` parameter during the
file creation (open_file()). These system attributes are specific of
PyTables, but it can read files without these just fine.

Hope this helps

Post by Sean Mackay
Found a solution which minimizes the problem outside of PyTables, which
was to import the functools lru_cache decorator (
https://docs.python.org/3/library/functools.html#functools.lru_cache)
which allows you to add a decorator which will memoize function arguments
and the results, meaning that the first call to select a subset of groups
based on an attribute will be very slow(as it reads the faile and imports
all attributes for all child nodes), but any repeated calls with the same
paramaters will be instant as the returned list of nodes is cached and
returned without ever touching the .h5 file.
I may even be able to further optimize this by adding the property to the
function that gets all child nodes, or by loading all nodes into memory
once and maintaining a variable with the loaded list.
Sean

Post by Sean Mackay
Hi,
The short version is I'm using PyTables for a task it might not actually
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002
attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during
file open didn't change the execution time, it all seems to be the
overheard of the AttreibuteSet object.
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean

--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sean Mackay

2016-11-04 15:32:44 UTC

Permalink

Thank you, I might give that a try.

Also, I re-read this part of the Optimization tips, and it could be helpful
as
well: http://www.pytables.org/usersguide/optimization.html#getting-the-most-from-the-node-lru-cache

Essentially the internal lru cache for Pytables nodes, which making larger
could help this. Though I would have to test it some, as I would need it to
be significantly larger since I'm iterating over 1000s of nodes.

If only there was a flag for the pre-1.2 behaviour where the entire list of
nodes loaded into memory upon opening, that might actually be preferable in
my scenario.

Sean

Post by Francesc Alted
Another thing that you can try is to create your files without the system
attributes so as to prevent PyTables to load too many attrs. This is
achieved by passing the `pytables_sys_attrs=False` parameter during the
file creation (open_file()). These system attributes are specific of
PyTables, but it can read files without these just fine.
Hope this helps

Post by Sean Mackay
Hi,
The short version is I'm using PyTables for a task it might not actually
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002
attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during
file open didn't change the execution time, it all seems to be the
overheard of the AttreibuteSet object.
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean

--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted

Francesc Alted

2016-11-04 15:45:09 UTC

Permalink

Indeed. And if you want everything loaded in memory you can do that by
passing a large negative CACHE_NODE_SLOTS. From the docs:

"""
Also worth noting is that if you have a lot of memory available and
performance is absolutely critical, you may want to try out a negative
value for parameters.NODE_CACHE_SLOTS. This will cause that all the touched
nodes will be kept in an internal dictionary and this is the faster way to
load/retrieve nodes. However, and in order to avoid a large memory
consumption, the user will be warned when the number of loaded nodes will
reach the -NODE_CACHE_SLOTS value.Also worth noting is that if you have a
lot of memory available and performance is absolutely critical, you may
want to try out a negative value for parameters.NODE_CACHE_SLOTS. This will
cause that all the touched nodes will be kept in an internal dictionary and
this is the faster way to load/retrieve nodes. However, and in order to
avoid a large memory consumption, the user will be warned when the number
of loaded nodes will reach the -NODE_CACHE_SLOTS value.
"""

Francesc

Post by Sean Mackay
Thank you, I might give that a try.
Also, I re-read this part of the Optimization tips, and it could be
helpful as well: http://www.pytables.org/usersguide/optimization.html#
getting-the-most-from-the-node-lru-cache
Essentially the internal lru cache for Pytables nodes, which making larger
could help this. Though I would have to test it some, as I would need it to
be significantly larger since I'm iterating over 1000s of nodes.
If only there was a flag for the pre-1.2 behaviour where the entire list
of nodes loaded into memory upon opening, that might actually be preferable
in my scenario.
Sean

Post by Sean Mackay
Found a solution which minimizes the problem outside of PyTables, which
was to import the functools lru_cache decorator (
https://docs.python.org/3/library/functools.html#functools.lru_cache)
which allows you to add a decorator which will memoize function arguments
and the results, meaning that the first call to select a subset of groups
based on an attribute will be very slow(as it reads the faile and imports
all attributes for all child nodes), but any repeated calls with the same
paramaters will be instant as the returned list of nodes is cached and
returned without ever touching the .h5 file.
I may even be able to further optimize this by adding the property to
the function that gets all child nodes, or by loading all nodes into memory
once and maintaining a variable with the loaded list.
Sean

Post by Sean Mackay
Hi,
The short version is I'm using PyTables for a task it might not
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002
attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during
file open didn't change the execution time, it all seems to be the
overheard of the AttreibuteSet object.
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean

--
You received this message because you are subscribed to the Google
Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted