Sean Mackay
2016-10-31 17:34:52 UTC
Hi,
The short version is I'm using PyTables for a task it might not actually be
perfectly suited for:
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002 attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during file
open didn't change the execution time, it all seems to be the overheard of
the AttreibuteSet object.
So my questions:
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean
The short version is I'm using PyTables for a task it might not actually be
perfectly suited for:
I've got a dataset which has ~200 Groups of ~400 Groups of 4 Arrays of
500-16000 elements, or about 320,000 nodes. However, this whole set is only
1.5 GB compressed and probably could have been implemented in Pandas or
something else designed to run and live in memory. I chose to implement it
in PyTables, and I ended up storing quite a bit of information with each
node as HDF5 Attributes (20-30 per Group).
Unfortunately, this seems to manifest in slow operation. When I iterate
over all the data and apply some calculation to subsets of it, I end up
spending 160 seconds just in the AttributeSet class out of 220 seconds
total execution time.
ncalls tottime percall cumtime percall filename:lineno(function)
1844745 51.754 0.000 51.754 0.000 {method '_g_getattr' of
'tables.hdf5extension.AttributeSet' objects}
2954006 35.901 0.000 35.901 0.000 {method 'reduce' of
'numpy.ufunc' objects}
1868775 14.883 0.000 126.966 0.000
attributeset.py:282(__getattr__)
2953157 10.546 0.000 48.355 0.000 fromnumeric.py:2395(prod)
3488246 8.723 0.000 62.905 0.000
attributeset.py:61(issysattrname)
2011526 5.781 0.000 16.974 0.000 file.py:395(cache_node)
83307 5.645 0.000 162.927 0.002 attributeset.py:200(__init__)
2011528 5.562 0.000 9.716 0.000 file.py:382(register_node)
Reading through the attreibuteset.py file I see that when a node is
accessed it immediately loads all attributes into a dictionary to enable
tab completion of attribute names in interactive mode. Unfortunately, this
doesn't help me much as I am writing an application that executes.
Loading the table into RAM with the `driver='H5DF_CORE'` option during file
open didn't change the execution time, it all seems to be the overheard of
the AttreibuteSet object.
So my questions:
1) Are there any mode I can enable which might disable some of this
overhead and only load the attributes I access directly?
2) Where and when does the AttributeSet initialize? Are there any best
practices I can use to avoid accidentally loading the same attribute set
into memory more than once? It only inits 83,000 times, so I might not be
doubling up any actually.
Thank you guys if anyone can offer insight.
Sean
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.