Giovanni Luca Ciampaglia
2014-09-15 15:30:15 UTC
Hi all, I have a table with 60 billion rows and a CSI index on it.
Individual indexed reads (i.e. read_where) are typically super fast
(amazing job guys), but when I run a script that does a bunch of those (in
the example below, just 466), then the runtime goes through the roof. It
seems like most of the time is actually spent reading group information
(see profiler trace below). My code is organized so that the function that
does an individual indexed reads takes an instance of tables.file.File,
gets the table instance (/pageview in this case), and then calls its
read_where method. Perhaps I should instead get an handle of the table
instance once for all and pass that, instead of the file instance. Any
ideas?
Thanks!
Giovanni
/pageview (Table(60203733729,), shuffle, blosc(5)) ''
description := {
"id": Int64Col(shape=(), dflt=0, pos=0),
"timestamp": Int64Col(shape=(), dflt=0, pos=1),
"count": Int64Col(shape=(), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (174762,)
autoindex := True
colindexes := {
"id": Index(9, full, shuffle, blosc(5)).is_csi=True}
/pageview._v_attrs (AttributeSet), 10 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'id',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'timestamp',
FIELD_2_FILL := 0,
FIELD_2_NAME := 'count',
NROWS := 60203733729,
TITLE := '',
VERSION := '2.7']
2053378 function calls (1977969 primitive calls) in 542.064 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2435 303.620 0.125 303.620 0.125 {method '_g_get_objinfo' of
'tables.hdf5extension.Group' objects}
638 40.425 0.063 226.299 0.355 index.py:2016(get_chunkmap)
538 38.388 0.071 428.361 0.796 table.py:1552(read_where)
1 34.320 34.320 537.231 537.231 extractseries.py:295(main)
42 32.146 0.765 32.146 0.765 {method '_read_elements' of
'tables.tableextension.Table' objects}
3436 22.413 0.007 22.413 0.007 {method '_read_index_slice'
of 'tables.indexesextension.IndexArray' objects}
6854 16.104 0.002 16.104 0.002 {method 'astype' of
'numpy.ndarray' objects}
638 13.425 0.021 151.245 0.237 index.py:1831(search)
392 6.147 0.016 6.152 0.016 {method '_search_bin_na_ll'
of 'tables.indexesextension.IndexArray' objects}
9411 4.429 0.000 4.429 0.000 {numpy.core.multiarray.empty}
869 4.133 0.005 4.133 0.005 {method 'nonzero' of
'numpy.ndarray' objects}
479 4.041 0.008 4.041 0.008 {method '_read_records' of
'tables.tableextension.Table' objects}
39368 3.788 0.000 3.789 0.000 {numpy.core.multiarray.array}
466 3.021 0.006 384.721 0.826
extractseries.py:112(extractone)
1930 1.388 0.001 1.388 0.001 {numpy.core.multiarray.zeros}
4890 0.958 0.000 1.082 0.000
conditions.py:437(call_on_recarr)
538 0.389 0.001 0.432 0.001 necompiler.py:662(evaluate)
15733 0.288 0.000 0.288 0.000 {method 'reduce' of
'numpy.ufunc' objects}
210012/210011 0.250 0.000 0.311 0.000 {isinstance}
60 0.232 0.004 3.172 0.053 __init__.py:1(<module>)
2 0.223 0.112 0.223 0.112 {method '_close_file' of
'tables.hdf5extension.File' objects}
6 0.182 0.030 1.887 0.315 __init__.py:3(<module>)
45 0.172 0.004 0.172 0.004 {method '_g_read_slice' of
'tables.hdf5extension.Array' objects}
16079 0.140 0.000 0.231 0.000 file.py:382(register_node)
202 0.138 0.001 0.138 0.001 {method '_read_index_slice'
of 'tables.indexesextension.LastRowArray' objects}
423 0.134 0.000 0.134 0.000
{pandas.tslib.array_to_datetime}
538 0.134 0.000 381.098 0.708 table.py:1512(_where)
4160 0.121 0.000 0.121 0.000 {numpy.core.multiarray.arange}
6 0.118 0.020 2.483 0.414 api.py:1(<module>)
133968/129588 0.108 0.000 0.122 0.000 {len}
8702 0.100 0.000 0.100 0.000
{tables.utilsextension.get_nested_field}
67207/67041 0.098 0.000 3.917 0.000 {getattr}
16043/15751 0.097 0.000 2.508 0.000 file.py:408(get_node)
1121 0.095 0.000 0.095 0.000 {compile}
16077 0.091 0.000 0.321 0.000 file.py:395(cache_node)
26 0.085 0.003 0.085 0.003 {posix.listdir}
230 0.080 0.000 0.232 0.001 doccer.py:12(docformat)
1 0.078 0.078 0.822 0.822 __init__.py:20(<module>)
Individual indexed reads (i.e. read_where) are typically super fast
(amazing job guys), but when I run a script that does a bunch of those (in
the example below, just 466), then the runtime goes through the roof. It
seems like most of the time is actually spent reading group information
(see profiler trace below). My code is organized so that the function that
does an individual indexed reads takes an instance of tables.file.File,
gets the table instance (/pageview in this case), and then calls its
read_where method. Perhaps I should instead get an handle of the table
instance once for all and pass that, instead of the file instance. Any
ideas?
Thanks!
Giovanni
/pageview (Table(60203733729,), shuffle, blosc(5)) ''
description := {
"id": Int64Col(shape=(), dflt=0, pos=0),
"timestamp": Int64Col(shape=(), dflt=0, pos=1),
"count": Int64Col(shape=(), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (174762,)
autoindex := True
colindexes := {
"id": Index(9, full, shuffle, blosc(5)).is_csi=True}
/pageview._v_attrs (AttributeSet), 10 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'id',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'timestamp',
FIELD_2_FILL := 0,
FIELD_2_NAME := 'count',
NROWS := 60203733729,
TITLE := '',
VERSION := '2.7']
2053378 function calls (1977969 primitive calls) in 542.064 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2435 303.620 0.125 303.620 0.125 {method '_g_get_objinfo' of
'tables.hdf5extension.Group' objects}
638 40.425 0.063 226.299 0.355 index.py:2016(get_chunkmap)
538 38.388 0.071 428.361 0.796 table.py:1552(read_where)
1 34.320 34.320 537.231 537.231 extractseries.py:295(main)
42 32.146 0.765 32.146 0.765 {method '_read_elements' of
'tables.tableextension.Table' objects}
3436 22.413 0.007 22.413 0.007 {method '_read_index_slice'
of 'tables.indexesextension.IndexArray' objects}
6854 16.104 0.002 16.104 0.002 {method 'astype' of
'numpy.ndarray' objects}
638 13.425 0.021 151.245 0.237 index.py:1831(search)
392 6.147 0.016 6.152 0.016 {method '_search_bin_na_ll'
of 'tables.indexesextension.IndexArray' objects}
9411 4.429 0.000 4.429 0.000 {numpy.core.multiarray.empty}
869 4.133 0.005 4.133 0.005 {method 'nonzero' of
'numpy.ndarray' objects}
479 4.041 0.008 4.041 0.008 {method '_read_records' of
'tables.tableextension.Table' objects}
39368 3.788 0.000 3.789 0.000 {numpy.core.multiarray.array}
466 3.021 0.006 384.721 0.826
extractseries.py:112(extractone)
1930 1.388 0.001 1.388 0.001 {numpy.core.multiarray.zeros}
4890 0.958 0.000 1.082 0.000
conditions.py:437(call_on_recarr)
538 0.389 0.001 0.432 0.001 necompiler.py:662(evaluate)
15733 0.288 0.000 0.288 0.000 {method 'reduce' of
'numpy.ufunc' objects}
210012/210011 0.250 0.000 0.311 0.000 {isinstance}
60 0.232 0.004 3.172 0.053 __init__.py:1(<module>)
2 0.223 0.112 0.223 0.112 {method '_close_file' of
'tables.hdf5extension.File' objects}
6 0.182 0.030 1.887 0.315 __init__.py:3(<module>)
45 0.172 0.004 0.172 0.004 {method '_g_read_slice' of
'tables.hdf5extension.Array' objects}
16079 0.140 0.000 0.231 0.000 file.py:382(register_node)
202 0.138 0.001 0.138 0.001 {method '_read_index_slice'
of 'tables.indexesextension.LastRowArray' objects}
423 0.134 0.000 0.134 0.000
{pandas.tslib.array_to_datetime}
538 0.134 0.000 381.098 0.708 table.py:1512(_where)
4160 0.121 0.000 0.121 0.000 {numpy.core.multiarray.arange}
6 0.118 0.020 2.483 0.414 api.py:1(<module>)
133968/129588 0.108 0.000 0.122 0.000 {len}
8702 0.100 0.000 0.100 0.000
{tables.utilsextension.get_nested_field}
67207/67041 0.098 0.000 3.917 0.000 {getattr}
16043/15751 0.097 0.000 2.508 0.000 file.py:408(get_node)
1121 0.095 0.000 0.095 0.000 {compile}
16077 0.091 0.000 0.321 0.000 file.py:395(cache_node)
26 0.085 0.003 0.085 0.003 {posix.listdir}
230 0.080 0.000 0.232 0.001 doccer.py:12(docformat)
1 0.078 0.078 0.822 0.822 __init__.py:20(<module>)
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.