r***@gmail.com
2015-10-05 17:47:04 UTC
Hello,
I've recently started using pytables, and have a program that queries a
table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with
table.copy(sortby='colname'). Acoording to my profiler, this indeed
improved the query time (spent in where), but it increased the time spent
in the next() built-in function. And by increased I mean a lot (see figures
below). What could be the reason?
This slowdown occurs only when there is another column in the table, and
the slowdown increases with the element size of that other column. To help
me understand the problem and make sure this was not related to something
else in my program, I made this minimum working example reproducing the
problem:
import tablesimport timeimport sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.mysetfor i in xrange(500):
get_element(table, i)print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output on my laptop showing the figures:
$ ./timedset.py nosort nodataQueried the set in 0.718s
$ ./timedset.py sort nodataQueried the set in 0.003s
$ ./timedset.py nosort withdataQueried the set in 0.597s
$ ./timedset.py sort withdataQueried the set in 5.846s
Accessing an element from a table by some sort of id (sorted or not) sounds
like a basic feature, I must be missing the typical way of doing it with
pytables. What is it? And why such a terrible slowdown after copying the
table in a sorted manner (even though the rows were already sorted from the
beginning)?
Thanks for your support,
Romain
I've recently started using pytables, and have a program that queries a
table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with
table.copy(sortby='colname'). Acoording to my profiler, this indeed
improved the query time (spent in where), but it increased the time spent
in the next() built-in function. And by increased I mean a lot (see figures
below). What could be the reason?
This slowdown occurs only when there is another column in the table, and
the slowdown increases with the element size of that other column. To help
me understand the problem and make sure this was not related to something
else in my program, I made this minimum working example reproducing the
problem:
import tablesimport timeimport sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.mysetfor i in xrange(500):
get_element(table, i)print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output on my laptop showing the figures:
$ ./timedset.py nosort nodataQueried the set in 0.718s
$ ./timedset.py sort nodataQueried the set in 0.003s
$ ./timedset.py nosort withdataQueried the set in 0.597s
$ ./timedset.py sort withdataQueried the set in 5.846s
Accessing an element from a table by some sort of id (sorted or not) sounds
like a basic feature, I must be missing the typical way of doing it with
pytables. What is it? And why such a terrible slowdown after copying the
table in a sorted manner (even though the rows were already sorted from the
beginning)?
Thanks for your support,
Romain
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.