[pytables-users] Why is iterating over a single row so slow after sorting?

Discussion:

r***@gmail.com

2015-10-05 17:47:04 UTC

Hello,

I've recently started using pytables, and have a program that queries a
table in this simple manner:

def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)

To reduce the query time, I decided to try to sort the table with
table.copy(sortby='colname'). Acoording to my profiler, this indeed
improved the query time (spent in where), but it increased the time spent
in the next() built-in function. And by increased I mean a lot (see figures
below). What could be the reason?

This slowdown occurs only when there is another column in the table, and
the slowdown increases with the element size of that other column. To help
me understand the problem and make sure this was not related to something
else in my program, I made this minimum working example reproducing the
problem:

import tablesimport timeimport sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)

#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()

#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()

#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp

def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None

sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)

start_time = time.time()
table = fp.root.mysetfor i in xrange(500):
get_element(table, i)print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()

And here is some console output on my laptop showing the figures:

$ ./timedset.py nosort nodataQueried the set in 0.718s

$ ./timedset.py sort nodataQueried the set in 0.003s

$ ./timedset.py nosort withdataQueried the set in 0.597s

$ ./timedset.py sort withdataQueried the set in 5.846s

Accessing an element from a table by some sort of id (sorted or not) sounds
like a basic feature, I must be missing the typical way of doing it with
pytables. What is it? And why such a terrible slowdown after copying the
table in a sorted manner (even though the rows were already sorted from the
beginning)?

Thanks for your support,
Romain

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

r***@gmail.com

2015-10-06 17:45:42 UTC

Permalink

I found the problem and posted an analysis of it on SO:
http://stackoverflow.com/questions/32921916/why-is-querying-a-table-so-much-slower-after-sorting-it
Pytables developers, feel free to have a look to check it is right !

As written, there is something I don't really understand though.
From what I could read in the code of pytables, it seems that when a query
could use indexing, but the index indicates that no result matches the
query, the Row iterator fallbacks to a "in kernel" search. This causes to
read the entire table to find nothing, while it was already known there was
no result.
Am I right? Why this behaviour?

Thanks for your time.
Romain

Post by r***@gmail.com
Hello,
I've recently started using pytables, and have a program that queries a
rows = table.where("colname == somevar")
row = next(rows, None)
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with
table.copy(sortby='colname'). Acoording to my profiler, this indeed
improved the query time (spent in where), but it increased the time spent
in the next() built-in function. And by increased I mean a lot (see
figures below). What could be the reason?
[../..]
Accessing an element from a table by some sort of id (sorted or not)
sounds like a basic feature, I must be missing the typical way of doing it
with pytables. What is it? And why such a terrible slowdown after copying
the table in a sorted manner (even though the rows were already sorted from
the beginning)?
Thanks for your support,
Romain

Andrea Bedini

2015-10-06 23:36:25 UTC

Permalink

Hi Romain,

fantastic analysis.

Falling back on a linear scan makes sense if the index is partial but not for a complete index. Perhaps there’s a bug there?

Andrea

--
Andrea Bedini
@andreabedini, http://www.andreabedini.com

See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591

r***@gmail.com

2015-10-07 08:27:37 UTC

Permalink

Thanks Andrea for the reply,

Ah ok, yes good point (I didn't know about the existence of partial
indexes).
Nevertheless, I am using a full infex so the fallback should not happen,
and indeed after a closer inspection of the code it turns out it is (was) a
bug.

In table._where_indexed(), a None chunkmap was returned when no result
could be found, which made table._where() fallback to a in kernel search.
The good news is that this is in pytables 3.1.1, the version I am using.
But this was fixed since then by this commit (I think):
https://github.com/PyTables/PyTables/commit/3be75ef3a5598e3c99cc351d04892884d3289e64

Now would not be a good time for me to migrate to 3.2, but fortunately
queries with no result should not happen in my current use cases.
I'll just have to keep that in mind.

End of the story I guess :)

Thanks
Romain

Post by Andrea Bedini
Hi Romain,
http://stackoverflow.com/questions/32921916/why-is-querying-a-table-so-much-slower-after-sorting-it

Post by r***@gmail.com
Pytables developers, feel free to have a look to check it is right !

fantastic analysis.
Falling back on a linear scan makes sense if the index is partial but not
for a complete index. Perhaps thereâs a bug there?
Andrea
--
Andrea Bedini
@andreabedini, http://www.andreabedini.com
See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591

Continue reading on narkive:

Search results for '[pytables-users] Why is iterating over a single row so slow after sorting?' (Questions and Answers)

replies

what is the mechanism behind the random selection of songs in mp3?

started 2007-01-18 01:59:15 UTC

software

replies

can someone give me proof of the geologic column?

started 2010-03-25 07:38:05 UTC

earth sciences & geology