Discussion:
efficiency of table row access by coordinates
Ken Walker
2018-10-04 21:07:23 UTC
Looking for guidance to optimize a process that modifies some elements of a
2D dataset (table).
This is about efficient order of operations, not code details.
and modify_coordinates() based on the number of coordinate values? (search
time vs data access time) Is there a point when the number of coordinate
values becomes inefficient?

Background:
I have to modify SOME columns/fields data on SOME rows.....so will do
multiple slices on both rows and columns.
The extracted numpy dtypes look something like this (simplified, there are
more fields):
dtype([('ID', '<i8'), ('X', '<f8'), ('Y', '<f8'), ('Z', '<f8'), ('FREQ',
'<f8')])
'X','Y','Z' are the values to be modified (by a dot product w/ another
vector)

It's not clear (to me) which is better/faster:
1) Extract a small number of rows 1 ID value at a time (say 100 out of 10e7
rows), modify the data, update the rows, then repeat this 1000 times. (ID
can appear multiple times with different FREQ.)
This method is simpler b/c I can easily get the row coordinates w/
get_where_list(), extract the row data w/ read_coordinates(), modify the
array data, then update row data w/ modify_coordinates().
(I have coded this process as a working prototype.)

2) Extract a large number of rows (for 1000 IDs, say 100,000 out of 10e7
rows), modify the data and update the rows, but only do it once.
This method follows the same process, but first iterates thru
get_where_list() to get all 100,000 coordinates for all 1000 ID values,
modify_coordinates().

So...is there any performance benefit to one method over the other?
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
Ken Walker
2018-10-08 18:47:59 UTC
After writing that post, I realized I outlined changes required to measure
the fastest method.
I wrote a second version that extracts a large number of rows (Method 2
above).
(Frankly, it took longer to create a large prototype dataset for testing
than it did to rework the loop logic.)

I then ran both versions and compared timing data. What did I learn?
For both methods, the majority of time is spent on get_where_list().
In a 10 minute procedure, about 9:30 is spent on 3 get_where_list() calls.

Also, it's more efficient to call get_where_list() each time to find all
rows than it is to repeatedly call get_where_list().
Post by Ken Walker
Looking for guidance to optimize a process that modifies some elements of
a 2D dataset (table).
This is about efficient order of operations, not code details.
and modify_coordinates() based on the number of coordinate values? (search
time vs data access time) Is there a point when the number of coordinate
values becomes inefficient?
I have to modify SOME columns/fields data on SOME rows.....so will do
multiple slices on both rows and columns.
The extracted numpy dtypes look something like this (simplified, there are
dtype([('ID', '<i8'), ('X', '<f8'), ('Y', '<f8'), ('Z', '<f8'), ('FREQ',
'<f8')])
'X','Y','Z' are the values to be modified (by a dot product w/ another
vector)
1) Extract a small number of rows 1 ID value at a time (say 100 out of
10e7 rows), modify the data, update the rows, then repeat this 1000 times.
(ID can appear multiple times with different FREQ.)
This method is simpler b/c I can easily get the row coordinates w/
get_where_list(), extract the row data w/ read_coordinates(), modify the
array data, then update row data w/ modify_coordinates().
(I have coded this process as a working prototype.)
2) Extract a large number of rows (for 1000 IDs, say 100,000 out of 10e7
rows), modify the data and update the rows, but only do it once.
This method follows the same process, but first iterates thru
get_where_list() to get all 100,000 coordinates for all 1000 ID values,