Discussion:
Performance of list of arrays vs one table?
'Hawk Anonymous' via pytables-users
2017-02-07 14:43:04 UTC
Hello,

I have a question in how to store data efficiently in pytables.

ATM, I have multiple variables, each variable in one CArray.
I read one CArray back to an numpy array and then apply a cut (let say, all
values
above 5) to generate a Boolean mask which I apply then to all variables.

Naively, I would have thought that this is the fasted approach to my
problem (assuming
the data is consequentially stored in memory) but multiple people told me
that I should instead use a table.
Using a table would also make my live a lot easier as I would only have to
apply the mask to the table, not all arrays.

What I do not get is how a table is supposed to be faster. They are row
major, aren't they?
But for example this here:
claims that they are indeed faster.
So how is this possible if it is true in the first space?
Can I just use a table or will I use a ton of speed because of it?

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
'Hawk Anonymous' via pytables-users
2017-03-03 12:49:31 UTC
Hello,

as I did not get any answers up to now, I tried t investigate this myself
by using a very simple example: two numpy arrays filled with random values.
Task: get all random numbers which are bigger in the first array then in
the second one.
I wrote this code to test it:

import tables
import numpy
class tdesc(tables.IsDescription):
r0 = tables.Float64Col()
r1 = tables.Float64Col()

r0 = numpy.random.rand(10**8)
r1 = numpy.random.rand(10**8)
hdf = tables.open_file("test.hdf", "w")
hdf.create_carray(hdf.root, "r0", obj=r0)
hdf.create_carray(hdf.root, "r1", obj=r1)
hdf.create_table(hdf.root, "rr", tdesc)
row = hdf.root.rr.row
for i in range(len(r0)):
row["r0"] = r0[i]
row["r1"] = r1[i]
row.append()
#data prepared in hdf file

#now, get all r0s which are bigger than r1

#read arrays, let numpy do the search
n = hdf.root.r0[:][hdf.root.r0[:] > hdf.root.r1[:]]
%timeit n = hdf.root.r0[:][hdf.root.r0[:] > hdf.root.r1[:]]
# ->1 loop, best of 3: 2.53 s per loop

#read arrays from table, use numpy for search
m = hdf.root.rr.col("r0")[hdf.root.rr.col("r0")>hdf.root.rr.col("r1")]
%timeit m = hdf.root.rr.col("r0")[hdf.root.rr.col("r0")>hdf.root.rr.col("r1"
)]
# -> 1 loop, best of 3: 3.65 s per loop

#use in-kernel search
o = [ x['r0'] for x in hdf.root.rr.where("""(r0 > r1)""") ]
%timeit o = [ x['r0'] for x in hdf.root.rr.where("""(r0 > r1)""") ]
# -> 1 loop, best of 3: 6.38 s per loop

print(len(n))
#50002016
print(len(m))
#49973491
print(len(o))
#49973491

As you can see, the arrays + numpy wins by a rough factor of 2 in speed.
What I do not understand is why n, m, and o are not the same.
I thing n is correct because it is always very close to 10**8/2 which i
would expect as a result but why does the table screw with my results?
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.