[pytables-users] Confused how PyTables indexing compares to pandas HDFStore (PyTables) indexing

Discussion:

Evan

2016-10-12 04:11:22 UTC

The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.

For the latter case, users instantiate an HDFStore object and then chose
which columns to index.

store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)

Here we index on two columns, which should optimize our search.

For PyTables alone, we create an HDF5 file as follows (from the
documentation):

from tables import *
class Particle(IsDescription):
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# Fill the table with 10 recordsfor i in xrange(10):
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.

# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()

Users index columns by using "Column.create_index()"

For example:

indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()

Two questions:

(1) I'm afraid in our PyTables example above, it is not clear to me how to
set the indices (indexes). There are no columns defined. To my mind, there
are three fields: identity, idnumber, speed. Let's say I wanted to place an
index on speed and identity. How would one do this?

(2) Are there any benchmarks between the the pandas based indexing and
PyTables based indexing? Is one faster than the other? Does one take up
more disk space (i.e. larger HDF5 file) than the other?

Thank you for any help! Apologies for so many questions recently

Thanks, Evan

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francesc Alted

2016-10-12 07:33:28 UTC

Permalink

Post by Evan
The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.
For the latter case, users instantiate an HDFStore object and then chose
which columns to index.
store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)
Here we index on two columns, which should optimize our search.
For PyTables alone, we create an HDF5 file as follows (from the
from tables import *
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.
# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()
Users index columns by using "Column.create_index()"
indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()
(1) I'm afraid in our PyTables example above, it is not clear to me how to
set the indices (indexes). There are no columns defined. To my mind, there
are three fields: identity, idnumber, speed. Let's say I wanted to place an
index on speed and identity. How would one do this?

indexedrows = table.cols.identity.create_index()

Post by Evan
(2) Are there any benchmarks between the the pandas based indexing and
PyTables based indexing? Is one faster than the other? Does one take up
more disk space (i.e. larger HDF5 file) than the other?

pandas just uses PyTables for doing the indexing, so they are going to
behave very close.

Post by Evan
Thank you for any help! Apologies for so many questions recently

No problem!

Post by Evan
Thanks, Evan
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francesc Alted

2016-10-12 07:37:59 UTC

Permalink

Also, you might be interested in a comparison between PyTables and other
data container (including relational databases):

https://github.com/FrancescAlted/DataContainersTutorials/blob/master/4-On-Disk-Tables.ipynb

This is for a relatively small database (1 million entries), but it could
give you an idea on what to expect.

Post by Francesc Alted

Post by Evan
The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.
For the latter case, users instantiate an HDFStore object and then chose
which columns to index.
store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)
Here we index on two columns, which should optimize our search.
For PyTables alone, we create an HDF5 file as follows (from the
from tables import *
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.
# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()
Users index columns by using "Column.create_index()"
indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()
(1) I'm afraid in our PyTables example above, it is not clear to me how
to set the indices (indexes). There are no columns defined. To my mind,
there are three fields: identity, idnumber, speed. Let's say I wanted to
place an index on speed and identity. How would one do this?

indexedrows = table.cols.identity.create_index()

pandas just uses PyTables for doing the indexing, so they are going to
behave very close.

Post by Evan
Thank you for any help! Apologies for so many questions recently

No problem!

--
Francesc Alted

Evan

2016-10-12 14:12:36 UTC

Permalink

Thank you for the prompt response.

So just to be clear, users should index the columns using

indexedrows = table.cols.identity.create_index()

before filling the tables with records? That is, users how instantiate the
`tables` and `rows` object, then use

indexedrows = table.cols.identity.create_index(), e.g. with the example
below

...
...

# Create 1 new tables in group1
table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:
row = table1.row

# now index columns
indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()

Does the order when columns are indexed matter?

Question 2:

So, `indexedrows` is not a traditional python variable?

If I were to index on threes columns 'col1', 'col2', 'col3', I would use

indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
indexedrows = table.cols3.identity.create_index()

`indexedrows` must automatically execute these commands. For a traditional
"variable", the only thing saved is
`table.cols3.identity.create_index()`---the other two have been written
over.

Thanks for the help!

Post by Francesc Alted

Post by Evan
The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.
For the latter case, users instantiate an HDFStore object and then chose
which columns to index.
store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)
Here we index on two columns, which should optimize our search.
For PyTables alone, we create an HDF5 file as follows (from the
from tables import *
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.
# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()
Users index columns by using "Column.create_index()"
indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()
(1) I'm afraid in our PyTables example above, it is not clear to me how
to set the indices (indexes). There are no columns defined. To my mind,
there are three fields: identity, idnumber, speed. Let's say I wanted to
place an index on speed and identity. How would one do this?

indexedrows = table.cols.identity.create_index()

pandas just uses PyTables for doing the indexing, so they are going to
behave very close.

Post by Evan
Thank you for any help! Apologies for so many questions recently

No problem!

Post by Evan
Thanks, Evan
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted

Francesc Alted

2016-10-12 14:40:04 UTC

Permalink

Post by Evan
Thank you for the prompt response.
So just to be clear, users should index the columns using
indexedrows = table.cols.identity.create_index()
before filling the tables with records?

Its use is the same than relational databases, you can index before or
after, although the time to create the index is faster if you use it
after. Also, there are some cases where indexing after allows better index
optimization. So yes, creating the index after is recommended.

Post by Evan
That is, users how instantiate the `tables` and `rows` object, then use
indexedrows = table.cols.identity.create_index(), e.g. with the example
below
...
...
# Create 1 new tables in group1
table1 = fileh.create_table(group1, "table1", Particle)
row = table1.row
# now index columns
indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
Does the order when columns are indexed matter?

Nope. The columns are independent, the same way than the indexes.

Post by Evan
So, `indexedrows` is not a traditional python variable?
If I were to index on threes columns 'col1', 'col2', 'col3', I would use
indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
indexedrows = table.cols3.identity.create_index()
`indexedrows` must automatically execute these commands. For a traditional
"variable", the only thing saved is `table.cols3.identity.create_index()`---the
other two have been written over.

I don't totally understand the question. IIRC indexedrows is the number of
rows indexed, but this is indeed not documented:

http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_index

In general, you should not care about this, because the number of rows
indexed will be the same than the number of rows in the table. I suppose
you have got this usage from
http://www.pytables.org/usersguide/optimization.html?highlight=create_index#indexed-searches,
and it would be nice if these lines would be replaced by just:

table.cols.var1.create_index()table.cols.var2.create_index()table.cols.var3.create_index()

Hope this helps,

Francesc

Post by Evan

Post by Evan
The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.
For the latter case, users instantiate an HDFStore object and then chose
which columns to index.
store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)
Here we index on two columns, which should optimize our search.
For PyTables alone, we create an HDF5 file as follows (from the
from tables import *
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.
# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()
Users index columns by using "Column.create_index()"
indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()
(1) I'm afraid in our PyTables example above, it is not clear to me how
to set the indices (indexes). There are no columns defined. To my mind,
there are three fields: identity, idnumber, speed. Let's say I wanted to
place an index on speed and identity. How would one do this?

indexedrows = table.cols.identity.create_index()

pandas just uses PyTables for doing the indexing, so they are going to
behave very close.

Post by Evan
Thank you for any help! Apologies for so many questions recently

No problem!

--
Francesc Alted

Post by Evan
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Evan

2016-10-12 22:38:13 UTC

Permalink

Let me provide a concrete example why I'm asking:

""""

class MyTable(IsDescription):

COL1 = Int16Col()

COL2 = Int16Col()

COL3= StringCol(64)

COL4= StringCol(64)

COL5= StringCol(64)

COL6= StringCol(64)

COL7 = Int32Col()

# Open a file in write mode

h5file = open_file("file1.h5", mode = "w")

my_key = "key"

# Create group

group = h5file.create_group("/", "my_table")

table = h5file.create_table(group, my_key, MyTable, "table of values")

row = table.row

# user decides which indices to create

field1 = "COL1" # create index on column 1, COL1

field2 = "COL2" # create index on column 2, COL2

# import dictionary 'dictionary1"
for dict in dictionary1:

row["COL1"] = dict["COL1"]

row["COL2"] = dict["COL2"]

row["COL3"] = dict["COL3"]

row["COL4"] = dict["COL4"]

row["COL5"] = dict["COL5"]

row["COL6"] = dict["COL6"]

row["COL7"] = dict["COL7"]

# This injects the Record values

table.cols.field1.create_index()

table.cols.field2.create_index()

row.append()

# Flush the table buffers

table.flush()
"""

Now, when I run this, I get the following error:

"ValueError: Index(6, medium, shuffle, zlib(1)).is_csi=False for column
'COL1' already exists. If you want to re-create it, please, try with
reindex() method better"

Where above am I indexing on all columns? Surely I would have most faster
queries if I only indexed on one/two columns, and queried those, right?

So...I'm making a mistake somewhere.

Thanks for the help, we'll soon have this figured out, Evan

Post by Francesc Alted

Post by Evan
Thank you for the prompt response.
So just to be clear, users should index the columns using
indexedrows = table.cols.identity.create_index()
before filling the tables with records?

Nope. The columns are independent, the same way than the indexes.

Post by Evan
So, `indexedrows` is not a traditional python variable?
If I were to index on threes columns 'col1', 'col2', 'col3', I would use
indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
indexedrows = table.cols3.identity.create_index()
`indexedrows` must automatically execute these commands. For a
traditional "variable", the only thing saved is
`table.cols3.identity.create_index()`---the other two have been written
over.

I don't totally understand the question. IIRC indexedrows is the number
http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_index
In general, you should not care about this, because the number of rows
indexed will be the same than the number of rows in the table. I suppose
you have got this usage from
http://www.pytables.org/usersguide/optimization.html?highlight=create_index#indexed-searches,
table.cols.var1.create_index()table.cols.var2.create_index()table.cols.var3.create_index()
Hope this helps,
Francesc

Post by Evan

Post by Evan
The PyTables library and the HDFStore object (based on PyTables) both
provide indexing for the user.
For the latter case, users instantiate an HDFStore object and then chose
which columns to index.
store = HDFStore('file1.hd5')
key = "key_name"
index_columns = ["column1", "column2"]
store.append(key,... data_columns=index_columns)
Here we index on two columns, which should optimize our search.
For PyTables alone, we create an HDF5 file as follows (from the
from tables import *
identity = StringCol(itemsize=22, dflt=" ", pos=0) # character String
idnumber = Int16Col(dflt=1, pos = 1) # short integer
speed = Float32Col(dflt=1, pos = 2) # single-precision
# Open a file in "w"rite modefileh = open_file("objecttree.h5", mode = "w")
# Get the HDF5 root grouproot = fileh.root
# Create the groupsgroup1 = fileh.create_group(root, "group1")group2 = fileh.create_group(root, "group2")
# Now, create an array in root grouparray1 = fileh.create_array(root, "array1", ["string", "array"], "String array")
# Create 1 new tables in group1table1 = fileh.create_table(group1, "table1", Particle)
# Get the record object associated with the table:row = table1.row
# First, assign the values to the Particle record
row['identity'] = 'This is particle: %2d' % (i)
row['idnumber'] = i
row['speed'] = i * 2.
# This injects the Record values
row.append()
# Flush the table bufferstable.flush()
# Finally, close the file (this also will flush all the remaining buffers!)fileh.close()
Users index columns by using "Column.create_index()"
indexrows = table.cols.var1.create_index()
indexrows = table.cols.var2.create_index()
indexrows = table.cols.var3.create_index()
(1) I'm afraid in our PyTables example above, it is not clear to me how
to set the indices (indexes). There are no columns defined. To my mind,
there are three fields: identity, idnumber, speed. Let's say I wanted to
place an index on speed and identity. How would one do this?

indexedrows = table.cols.identity.create_index()

pandas just uses PyTables for doing the indexing, so they are going to
behave very close.

Post by Evan
Thank you for any help! Apologies for so many questions recently

No problem!

Post by Evan
Thanks, Evan
--
You received this message because you are subscribed to the Google
Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted

Post by Evan
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted

Evan

2016-10-12 17:28:58 UTC

Permalink

Thanks for the response.

One further clarification based on the recent reply:

""""
In general, you should not care about this, because the number of rows
indexed will be the same than the number of rows in the table. I suppose
you have got this usage from
http://www.pytables.org/usersguide/optimization.html?highlight=create_index#indexed-searches,
and it would be nice if these lines would be replaced by just:

table.cols.var1.create_index()
table.cols.var2.create_index()
table.cols.var3.create_index()

""""

So, in the example shown previously from the documentation with fields
"identity", "idnumber", and "speed", the correct way to index on (as an
example) both "idnumber" and "speed" would be for users to execute

table.cols.idnumber.create_index()
table.cols.speed.create_index()

After `row.append()` and before `table.flush()` for optimal use and
performance of indexes? I'm still slightly unclear.

Thank you for the help. I appreciate it, Evan