[pytables-users] PyTables: how to index on certain columns, not all; ValueError

Discussion:

Evan

2016-10-16 05:00:15 UTC

I appears I am indexing on all columns instead of one or two---I believe
the latter would be far more optimized for searches.

With the following code, I am trying to index only on columns COL1 and
COL2: (based off the example in the
introduction: http://www.pytables.org/usersguide/introduction.html)

class MyTable(IsDescription):
COL1 = Int16Col()
COL2 = Int16Col()
COL3= StringCol(64)
COL4= StringCol(64)
COL5= StringCol(64)
COL6= StringCol(64)
COL7 = Int32Col()

# Open a file in write mode

h5file = open_file("file1.h5", mode = "w")

my_key = "key"

# Create group

group = h5file.create_group("/", "my_table")

table = h5file.create_table(group, my_key, MyTable, "table of values")

row = table.row

# user decides which indices to create

field1 = "COL1" # create index on column 1, COL1

field2 = "COL2" # create index on column 2, COL2

# import dictionary 'dictionary1"
for dict in dictionary1:
row["COL1"] = dict["COL1"]
row["COL2"] = dict["COL2"]
row["COL3"] = dict["COL3"]
row["COL4"] = dict["COL4"]
row["COL5"] = dict["COL5"]
row["COL6"] = dict["COL6"]
row["COL7"] = dict["COL7"]

# This injects the Record values

table.cols.field1.create_index()
table.cols.field2.create_index()
row.append()

# Flush the table buffers
table.flush()

"ValueError: Index(6, medium, shuffle, zlib(1)).is_csi=False for
column 'COL1' already exists. If you want to re-create it, please, try
with reindex() method better"

Where above am I indexing on all columns? Surely I would have most faster
queries if I only indexed on one/two columns, and queried those, right?

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francesc Alted

2016-10-22 15:34:56 UTC

Permalink

Post by Evan
I appears I am indexing on all columns instead of one or two---I believe
the latter would be far more optimized for searches.
With the following code, I am trying to index only on columns COL1 and
COL2: (based off the example in the introduction: http://www.
pytables.org/usersguide/introduction.html)
COL1 = Int16Col()
COL2 = Int16Col()
COL3= StringCol(64)
COL4= StringCol(64)
COL5= StringCol(64)
COL6= StringCol(64)
COL7 = Int32Col()
# Open a file in write mode
h5file = open_file("file1.h5", mode = "w")
my_key = "key"
# Create group
group = h5file.create_group("/", "my_table")
table = h5file.create_table(group, my_key, MyTable, "table of values")
row = table.row
# user decides which indices to create
field1 = "COL1" # create index on column 1, COL1
field2 = "COL2" # create index on column 2, COL2
# import dictionary 'dictionary1"
row["COL1"] = dict["COL1"]
row["COL2"] = dict["COL2"]
row["COL3"] = dict["COL3"]
row["COL4"] = dict["COL4"]
row["COL5"] = dict["COL5"]
row["COL6"] = dict["COL6"]
row["COL7"] = dict["COL7"]
# This injects the Record values
table.cols.field1.create_index()

You are creating an index inside the loop, but the index only needs to be
created *once*, so you need to do that once the loop is finished (typically
after the flush(), as we discussed already).

Post by Evan
table.cols.field2.create_index()

Also, if the column name is called 'COL1' or 'COL2', the way to access the
columns is:

table.cols.COL1.create_index()

Post by Evan
row.append()
# Flush the table buffers
table.flush()

"ValueError: Index(6, medium, shuffle, zlib(1)).is_csi=False for
column 'COL1' already exists. If you want to re-create it, please, try
with reindex() method better"

Where above am I indexing on all columns? Surely I would have most faster
queries if I only indexed on one/two columns, and queried those, right?

Hope this helps

--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evan

2016-10-23 11:54:25 UTC

Permalink

Post by Francesc Alted

Post by Evan
I appears I am indexing on all columns instead of one or two---I believe
the latter would be far more optimized for searches.
With the following code, I am trying to index only on columns COL1 and
http://www.pytables.org/usersguide/introduction.html)
COL1 = Int16Col()
COL2 = Int16Col()
COL3= StringCol(64)
COL4= StringCol(64)
COL5= StringCol(64)
COL6= StringCol(64)
COL7 = Int32Col()
# Open a file in write mode
h5file = open_file("file1.h5", mode = "w")
my_key = "key"
# Create group
group = h5file.create_group("/", "my_table")
table = h5file.create_table(group, my_key, MyTable, "table of values")
row = table.row
# user decides which indices to create
field1 = "COL1" # create index on column 1, COL1
field2 = "COL2" # create index on column 2, COL2
# import dictionary 'dictionary1"
row["COL1"] = dict["COL1"]
row["COL2"] = dict["COL2"]
row["COL3"] = dict["COL3"]
row["COL4"] = dict["COL4"]
row["COL5"] = dict["COL5"]
row["COL6"] = dict["COL6"]
row["COL7"] = dict["COL7"]
# This injects the Record values
table.cols.field1.create_index()

You are creating an index inside the loop, but the index only needs to be
created *once*, so you need to do that once the loop is finished (typically
after the flush(), as we discussed already).

Thanks for this. I understand now the optimal way to create keys is after
the loop is finished with `flush()`---it felt strange to create keys after
~TB of data had been written to HDF5.

Post by Francesc Alted

Post by Evan
table.cols.field2.create_index()

Also, if the column name is called 'COL1' or 'COL2', the way to access the
table.cols.COL1.create_index()

Let's say users were to input the column names via argparse, eg.
`---column_names ="COL1","COL2"`. Argparse would take this user input and
define them as variables, e.g. `variable1 = "COL1"`, `variable2 = "COL2"`

Would there be a difference in terms of the PyTables API if the source code
read

`table.cols.variable1.create_index()`
`table.cols.variable.2.create_index()` ?

Post by Francesc Alted

Post by Evan
row.append()
# Flush the table buffers
table.flush()

"ValueError: Index(6, medium, shuffle, zlib(1)).is_csi=False for
column 'COL1' already exists. If you want to re-create it, please, try
with reindex() method better"

Where above am I indexing on all columns? Surely I would have most faster
queries if I only indexed on one/two columns, and queried those, right?

Hope this helps
--
Francesc Alted