Discussion:
[pytables-users] Is it possible to upload a +TB csv dataframe into HDF5 via PyTables? Use only one key?
EB
2016-10-08 00:25:39 UTC
Permalink
I have a very large amount of data in csv (data.table) format. I would like
to put this into HDF5 format and query. However, I am not sure how to do
this with only one key?

My approach so far has been:

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

key1 = "key"

columns1 = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
csvfile = csv.reader(csvfile)
for i, line in csvfile:
values = line.split()
# additional parsing on 'values'
dictionary1 = dict(zip(columns1, values))
df = pd.DataFrame(dictionary_line, index=[i])# save as pandas
dataframe
store.append(key1, df, data_columns=columns1, index=False)

columns_to_index = ["COL1", "COL2"] # only the first two

store.create_table_index(key1, columns=columns_to_index , optlevel=9,
kind='full')

store.close()

However, I get various errors doing this. For instance, ` Attribute
'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'`
This seems like a very easy thing to do, but I'm stuck. Am I mistaken? Is
this possible to do with PyTables?
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted
2016-10-10 07:40:26 UTC
Permalink
Hi Evan,

Uh, could you please provide a self-contained example using just PyTables
showing the problem? But again, this is something that you may get better
help by asking the pandas crew.
Post by EB
I have a very large amount of data in csv (data.table) format. I would
like to put this into HDF5 format and query. However, I am not sure how to
do this with only one key?
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
key1 = "key"
columns1 = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
csvfile = csv.reader(csvfile)
values = line.split()
# additional parsing on 'values'
dictionary1 = dict(zip(columns1, values))
df = pd.DataFrame(dictionary_line, index=[i])# save as pandas
dataframe
store.append(key1, df, data_columns=columns1, index=False)
columns_to_index = ["COL1", "COL2"] # only the first two
store.create_table_index(key1, columns=columns_to_index , optlevel=9,
kind='full')
store.close()
However, I get various errors doing this. For instance, ` Attribute
'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'`
This seems like a very easy thing to do, but I'm stuck. Am I mistaken? Is
this possible to do with PyTables?
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Evan
2016-10-11 21:19:17 UTC
Permalink
Post by EB
I have a very large amount of data in csv (data.table) format. I would
like to put this into HDF5 format and query. However, I am not sure how to
do this with only one key?
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
key1 = "key"
columns1 = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
csvfile = csv.reader(csvfile)
values = line.split()
# additional parsing on 'values'
dictionary1 = dict(zip(columns1, values))
df = pd.DataFrame(dictionary_line, index=[i])# save as pandas
dataframe
store.append(key1, df, data_columns=columns1, index=False)
columns_to_index = ["COL1", "COL2"] # only the first two
store.create_table_index(key1, columns=columns_to_index , optlevel=9,
kind='full')
store.close()
However, I get various errors doing this. For instance, ` Attribute
'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'`
This seems like a very easy thing to do, but I'm stuck. Am I mistaken? Is
this possible to do with PyTables?
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...