Discussion:
dates and space
Oleksandr Huziy
2013-08-05 18:38:35 UTC
Permalink
Hi Pytables users and developers:

I have a few questions to which I could not find the answer in the
documentation. Thank you in advance for any help.

1. If I store dates in Pytables, does it mean I could write queries like
table.where('date.month == 5')? Is there a common way to pass from python's
datetime to pytable's datetime and inversely?

2. I have several variables stored in the same file in a separate table for
each variable. And I use separate columns year, month, day, hour, minute,
second - to mark the time for a record (the records are not necessarily
ordered in time) and this is for each variable. I was thinking to put all
the variables in the same table and put missing values for the variables
which do not have outputs for a given time step. Is it possible to put None
as a default value into a table (so I could easily filter dummy rows).
But then again the data comes in chunks, does this mean I would have to
check if a row with the same date already exist for a different variable?


I don't really like the ideas in 2, which are intended to save space, but
maybe all I need is a good compression level? Can somebody advise me on
this?



Cheers
--
Oleksandr (Sasha) Huziy
Anthony Scopatz
2013-08-05 18:54:30 UTC
Permalink
Post by Oleksandr Huziy
I have a few questions to which I could not find the answer in the
documentation. Thank you in advance for any help.
1. If I store dates in Pytables, does it mean I could write queries like
table.where('date.month == 5')? Is there a common way to pass from python's
datetime to pytable's datetime and inversely?
Hello Sasha,

Pytables times are the actual based off of C time, not Python's date times.
This is because they use the HDF5 time types. So unfortunately you can't
write queries like the one above. (You'd need to talk to numexpr about
getting that kind of query implemented ~_~.)

Instead I would suggest that you store your times as Float64Atoms and
Float64Cols and then use arithmetic to figure out the query:

table.where("(x / 3600 / 24)%12 == 5")

This is not perfect...
Post by Oleksandr Huziy
2. I have several variables stored in the same file in a separate table
for each variable. And I use separate columns year, month, day, hour,
minute, second - to mark the time for a record (the records are not
necessarily ordered in time) and this is for each variable. I was thinking
to put all the variables in the same table and put missing values for the
variables which do not have outputs for a given time step. Is it possible
to put None as a default value into a table (so I could easily filter dummy
rows).
It is not possible to use "None" since that is a Python object of a
different type than the other integers you are trying to stick in the
column. I would suggest that you use values with no actual meaning. If
you are using normal ints you can use -1 to represent missing values. If
you are using unsigned ints you have to pick other values, like 13 for
month on the Julian calendar.
Post by Oleksandr Huziy
But then again the data comes in chunks, does this mean I would have to
check if a row with the same date already exist for a different variable?
No you wouldn't you can store the same data multiple times in different
rows.
Post by Oleksandr Huziy
I don't really like the ideas in 2, which are intended to save space, but
maybe all I need is a good compression level? Can somebody advise me on
this?
Compression would definitely help here since the date numbers are all
fairly similar. Probably even a compression level of 1 would work. Keep
in mind that sometime using compression actually speeds things up (see the
starving CPU problem). You might just need to experiment with a few
different compression level to see how things go. 0, 1, 5, 9 gives you a
good spread.

Be Well
Anthony
Post by Oleksandr Huziy
Cheers
--
Oleksandr (Sasha) Huziy
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Jeff Reback
2013-08-05 19:02:38 UTC
Permalink
Here is a pandas solution for doing just this (which uses PyTables under the hood):

# create a frame
In [45]: df = DataFrame(randn(1000,2),index=date_range('20000101',periods=1000))

In [53]: df
Out[53]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2000-01-01 00:00:00 to 2002-09-26 00:00:00
Freq: D
Data columns (total 2 columns):
0    1000  non-null values
1    1000  non-null values
dtypes: float64(2)

# store it as a table
In [46]: store = pd.HDFStore('test.h5',mode='w')

In [47]: store.append('df',df)

# select out the index (a datetimeindex in this case)
In [48]: c = store.select_column('df','index')

# get the coordinates of matching index
In [49]: coords = c[pd.DatetimeIndex(c).month==5]

# select those rows
In [51]: from pandas.io.pytables import Coordinates

In [50]: store.select('df',where=Coordinates(coords.index,None,None))
Out[50]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 93 entries, 2000-05-01 00:00:00 to 2002-05-31 00:00:00
Data columns (total 2 columns):
0    93  non-null values
1    93  non-null values
dtypes: float64(2)



________________________________
From: Anthony Scopatz <***@gmail.com>
To: Discussion list for PyTables <pytables-***@lists.sourceforge.net>
Sent: Monday, August 5, 2013 2:54 PM
Subject: Re: [Pytables-users] dates and space
I have a few questions to which I could not find the answer in the documentation. Thank you in advance for any help.
1. If I store dates in Pytables, does it mean I could write queries like table.where('date.month == 5')? Is there a common way to pass from python's datetime to pytable's datetime and inversely?
Hello Sasha, 

Pytables times are the actual based off of C time, not Python's date times.  This is because they use the HDF5 time types.  So unfortunately you can't write queries like the one above.  (You'd need to talk to numexpr about getting that kind of query implemented ~_~.)

Instead I would suggest that you store your times as Float64Atoms and Float64Cols and then use arithmetic to figure out the query:

table.where("(x / 3600 / 24)%12 == 5") 

This is not perfect...
 
2. I have several variables stored in the same file in a separate table for each variable. And I use separate columns year, month, day, hour, minute, second  - to mark the time for a record (the records are not necessarily ordered in time) and this is for each variable. I was thinking to put all the variables in the same table and put missing values for the variables which do not have outputs for a given time step. Is it possible to put None as a default value into a table (so I could easily filter dummy rows).
It is not possible to use "None" since that is a Python object of a different type than the other integers you are trying to stick in the column.  I would suggest that you use values with no actual meaning.  If you are using normal ints you can use -1 to represent missing values.  If you are using unsigned ints you have to pick other values, like 13 for month on the Julian calendar.
 
But then again the data comes in chunks, does this mean I would have to check if a row with the same date already exist for a different variable?

No you wouldn't you can store the same data multiple times in different rows.
 
I don't really like the ideas in 2, which are intended to save space, but maybe all I need is a good compression level? Can somebody advise me on this?
Compression would definitely help here since the date numbers are all fairly similar.  Probably even a compression level of 1 would work.  Keep in mind that sometime using compression actually speeds things up (see the starving CPU problem).  You might just need to experiment with a few different compression level to see how things go. 0, 1, 5, 9 gives you a good spread.

Be Well
Anthony
 
Cheers
--
Oleksandr (Sasha) Huziy   
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Oleksandr Huziy
2013-08-05 19:43:33 UTC
Permalink
Thank you Anthony and Jeff:

I will try compression and if that won't be enough, I'll try using pandas
for working with dates.

Cheers
--
Sasha
Post by Jeff Reback
# create a frame
In [45]: df =
DataFrame(randn(1000,2),index=date_range('20000101',periods=1000))
In [53]: df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2000-01-01 00:00:00 to 2002-09-26 00:00:00
Freq: D
0 1000 non-null values
1 1000 non-null values
dtypes: float64(2)
# store it as a table
In [46]: store = pd.HDFStore('test.h5',mode='w')
In [47]: store.append('df',df)
# select out the index (a datetimeindex in this case)
In [48]: c = store.select_column('df','index')
# get the coordinates of matching index
In [49]: coords = c[pd.DatetimeIndex(c).month==5]
# select those rows
In [51]: from pandas.io.pytables import Coordinates
In [50]: store.select('df',where=Coordinates(coords.index,None,None))
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 93 entries, 2000-05-01 00:00:00 to 2002-05-31 00:00:00
0 93 non-null values
1 93 non-null values
dtypes: float64(2)
------------------------------
*Sent:* Monday, August 5, 2013 2:54 PM
*Subject:* Re: [Pytables-users] dates and space
I have a few questions to which I could not find the answer in the
documentation. Thank you in advance for any help.
1. If I store dates in Pytables, does it mean I could write queries like
table.where('date.month == 5')? Is there a common way to pass from python's
datetime to pytable's datetime and inversely?
Hello Sasha,
Pytables times are the actual based off of C time, not Python's date
times. This is because they use the HDF5 time types. So unfortunately you
can't write queries like the one above. (You'd need to talk to numexpr
about getting that kind of query implemented ~_~.)
Instead I would suggest that you store your times as Float64Atoms and
table.where("(x / 3600 / 24)%12 == 5")
This is not perfect...
2. I have several variables stored in the same file in a separate table
for each variable. And I use separate columns year, month, day, hour,
minute, second - to mark the time for a record (the records are not
necessarily ordered in time) and this is for each variable. I was thinking
to put all the variables in the same table and put missing values for the
variables which do not have outputs for a given time step. Is it possible
to put None as a default value into a table (so I could easily filter dummy
rows).
It is not possible to use "None" since that is a Python object of a
different type than the other integers you are trying to stick in the
column. I would suggest that you use values with no actual meaning. If
you are using normal ints you can use -1 to represent missing values. If
you are using unsigned ints you have to pick other values, like 13 for
month on the Julian calendar.
But then again the data comes in chunks, does this mean I would have to
check if a row with the same date already exist for a different variable?
No you wouldn't you can store the same data multiple times in different rows.
I don't really like the ideas in 2, which are intended to save space, but
maybe all I need is a good compression level? Can somebody advise me on
this?
Compression would definitely help here since the date numbers are all
fairly similar. Probably even a compression level of 1 would work. Keep
in mind that sometime using compression actually speeds things up (see the
starving CPU problem). You might just need to experiment with a few
different compression level to see how things go. 0, 1, 5, 9 gives you a
good spread.
Be Well
Anthony
Cheers
--
Oleksandr (Sasha) Huziy
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
Loading...