[pytables-users] Efficient storing/querying of big time series

Miguel Dardenne

2015-01-08 14:17:22 UTC

Hi all,

I need to store and be able to query some relatively big time series data.

Properties of the data are as follows:

- number of series : around 12000
- number of data points, globally : around 500.000.000 per month (five
hundred millions)
- the majority of data points are floating point values, the rest are
strings
- sampling period : variable between series and within a series
- data retention period : several years
- data archives need to be built in realtime, but a reasonable delay (~1
hour) is acceptable
- past data can be rebuilt if needed, but at a high cost
- sometimes, but quite rarely, some past data needs to be updated

Properties of envisioned queries:

- most of them will be timestamp-based queries, ranging from one day to
several months/years. 90%+ will be queries on the most recent data

I thought using PyTables/Pandas instead of an SQL database.

The questions I have are :

1. Should I create and manage, say, a file per month or is it better to
get as big as an HDF file possible, and why ?

2. Should I go and prefer the fixed or the table format ? To me, fixed
format looks OK if I keep one HDF file per month, as this way a whole
series probably fits in RAM and I can slice in-memory without needing a
table format index. Am I correct ?

Many thanks in advance for any useful insights.

Miguel

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.