t***@gmail.com
2016-10-17 18:21:51 UTC
Dear group,
I am using pytables to write HDF5 data rows sequentially row by row during
the analysis of a larger data set.
So far when i used this for small to medium analyses, everything is written
correctly. However, it seems that the output files are 0 bytes until my
program terminates, but then end up stored correctly.
Now I started an analysis that runs for two weeks or more. After 5 days,
all my output files are still 0 bytes, on a 16 CPU machine (Aws instance)
with 122 GB mem and 16 processes running in parallel (all writing into
multiple different hdf5 files, no parallel writing to the same file).
This means if a process is terminated by a fault, or i kill one (or there
is a power outage or other problem) - I will end up with 0 data?
Where does all that data get cached or buffered in the meantime? With 122
GB of memory it is impossible that the data of 5 days of analysis is still
kept in memory...
I would like to understand Pytables' buffering mechanisms and the risks
involved. I would like to avoid ending up with 0 data after running
multiple instance of big data analysis.
The code i use is the class HDF5FeatureWriter from
https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py
I am using pytables 3.1.1 and numpy 1.8.2 with Python 2.7 under Ubuntu
14.04 LTS.
thanks, Thomas
I am using pytables to write HDF5 data rows sequentially row by row during
the analysis of a larger data set.
So far when i used this for small to medium analyses, everything is written
correctly. However, it seems that the output files are 0 bytes until my
program terminates, but then end up stored correctly.
Now I started an analysis that runs for two weeks or more. After 5 days,
all my output files are still 0 bytes, on a 16 CPU machine (Aws instance)
with 122 GB mem and 16 processes running in parallel (all writing into
multiple different hdf5 files, no parallel writing to the same file).
This means if a process is terminated by a fault, or i kill one (or there
is a power outage or other problem) - I will end up with 0 data?
Where does all that data get cached or buffered in the meantime? With 122
GB of memory it is impossible that the data of 5 days of analysis is still
kept in memory...
I would like to understand Pytables' buffering mechanisms and the risks
involved. I would like to avoid ending up with 0 data after running
multiple instance of big data analysis.
The code i use is the class HDF5FeatureWriter from
https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py
I am using pytables 3.1.1 and numpy 1.8.2 with Python 2.7 under Ubuntu
14.04 LTS.
thanks, Thomas
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.