Discussion:
[pytables-users] Output files are 0 bytes before Hdf5 file is closed - buffering?
t***@gmail.com
2016-10-17 18:21:51 UTC
Permalink
Dear group,

I am using pytables to write HDF5 data rows sequentially row by row during
the analysis of a larger data set.

So far when i used this for small to medium analyses, everything is written
correctly. However, it seems that the output files are 0 bytes until my
program terminates, but then end up stored correctly.

Now I started an analysis that runs for two weeks or more. After 5 days,
all my output files are still 0 bytes, on a 16 CPU machine (Aws instance)
with 122 GB mem and 16 processes running in parallel (all writing into
multiple different hdf5 files, no parallel writing to the same file).

This means if a process is terminated by a fault, or i kill one (or there
is a power outage or other problem) - I will end up with 0 data?

Where does all that data get cached or buffered in the meantime? With 122
GB of memory it is impossible that the data of 5 days of analysis is still
kept in memory...

I would like to understand Pytables' buffering mechanisms and the risks
involved. I would like to avoid ending up with 0 data after running
multiple instance of big data analysis.

The code i use is the class HDF5FeatureWriter from
https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py

I am using pytables 3.1.1 and numpy 1.8.2 with Python 2.7 under Ubuntu
14.04 LTS.

thanks, Thomas
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Francesc Alted
2016-10-22 15:47:01 UTC
Permalink
Hi Tom,

In the code that you are showing us, I am seeing that you are doing a
flush() when you call HDF5FeatureWriter.write_features() (as long as you
set parameter flush=True). Are you sure that you are passing this flag
from your programs?

BTW, in determining the root of the problem, it always helps if you send a
minimal, self-contained example showing the behavior you are describing.
Post by t***@gmail.com
Dear group,
I am using pytables to write HDF5 data rows sequentially row by row during
the analysis of a larger data set.
So far when i used this for small to medium analyses, everything is
written correctly. However, it seems that the output files are 0 bytes
until my program terminates, but then end up stored correctly.
Now I started an analysis that runs for two weeks or more. After 5 days,
all my output files are still 0 bytes, on a 16 CPU machine (Aws instance)
with 122 GB mem and 16 processes running in parallel (all writing into
multiple different hdf5 files, no parallel writing to the same file).
This means if a process is terminated by a fault, or i kill one (or there
is a power outage or other problem) - I will end up with 0 data?
Where does all that data get cached or buffered in the meantime? With 122
GB of memory it is impossible that the data of 5 days of analysis is still
kept in memory...
I would like to understand Pytables' buffering mechanisms and the risks
involved. I would like to avoid ending up with 0 data after running
multiple instance of big data analysis.
The code i use is the class HDF5FeatureWriter from
https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py
I am using pytables 3.1.1 and numpy 1.8.2 with Python 2.7 under Ubuntu
14.04 LTS.
thanks, Thomas
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thomas Lidy
2016-10-24 14:20:40 UTC
Permalink
Dear Frances,

Thank you for your answer. In the meantime I use flush as default, in the original implementation this was not there.

In fact my problem was due to another bug on my side: in a for loop I overwrote my data in each loop iteration and thats why the pytables data did never increase. It seems that the amount of data had been below the file buffer size and therefore the files remained at 0 Bytes at all times.

Once the bug was resolved, i observed the files increasing from time to time (i.e. buffering), with the flush option set to True, they are written instantly everytime I add data.

Thanks for your reply though.
best
Thomas
Post by Francesc Alted
Hi Tom,
In the code that you are showing us, I am seeing that you are doing a flush() when you call HDF5FeatureWriter.write_features() (as long as you set parameter flush=True). Are you sure that you are passing this flag from your programs?
BTW, in determining the root of the problem, it always helps if you send a minimal, self-contained example showing the behavior you are describing.
Dear group,
I am using pytables to write HDF5 data rows sequentially row by row during the analysis of a larger data set.
So far when i used this for small to medium analyses, everything is written correctly. However, it seems that the output files are 0 bytes until my program terminates, but then end up stored correctly.
Now I started an analysis that runs for two weeks or more. After 5 days, all my output files are still 0 bytes, on a 16 CPU machine (Aws instance) with 122 GB mem and 16 processes running in parallel (all writing into multiple different hdf5 files, no parallel writing to the same file).
This means if a process is terminated by a fault, or i kill one (or there is a power outage or other problem) - I will end up with 0 data?
Where does all that data get cached or buffered in the meantime? With 122 GB of memory it is impossible that the data of 5 days of analysis is still kept in memory...
I would like to understand Pytables' buffering mechanisms and the risks involved. I would like to avoid ending up with 0 data after running multiple instance of big data analysis.
The code i use is the class HDF5FeatureWriter from https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py <https://github.com/tuwien-musicir/rp_extract/blob/master/rp_feature_io.py>
I am using pytables 3.1.1 and numpy 1.8.2 with Python 2.7 under Ubuntu 14.04 LTS.
thanks, Thomas
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
--
Francesc Alted
--
You received this message because you are subscribed to a topic in the Google Groups "pytables-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pytables-users/wt_gZnbwoPs/unsubscribe <https://groups.google.com/d/topic/pytables-users/wt_gZnbwoPs/unsubscribe>.
For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...