'Stefan Braun' via pytables-users
2017-11-13 13:12:47 UTC
Hi, I have an hdf5 file that I have to read with multiple processes forked
from a main process.
The target dataset has ~10k samples. A single sample is requested at a time
and needs further processing (hence the multi processing).
The main process creates a dataset class instance. The multiprocess fork is
done after the dataset class initialization. Two dataset classes are
available (see attached file):
(a) Open hdf5 during class initialization and pass the opened hdf5 object
to the forked processes. The get_data() method works directly with the
already opened hdf5 from the main process.
(b) Open hdf5 during the get_data method. Each forked process controls the
hdf5 state itself.
Both variants have disadvantages:
(a) Only works with 1 forked process. More processes throw errors as the
forked processes seem to disagree about the state of the opened hdf5 object
(b) Works with >1 forked processes, but ~10x slower as each forked process
has to reopen the file each time a new sample is requested.
Is there a way to use >1 processes while avoiding the slowdown by
reopening the hdf5 file for each sample? Perhaps I am missing something
obvious?
Thanks in advance
Best
Stefan
from a main process.
The target dataset has ~10k samples. A single sample is requested at a time
and needs further processing (hence the multi processing).
The main process creates a dataset class instance. The multiprocess fork is
done after the dataset class initialization. Two dataset classes are
available (see attached file):
(a) Open hdf5 during class initialization and pass the opened hdf5 object
to the forked processes. The get_data() method works directly with the
already opened hdf5 from the main process.
(b) Open hdf5 during the get_data method. Each forked process controls the
hdf5 state itself.
Both variants have disadvantages:
(a) Only works with 1 forked process. More processes throw errors as the
forked processes seem to disagree about the state of the opened hdf5 object
(b) Works with >1 forked processes, but ~10x slower as each forked process
has to reopen the file each time a new sample is requested.
Is there a way to use >1 processes while avoiding the slowdown by
reopening the hdf5 file for each sample? Perhaps I am missing something
obvious?
Thanks in advance
Best
Stefan
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.