[pytables-users] What is the most efficient way to store raw data in potables?

Discussion:

juan jose gomez cadenas

2016-07-17 20:23:29 UTC

Dear users,
I have to store large arrays of raw data corresponding to photomultiplier
waveforms not yet zero suppressed. Specifically I have 12 PMTs waveforms to
store, each one containing about 1.2 million integers. My detector records
millions of such events. I am looking for the most efficiency way to store
the data.

One possibility is to use a nested array. For example:

class PMTRD(tables.IsDescription):
# event_id = tables.Int32Col(pos=1, indexed=True)
event_id = tables.Int32Col(pos=1)
# values means no parent)
pmtrd = tables.Int32Col(shape=LEN_PMT, pos=2)

then fill like this:

for i in range(NEVENTS):
for j in range(NPMTS):
pmt['event_id'] = i
pmt['pmtrd'] =raw_data(LEN_PMT)
# This injects the row values.
pmt.append()
table.flush()

here NPMTS = 12, LEN_PMT = 1.2e+6 and NEVENTS is a large number. This
system works and gives a table with two columns, one for event-id and the
other for the PMT raw data. The column for the PMT raw data is nested (it
contains the large raw data vector). Not sure this is very efficient. The
raws are NPMS*NEVENTS.

Alternatively one could store the raw data using carrays

hcnt = h5file.create_carray(root, name, atom, raw_data.shape,
filters=filters)
hcnt[:] = raw_data_per_pmt

one then uses the tree structure of pyTables to store the data, something
like

/root/rawData/event1/PMT1
carray
/root/rawData/event1/PMT2
carray
...
/root/rawData/event1/PMT12
carray
/root/rawData/event2/PMT1
carray
and so on...

My question: what is the best (most efficient) way to store such raw data?
Any limitations on the number of events that can be stored per file using
method "a" of method "b"? Any tips?

Thanks a lot,

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maarten Sneep

2016-07-17 20:57:36 UTC

Permalink

Hi,

Post by juan jose gomez cadenas
Dear users,
I have to store large arrays of raw data corresponding to photomultiplier waveforms not yet zero suppressed. Specifically I have 12 PMTs waveforms to store, each one containing about 1.2 million integers. My detector records millions of such events. I am looking for the most efficiency way to store the data.
# event_id = tables.Int32Col(pos=1, indexed=True)
event_id = tables.Int32Col(pos=1) # values means no parent)
pmtrd = tables.Int32Col(shape=LEN_PMT, pos=2)
pmt['event_id'] = i
pmt['pmtrd'] =raw_data(LEN_PMT)
# This injects the row values.
pmt.append()
table.flush()
here NPMTS = 12, LEN_PMT = 1.2e+6 and NEVENTS is a large number. This system works and gives a table with two columns, one for event-id and the other for the PMT raw data. The column for the PMT raw data is nested (it contains the large raw data vector). Not sure this is very efficient. The raws are NPMS*NEVENTS.
Alternatively one could store the raw data using carrays
hcnt = h5file.create_carray(root, name, atom, raw_data.shape, filters=filters)
hcnt[:] = raw_data_per_pmt
one then uses the tree structure of pyTables to store the data, something like
/root/rawData/event1/PMT1
carray
/root/rawData/event1/PMT2
carray
...
/root/rawData/event1/PMT12
carray
/root/rawData/event2/PMT1
carray
and so on...
My question: what is the best (most efficient) way to store such raw data? Any limitations on the number of events that can be stored per file using method "a" of method "b"? Any tips?

If you are going the carray route, go all the way, use some additional dimensions:

/root/rawData/pmts[event, detector, :]
/root/rawData/event_id[event]

You can use an extensible array for the events dimension, or pre-set a maximum and start a new file for each sequence of events. Play with the chunk-size to optimise for speed. It doesn’t look like you are going to use the indexing functionality of PyTables, so this will just work. When writing, you may want to dump complete slices into the file, writing per detector will be less efficient.

Make sure that your file system can handle the data rate, it seems you are trying to write quite a lot of data. 12*1.2e6*4 bytes/sample gives a base rate of 56 MB/sample-time. If you really need to multiply that by “millions”, then either your run-time must be very long indeed, or the data collection rate is going to be high. What kind of system is this?

You probably don’t want a group per event, in case of millions of events. That is why we have multi-dimensional arrays. Same is true for each PMT.

Best,

Maarten

juan jose gomez cadenas

2016-07-17 21:18:27 UTC

Permalink

Hi Maarten,
Thanks for the tip. That sounds as a good way to handle the data. Do you think is more efficient than the nested array approach (naively I would think so, but not sure).

The detector is a high-pressure xenon TPC searching for neutrinoless double beta decay. see:

http://next.ific.uv.es/next/

Our time window is 1.2 ms, and we sample at steps of 25 ns. We can take million of events when calibrating the detector with radioactive sources. For the Monte Carlo simulation of the event we generate photoelectrons in bins of 1 ns, to be able to simulate properly the electronics and DAQ, thus the large size of the (Monte Carlo) events, the real data events are 25 times less dense.

Indexing will be useful at a later stage, once we reduce the raw data and start producing high level objects. We come from the “Particle physics” culture, meaning C++ and ROOT package (root.cern.ch). I am looking for alternatives involving python and a suitable data base. Hdf5 and pytables seem to be a good solution.

Thanks!

Post by Maarten Sneep
Hi,

/root/rawData/pmts[event, detector, :]
/root/rawData/event_id[event]
You can use an extensible array for the events dimension, or pre-set a maximum and start a new file for each sequence of events. Play with the chunk-size to optimise for speed. It doesn’t look like you are going to use the indexing functionality of PyTables, so this will just work. When writing, you may want to dump complete slices into the file, writing per detector will be less efficient.
Make sure that your file system can handle the data rate, it seems you are trying to write quite a lot of data. 12*1.2e6*4 bytes/sample gives a base rate of 56 MB/sample-time. If you really need to multiply that by “millions”, then either your run-time must be very long indeed, or the data collection rate is going to be high. What kind of system is this?
You probably don’t want a group per event, in case of millions of events. That is why we have multi-dimensional arrays. Same is true for each PMT.
Best,
Maarten

Maarten Sneep

2016-07-17 22:38:37 UTC

Permalink

Post by juan jose gomez cadenas
Hi Maarten,
Thanks for the tip. That sounds as a good way to handle the data. Do you think is more efficient than the nested array approach (naively I would think so, but not sure).

Not 100% sure either. You do have a bit more overhead when using groups, even if it is only for the object-creation.

Whatever you do: add metadata/attributes (units, descriptive data, …). You may want to record some metadata in a standard table to allow for quick indexing. I think your main worry should be finding back the data you really want to look at. Or, keeping in tune with the particle physics background: do triage and throw out anything you can’t use. But I guess for the initial calibration that just isn’t an option.

Of course you can group similar experiments in a group in a file, just for organisation. One other tip: store your dimensions, just a carray with the time axis and other relevant dimensions. In netCDF all dimensions must be declared beforehand. They may be unlimited, but at least they have a name, and are therefore connected with all data. It is customary to declare an array with the same name as the dimension to store the time (and, given the application space of netCDF: latitude, longitude, …).

This gives you a sample space, and helps a lot in making the data self-describing. PyTables probably has a better performance than netCDF (even though they use both HDF5 under the hood), because the C-netCDF layer is somewhat inefficient. Nevertheless, it helps to use good ideas from elsewhere.

Post by juan jose gomez cadenas
http://next.ific.uv.es/next/
Our time window is 1.2 ms, and we sample at steps of 25 ns. We can take million of events when calibrating the detector with radioactive sources. For the Monte Carlo simulation of the event we generate photoelectrons in bins of 1 ns, to be able to simulate properly the electronics and DAQ, thus the large size of the (Monte Carlo) events, the real data events are 25 times less dense.

1.2 ms time window, 25 ns sample interval, 12 detectors, 4 bytes/sample. That looks like a leisurely 2½ MB/s. Even the simulations some to about 62 MB/s. That is quite doable, not exotic.

Post by juan jose gomez cadenas
Indexing will be useful at a later stage, once we reduce the raw data and start producing high level objects. We come from the “Particle physics” culture, meaning C++ and ROOT package (root.cern.ch). I am looking for alternatives involving python and a suitable data base. Hdf5 and pytables seem to be a good solution.

Plenty of people in Cern are using Python. But I have a somewhat different background, as you may have guessed by now.

Maarten

juan jose gomez cadenas

2016-07-18 07:16:54 UTC

Permalink

Thanks again! Most useful!
Indeed, Python is being heavily used at CERN this days (I am an old hand, first Fortran, then C, then C++, then started the python movement, along other ârebelsâ at CERN, more than 10 years ago). The problem is that python is heavily used with ROOT via the pyROOT interface. ROOT itself is a bit of a nightmare, and the combination with python ruins many of the nicest python features. Instead, pyTables seems very well thought out, so I decided to give it a try. Thanks again, Maarten!

Post by Maarten Sneep

Not 100% sure either. You do have a bit more overhead when using groups, even if it is only for the object-creation.
Whatever you do: add metadata/attributes (units, descriptive data, âŠ). You may want to record some metadata in a standard table to allow for quick indexing. I think your main worry should be finding back the data you really want to look at. Or, keeping in tune with the particle physics background: do triage and throw out anything you canât use. But I guess for the initial calibration that just isnât an option.
Of course you can group similar experiments in a group in a file, just for organisation. One other tip: store your dimensions, just a carray with the time axis and other relevant dimensions. In netCDF all dimensions must be declared beforehand. They may be unlimited, but at least they have a name, and are therefore connected with all data. It is customary to declare an array with the same name as the dimension to store the time (and, given the application space of netCDF: latitude, longitude, âŠ).
This gives you a sample space, and helps a lot in making the data self-describing. PyTables probably has a better performance than netCDF (even though they use both HDF5 under the hood), because the C-netCDF layer is somewhat inefficient. Nevertheless, it helps to use good ideas from elsewhere.

1.2 ms time window, 25 ns sample interval, 12 detectors, 4 bytes/sample. That looks like a leisurely 2Âœ MB/s. Even the simulations some to about 62 MB/s. That is quite doable, not exotic.

Post by juan jose gomez cadenas
Indexing will be useful at a later stage, once we reduce the raw data and start producing high level objects. We come from the âParticle physicsâ culture, meaning C++ and ROOT package (root.cern.ch <http://root.cern.ch/>). I am looking for alternatives involving python and a suitable data base. Hdf5 and pytables seem to be a good solution.

Plenty of people in Cern are using Python. But I have a somewhat different background, as you may have guessed by now.
Maarten

Francesc Alted

2016-07-18 16:19:12 UTC

Permalink

Hola Juanjo,

Your question is actually a FAQ that was asked now and then almost since
the beginning of the project, so I took some time to create an example
(heavily based on your setup) that tries to explain when a single Table is
not enough and a combination of an EArray+Table works much better:

https://github.com/PyTables/PyTables/blob/develop/examples/Single_Table-vs-EArray_Table.ipynb

Spoiler: at speeds of ~1 GB/s for writing and ~2.5 GB/s for reading, the
EArray+Table is a very powerful one when speed is needed (during
acquisition peaks for example).

Post by juan jose gomez cadenas
Thanks again! Most useful!
Indeed, Python is being heavily used at CERN this days (I am an old hand,
first Fortran, then C, then C++, then started the python movement, along
other ârebelsâ at CERN, more than 10 years ago). The problem is that python
is heavily used with ROOT via the pyROOT interface. ROOT itself is a bit of
a nightmare, and the combination with python ruins many of the nicest
python features. Instead, pyTables seems very well thought out, so I
decided to give it a try. Thanks again, Maarten!
On 17 Jul 2016, at 23:18, juan jose gomez cadenas <
Hi Maarten,
Thanks for the tip. That sounds as a good way to handle the data. Do you
think is more efficient than the nested array approach (naively I would
think so, but not sure).
Not 100% sure either. You do have a bit more overhead when using groups,
even if it is only for the object-creation.
Whatever you do: add metadata/attributes (units, descriptive data, âŠ). You
may want to record some metadata in a standard table to allow for quick
indexing. I think your main worry should be finding back the data you
really want to look at. Or, keeping in tune with the particle physics
background: do triage and throw out anything you canât use. But I guess for
the initial calibration that just isnât an option.
Of course you can group similar experiments in a group in a file, just for
organisation. One other tip: store your dimensions, just a carray with the
time axis and other relevant dimensions. In netCDF all dimensions must be
declared beforehand. They may be unlimited, but at least they have a name,
and are therefore connected with all data. It is customary to declare an
array with the same name as the dimension to store the time (and, given the
application space of netCDF: latitude, longitude, âŠ).
This gives you a sample space, and helps a lot in making the data
self-describing. PyTables probably has a better performance than netCDF
(even though they use both HDF5 under the hood), because the C-netCDF layer
is somewhat inefficient. Nevertheless, it helps to use good ideas from
elsewhere.
The detector is a high-pressure xenon TPC searching for neutrinoless
http://next.ific.uv.es/next/
Our time window is 1.2 ms, and we sample at steps of 25 ns. We can take
million of events when calibrating the detector with radioactive sources.
For the Monte Carlo simulation of the event we generate photoelectrons in
bins of 1 ns, to be able to simulate properly the electronics and DAQ, thus
the large size of the (Monte Carlo) events, the real data events are 25
times less dense.
1.2 ms time window, 25 ns sample interval, 12 detectors, 4 bytes/sample.
That looks like a leisurely 2Âœ MB/s. Even the simulations some to about 62
MB/s. That is quite doable, not exotic.
Indexing will be useful at a later stage, once we reduce the raw data and
start producing high level objects. We come from the âParticle physicsâ
culture, meaning C++ and ROOT package (root.cern.ch). I am looking for
alternatives involving python and a suitable data base. Hdf5 and pytables
seem to be a good solution.
Plenty of people in Cern are using Python. But I have a somewhat different
background, as you may have guessed by now.
Maarten
--
You received this message because you are subscribed to the Google Groups
"pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
Francesc Alted
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jon Wilson

2016-07-18 16:46:48 UTC

Permalink

Hi Juan,
I'm on SuperCDMS and formerly CDF, and have encountered many of the same
sorts of issues as you. Our raw data (not so different from PMT pulses)
is stored in a custom binary format that is actually quite fast to read
in python if you're smart about allocating memory in advance, doing
block reads, etc. I'm fairly satisfied with this solution.

For reading data that is already locked up in ROOT files, then by all
means avoid PyROOT if you possibly can. The indirection involved in the
ROOT -- Python interface incurs huge overhead costs at every PyROOT
call. It's far better to snarf big chunks of data into memory and then
use the scientific python tools to do analysis. You can take a look at
"rootpy" and "root_numpy" for this. If that still isn't fast enough,
then I'm happy to share some code I wrote that does block reads from the
TBaskets of the TTree, with which I got some additional speedup beyond
root_numpy.

Any further replies along these lines probably should be off this list,
since this is getting a bit far afield from pytables stuff.
Regards,
Jon

Post by Maarten Sneep

Post by juan jose gomez cadenas
On 17 Jul 2016, at 23:18, juan jose gomez cadenas
Hi Maarten,
Thanks for the tip. That sounds as a good way to handle the data. Do
you think is more efficient than the nested array approach (naively
I would think so, but not sure).

Not 100% sure either. You do have a bit more overhead when using
groups, even if it is only for the object-creation.
Whatever you do: add metadata/attributes (units, descriptive data,
âŠ). You may want to record some metadata in a standard table to allow
for quick indexing. I think your main worry should be finding back
the data you really want to look at. Or, keeping in tune with the
particle physics background: do triage and throw out anything you
canât use. But I guess for the initial calibration that just isnât an
option.
Of course you can group similar experiments in a group in a file,
just for organisation. One other tip: store your dimensions, just a
carray with the time axis and other relevant dimensions. In netCDF
all dimensions must be declared beforehand. They may be unlimited,
but at least they have a name, and are therefore connected with all
data. It is customary to declare an array with the same name as the
dimension to store the time (and, given the application space of
netCDF: latitude, longitude, âŠ).
This gives you a sample space, and helps a lot in making the data
self-describing. PyTables probably has a better performance than
netCDF (even though they use both HDF5 under the hood), because the
C-netCDF layer is somewhat inefficient. Nevertheless, it helps to use
good ideas from elsewhere.

Post by juan jose gomez cadenas
The detector is a high-pressure xenon TPC searching for neutrinoless
http://next.ific.uv.es/next/
Our time window is 1.2 ms, and we sample at steps of 25 ns. We can
take million of events when calibrating the detector with
radioactive sources. For the Monte Carlo simulation of the event we
generate photoelectrons in bins of 1 ns, to be able to simulate
properly the electronics and DAQ, thus the large size of the (Monte
Carlo) events, the real data events are 25 times less dense.

1.2 ms time window, 25 ns sample interval, 12 detectors, 4
bytes/sample. That looks like a leisurely 2Âœ MB/s. Even the
simulations some to about 62 MB/s. That is quite doable, not exotic.

Post by juan jose gomez cadenas
Indexing will be useful at a later stage, once we reduce the raw
data and start producing high level objects. We come from the
âParticle physicsâ culture, meaning C++ and ROOT package
(root.cern.ch <http://root.cern.ch/>). I am looking for alternatives
involving python and a suitable data base. Hdf5 and pytables seem to
be a good solution.

Plenty of people in Cern are using Python. But I have a somewhat
different background, as you may have guessed by now.
Maarten

--
You received this message because you are subscribed to the Google
Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.