Pytables-users Digest, Vol 86, Issue 8

Discussion:

Pushkar Raj Pande

2013-07-18 02:04:46 UTC

Thanks Antonio and Anthony. I will give this a try.

-Pushkar

On Wed, Jul 17, 2013 at 2:59 PM, <

Date: Wed, 17 Jul 2013 16:59:16 -0500
Subject: Re: [Pytables-users] Pytables bulk loading data
To: Discussion list for PyTables
<
Content-Type: text/plain; charset="iso-8859-1"
Hi Pushkar,
I agree with Antonio. You should load your data with NumPy functions and
then write back out to PyTables. This is the fastest way to do things.
Be Well
Anthony
On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino <

Hi Pushkar,

Hi all,
I am trying to figure out the best way to bulk load data into pytables.
This question may have been already answered but I couldn't find what I

was

looking for.
The source data is in form of csv which may require parsing, type

checking

and setting default values if it doesn't conform to the type of the

column.

There are over 100 columns in a record. Doing this in a loop in python

for

each row of the record is very slow compared to just fetching the rows

from

one pytable file and writing it to another. Difference is almost a

factor

of ~50.
I believe if I load the data using a C procedure that does the parsing

and

builds the records to write in pytables I can get close to the speed of
just copying and writing the rows from 1 pytable to another. But may be
there is something simple and better that already exists. Can someone
please advise? But if it is a C procedure that I should write can

someone

point me to some examples or snippets that I can refer to put this

together.

Thanks,
Pushkar

numpy has some tools for loading data from csv files like loadtxt [1],
genfromtxt [2] and other variants.
Non of them is OK for you?
[1]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt

[2]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

cheers
--
Antonio Valentino

------------------------------------------------------------------------------

See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!

http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users

Pushkar Raj Pande

2013-07-18 06:45:47 UTC

Permalink

Both loadtxt and genfromtxt read the entire data into memory which is not
desirable. Is there a way to achieve streaming writes?

Thanks,
Pushkar

Post by Pushkar Raj Pande
Thanks Antonio and Anthony. I will give this a try.
-Pushkar
On Wed, Jul 17, 2013 at 2:59 PM, <

Hi Pushkar,

Hi all,
I am trying to figure out the best way to bulk load data into

pytables.

This question may have been already answered but I couldn't find what

was

looking for.
The source data is in form of csv which may require parsing, type

checking

and setting default values if it doesn't conform to the type of the

column.

There are over 100 columns in a record. Doing this in a loop in python

for

each row of the record is very slow compared to just fetching the rows

from

one pytable file and writing it to another. Difference is almost a

factor

of ~50.
I believe if I load the data using a C procedure that does the parsing

and

builds the records to write in pytables I can get close to the speed

just copying and writing the rows from 1 pytable to another. But may

there is something simple and better that already exists. Can someone
please advise? But if it is a C procedure that I should write can

someone

point me to some examples or snippets that I can refer to put this

together.

Thanks,
Pushkar

numpy has some tools for loading data from csv files like loadtxt [1],
genfromtxt [2] and other variants.
Non of them is OK for you?
[1]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt

[2]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

cheers
--
Antonio Valentino

------------------------------------------------------------------------------

http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users

Andreas Hilboll

2013-07-18 07:36:36 UTC

Permalink

Post by Pushkar Raj Pande
Both loadtxt and genfromtxt read the entire data into memory which is
not desirable. Is there a way to achieve streaming writes?
Thanks,
Pushkar
Thanks Antonio and Anthony. I will give this a try.
-Pushkar
On Wed, Jul 17, 2013 at 2:59 PM,
Date: Wed, 17 Jul 2013 16:59:16 -0500
Subject: Re: [Pytables-users] Pytables bulk loading data
To: Discussion list for PyTables
Content-Type: text/plain; charset="iso-8859-1"
Hi Pushkar,
I agree with Antonio. You should load your data with NumPy functions and
then write back out to PyTables. This is the fastest way to do things.
Be Well
Anthony
On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino <

Hi Pushkar,

Hi all,
I am trying to figure out the best way to bulk load data

into pytables.

This question may have been already answered but I couldn't

find what I

was

looking for.
The source data is in form of csv which may require parsing,

type

checking

and setting default values if it doesn't conform to the type

of the

column.

There are over 100 columns in a record. Doing this in a loop

in python

for

each row of the record is very slow compared to just

fetching the rows

from

one pytable file and writing it to another. Difference is

almost a factor

of ~50.
I believe if I load the data using a C procedure that does

the parsing

and

builds the records to write in pytables I can get close to

the speed of

just copying and writing the rows from 1 pytable to another.

But may be

there is something simple and better that already exists.

Can someone

please advise? But if it is a C procedure that I should

write can someone

point me to some examples or snippets that I can refer to

put this

together.

Thanks,
Pushkar

numpy has some tools for loading data from csv files like

loadtxt [1],

genfromtxt [2] and other variants.
Non of them is OK for you?
[1]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt

[2]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

cheers
--
Antonio Valentino

------------------------------------------------------------------------------

See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from

AppDynamics

Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!

http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users

-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
------------------------------
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users
End of Pytables-users Digest, Vol 86, Issue 8
*********************************************
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Pytables-users mailing list
https://lists.sourceforge.net/lists/listinfo/pytables-users

You could use pandas_ and the read_table function. There, you have nrows
and skiprows parameters with which you can easily do your own 'streaming'.

.. _pandas: http://pandas.pydata.org/

-- Andreas

Antonio Valentino

2013-07-18 08:00:56 UTC

Permalink

Hi Pushkar,

Post by Pushkar Raj Pande
Both loadtxt and genfromtxt read the entire data into memory which is not
desirable. Is there a way to achieve streaming writes?

OK, probably fromfile [1] can help you to cook something that works
without loading the entire file into memory (and without too much
iterations over the file).

Anyway I strongly recommend you to not perform read/write cycles on
single lines, rather define a reasonable data block size (number of
rows) and process the file in chunks.

If you find a reasonably simple solution it would be nice to include it
in out documentation as an example or a "recipe" [2]

[1]
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile
[2] http://pytables.github.io/latest/cookbook/index.html

best regards

antonio

Post by Pushkar Raj Pande
Thanks,
Pushkar

Post by Pushkar Raj Pande
Thanks Antonio and Anthony. I will give this a try.
-Pushkar
On Wed, Jul 17, 2013 at 2:59 PM, <

Hi Pushkar,

Hi all,
I am trying to figure out the best way to bulk load data into

pytables.

This question may have been already answered but I couldn't find what

was

looking for.
The source data is in form of csv which may require parsing, type

checking

and setting default values if it doesn't conform to the type of the

column.

There are over 100 columns in a record. Doing this in a loop in python

for

each row of the record is very slow compared to just fetching the rows

from

one pytable file and writing it to another. Difference is almost a

factor

of ~50.
I believe if I load the data using a C procedure that does the parsing

and

builds the records to write in pytables I can get close to the speed

just copying and writing the rows from 1 pytable to another. But may

there is something simple and better that already exists. Can someone
please advise? But if it is a C procedure that I should write can

someone

point me to some examples or snippets that I can refer to put this

together.

Thanks,
Pushkar

numpy has some tools for loading data from csv files like loadtxt [1],
genfromtxt [2] and other variants.
Non of them is OK for you?
[1]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt

[2]

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

cheers
--
Antonio Valentino

--
Antonio Valentino