Discussion:
[pytables-users] values (unicode) missing when loaded from Pandas HDF5 file
Kevad
2015-09-15 12:45:27 UTC
Permalink
Hello,

I have some twitter feed loaded in Pandas Series that I would like to
store in HDF5 file. Some of the feeds are norwegian and hence are to be
encoded. Since I am using Python 3.3.x and strings are by default UTF-8
encoded, I guess I need not worry about that (?). Assuming PyTables support
unicode columns (even though they are 'str') in Python 3, I saved them in
HDF5 file. But while loading them back, some of the values are missing.
# A sample of the data
feeds[80:90]
80 BØR MAN STARTE en tweet med store bokstaver? F...
81 @NRKSigrid @audunlysbakken Har du husket Per S...
82 Lurer på om IS har fått med seg kaoset ved Eur...
83 synes han hÞrte på P3 at Opoku uttales Opoko. ...
84 De statsbÊrende partiene Ap og HÞyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
86 MDG: Hasj for kjÞtt. #valg2015
87 GrÞnt skifte.. https://t.co/OuM8quaMz0
88 Kinderegg https://t.co/AsECmw2sV9
89 MDG for honning, frukt og grÞnt. https://t.co/...
Name: feeds, dtype: object
store = pd.HDFStore('feed.hd5')
store.append('feed', feeds[84:86], min_itemsize=200)
store.close()
pd.read_hdf('feed.hd5', 'feed')
84
85 April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object
feeds[84:86].to_hdf('feed.hd5', 'feed', format='table',
data_columns=True)
pd.read_hdf('feed.hd5', 'feed')
But If I change the index to, say, *[84:87]*, the value of *84th* row is
now loaded.
feeds[84:87].to_hdf('feed.hd5', 'feed', format='table',
data_columns=True)
res = pd.read_hdf('feed.hd5', 'feed')
res
84 De statsbÊrende partiene Ap og HÞyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
86 MDG: Hasj for kjÞtt. #valg2015
Name: feeds, dtype: object

But, the loaded string is missing some characters when compared with the
# Original tweet (Length: 140)
print (feeds[84])
De statsbÊrende partiene Ap og HÞyre må ta skylda for MiljÞpartiets
fremgang. Velgerne har sett at SV og V ikke vinner frem i miljÞspÞrsmål.
# tweet after loading from HDF5 file (Length: 134)
print (res[84])
De statsbÊrende partiene Ap og HÞyre må ta skylda for MiljÞpartiets
fremgang. Velgerne har sett at SV og V ikke vinner frem i miljÞspÞ

Can anyone explain this and let me know how can I avoid it ?

I am using
Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0

Thanks.
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kevad
2015-09-15 15:10:26 UTC
Permalink
This is also posted as a SO question here:
http://stackoverflow.com/questions/32553207/values-missing-when-loaded-from-pandas-hdf5-file

For reference, please find the Feeds file (the file in discussion above
that contains those twitter feeds) attached below.
Load it as
feeds = np.load('feeds.npy')

- Kevad.
Post by Kevad
Hello,
I have some twitter feed loaded in Pandas Series that I would like to
store in HDF5 file. Some of the feeds are norwegian and hence are to be
encoded. Since I am using Python 3.3.x and strings are by default UTF-8
encoded, I guess I need not worry about that (?). Assuming PyTables support
unicode columns (even though they are 'str') in Python 3, I saved them in
HDF5 file. But while loading them back, some of the values are missing.
# A sample of the data
feeds[80:90]
80 BØR MAN STARTE en tweet med store bokstaver? F...
82 Lurer på om IS har fått med seg kaoset ved Eur...
83 synes han hÞrte på P3 at Opoku uttales Opoko. ...
84 De statsbÊrende partiene Ap og HÞyre må ta sky...
86 MDG: Hasj for kjÞtt. #valg2015
87 GrÞnt skifte.. https://t.co/OuM8quaMz0
88 Kinderegg https://t.co/AsECmw2sV9
89 MDG for honning, frukt og grÞnt. https://t.co/...
Name: feeds, dtype: object
store = pd.HDFStore('feed.hd5')
store.append('feed', feeds[84:86], min_itemsize=200)
store.close()
pd.read_hdf('feed.hd5', 'feed')
84
Name: feeds, dtype: object
feeds[84:86].to_hdf('feed.hd5', 'feed', format='table',
data_columns=True)
pd.read_hdf('feed.hd5', 'feed')
But If I change the index to, say, *[84:87]*, the value of *84th* row is
now loaded.
feeds[84:87].to_hdf('feed.hd5', 'feed', format='table',
data_columns=True)
res = pd.read_hdf('feed.hd5', 'feed')
res
84 De statsbÊrende partiene Ap og HÞyre må ta sky...
86 MDG: Hasj for kjÞtt. #valg2015
Name: feeds, dtype: object
But, the loaded string is missing some characters when compared with the
# Original tweet (Length: 140)
print (feeds[84])
De statsbÊrende partiene Ap og HÞyre må ta skylda for MiljÞpartiets
fremgang. Velgerne har sett at SV og V ikke vinner frem i miljÞspÞrsmål.
# tweet after loading from HDF5 file (Length: 134)
print (res[84])
De statsbÊrende partiene Ap og HÞyre må ta skylda for MiljÞpartiets
fremgang. Velgerne har sett at SV og V ikke vinner frem i miljÞspÞ
Can anyone explain this and let me know how can I avoid it ?
I am using
Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0
Thanks.
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Andrea Bedini
2015-09-16 23:44:52 UTC
Permalink
Hi Kevad,

thank you very much for the detailed report, I will look into this.

Cheers,
Andrea

--
Andrea Bedini
@andreabedini, http://www.andreabedini.com

See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kevad
2015-09-16 23:49:03 UTC
Permalink
Thanks Andrea... :) I look forward to it. Please don't forget to follow on
the SO question as well, if possible. :)
Post by Andrea Bedini
Hi Kevad,
thank you very much for the detailed report, I will look into this.
Cheers,
Andrea
--
Andrea Bedini
@andreabedini, http://www.andreabedini.com
See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Andrea Bedini
2015-09-18 01:34:42 UTC
Permalink
Thanks Andrea... :) I look forward to it. Please don't forget to follow on the SO question as well, if possible. :)
I only start looking at this but it seems that PyTables correctly stores the tweets

In [69]: h5.root.feeds.table[:]
Out[69]:
array([ (84, b'De statsb\xc3\xa6rende partiene Ap og H\xc3\xb8yre m\xc3\xa5 ta skylda for Milj\xc3\xb8partiets fremgang. Velgerne har sett at SV og V ikke vinner frem i milj\xc3\xb8sp\xc3\xb8rsm\xc3\xa5l.'),
(85, b'April 2014. Blir MDG det nye arbeider @partiet? http://t.co/TpJwUhVyVM')],
dtype=[('index', '<i8'), ('feeds', 'S200')])

In [71]: h5.root.feeds.table[0][1].decode('utf-8')
Out[71]: 'De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspørsmål.'

so something is going wrong in the roundtrip through HDFStore.

Also I am bit confused by the output of h5dump:

DATASET "/feed/table" {
DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "index";
H5T_STRING {
STRSIZE 200;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "feeds";
}
DATASPACE SIMPLE { ( 2 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
84,
"De statsb\37777777703\37777777646rende partiene Ap og H\37777777703\37777777670yre m\37777777703\37777777645 ta skylda for Milj\37777777703\37777777670partiets fremgang. Velgerne har sett at SV og V ikke vinner frem i milj\37777777703\37777777670sp\37777777703\37777777670rsm\37777777703\37777777645l."
},
(1): {
85,
"April 2014. Blir MDG det nye arbeider @partiet? http://t.co/TpJwUhVyVM"
}
}
...
}

the datatype says the string is ASCII (which is not) and I have not idea what the \37... sequences are.

I need to add that I reckon the state of Unicode unicode support in PyTables and HDF5 is a bit sad (I’m speaking for myself here). Basically we still live in the fantasy world of "plain text” amended with some support for utf8 (which requires almost none by design).

Please have a look to the bugs with label “strings” if you feel like helping out: https://github.com/PyTables/PyTables/labels/strings .


Andrea

--
Andrea Bedini
@andreabedini, http://www.andreabedini.com

See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Andrea Bedini
2015-09-18 05:12:44 UTC
Permalink
Jeff Reback has opened a bug on Pandas. See https://github.com/pydata/pandas/issues/11126

Andrea

--
Andrea Bedini
@andreabedini, http://www.andreabedini.com

See the impact of my research at https://impactstory.org/AndreaBedini
use https://keybase.io/andreabedini to send me encrypted messages
Key fingerprint = 17D5 FB49 FA18 A068 CF53 C5C2 9503 64C1 B2D5 9591
--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+***@googlegroups.com.
To post to this group, send an email to pytables-***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...