Quantcast

Decompressing raw vectors in memory

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Decompressing raw vectors in memory

Hadley Wickham-2
Hi all,

I'm struggling to decompress a gzip'd raw vector in memory:

content <- readBin("http://httpbin.org/gzip", "raw", 1000)

memDecompress(content, type = "gzip")
# Error in memDecompress(content, type = "gzip") :
#  internal error -3 in memDecompress(2)

I'm reasonably certain that the file is correctly compressed, because
if I save it out to a file, I can read the uncompressed data:

tmp <- tempfile()
writeBin(content, tmp)
readLines(tmp)

So that suggests I'm using memDecompress incorrectly.  Any hints?

Thanks!

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Prof Brian Ripley
On 02/05/2012 14:24, Hadley Wickham wrote:

> Hi all,
>
> I'm struggling to decompress a gzip'd raw vector in memory:
>
> content<- readBin("http://httpbin.org/gzip", "raw", 1000)
>
> memDecompress(content, type = "gzip")
> # Error in memDecompress(content, type = "gzip") :
> #  internal error -3 in memDecompress(2)
>
> I'm reasonably certain that the file is correctly compressed, because
> if I save it out to a file, I can read the uncompressed data:
>
> tmp<- tempfile()
> writeBin(content, tmp)
> readLines(tmp)
>
> So that suggests I'm using memDecompress incorrectly.  Any hints?

Headers.

> Thanks!
>
> Hadley
>


--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Hadley Wickham-2
>> I'm struggling to decompress a gzip'd raw vector in memory:
>>
>> content<- readBin("http://httpbin.org/gzip", "raw", 1000)
>>
>> memDecompress(content, type = "gzip")
>> # Error in memDecompress(content, type = "gzip") :
>> #  internal error -3 in memDecompress(2)
>>
>> I'm reasonably certain that the file is correctly compressed, because
>> if I save it out to a file, I can read the uncompressed data:
>>
>> tmp<- tempfile()
>> writeBin(content, tmp)
>> readLines(tmp)
>>
>> So that suggests I'm using memDecompress incorrectly.  Any hints?
>
> Headers.

Looking at http://tools.ietf.org/html/rfc1952:

* the first two bytes are id1 and id2, which are 1f 8b as expected

* the third byte is the compression: deflate (as.integer(content[3]))

* the fourth byte is the flag

  rawToBits(content[4])
  [1] 00 00 00 00 00 00 00 00

  which indicates no extra header fields are present

So the header looks ok to me (with my limited knowledge of gzip)

Stripping off the header doesn't seem to help either:

memDecompress(content[-(1:10)], type = "gzip")
# Error in memDecompress(content[-(1:10)], type = "gzip") :
#  internal error -3 in memDecompress(2)

I've read the help for memDecompress but I don't see anything there to help me.

Any more hints?

Thanks!

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Prof Brian Ripley
On 02/05/2012 16:43, Hadley Wickham wrote:

>>> I'm struggling to decompress a gzip'd raw vector in memory:
>>>
>>> content<- readBin("http://httpbin.org/gzip", "raw", 1000)
>>>
>>> memDecompress(content, type = "gzip")
>>> # Error in memDecompress(content, type = "gzip") :
>>> #  internal error -3 in memDecompress(2)
>>>
>>> I'm reasonably certain that the file is correctly compressed, because
>>> if I save it out to a file, I can read the uncompressed data:
>>>
>>> tmp<- tempfile()
>>> writeBin(content, tmp)
>>> readLines(tmp)
>>>
>>> So that suggests I'm using memDecompress incorrectly.  Any hints?
>>
>> Headers.
>
> Looking at http://tools.ietf.org/html/rfc1952:
>
> * the first two bytes are id1 and id2, which are 1f 8b as expected
>
> * the third byte is the compression: deflate (as.integer(content[3]))
>
> * the fourth byte is the flag
>
>    rawToBits(content[4])
>    [1] 00 00 00 00 00 00 00 00
>
>    which indicates no extra header fields are present
>
> So the header looks ok to me (with my limited knowledge of gzip)
>
> Stripping off the header doesn't seem to help either:
>
> memDecompress(content[-(1:10)], type = "gzip")
> # Error in memDecompress(content[-(1:10)], type = "gzip") :
> #  internal error -3 in memDecompress(2)
>
> I've read the help for memDecompress but I don't see anything there to help me.
>
> Any more hints?

Well, it seems what you get there depends on the client, but I did

tystie% curl -o foo "http://httpbin.org/gzip"
tystie% file foo
foo: gzip compressed data, last modified: Wed May  2 17:06:24 2012, max
compression

and the final part worried me: I do not know if memDecompress() knows
about that format.  The help page does not claim it can do anything
other than de-compress the results of memCompress() (although past
experience has shown that it can in some cases).  gzfile() supports a
much wider range of formats.


--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Hadley Wickham-2
> Well, it seems what you get there depends on the client, but I did
>
> tystie% curl -o foo "http://httpbin.org/gzip"
> tystie% file foo
> foo: gzip compressed data, last modified: Wed May  2 17:06:24 2012, max
> compression
>
> and the final part worried me: I do not know if memDecompress() knows about
> that format.  The help page does not claim it can do anything other than
> de-compress the results of memCompress() (although past experience has shown
> that it can in some cases).  gzfile() supports a much wider range of
> formats.

Ah, ok.  Thanks.  Then in that case it's probably just as easy to save
it to a temp file and read that.

  con <- file(tmp) # R automatically detects compression
  open(con, "rb")
  on.exit(close(con), TRUE)

  readBin(con, raw(), file.info(tmp)$size * 10)

The only challenge is figuring out what n to give readBin. Is there a
good general strategy for this?  Guess based on the file size and then
iterate until result of readBin has length less than n?

  n <- file.info(tmp)$size * 2
  content <- readBin(con, raw(),  n)
  n_read <- length(content)
  while(n_read == n) {
    more <- readBin(con, raw(),  n)
    content <- c(content, more)
    n_read <- length(more)
  }

Which is not great style, but there shouldn't be many reads.

Hadley


--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Duncan Temple Lang
I understand the desire not to have any dependency on additional
packages, and I have no desire to engage in any "mine's better" exchanges.
So write this just for the record.
The gzunzip() function handle this.

> library(RCurl); library(Rcompression)
> val = getURLContent("http://httpbin.org/gzip")
> cat(gunzip(val))
{
  "origin": "24.5.119.171",
  "headers": {
    "Content-Length": "",
    "Host": "httpbin.org",
    "Content-Type": "",
    "Connection": "keep-alive",
    "Accept": "*/*"
  },
  "gzipped": true,
  "method": "GET"
}


Just FWIW, as I really don't like writing to temporary files,
most so that we might move towards security in R.

   D.


Hadley Wickham wrote:

> > Well, it seems what you get there depends on the client, but I did
> >
> > tystie% curl -o foo "http://httpbin.org/gzip"
> > tystie% file foo
> > foo: gzip compressed data, last modified: Wed May  2 17:06:24 2012, max
> > compression
> >
> > and the final part worried me: I do not know if memDecompress() knows about
> > that format.  The help page does not claim it can do anything other than
> > de-compress the results of memCompress() (although past experience has shown
> > that it can in some cases).  gzfile() supports a much wider range of
> > formats.
>
> Ah, ok.  Thanks.  Then in that case it's probably just as easy to save
> it to a temp file and read that.
>
>   con <- file(tmp) # R automatically detects compression
>   open(con, "rb")
>   on.exit(close(con), TRUE)
>
>   readBin(con, raw(), file.info(tmp)$size * 10)
>
> The only challenge is figuring out what n to give readBin. Is there a
> good general strategy for this?  Guess based on the file size and then
> iterate until result of readBin has length less than n?
>
>   n <- file.info(tmp)$size * 2
>   content <- readBin(con, raw(),  n)
>   n_read <- length(content)
>   while(n_read == n) {
>     more <- readBin(con, raw(),  n)
>     content <- c(content, more)
>     n_read <- length(more)
>   }
>
> Which is not great style, but there shouldn't be many reads.
>
> Hadley
>
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Decompressing raw vectors in memory

Hadley Wickham-2
> I understand the desire not to have any dependency on additional
> packages, and I have no desire to engage in any "mine's better" exchanges.
> So write this just for the record.
> The gzunzip() function handle this.

Funnily enough I just discovered that RCurl already handles this: you
just need to set encoding = "gzip".  No extra dependencies, and yours
is better ;)

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Loading...