|
Hi all,
I'm struggling to decompress a gzip'd raw vector in memory: content <- readBin("http://httpbin.org/gzip", "raw", 1000) memDecompress(content, type = "gzip") # Error in memDecompress(content, type = "gzip") : # internal error -3 in memDecompress(2) I'm reasonably certain that the file is correctly compressed, because if I save it out to a file, I can read the uncompressed data: tmp <- tempfile() writeBin(content, tmp) readLines(tmp) So that suggests I'm using memDecompress incorrectly. Any hints? Thanks! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On 02/05/2012 14:24, Hadley Wickham wrote:
> Hi all, > > I'm struggling to decompress a gzip'd raw vector in memory: > > content<- readBin("http://httpbin.org/gzip", "raw", 1000) > > memDecompress(content, type = "gzip") > # Error in memDecompress(content, type = "gzip") : > # internal error -3 in memDecompress(2) > > I'm reasonably certain that the file is correctly compressed, because > if I save it out to a file, I can read the uncompressed data: > > tmp<- tempfile() > writeBin(content, tmp) > readLines(tmp) > > So that suggests I'm using memDecompress incorrectly. Any hints? Headers. > Thanks! > > Hadley > -- Brian D. Ripley, [hidden email] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
>> I'm struggling to decompress a gzip'd raw vector in memory:
>> >> content<- readBin("http://httpbin.org/gzip", "raw", 1000) >> >> memDecompress(content, type = "gzip") >> # Error in memDecompress(content, type = "gzip") : >> # internal error -3 in memDecompress(2) >> >> I'm reasonably certain that the file is correctly compressed, because >> if I save it out to a file, I can read the uncompressed data: >> >> tmp<- tempfile() >> writeBin(content, tmp) >> readLines(tmp) >> >> So that suggests I'm using memDecompress incorrectly. Any hints? > > Headers. Looking at http://tools.ietf.org/html/rfc1952: * the first two bytes are id1 and id2, which are 1f 8b as expected * the third byte is the compression: deflate (as.integer(content[3])) * the fourth byte is the flag rawToBits(content[4]) [1] 00 00 00 00 00 00 00 00 which indicates no extra header fields are present So the header looks ok to me (with my limited knowledge of gzip) Stripping off the header doesn't seem to help either: memDecompress(content[-(1:10)], type = "gzip") # Error in memDecompress(content[-(1:10)], type = "gzip") : # internal error -3 in memDecompress(2) I've read the help for memDecompress but I don't see anything there to help me. Any more hints? Thanks! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On 02/05/2012 16:43, Hadley Wickham wrote:
>>> I'm struggling to decompress a gzip'd raw vector in memory: >>> >>> content<- readBin("http://httpbin.org/gzip", "raw", 1000) >>> >>> memDecompress(content, type = "gzip") >>> # Error in memDecompress(content, type = "gzip") : >>> # internal error -3 in memDecompress(2) >>> >>> I'm reasonably certain that the file is correctly compressed, because >>> if I save it out to a file, I can read the uncompressed data: >>> >>> tmp<- tempfile() >>> writeBin(content, tmp) >>> readLines(tmp) >>> >>> So that suggests I'm using memDecompress incorrectly. Any hints? >> >> Headers. > > Looking at http://tools.ietf.org/html/rfc1952: > > * the first two bytes are id1 and id2, which are 1f 8b as expected > > * the third byte is the compression: deflate (as.integer(content[3])) > > * the fourth byte is the flag > > rawToBits(content[4]) > [1] 00 00 00 00 00 00 00 00 > > which indicates no extra header fields are present > > So the header looks ok to me (with my limited knowledge of gzip) > > Stripping off the header doesn't seem to help either: > > memDecompress(content[-(1:10)], type = "gzip") > # Error in memDecompress(content[-(1:10)], type = "gzip") : > # internal error -3 in memDecompress(2) > > I've read the help for memDecompress but I don't see anything there to help me. > > Any more hints? Well, it seems what you get there depends on the client, but I did tystie% curl -o foo "http://httpbin.org/gzip" tystie% file foo foo: gzip compressed data, last modified: Wed May 2 17:06:24 2012, max compression and the final part worried me: I do not know if memDecompress() knows about that format. The help page does not claim it can do anything other than de-compress the results of memCompress() (although past experience has shown that it can in some cases). gzfile() supports a much wider range of formats. -- Brian D. Ripley, [hidden email] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
> Well, it seems what you get there depends on the client, but I did
> > tystie% curl -o foo "http://httpbin.org/gzip" > tystie% file foo > foo: gzip compressed data, last modified: Wed May 2 17:06:24 2012, max > compression > > and the final part worried me: I do not know if memDecompress() knows about > that format. The help page does not claim it can do anything other than > de-compress the results of memCompress() (although past experience has shown > that it can in some cases). gzfile() supports a much wider range of > formats. Ah, ok. Thanks. Then in that case it's probably just as easy to save it to a temp file and read that. con <- file(tmp) # R automatically detects compression open(con, "rb") on.exit(close(con), TRUE) readBin(con, raw(), file.info(tmp)$size * 10) The only challenge is figuring out what n to give readBin. Is there a good general strategy for this? Guess based on the file size and then iterate until result of readBin has length less than n? n <- file.info(tmp)$size * 2 content <- readBin(con, raw(), n) n_read <- length(content) while(n_read == n) { more <- readBin(con, raw(), n) content <- c(content, more) n_read <- length(more) } Which is not great style, but there shouldn't be many reads. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
I understand the desire not to have any dependency on additional
packages, and I have no desire to engage in any "mine's better" exchanges. So write this just for the record. The gzunzip() function handle this. > library(RCurl); library(Rcompression) > val = getURLContent("http://httpbin.org/gzip") > cat(gunzip(val)) { "origin": "24.5.119.171", "headers": { "Content-Length": "", "Host": "httpbin.org", "Content-Type": "", "Connection": "keep-alive", "Accept": "*/*" }, "gzipped": true, "method": "GET" } Just FWIW, as I really don't like writing to temporary files, most so that we might move towards security in R. D. Hadley Wickham wrote: > > Well, it seems what you get there depends on the client, but I did > > > > tystie% curl -o foo "http://httpbin.org/gzip" > > tystie% file foo > > foo: gzip compressed data, last modified: Wed May 2 17:06:24 2012, max > > compression > > > > and the final part worried me: I do not know if memDecompress() knows about > > that format. The help page does not claim it can do anything other than > > de-compress the results of memCompress() (although past experience has shown > > that it can in some cases). gzfile() supports a much wider range of > > formats. > > Ah, ok. Thanks. Then in that case it's probably just as easy to save > it to a temp file and read that. > > con <- file(tmp) # R automatically detects compression > open(con, "rb") > on.exit(close(con), TRUE) > > readBin(con, raw(), file.info(tmp)$size * 10) > > The only challenge is figuring out what n to give readBin. Is there a > good general strategy for this? Guess based on the file size and then > iterate until result of readBin has length less than n? > > n <- file.info(tmp)$size * 2 > content <- readBin(con, raw(), n) > n_read <- length(content) > while(n_read == n) { > more <- readBin(con, raw(), n) > content <- c(content, more) > n_read <- length(more) > } > > Which is not great style, but there shouldn't be many reads. > > Hadley > > > -- > Assistant Professor / Dobelman Family Junior Chair > Department of Statistics / Rice University > http://had.co.nz/ > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
> I understand the desire not to have any dependency on additional
> packages, and I have no desire to engage in any "mine's better" exchanges. > So write this just for the record. > The gzunzip() function handle this. Funnily enough I just discovered that RCurl already handles this: you just need to set encoding = "gzip". No extra dependencies, and yours is better ;) Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
| Powered by Nabble | Edit this page |
