A warning in gzcon but not in gzfile

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

A warning in gzcon but not in gzfile

Wang Jiefei
Hi all,

I used `gzfile` and `gzcon` to read a compressed file but I found that
`gzcon` gave me a different result than `gzfile`. It seems like the `gzcon`
does not handle the data correctly. I have posted an example below. In the
example, a portion of a compressed file is downloaded from Google Cloud as
a raw vector, and the data is saved into a temp file. If I use ` gzfile` to
read the file, it can show the first 1000 lines successfully. However, if I
wrap the raw vector as a connection, and use  `gzcon` to read from that
connection, it shows the first  884 lines along with a warning(see the
output).

code:

> # installed.packages("BiocManager")
> # BiocManager::install("GCSConnection", version = "devel")
> library(GCSConnection)
> ## Download data from cloud
> uri <-
> "gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz"
> con <- gcs_connection(uri)
> data <- readBin(con, raw(), 4*1024*1024)
> close(con)
>


## write data to a file
> file_path <- tempfile()
> writeBin(data, file_path)
>


## Read the data using `gzfile`
> con1 <- gzfile(file_path)
> str(readLines(con1, 1000))
>


## Read the data using `gzcon`
> ## We create a raw connection from the raw vector
> con2 <- gzcon(rawConnection(data))
> str(readLines(con2, 1000))


output:

> > str(readLines(con1, 1000))
>  chr [1:1000] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd"
> ...
> > str(readLines(con2, 1000))
>  chr [1:884] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ...
> Warning message:
> In readLines(con2, 1000) : incomplete final line found on 'gzcon(data)'


I am not sure if this is caused by a bug in `gzcon` or the misuse of the
function. The same result can be observed at R4.0 and R4.1 devel on Win.
Here is my session info, I hope it can be helpful. Any suggestions and help
would be appreciated.

R Under development (unstable) (2020-06-27 r78747)

> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 18363)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
> system code page: 65001



Best,
Jiefei

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel