Using read.table for importing gz file

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Using read.table for importing gz file

Spencer Brackett
Hello,

I am trying to read the following Xena dataset into R for data analysis:
https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz

I tried to run the following read.table(gzfile("HumanMethylation450.gz")),
but R ended up crashing as a result.

Is there perhaps a way to use read.table with fread in some way to do this?

Many thanks,

Spencer

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Using read.table for importing gz file

Wetherby
Unsubscribe

On Sat, 10 Aug 2019 at 20:30, Spencer Brackett <
[hidden email]> wrote:

> Hello,
>
> I am trying to read the following Xena dataset into R for data analysis:
>
> https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz
>
> I tried to run the following read.table(gzfile("HumanMethylation450.gz")),
> but R ended up crashing as a result.
>
> Is there perhaps a way to use read.table with fread in some way to do this?
>
> Many thanks,
>
> Spencer
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Using read.table for importing gz file

David Winsemius
In reply to this post by Spencer Brackett
Have you tried using readLines in the manner illustrated on the ?gzfile
help page?


David.

On 8/10/19 12:29 PM, Spencer Brackett wrote:

> Hello,
>
> I am trying to read the following Xena dataset into R for data analysis:
> https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz
>
> I tried to run the following read.table(gzfile("HumanMethylation450.gz")),
> but R ended up crashing as a result.
>
> Is there perhaps a way to use read.table with fread in some way to do this?
>
> Many thanks,
>
> Spencer
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Using read.table for importing gz file

David Winsemius
Well, let's see about "rules"  ... you posted in HTML when this is a
plain text mailing list and then you replied to only me when you are
supposed reply to the list (so I'm putting back the list address in my
reply:


When I copied your code and then attempted to do a bit of debugging I get:


 > z <-
readLines(gzcon(url(“https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz”)),
n = 100)
Error: unexpected input in "z <- readLines(gzcon(url(�"

# that was because you had "smart-quotes" rather than ASCII quotes:


 > z <- readLines(gzcon(url(
'https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz' 
)), n = 100)
 > z[1:10]
  [1]
"sample\tTCGA-E1-5319-01\tTCGA-HT-7693-01\tTCGA-CS-6665-01\tTCGA-S9-A7J2-01\tTCGA-FG-A6J3-01\tTCGA-FG-6688-01\tTCGA-S9-A6TX-01\tTCGA-VM-A8C8-01\tTCGA-74-6577-01\tTCGA-06-AABW-11\tTCGA-06-0125-02\tTCGA-HT-A74L-01\tTCGA-26-A7UX-01\tTCGA-DU-A5TS-01\tTCGA-06-6388-01\tTCGA-DB-A4XA-01\tTCGA-06-A7TL-01\tTCGA-HT-A4DV-01\tTCGA-TQ-A7RP-01\tTCGA-E1-5311-01\tTCGA-28-5213-01\tTCGA-E1-A7YI-01\tTCGA-E1-5305-01\tTCGA-F6-A8O4-01\tTCGA-HT-8113-01\tTCGA-DH-A66G-01\tTCGA-76-4932-01\t

Snipped hundreds of lines. So this seems to indicate that this is a tab
separated file. Don't you have some documentation to refer to?


This seems possibly useful:


 > z <- read.table(
text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')),
n = 100), header=TRUE, sep="\t")
 > str(z)
'data.frame':    99 obs. of  686 variables:
  $ sample         : Factor w/ 99 levels "cg00036732","cg00651829",..:
53 2 60 41 16 13 37 20 70 21 ...
  $ TCGA.E1.5319.01: num  0.4019 0.0215 0.053 0.0453 0.515 ...
  $ TCGA.HT.7693.01: num  0.9364 0.0216 0.0547 0.0819 0.6129 ...
  $ TCGA.CS.6665.01: num  0.0345 0.0164 0.0719 0.0497 0.6648 ...
  $ TCGA.S9.A7J2.01: num  0.0295 0.0168 0.0421 0.0867 0.1657 ...
  $ TCGA.FG.A6J3.01: num  0.0248 0.0161 0.0556 0.0902 0.5042 ...
  $ TCGA.FG.6688.01: num  0.0203 0.0179 0.0321 0.0513 0.1075 ...
  $ TCGA.S9.A6TX.01: num  0.0378 0.0199 0.0623 0.0992 0.7662 ...
  $ TCGA.VM.A8C8.01: num  0.0271 0.0172 0.0466 0.0564 0.3478 ...
  $ TCGA.74.6577.01: num  0.0237 0.0193 0.0196 0.0961 0.1242 ...
  $ TCGA.06.AABW.11: num  0.0323 0.0156 0.0395 0.0708 0.1136 ...
  $ TCGA.06.0125.02: num  0.0238 0.0181 0.039 0.068 0.0796 ...
  $ TCGA.HT.A74L.01: num  0.7409 0.0221 0.0596 0.0765 0.8157 ...

#snipped the output

# there seemed to be 686 columns


--

David.



On 8/10/19 3:07 PM, Spencer Brackett wrote:

> I’ve tried z <-
> readLines(gzcon(url(“https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz”)),
> n = 100)
>
> Which prints out the indicated 10 rows, but I can not seem to run the
> same code excluding the n = 100 without R stalling and me being forced
> to close the program. All I am trying to do is ensure that the whole
> file is imported into R so that I can proceed with a survival analysis.
>
> Also, what particular rule of the mailing list did I break? I
> apologize in advance, as I thought that code specific queries like the
> one I asked were acceptable.
>
> Many thanks,
>
> Spencer
>
> On Sat, Aug 10, 2019 at 5:51 PM David Winsemius
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Have you tried using readLines in the manner illustrated on the
>     ?gzfile
>     help page?
>
>
>     David.
>
>     On 8/10/19 12:29 PM, Spencer Brackett wrote:
>     > Hello,
>     >
>     > I am trying to read the following Xena dataset into R for data
>     analysis:
>     >
>     https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz
>     >
>     > I tried to run the following
>     read.table(gzfile("HumanMethylation450.gz")),
>     > but R ended up crashing as a result.
>     >
>     > Is there perhaps a way to use read.table with fread in some way
>     to do this?
>     >
>     > Many thanks,
>     >
>     > Spencer
>     >
>     >       [[alternative HTML version deleted]]
>     >
>     > ______________________________________________
>     > [hidden email] <mailto:[hidden email]> mailing list
>     -- To UNSUBSCRIBE and more, see
>     > https://stat.ethz.ch/mailman/listinfo/r-help
>     > PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     > and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Using read.table for importing gz file

David Winsemius
Further note:


After three minutes of waiting  ... not a particularly long wait in my
opinion, I get this:


 > z <- read.table(
text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz'))
), header=TRUE, sep="\t")
 > dim(z)
[1] 485577    686

So almost half a million lines of data in a rather wide dataset for an
incompletely described file.


I'd say R seems to be "working" properly.

data.table::fread was more informative about the process but acheived
basically the same result in 1/6th the time:


  ?fread
system.time( z <-
fread('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz',
sep="\t")  )

#-----------

[100%] Downloaded 597770433 bytes...
    user  system elapsed
  20.682   3.322  29.292

 > dim(z)
[1] 485577    686

--

David.

On 8/10/19 5:32 PM, David Winsemius wrote:

> Well, let's see about "rules"  ... you posted in HTML when this is a
> plain text mailing list and then you replied to only me when you are
> supposed reply to the list (so I'm putting back the list address in my
> reply:
>
>
> When I copied your code and then attempted to do a bit of debugging I
> get:
>
>
> > z <-
> readLines(gzcon(url(“https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz”)),
> n = 100)
> Error: unexpected input in "z <- readLines(gzcon(url(�"
>
> # that was because you had "smart-quotes" rather than ASCII quotes:
>
>
> > z <- readLines(gzcon(url(
> 'https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz' 
> )), n = 100)
> > z[1:10]
>  [1]
> "sample\tTCGA-E1-5319-01\tTCGA-HT-7693-01\tTCGA-CS-6665-01\tTCGA-S9-A7J2-01\tTCGA-FG-A6J3-01\tTCGA-FG-6688-01\tTCGA-S9-A6TX-01\tTCGA-VM-A8C8-01\tTCGA-74-6577-01\tTCGA-06-AABW-11\tTCGA-06-0125-02\tTCGA-HT-A74L-01\tTCGA-26-A7UX-01\tTCGA-DU-A5TS-01\tTCGA-06-6388-01\tTCGA-DB-A4XA-01\tTCGA-06-A7TL-01\tTCGA-HT-A4DV-01\tTCGA-TQ-A7RP-01\tTCGA-E1-5311-01\tTCGA-28-5213-01\tTCGA-E1-A7YI-01\tTCGA-E1-5305-01\tTCGA-F6-A8O4-01\tTCGA-HT-8113-01\tTCGA-DH-A66G-01\tTCGA-76-4932-01\t
>
> Snipped hundreds of lines. So this seems to indicate that this is a
> tab separated file. Don't you have some documentation to refer to?
>
>
> This seems possibly useful:
>
>
> > z <- read.table(
> text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')),
> n = 100), header=TRUE, sep="\t")
> > str(z)
> 'data.frame':    99 obs. of  686 variables:
>  $ sample         : Factor w/ 99 levels "cg00036732","cg00651829",..:
> 53 2 60 41 16 13 37 20 70 21 ...
>  $ TCGA.E1.5319.01: num  0.4019 0.0215 0.053 0.0453 0.515 ...
>  $ TCGA.HT.7693.01: num  0.9364 0.0216 0.0547 0.0819 0.6129 ...
>  $ TCGA.CS.6665.01: num  0.0345 0.0164 0.0719 0.0497 0.6648 ...
>  $ TCGA.S9.A7J2.01: num  0.0295 0.0168 0.0421 0.0867 0.1657 ...
>  $ TCGA.FG.A6J3.01: num  0.0248 0.0161 0.0556 0.0902 0.5042 ...
>  $ TCGA.FG.6688.01: num  0.0203 0.0179 0.0321 0.0513 0.1075 ...
>  $ TCGA.S9.A6TX.01: num  0.0378 0.0199 0.0623 0.0992 0.7662 ...
>  $ TCGA.VM.A8C8.01: num  0.0271 0.0172 0.0466 0.0564 0.3478 ...
>  $ TCGA.74.6577.01: num  0.0237 0.0193 0.0196 0.0961 0.1242 ...
>  $ TCGA.06.AABW.11: num  0.0323 0.0156 0.0395 0.0708 0.1136 ...
>  $ TCGA.06.0125.02: num  0.0238 0.0181 0.039 0.068 0.0796 ...
>  $ TCGA.HT.A74L.01: num  0.7409 0.0221 0.0596 0.0765 0.8157 ...
>
> #snipped the output
>
> # there seemed to be 686 columns
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.