Reading data from a worksheet on the Internet

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading data from a worksheet on the Internet

Nilza BARROS
Dear R-users,

I have to read data from a worksheet that is available on the Internet. I
have been doing this by copying the worksheet from the browser.
But I would like to be able to copy the data automatically using the url
command.

But when using  "url" command the result is the source code, I mean, a html
code.
I see that the data I need is in the source code but before thinking about
reading the data from the html code I wonder if there is a package or
anoher way to extract these data since reading  from the code will demand
many work and it can be not so accurate.

Below one can see the from where I am trying to export the data:

dados<-url("
http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r
")
I am looking forward  any help.

Thanks in advance ,

Nilza Barros

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Reading data from a worksheet on the Internet

pr3d4t0r
This post has NOT been accepted by the mailing list yet.
Hi Nilza,

The URL that you posted points at a document that has another document within it, in a frame.  These files are Excel dumps into HTML.  To view the actual data you need the URIs for each data set.  Those appear at the bottom of the listing, under sc1201_arquivos/sheet001.htm and sheet002.htm.  Your code must fetch these files, not the one at http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm which only "wraps" them.  Most of what you see on the file that you linked isn't HTML - it's JavaScript and style information for the data living on the two separate HTML documents.

You can do this in R using the RCurl and XML libraries, by pulling the specific files for each data source.  If this is a one-time thing, I'd suggest just coding something simple that loads the data for each file.  If this is something you'll execute periodically, you'll need a bit more code to extract the internal data sheets (e.g. the "planhilas" at the bottom), then extracting the actual data.

Let me know if you want this as a one-time thing, or as a reusable program.  If you don't know how to use RCurl and XML to parse HTML I'll be happy to help with that too.  I'd just like to know more about the scope of your question.

Cheers,

pr3d4t0r
Reply | Threaded
Open this post in threaded view
|

Re: [R-sig-DB] Reading data from a worksheet on the Internet

CIURANA EUGENE (R)
In reply to this post by Nilza BARROS
 

On Sat, 11 Feb 2012 22:49:07 -0200, Nilza BARROS wrote:

> I have
to read data from a worksheet that is available on the Internet. I
>
have been doing this by copying the worksheet from the browser.
> But I
would like to be able to copy the data automatically using the url
>
command.
>
> But when using "url" command the result is the source
code, I mean, a html
> code.
> I see that the data I need is in the
source code but before thinking about
> reading the data from the html
code I wonder if there is a package or
> anoher way to extract these
data since reading from the code will demand
> many work and it can be
not so accurate.
>
> Below one can see the from where I am trying to
export the data:
>
>
dadoshttp://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r
>
")

Hi Nilza,

The URL that you posted points at a document that has
another document within it, in a frame. These files are Excel dumps into
HTML. To view the actual data you need the URIs for each data set. Those
appear at the bottom of the listing, under sc1201_arquivos/sheet001.htm
and sheet002.htm. Your code must fetch these files, not the one at
http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm [1]
which only "wraps" them. Most of what you see on the file that you
linked isn't HTML - it's JavaScript and style information for the data
living on the two separate HTML documents.

You can do this in R using
the RCurl and XML libraries, by pulling the specific files for each data
source. If this is a one-time thing, I'd suggest just coding something
simple that loads the data for each file. If this is something you'll
execute periodically, you'll need a bit more code to extract the
internal data sheets (e.g. the "planhilas" at the bottom), then
extracting the actual data.

Let me know if you want this as a one-time
thing, or as a reusable program. If you don't know how to use RCurl and
XML to parse HTML I'll be happy to help with that too. I'd just like to
know more about the scope of your question.

Cheers,

pr3d

--

pr3d4t0r at #R, ##java, #awk, #pyton
irc.freeenode.net

--
pr3d4t0r at
#R, ##java, #awk, #pyton
irc.freeenode.net
 

Links:
------
[1]
http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R-sig-DB] Reading data from a worksheet on the Internet

Nilza BARROS
In reply to this post by Nilza BARROS
Hi,

I really appreciate your help. I definitively need a reusable program since
I have been asking to  someone to extract these data from the Internet
everyday.  That's the reason why I am trying to do a program to do that
Related to the url I sent, I have just realized that although I had written
 the one related to only worksheet (PLANILHA2) when I copy it to my browse
it is showed the link with both worksheets.


I am going to read about Rcurl and XML libraries but I hope you can help me
too.

Thanks in advance
Nilza Barros


On Sun, Feb 12, 2012 at 10:42 AM, CIURANA EUGENE (R) <[hidden email]>wrote:

> **
>
> On Sat, 11 Feb 2012 22:49:07 -0200, Nilza BARROS wrote:
>
> I have to read data from a worksheet that is available on the Internet. I
> have been doing this by copying the worksheet from the browser.
> But I would like to be able to copy the data automatically using the url
> command.
>
> But when using  "url" command the result is the source code, I mean, a html
> code.
> I see that the data I need is in the source code but before thinking about
> reading the data from the html code I wonder if there is a package or
> anoher way to extract these data since reading  from the code will demand
> many work and it can be not so accurate.
>
> Below one can see the from where I am trying to export the data:
>
> dadoshttp://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r
> ")
>
>
>
> Hi Nilza,
>
> The URL that you posted points at a document that has another document
> within it, in a frame.  These files are Excel dumps into HTML.  To view the
> actual data you need the URIs for each data set.  Those appear at the
> bottom of the listing, under sc1201_arquivos/sheet001.htm and sheet002.htm.
>  Your code must fetch these files, not the one at
> http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm which
> only "wraps" them.  Most of what you see on the file that you linked isn't
> HTML - it's JavaScript and style information for the data living on the two
> separate HTML documents.
>
> You can do this in R using the RCurl and XML libraries, by pulling the
> specific files for each data source.  If this is a one-time thing, I'd
> suggest just coding something simple that loads the data for each file.  If
> this is something you'll execute periodically, you'll need a bit more code
> to extract the internal data sheets (e.g. the "planhilas" at the bottom),
> then extracting the actual data.
>
> Let me know if you want this as a one-time thing, or as a reusable
> program.  If you don't know how to use RCurl and XML to parse HTML I'll be
> happy to help with that too.  I'd just like to know more about the scope of
> your question.
>
> Cheers,
>
> pr3d
>
> --
> pr3d4t0r at #R, ##java, #awk, #pytonirc.freeenode.net
>
>

--
Abraço,
Nilza Barros

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R-sig-DB] Reading data from a worksheet on the Internet

Henrique Dallazuanna
Try the readHTMLTable function in package XML:

sheet2 <- readHTMLTable("
http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm",
skip.rows = 2)

head(sheet2[[1]])

On Sun, Feb 12, 2012 at 4:24 PM, Nilza BARROS <[hidden email]> wrote:

> Hi,
>
> I really appreciate your help. I definitively need a reusable program since
> I have been asking to  someone to extract these data from the Internet
> everyday.  That's the reason why I am trying to do a program to do that
> Related to the url I sent, I have just realized that although I had written
>  the one related to only worksheet (PLANILHA2) when I copy it to my browse
> it is showed the link with both worksheets.
>
>
> I am going to read about Rcurl and XML libraries but I hope you can help me
> too.
>
> Thanks in advance
> Nilza Barros
>
>
> On Sun, Feb 12, 2012 at 10:42 AM, CIURANA EUGENE (R) <[hidden email]
> >wrote:
>
> > **
> >
> > On Sat, 11 Feb 2012 22:49:07 -0200, Nilza BARROS wrote:
> >
> > I have to read data from a worksheet that is available on the Internet. I
> > have been doing this by copying the worksheet from the browser.
> > But I would like to be able to copy the data automatically using the url
> > command.
> >
> > But when using  "url" command the result is the source code, I mean, a
> html
> > code.
> > I see that the data I need is in the source code but before thinking
> about
> > reading the data from the html code I wonder if there is a package or
> > anoher way to extract these data since reading  from the code will demand
> > many work and it can be not so accurate.
> >
> > Below one can see the from where I am trying to export the data:
> >
> > dadoshttp://
> www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm
> ","r
> > ")
> >
> >
> >
> > Hi Nilza,
> >
> > The URL that you posted points at a document that has another document
> > within it, in a frame.  These files are Excel dumps into HTML.  To view
> the
> > actual data you need the URIs for each data set.  Those appear at the
> > bottom of the listing, under sc1201_arquivos/sheet001.htm and
> sheet002.htm.
> >  Your code must fetch these files, not the one at
> > http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm which
> > only "wraps" them.  Most of what you see on the file that you linked
> isn't
> > HTML - it's JavaScript and style information for the data living on the
> two
> > separate HTML documents.
> >
> > You can do this in R using the RCurl and XML libraries, by pulling the
> > specific files for each data source.  If this is a one-time thing, I'd
> > suggest just coding something simple that loads the data for each file.
>  If
> > this is something you'll execute periodically, you'll need a bit more
> code
> > to extract the internal data sheets (e.g. the "planhilas" at the bottom),
> > then extracting the actual data.
> >
> > Let me know if you want this as a one-time thing, or as a reusable
> > program.  If you don't know how to use RCurl and XML to parse HTML I'll
> be
> > happy to help with that too.  I'd just like to know more about the scope
> of
> > your question.
> >
> > Cheers,
> >
> > pr3d
> >
> > --
> > pr3d4t0r at #R, ##java, #awk, #pytonirc.freeenode.net
> >
> >
>
>
> --
> Abraço,
> Nilza Barros
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

--
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R-sig-DB] Reading data from a worksheet on the Internet

CIURANA EUGENE (R)
In reply to this post by Nilza BARROS
 

On Sun, 12 Feb 2012 16:24:58 -0200, Nilza BARROS wrote:

> I
really appreciate your help. I definitively need a reusable program
since
> I have been asking to someone to extract these data from the
Internet
> everyday. That's the reason why I am trying to do a program
to do that
> Related to the url I sent, I have just realized that
although I had written
> the one related to only worksheet (PLANILHA2)
when I copy it to my browse
> it is showed the link with both
worksheets.
>
> I am going to read about Rcurl and XML libraries but I
hope you can help me
> too.

Hi again, Nilza.

I looked over this to
see if there was some simpler way of doing this; I couldn't find one.


The main issue I see is that this is "HTML" generated from Excel. That
means it's got a lot of "features" for navigation, formatting, and such
built into the script that make it a pain to parse. I tried parsing it
with both the XML R package (look at the htmlTreeParse() function) and
with other non-R tools like scrapy and llxml in Python.

If you read
this post after you had a peek at the XML package, my next explanation
will make more sense.

The main issue is that the DOM you're analyzing
has child nodes that are either generated or repopulated from the Excel
data dumped onto the file system. Walking this DOM requires not only
XML, but perhaps even having a JavaScript parser resolve some of the
nodes before you get the information you want. That's why, doing it from
the main page, is a nasty issue.

My suggestion, at this time, would be
to focus on seeing if you can parse the individual sub-sheets. Although
you may to load each manually, their DOM appears simpler and with less
crap specific to JavaScript/CSS/Excel/Internet Explorer. Being cleaner,
they should be easier to parse with XML or any other tool. I'll take
another look at the individual sheets to check if they in fact have a
simpler document model.

Cheers!

pr3d4t0r

--
pr3d4t0r at #R,
##java, #awk, #pyton
irc.freeenode.net
 
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.