Grap Element from Web Page

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Grap Element from Web Page

Sparks, John James
Dear R Helpers,

I would like to pull the CIK number from the web page

http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany

If you put this web page into your browser you will see the CIK number in
red on the left side of the page near the top.

When I try the basic
require(scrapeR)
require(XML)
require(RCurl)
doc
<-htmlTreeParse("http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany")
str(doc)

I get a large number of items in the data frame that I don't know how to
interpret.  Both
tables <- readHTMLTable(doc)

and

list<-xmlToList(doc)

result in errors.

Any (positive) guidance would be much appreciated.

--John J. Sparks, Ph.D.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Grap Element from Web Page

Jeffrey Dick
Hi,

There are many occurrences of the CIK number in the page source. This pulls
out the first node containing it:

node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )

>From there you can extract the number. Here's one way to do it.

strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]

Jeff


On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <[hidden email]> wrote:

> Dear R Helpers,
>
> I would like to pull the CIK number from the web page
>
>
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>
> If you put this web page into your browser you will see the CIK number in
> red on the left side of the page near the top.
>
> When I try the basic
> require(scrapeR)
> require(XML)
> require(RCurl)
> doc
> <-htmlTreeParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
> str(doc)
>
> I get a large number of items in the data frame that I don't know how to
> interpret.  Both
> tables <- readHTMLTable(doc)
>
> and
>
> list<-xmlToList(doc)
>
> result in errors.
>
> Any (positive) guidance would be much appreciated.
>
> --John J. Sparks, Ph.D.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Grap Element from Web Page

Sparks, John James
Thanks so much for looking into this for me.

Unfortunately, I get an error when I execute your code.  Is there a
library that you loaded that I haven't?

require(scrapeR)
require(XML)
require(RCurl)
doc<-htmlTreeParse("http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany")
node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
Error in UseMethod("xpathApply") :
  no applicable method for 'xpathApply' applied to an object of class
"character"


Guidance would be much appreciated.

--JJS



On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote:

> Hi,
>
> There are many occurrences of the CIK number in the page source. This
> pulls
> out the first node containing it:
>
> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>
> From there you can extract the number. Here's one way to do it.
>
> strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]
>
> Jeff
>
>
> On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <[hidden email]>
> wrote:
>
>> Dear R Helpers,
>>
>> I would like to pull the CIK number from the web page
>>
>>
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>>
>> If you put this web page into your browser you will see the CIK number
>> in
>> red on the left side of the page near the top.
>>
>> When I try the basic
>> require(scrapeR)
>> require(XML)
>> require(RCurl)
>> doc
>> <-htmlTreeParse("
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> ")
>> str(doc)
>>
>> I get a large number of items in the data frame that I don't know how to
>> interpret.  Both
>> tables <- readHTMLTable(doc)
>>
>> and
>>
>> list<-xmlToList(doc)
>>
>> result in errors.
>>
>> Any (positive) guidance would be much appreciated.
>>
>> --John J. Sparks, Ph.D.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Grap Element from Web Page

Jeffrey Dick
Sorry, I can't generate an error when running those commands in R on Linux
64-bit. But if I move to Windows (R version 3.0.1, XML_3.98-1.1), I get a
different error ...

> require(XML)
Loading required package: XML
> doc <- htmlTreeParse("
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
")
> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x0A 0x20 0x20
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x0A 0x20 0x20
> node <- getNodeSet(doc, "//link[@rel='alternate']" )
Error in UseMethod("xpathApply") :
  no applicable method for 'xpathApply' applied to an object of class
"XMLDocumentContent"

... note that I've tried both doc[[1]] and doc in the function call. Also,
only the XML library is required. I'm not sure what's going on with the
character encoding error, might be my system settings. Reading the help
page (?htmlTreeParse) provides a clue to use the htmlParse function
instead, equivalent to setting the useInternalNodes parameter to TRUE ...
"These can then be searched using XPath expressions via 'xpathApply' and
'getNodeSet'." That seems to be relevant to this case.

> doc <- htmlParse("
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
")
> node <- xpathSApply(doc, "//link[@rel='alternate']", xmlAttrs)
> node

[,1]

rel
"alternate"

type
"application/atom+xml"

title
"ATOM"

href
"/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=&dateb=&owner=exclude&count=40&output=atom"
> strsplit(strsplit(node[[4]], "CIK=")[[1]][2], "&type")[[1]][1]
[1] "0000789019"

Perhaps that approach is less prone to error.


On Thu, Aug 15, 2013 at 12:48 PM, Sparks, John James <[hidden email]>wrote:

> Thanks so much for looking into this for me.
>
> Unfortunately, I get an error when I execute your code.  Is there a
> library that you loaded that I haven't?
>
> require(scrapeR)
> require(XML)
> require(RCurl)
> doc<-htmlTreeParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> Error in UseMethod("xpathApply") :
>   no applicable method for 'xpathApply' applied to an object of class
> "character"
>
>
> Guidance would be much appreciated.
>
> --JJS
>
>
>
> On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote:
> > Hi,
> >
> > There are many occurrences of the CIK number in the page source. This
> > pulls
> > out the first node containing it:
> >
> > node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> >
> > From there you can extract the number. Here's one way to do it.
> >
> > strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]
> >
> > Jeff
> >
> >
> > On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <[hidden email]>
> > wrote:
> >
> >> Dear R Helpers,
> >>
> >> I would like to pull the CIK number from the web page
> >>
> >>
> >>
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> >>
> >> If you put this web page into your browser you will see the CIK number
> >> in
> >> red on the left side of the page near the top.
> >>
> >> When I try the basic
> >> require(scrapeR)
> >> require(XML)
> >> require(RCurl)
> >> doc
> >> <-htmlTreeParse("
> >>
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> >> ")
> >> str(doc)
> >>
> >> I get a large number of items in the data frame that I don't know how to
> >> interpret.  Both
> >> tables <- readHTMLTable(doc)
> >>
> >> and
> >>
> >> list<-xmlToList(doc)
> >>
> >> result in errors.
> >>
> >> Any (positive) guidance would be much appreciated.
> >>
> >> --John J. Sparks, Ph.D.
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Grap Element from Web Page

Sparks, John James
Thanks, the second approach worked fine on Windows.

--JJS

On Thu, August 15, 2013 8:38 am, Jeffrey Dick wrote:

> Sorry, I can't generate an error when running those commands in R on Linux
> 64-bit. But if I move to Windows (R version 3.0.1, XML_3.98-1.1), I get a
> different error ...
>
>> require(XML)
> Loading required package: XML
>> doc <- htmlTreeParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
>> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> Input is not proper UTF-8, indicate encoding !
> Bytes: 0xC2 0x0A 0x20 0x20
> Error: 1: Input is not proper UTF-8, indicate encoding !
> Bytes: 0xC2 0x0A 0x20 0x20
>> node <- getNodeSet(doc, "//link[@rel='alternate']" )
> Error in UseMethod("xpathApply") :
>   no applicable method for 'xpathApply' applied to an object of class
> "XMLDocumentContent"
>
> ... note that I've tried both doc[[1]] and doc in the function call. Also,
> only the XML library is required. I'm not sure what's going on with the
> character encoding error, might be my system settings. Reading the help
> page (?htmlTreeParse) provides a clue to use the htmlParse function
> instead, equivalent to setting the useInternalNodes parameter to TRUE ...
> "These can then be searched using XPath expressions via 'xpathApply' and
> 'getNodeSet'." That seems to be relevant to this case.
>
>> doc <- htmlParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
>> node <- xpathSApply(doc, "//link[@rel='alternate']", xmlAttrs)
>> node
>
> [,1]
>
> rel
> "alternate"
>
> type
> "application/atom+xml"
>
> title
> "ATOM"
>
> href
> "/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=&dateb=&owner=exclude&count=40&output=atom"
>> strsplit(strsplit(node[[4]], "CIK=")[[1]][2], "&type")[[1]][1]
> [1] "0000789019"
>
> Perhaps that approach is less prone to error.
>
>
> On Thu, Aug 15, 2013 at 12:48 PM, Sparks, John James
> <[hidden email]>wrote:
>
>> Thanks so much for looking into this for me.
>>
>> Unfortunately, I get an error when I execute your code.  Is there a
>> library that you loaded that I haven't?
>>
>> require(scrapeR)
>> require(XML)
>> require(RCurl)
>> doc<-htmlTreeParse("
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> ")
>> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>> Error in UseMethod("xpathApply") :
>>   no applicable method for 'xpathApply' applied to an object of class
>> "character"
>>
>>
>> Guidance would be much appreciated.
>>
>> --JJS
>>
>>
>>
>> On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote:
>> > Hi,
>> >
>> > There are many occurrences of the CIK number in the page source. This
>> > pulls
>> > out the first node containing it:
>> >
>> > node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>> >
>> > From there you can extract the number. Here's one way to do it.
>> >
>> > strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]
>> >
>> > Jeff
>> >
>> >
>> > On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <[hidden email]>
>> > wrote:
>> >
>> >> Dear R Helpers,
>> >>
>> >> I would like to pull the CIK number from the web page
>> >>
>> >>
>> >>
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> >>
>> >> If you put this web page into your browser you will see the CIK
>> number
>> >> in
>> >> red on the left side of the page near the top.
>> >>
>> >> When I try the basic
>> >> require(scrapeR)
>> >> require(XML)
>> >> require(RCurl)
>> >> doc
>> >> <-htmlTreeParse("
>> >>
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> >> ")
>> >> str(doc)
>> >>
>> >> I get a large number of items in the data frame that I don't know how
>> to
>> >> interpret.  Both
>> >> tables <- readHTMLTable(doc)
>> >>
>> >> and
>> >>
>> >> list<-xmlToList(doc)
>> >>
>> >> result in errors.
>> >>
>> >> Any (positive) guidance would be much appreciated.
>> >>
>> >> --John J. Sparks, Ph.D.
>> >>
>> >> ______________________________________________
>> >> [hidden email] mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>>
>>
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.