prevent XML::readHTMLTable from suppressing <br/>

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

prevent XML::readHTMLTable from suppressing <br/>

Spencer Graves-4
Hello, All:


       Thanks to Rasmus Liland, William Michels, and Luke Tierney with
my earlier web scraping question.  With their help, I've made progress. 
Sadly, I still have a problem:  One field has "<br/>", which gets
suppressed by XML::readHTMLTable:


sosURL <-
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
sosChars <- RCurl::getURL(sosURL)
MOcan <- XML::readHTMLTable(sosChars)
MOcan[[2]][1, 2]
[1] "4476 FIVE MILE RDSENECA MO 64865"


(Seneca <- regexpr('SENECA', sosChars))
substring(sosChars, Seneca-22, Seneca+14)


[1] "4476 FIVE MILE RD<br/>SENECA MO 64865"


       How can I get essentially the same result but without having
XML::readHTMLTable suppress "<br/>"?


NOTE:  I get something very similar with xml2::read_html and
rvest::html_table:


sosPointers <- xml2::read_html(sosChars)
MOcan2 <- rvest::html_table(sosPointers)
MOcan2[[2]][1, 2]
[1] "4476 FIVE MILE RDSENECA MO 64865"


       MOcan2 does not have names, and some of the fields are
automatically converted to integers, which I think is not smart in this
application.


       Thanks,
       Spencer Graves

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: prevent XML::readHTMLTable from suppressing <br/>

Rasmus Liland-3
On 2020-07-24 22:59 -0500, Spencer Graves wrote:

> Hello, All:
>
> Thanks to Rasmus Liland, William
> Michels, and Luke Tierney with my
> earlier web scraping question.  With
> their help, I've made progress. 
> Sadly, I still have a problem:  One
> field has "<br/>", which gets
> suppressed by XML::readHTMLTable:
>
> sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> sosChars <- RCurl::getURL(sosURL)
> MOcan <- XML::readHTMLTable(sosChars)
> MOcan[[2]][1, 2]
> [1] "4476 FIVE MILE RDSENECA MO 64865"
>
> (Seneca <- regexpr('SENECA', sosChars))
> substring(sosChars, Seneca-22, Seneca+14)
>
> [1] "4476 FIVE MILE RD<br/>SENECA MO 64865"
>
> How can I get essentially the same
> result but without having >
> XML::readHTMLTable suppress "<br/>"?
>
> NOTE:  I get something very similar with xml2::read_html and
> rvest::html_table:
>
> sosPointers <- xml2::read_html(sosChars)
> MOcan2 <- rvest::html_table(sosPointers)
> MOcan2[[2]][1, 2]
> [1] "4476 FIVE MILE RDSENECA MO 64865"
>
> MOcan2 does not have names, and some
> of the fields are automatically
> converted to integers, which I think
> is not smart in this application.
Yes, I observed this also, if you see my
challenging quest to you in the old
thread.

You could just edit it yourself by
finding all the string separators:

        cities <-
          c("KANSAS CITY",
            "SENECA MO")
        for (city in cities) {
          idx <- grepl(city, tab[,"Mailing Address"])
          tab[idx,"Mailing Address"] <-
            sapply(strsplit(tab[idx,"Mailing Address"], city), paste,
              collapse=paste0("\n", city))
        }
        cat(sum(!grepl("\n", tab[,"Mailing Address"])),
            "addresses left to hard-code a newline char into!", "\n")

... I'm sure the post office can mail
out your snail mail letters correctly if
you put the addresses in without the
newline char, after all the area code is
correct ...

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment