help with web scraping

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

help with web scraping

Spencer Graves-4
Hello, All:


       I've failed with multiple attempts to scrape the table of
candidates from the website of the Missouri Secretary of State:


https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975


       I've tried base::url, base::readLines, xml2::read_html, and
XML::readHTMLTable; see summary below.


       Suggestions?
       Thanks,
       Spencer Graves


sosURL <-
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"

str(baseURL <- base::url(sosURL))
# this might give me something, but I don't know what

sosRead <- base::readLines(sosURL) # 404 Not Found
sosRb <- base::readLines(baseURL) # 404 Not Found

sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.

sosXML <- XML::readHTMLTable(sosURL)
# List of 0;  does not seem to be XML

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets
[6] methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2    curl_4.3
[4] xml2_1.3.2     XML_3.99-0.3

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help with web scraping

R help mailing list-2
Hi Spencer,

I tried the code below on an older R-installation, and it works fine.
Not a full solution, but it's a start:

> library(RCurl)
Loading required package: bitops
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> M_sos <- getURL(url)
> print(M_sos)
[1] "\r\n<!DOCTYPE html>\r\n\r\n<html
lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
content=\"width=device-width, initial-scale=1.0\" [...remainder
truncated].

HTH, Bill.

W. Michels, Ph.D.



On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
<[hidden email]> wrote:

>
> Hello, All:
>
>
>        I've failed with multiple attempts to scrape the table of
> candidates from the website of the Missouri Secretary of State:
>
>
> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>
>
>        I've tried base::url, base::readLines, xml2::read_html, and
> XML::readHTMLTable; see summary below.
>
>
>        Suggestions?
>        Thanks,
>        Spencer Graves
>
>
> sosURL <-
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>
> str(baseURL <- base::url(sosURL))
> # this might give me something, but I don't know what
>
> sosRead <- base::readLines(sosURL) # 404 Not Found
> sosRb <- base::readLines(baseURL) # 404 Not Found
>
> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>
> sosXML <- XML::readHTMLTable(sosURL)
> # List of 0;  does not seem to be XML
>
> sessionInfo()
>
> R version 4.0.2 (2020-06-22)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Catalina 10.15.5
>
> Matrix products: default
> BLAS:
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
> LAPACK:
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets
> [6] methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
> [4] xml2_1.3.2     XML_3.99-0.3
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help with web scraping

Spencer Graves-4
Hi Bill et al.:


       That broke the dam:  It gave me a character vector of length 1
consisting of 218 KB.  I fed that to XML::readHTMLTable and
purrr::map_chr, both of which returned lists of 337 data.frames. The
former retained names for all the tables, absent from the latter.  The
columns of the former are all character;  that's not true for the latter.


       Sadly, it's not quite what I want:  It's one table for each
office-party combination, but it's lost the office designation. However,
I'm confident I can figure out how to hack that.


       Thanks,
       Spencer Graves


On 2020-07-23 17:46, William Michels wrote:

> Hi Spencer,
>
> I tried the code below on an older R-installation, and it works fine.
> Not a full solution, but it's a start:
>
>> library(RCurl)
> Loading required package: bitops
>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>> M_sos <- getURL(url)
>> print(M_sos)
> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
> content=\"width=device-width, initial-scale=1.0\" [...remainder
> truncated].
>
> HTH, Bill.
>
> W. Michels, Ph.D.
>
>
>
> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> <[hidden email]> wrote:
>> Hello, All:
>>
>>
>>         I've failed with multiple attempts to scrape the table of
>> candidates from the website of the Missouri Secretary of State:
>>
>>
>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>
>>
>>         I've tried base::url, base::readLines, xml2::read_html, and
>> XML::readHTMLTable; see summary below.
>>
>>
>>         Suggestions?
>>         Thanks,
>>         Spencer Graves
>>
>>
>> sosURL <-
>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>
>> str(baseURL <- base::url(sosURL))
>> # this might give me something, but I don't know what
>>
>> sosRead <- base::readLines(sosURL) # 404 Not Found
>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>
>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>
>> sosXML <- XML::readHTMLTable(sosURL)
>> # List of 0;  does not seem to be XML
>>
>> sessionInfo()
>>
>> R version 4.0.2 (2020-06-22)
>> Platform: x86_64-apple-darwin17.0 (64-bit)
>> Running under: macOS Catalina 10.15.5
>>
>> Matrix products: default
>> BLAS:
>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>> LAPACK:
>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets
>> [6] methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>> [4] xml2_1.3.2     XML_3.99-0.3
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

luke-tierney
Maybe try something like this:

url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
h <- xml2::read_html(url)
tbl <- rvest::html_table(h)

Best,

luke

On Fri, 24 Jul 2020, Spencer Graves wrote:

> Hi Bill et al.:
>
>
>       That broke the dam:  It gave me a character vector of length 1
> consisting of 218 KB.  I fed that to XML::readHTMLTable and purrr::map_chr,
> both of which returned lists of 337 data.frames. The former retained names
> for all the tables, absent from the latter.  The columns of the former are
> all character;  that's not true for the latter.
>
>
>       Sadly, it's not quite what I want:  It's one table for each
> office-party combination, but it's lost the office designation. However, I'm
> confident I can figure out how to hack that.
>
>
>       Thanks,
>       Spencer Graves
>
>
> On 2020-07-23 17:46, William Michels wrote:
>> Hi Spencer,
>>
>> I tried the code below on an older R-installation, and it works fine.
>> Not a full solution, but it's a start:
>>
>>> library(RCurl)
>> Loading required package: bitops
>>> url <-
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> M_sos <- getURL(url)
>>> print(M_sos)
>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
>> content=\"width=device-width, initial-scale=1.0\" [...remainder
>> truncated].
>>
>> HTH, Bill.
>>
>> W. Michels, Ph.D.
>>
>>
>>
>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>> <[hidden email]> wrote:
>>> Hello, All:
>>>
>>>
>>>         I've failed with multiple attempts to scrape the table of
>>> candidates from the website of the Missouri Secretary of State:
>>>
>>>
>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>>
>>>
>>>         I've tried base::url, base::readLines, xml2::read_html, and
>>> XML::readHTMLTable; see summary below.
>>>
>>>
>>>         Suggestions?
>>>         Thanks,
>>>         Spencer Graves
>>>
>>>
>>> sosURL <-
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>
>>> str(baseURL <- base::url(sosURL))
>>> # this might give me something, but I don't know what
>>>
>>> sosRead <- base::readLines(sosURL) # 404 Not Found
>>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>>
>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>>
>>> sosXML <- XML::readHTMLTable(sosURL)
>>> # List of 0;  does not seem to be XML
>>>
>>> sessionInfo()
>>>
>>> R version 4.0.2 (2020-06-22)
>>> Platform: x86_64-apple-darwin17.0 (64-bit)
>>> Running under: macOS Catalina 10.15.5
>>>
>>> Matrix products: default
>>> BLAS:
>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>>> LAPACK:
>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets
>>> [6] methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>>> [4] xml2_1.3.2     XML_3.99-0.3
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Spencer Graves-4


On 2020-07-24 08:20, [hidden email] wrote:
> Maybe try something like this:
>
> url <-
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> h <- xml2::read_html(url)


Error in open.connection(x, "rb") : HTTP error 404.


       Thanks for the suggestion, but this failed for me on the platform
described in "sessionInfo" below.


> tbl <- rvest::html_table(h)


       As I previously noted, RCurl::getURL returned a single character
string of roughly 218 KB, from which I've so far gotten most but not all
of what I want.  Unfortunately, when I fed that character vector to
rvest::html_table, I got:


Error in UseMethod("html_table") :
   no applicable method for 'html_table' applied to an object of class
"character"


       I don't know for sure yet, but I believe I'll be able to get what
I want from the single character string using, e.g., gregexpr and other
functions.


       Thanks again,
       Spencer Graves

>
> Best,
>
> luke
>
> On Fri, 24 Jul 2020, Spencer Graves wrote:
>
>> Hi Bill et al.:
>>
>>
>>       That broke the dam:  It gave me a character vector of length 1
>> consisting of 218 KB.  I fed that to XML::readHTMLTable and
>> purrr::map_chr, both of which returned lists of 337 data.frames. The
>> former retained names for all the tables, absent from the latter. 
>> The columns of the former are all character;  that's not true for the
>> latter.
>>
>>
>>       Sadly, it's not quite what I want:  It's one table for each
>> office-party combination, but it's lost the office designation.
>> However, I'm confident I can figure out how to hack that.
>>
>>
>>       Thanks,
>>       Spencer Graves
>>
>>
>> On 2020-07-23 17:46, William Michels wrote:
>>> Hi Spencer,
>>>
>>> I tried the code below on an older R-installation, and it works fine.
>>> Not a full solution, but it's a start:
>>>
>>>> library(RCurl)
>>> Loading required package: bitops
>>>> url <-
>>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>> M_sos <- getURL(url)
>>>> print(M_sos)
>>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
>>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
>>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
>>> content=\"width=device-width, initial-scale=1.0\" [...remainder
>>> truncated].
>>>
>>> HTH, Bill.
>>>
>>> W. Michels, Ph.D.
>>>
>>>
>>>
>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>>> <[hidden email]> wrote:
>>>> Hello, All:
>>>>
>>>>
>>>>         I've failed with multiple attempts to scrape the table of
>>>> candidates from the website of the Missouri Secretary of State:
>>>>
>>>>
>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 
>>>>
>>>>
>>>>
>>>>         I've tried base::url, base::readLines, xml2::read_html, and
>>>> XML::readHTMLTable; see summary below.
>>>>
>>>>
>>>>         Suggestions?
>>>>         Thanks,
>>>>         Spencer Graves
>>>>
>>>>
>>>> sosURL <-
>>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>>
>>>>
>>>> str(baseURL <- base::url(sosURL))
>>>> # this might give me something, but I don't know what
>>>>
>>>> sosRead <- base::readLines(sosURL) # 404 Not Found
>>>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>>>
>>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>>>
>>>> sosXML <- XML::readHTMLTable(sosURL)
>>>> # List of 0;  does not seem to be XML
>>>>
>>>> sessionInfo()
>>>>
>>>> R version 4.0.2 (2020-06-22)
>>>> Platform: x86_64-apple-darwin17.0 (64-bit)
>>>> Running under: macOS Catalina 10.15.5
>>>>
>>>> Matrix products: default
>>>> BLAS:
>>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>>>>
>>>> LAPACK:
>>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>>>
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets
>>>> [6] methods   base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>>>> [4] xml2_1.3.2     XML_3.99-0.3
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Rasmus Liland-3
In reply to this post by luke-tierney
On 2020-07-24 08:20 -0500, [hidden email] wrote:

> On Fri, 24 Jul 2020, Spencer Graves wrote:
> > On 2020-07-23 17:46, William Michels wrote:
> > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> > > <[hidden email]> wrote:
> > > > Hello, All:
> > > >
> > > > I've failed with multiple
> > > > attempts to scrape the table of
> > > > candidates from the website of
> > > > the Missouri Secretary of
> > > > State:
> > > >
> > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
> > >
> > > Hi Spencer,
> > >
> > > I tried the code below on an older
> > > R-installation, and it works fine.  
> > > Not a full solution, but it's a
> > > start:
> > >
> > > > library(RCurl)
> > > Loading required package: bitops
> > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > > > M_sos <- getURL(url)
> >
> > Hi Bill et al.:
> >
> > That broke the dam:  It gave me a
> > character vector of length 1
> > consisting of 218 KB.  I fed that to
> > XML::readHTMLTable and
> > purrr::map_chr, both of which
> > returned lists of 337 data.frames.
> > The former retained names for all
> > the tables, absent from the latter. 
> > The columns of the former are all
> > character;  that's not true for the
> > latter.
> >
> > Sadly, it's not quite what I want: 
> > It's one table for each office-party
> > combination, but it's lost the
> > office designation. However, I'm
> > confident I can figure out how to
> > hack that.
>
> Maybe try something like this:
>
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> h <- xml2::read_html(url)
> tbl <- rvest::html_table(h)
Dear Spencer,

I unified the party tables after the
first summary table like this:

        url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
        M_sos <- RCurl::getURL(url)
        saveRDS(object=M_sos, file="dcp.rds")
        dat <- XML::readHTMLTable(M_sos)
        idx <- 2:length(dat)
        cn <- unique(unlist(lapply(dat[idx], colnames)))
        dat <- do.call(rbind,
          sapply(idx, function(i, dat, cn) {
            x <- dat[[i]]
            x[,cn[!(cn %in% colnames(x))]] <- NA
            x <- x[,cn]
            x$Party <- names(dat)[i]
            return(list(x))
          }, dat=dat, cn=cn))
        dat[,"Date Filed"] <-
          as.Date(x=dat[,"Date Filed"],
                  format="%m/%d/%Y")
        write.table(dat, file="dcp.tsv", sep="\t",
                    row.names=FALSE,
                    quote=TRUE, na="N/A")

Best,
Rasmus

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Spencer Graves-4
Dear Rasmus:


On 2020-07-24 09:16, Rasmus Liland wrote:

> On 2020-07-24 08:20 -0500, [hidden email] wrote:
>> On Fri, 24 Jul 2020, Spencer Graves wrote:
>>> On 2020-07-23 17:46, William Michels wrote:
>>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>>>> <[hidden email]> wrote:
>>>>> Hello, All:
>>>>>
>>>>> I've failed with multiple
>>>>> attempts to scrape the table of
>>>>> candidates from the website of
>>>>> the Missouri Secretary of
>>>>> State:
>>>>>
>>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>>> Hi Spencer,
>>>>
>>>> I tried the code below on an older
>>>> R-installation, and it works fine.
>>>> Not a full solution, but it's a
>>>> start:
>>>>
>>>>> library(RCurl)
>>>> Loading required package: bitops
>>>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>>> M_sos <- getURL(url)
>>> Hi Bill et al.:
>>>
>>> That broke the dam:� It gave me a
>>> character vector of length 1
>>> consisting of 218 KB.� I fed that to
>>> XML::readHTMLTable and
>>> purrr::map_chr, both of which
>>> returned lists of 337 data.frames.
>>> The former retained names for all
>>> the tables, absent from the latter.
>>> The columns of the former are all
>>> character;� that's not true for the
>>> latter.
>>>
>>> Sadly, it's not quite what I want:
>>> It's one table for each office-party
>>> combination, but it's lost the
>>> office designation. However, I'm
>>> confident I can figure out how to
>>> hack that.
>> Maybe try something like this:
>>
>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>> h <- xml2::read_html(url)
>> tbl <- rvest::html_table(h)
> Dear Spencer,
>
> I unified the party tables after the
> first summary table like this:
>
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> M_sos <- RCurl::getURL(url)
> saveRDS(object=M_sos, file="dcp.rds")
> dat <- XML::readHTMLTable(M_sos)
> idx <- 2:length(dat)
> cn <- unique(unlist(lapply(dat[idx], colnames)))

 ����� This is useful for this application.

> dat <- do.call(rbind,
>  sapply(idx, function(i, dat, cn) {
>    x <- dat[[i]]
>    x[,cn[!(cn %in% colnames(x))]] <- NA
>    x <- x[,cn]
>    x$Party <- names(dat)[i]
>    return(list(x))
>  }, dat=dat, cn=cn))
> dat[,"Date Filed"] <-
>  as.Date(x=dat[,"Date Filed"],
>          format="%m/%d/%Y")

 ����� This misses something extremely important for this application:�
The political office.� That's buried in the HTML or whatever it is.� I'm
using something like the following to find that:


str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])


 ����� After I figure this out, I will use something like your code to
combine it all into separate tables for each office, and then probably
combine those into one table for the offices I'm interested in.� For my
present purposes, I don't want all the offices in Missouri, only the
executive positions and those representing parts of the Kansas City
metro area in the Missouri legislature.


 ����� Thanks again,
 ����� Spencer Graves

> write.table(dat, file="dcp.tsv", sep="\t",
>            row.names=FALSE,
>            quote=TRUE, na="N/A")
>
> Best,
> Rasmus
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Rasmus Liland-3
On 2020-07-24 10:28 -0500, Spencer Graves wrote:

> Dear Rasmus:
>
> > Dear Spencer,
> >
> > I unified the party tables after the
> > first summary table like this:
> >
> > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > M_sos <- RCurl::getURL(url)
> > saveRDS(object=M_sos, file="dcp.rds")
> > dat <- XML::readHTMLTable(M_sos)
> > idx <- 2:length(dat)
> > cn <- unique(unlist(lapply(dat[idx], colnames)))
>
> This is useful for this application.
>
> > dat <- do.call(rbind,
> >  sapply(idx, function(i, dat, cn) {
> >    x <- dat[[i]]
> >    x[,cn[!(cn %in% colnames(x))]] <- NA
> >    x <- x[,cn]
> >    x$Party <- names(dat)[i]
> >    return(list(x))
> >  }, dat=dat, cn=cn))
> > dat[,"Date Filed"] <-
> >  as.Date(x=dat[,"Date Filed"],
> >          format="%m/%d/%Y")
>
> This misses something extremely
> important for this application:?  The
> political office.? That's buried in
> the HTML or whatever it is.? I'm using
> something like the following to find
> that:
>
> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
Dear Spencer,

I came up with a solution, but it is not
very elegant.  Instead of showing you
the solution, hoping you understand
everything in it, I istead want to give
you some emphatic hints to see if you
can come up with a solution on you own.

- XML::htmlTreeParse(M_sos)
  - *Gandalf voice*: climb the tree
    until you find the content you are
    looking for flat out at the level of
    «The Children of the Div», *uuuUUU*
  - you only want to keep the table and
    header tags at this level
- Use XML::xmlValue to extract the
  values of all the headers (the
  political positions)
- Observe that all the tables on the
  page you were able to extract
  previously using XML::readHTMLTable,
  are at this level, shuffled between
  the political position header tags,
  this means you extract the political
  position and party affiliation by
  using a for loop, if statements,
  typeof, names, and [] and [[]] to grab
  different things from the list
  (content or the bag itself).
  XML::readHTMLTable strips away the
  line break tags from the Mailing
  address, so if you find a better way
  of extracting the tables, tell me,
  e.g. you get

        8805 HUNTER AVEKANSAS CITY MO 64138

  and not

        8805 HUNTER AVE<br/>KANSAS CITY MO 64138

When you've completed this «programming
quest», you're back at the level of the
previous email, i.e.  you have have the
same tables, but with political position
and party affiliation added to them.

Best,
Rasmus

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Spencer Graves-4
Dear Rasmus et al.:


On 2020-07-25 04:10, Rasmus Liland wrote:

> On 2020-07-24 10:28 -0500, Spencer Graves wrote:
>> Dear Rasmus:
>>
>>> Dear Spencer,
>>>
>>> I unified the party tables after the
>>> first summary table like this:
>>>
>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> M_sos <- RCurl::getURL(url)
>>> saveRDS(object=M_sos, file="dcp.rds")
>>> dat <- XML::readHTMLTable(M_sos)
>>> idx <- 2:length(dat)
>>> cn <- unique(unlist(lapply(dat[idx], colnames)))
>> This is useful for this application.
>>
>>> dat <- do.call(rbind,
>>>  sapply(idx, function(i, dat, cn) {
>>>    x <- dat[[i]]
>>>    x[,cn[!(cn %in% colnames(x))]] <- NA
>>>    x <- x[,cn]
>>>    x$Party <- names(dat)[i]
>>>    return(list(x))
>>>  }, dat=dat, cn=cn))
>>> dat[,"Date Filed"] <-
>>>  as.Date(x=dat[,"Date Filed"],
>>>          format="%m/%d/%Y")
>> This misses something extremely
>> important for this application:?  The
>> political office.? That's buried in
>> the HTML or whatever it is.? I'm using
>> something like the following to find
>> that:
>>
>> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> Dear Spencer,
>
> I came up with a solution, but it is not
> very elegant.  Instead of showing you
> the solution, hoping you understand
> everything in it, I istead want to give
> you some emphatic hints to see if you
> can come up with a solution on you own.
>
> - XML::htmlTreeParse(M_sos)
>    - *Gandalf voice*: climb the tree
>      until you find the content you are
>      looking for flat out at the level of
>      �The Children of the Div�, *uuuUUU*
>    - you only want to keep the table and
>      header tags at this level
> - Use XML::xmlValue to extract the
>    values of all the headers (the
>    political positions)
> - Observe that all the tables on the
>    page you were able to extract
>    previously using XML::readHTMLTable,
>    are at this level, shuffled between
>    the political position header tags,
>    this means you extract the political
>    position and party affiliation by
>    using a for loop, if statements,
>    typeof, names, and [] and [[]] to grab
>    different things from the list
>    (content or the bag itself).
>    XML::readHTMLTable strips away the
>    line break tags from the Mailing
>    address, so if you find a better way
>    of extracting the tables, tell me,
>    e.g. you get
>
> 8805 HUNTER AVEKANSAS CITY MO 64138
>
>    and not
>
> 8805 HUNTER AVE<br/>KANSAS CITY MO 64138
>
> When you've completed this �programming
> quest�, you're back at the level of the
> previous email, i.e.  you have have the
> same tables, but with political position
> and party affiliation added to them.

 ����� Please excuse:� Before my last post, I had written code to do all
that.� In brief, the political offices are "h3" tags.� I used "strsplit"
to split the string at "<h3>".� I then wrote a function to find "</h3>",
extract the political office and pass the rest to "XML::readHTMLTable",
adding columns for party and political office.


 ����� However, this suppressed "<br/>" everywhere.� I thought there
should be an option with something like "XML::readHTMLTable" that would
not delete "<br/>" everywhere, but I couldn't find it.� If you aren't
aware of one, I can gsub("<br/>", "\n", ...) on the string for each
political office before passing it to "XML::readHTMLTable".� I just
tested this:� It works.


 ����� I have other web scraping problems in my work plan for the few
days.� I will definitely try XML::htmlTreeParse, etc., as you suggest.


 ����� Thanks again.
 ����� Spencer Graves
>
> Best,
> Rasmus
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Rasmus Liland-3
On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> Dear Rasmus et al.:

It is LILAND et al., is it not?  I do
not belong to a large Confucian family
structure (putting the hunter-gatherer
horse-rider tribe name first in all-caps
in the email), else it's customary to
put a comma in there, isn't it? ...
right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:
>
>  ?????

It might be a better idea to write the
reply in plain-text utf-8 or at least
Western or Eastern-European ISO euro
encoding instead of us-ascii (maybe
KOI8, ¯\_(ツ)_/¯) ...  something in your
email got string-replaced by "?????" and
also "«" got replaced by "?".

Please research using Thunderbird, Claws
mail, or some other sane e-mail client;
they are great, I promise.

> Please excuse:? Before my last post, I
> had written code to do all that.?

Good!

> In brief, the political offices are
> "h3" tags.?

Yes, some type of header element at
least, in-between the various tables,
everything children of the div in the
element tree.

> I used "strsplit" to split the string
> at "<h3>".? I then wrote a
> function to find "</h3>", extract the
> political office and pass the rest to
> "XML::readHTMLTable", adding columns
> for party and political office.

Yes, doing that for the political office
is also possible, but the party is
inside the table's caption tag, which
end up as the name of the table in the
XML::readHTMLTable list ...

> However, this suppressed "<br/>"
> everywhere.?

Why is that, please explain.

> I thought there should be
> an option with something like
> "XML::readHTMLTable" that would not
> delete "<br/>" everywhere, but I
> couldn't find it.?

No, there is not, AFAIK.  Please, if
anyone else knows, please say so *echoes
in the forest*

> If you aren't aware of one, I can
> gsub("<br/>", "\n", ...) on the string
> for each political office before
> passing it to "XML::readHTMLTable".? I
> just tested this:? It works.

Such a great hack!  IMHO, this is much
more flexible than using
xml2::read_html, rvest::read_table,
dplyr::mutate like here[1]

> I have other web scraping problems in
> my work plan for the few days.?

Maybe, idk ...

> I will definitely try
> XML::htmlTreeParse, etc., as you
> suggest.

I wish you good luck,
Rasmus

[1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Spencer Graves-4
Dear Rasmus Liland et al.:


On 2020-07-25 11:30, Rasmus Liland wrote:
> On 2020-07-25 09:56 -0500, Spencer Graves wrote:
>> Dear Rasmus et al.:
>
> It is LILAND et al., is it not?  ... else it's customary to
> put a comma in there, isn't it? ...


The APA Style recommends "Sharp et al., 2007":


https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html


          Regarding Confucius, I'm confused.



> right, moving on:
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
>>

<snip>

>
> Please research using Thunderbird, Claws
> mail, or some other sane e-mail client;
> they are great, I promise.


Thanks.  I researched it and turned of HTML.  Please excuse:  I noticed
it was a problem, but hadn't prioritized time to research and fix it
until your comment.  Thanks.

>
>> Please excuse:? Before my last post, I
>> had written code to do all that.?
>
> Good!
>
>> In brief, the political offices are
>> "h3" tags.?
>
> Yes, some type of header element at
> least, in-between the various tables,
> everything children of the div in the
> element tree.
>
>> I used "strsplit" to split the string
>> at "<h3>".? I then wrote a
>> function to find "</h3>", extract the
>> political office and pass the rest to
>> "XML::readHTMLTable", adding columns
>> for party and political office.
>
> Yes, doing that for the political office
> is also possible, but the party is
> inside the table's caption tag, which
> end up as the name of the table in the
> XML::readHTMLTable list ...
>
>> However, this suppressed "<br/>"
>> everywhere.?
>
> Why is that, please explain.
>

          I don't know why the Missouri Secretary of State's web site includes
"<br/>" to signal a new line, but it does.  I also don't know why
XML::readHTMLTable suppressed "<br/>" everywhere it occurred, but it did
that.  After I used gsub to replace "<br/>" with "\n", I found that
XML::readHTMLTable did not replace "\n", so I got what I wanted.


>> I thought there should be
>> an option with something like
>> "XML::readHTMLTable" that would not
>> delete "<br/>" everywhere, but I
>> couldn't find it.?
>
> No, there is not, AFAIK.  Please, if
> anyone else knows, please say so *echoes
> in the forest*
>
>> If you aren't aware of one, I can
>> gsub("<br/>", "\n", ...) on the string
>> for each political office before
>> passing it to "XML::readHTMLTable".? I
>> just tested this:? It works.
>
> Such a great hack!  IMHO, this is much
> more flexible than using
> xml2::read_html, rvest::read_table,
> dplyr::mutate like here[1]
>
>> I have other web scraping problems in
>> my work plan for the few days.?
>
> Maybe, idk ...
>
>> I will definitely try
>> XML::htmlTreeParse, etc., as you
>> suggest.
>
> I wish you good luck,
> Rasmus
>
> [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells


          And I added my solution to this problem to this Stackoverflow thread.


          Thanks again,
          Spencer
>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

R help mailing list-2
In reply to this post by Spencer Graves-4
Dear Spencer Graves (and Rasmus Liland),

I've had some luck just using gsub() to alter the offending "</br>"
characters, appending a "___" tag at each instance of "<br>" (first I
checked the text to make sure it didn't contain any pre-existing
instances of "___"). See the output snippet below:

> library(RCurl)
> library(XML)
> sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> sosChars <- getURL(sosURL)
> sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> MOcan <- readHTMLTable(sosChars2)
> MOcan[[2]]
                  Name                          Mailing Address Random
Number Date Filed
1       Raleigh Ritter      4476 FIVE MILE RD___SENECA MO 64865
   185  2/25/2020
2          Mike Parson         1458 E 464 RD___BOLIVAR MO 65613
   348  2/25/2020
3 James W. (Jim) Neely            PO BOX 343___CAMERON MO 64429
   477  2/25/2020
4     Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
        3/31/2020
>

It's true, there's one a 'section' of MOcan output that contains
odd-looking characters (see the "Total" line of MOcan[[1]]). But my
guess is you'll be deleting this 'line' anyway--and recalulating totals in R.

Now that you have a comprehensive list object, you should be able to
pull out districts/races of interest. You might want to take a look at
the "rlist" package, to see if it can make your work a little easier:

https://CRAN.R-project.org/package=rlist
https://renkun-ken.github.io/rlist-tutorial/index.html

HTH, Bill.

W. Michels, Ph.D.









On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves
<[hidden email]> wrote:

>
> Dear Rasmus et al.:
>
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
> > On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> >> Dear Rasmus:
> >>
> >>> Dear Spencer,
> >>>
> >>> I unified the party tables after the
> >>> first summary table like this:
> >>>
> >>>     url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> >>>     M_sos <- RCurl::getURL(url)
> >>>     saveRDS(object=M_sos, file="dcp.rds")
> >>>     dat <- XML::readHTMLTable(M_sos)
> >>>     idx <- 2:length(dat)
> >>>     cn <- unique(unlist(lapply(dat[idx], colnames)))
> >> This is useful for this application.
> >>
> >>>     dat <- do.call(rbind,
> >>>       sapply(idx, function(i, dat, cn) {
> >>>         x <- dat[[i]]
> >>>         x[,cn[!(cn %in% colnames(x))]] <- NA
> >>>         x <- x[,cn]
> >>>         x$Party <- names(dat)[i]
> >>>         return(list(x))
> >>>       }, dat=dat, cn=cn))
> >>>     dat[,"Date Filed"] <-
> >>>       as.Date(x=dat[,"Date Filed"],
> >>>               format="%m/%d/%Y")
> >> This misses something extremely
> >> important for this application:?  The
> >> political office.? That's buried in
> >> the HTML or whatever it is.? I'm using
> >> something like the following to find
> >> that:
> >>
> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> > Dear Spencer,
> >
> > I came up with a solution, but it is not
> > very elegant.  Instead of showing you
> > the solution, hoping you understand
> > everything in it, I istead want to give
> > you some emphatic hints to see if you
> > can come up with a solution on you own.
> >
> > - XML::htmlTreeParse(M_sos)
> >    - *Gandalf voice*: climb the tree
> >      until you find the content you are
> >      looking for flat out at the level of
> >      «The Children of the Div», *uuuUUU*
> >    - you only want to keep the table and
> >      header tags at this level
> > - Use XML::xmlValue to extract the
> >    values of all the headers (the
> >    political positions)
> > - Observe that all the tables on the
> >    page you were able to extract
> >    previously using XML::readHTMLTable,
> >    are at this level, shuffled between
> >    the political position header tags,
> >    this means you extract the political
> >    position and party affiliation by
> >    using a for loop, if statements,
> >    typeof, names, and [] and [[]] to grab
> >    different things from the list
> >    (content or the bag itself).
> >    XML::readHTMLTable strips away the
> >    line break tags from the Mailing
> >    address, so if you find a better way
> >    of extracting the tables, tell me,
> >    e.g. you get
> >
> >       8805 HUNTER AVEKANSAS CITY MO 64138
> >
> >    and not
> >
> >       8805 HUNTER AVE<br/>KANSAS CITY MO 64138
> >
> > When you've completed this «programming
> > quest», you're back at the level of the
> > previous email, i.e.  you have have the
> > same tables, but with political position
> > and party affiliation added to them.
>
>
>        Please excuse:  Before my last post, I had written code to do all
> that.  In brief, the political offices are "h3" tags.  I used "strsplit"
> to split the string at "<h3>".  I then wrote a function to find "</h3>",
> extract the political office and pass the rest to "XML::readHTMLTable",
> adding columns for party and political office.
>
>
>        However, this suppressed "<br/>" everywhere.  I thought there
> should be an option with something like "XML::readHTMLTable" that would
> not delete "<br/>" everywhere, but I couldn't find it.  If you aren't
> aware of one, I can gsub("<br/>", "\n", ...) on the string for each
> political office before passing it to "XML::readHTMLTable".  I just
> tested this:  It works.
>
>
>        I have other web scraping problems in my work plan for the few
> days.  I will definitely try XML::htmlTreeParse, etc., as you suggest.
>
>
>        Thanks again.
>        Spencer Graves
> >
> > Best,
> > Rasmus
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Rasmus Liland-3
In reply to this post by Spencer Graves-4
Dear GRAVES et al.,

On 2020-07-25 12:43 -0500, Spencer Graves wrote:

> Dear Rasmus Liland et al.:
>
> On 2020-07-25 11:30, Rasmus Liland wrote:
> > On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> > > Dear Rasmus et al.:
> >
> > It is LILAND et al., is it not?  ... else it's customary to
> > put a comma in there, isn't it? ...
>
> The APA Style recommends "Sharp et al., 2007":
>
> https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html
If "Sharp et al., 2007" is an APA
citation of this book[*], Sharp is John A
Sharp's surname, Liland is my surname.  
Q.E.D.

I have not used APA before (as I am not
a Psychiatrist), as the minimalism of
IEEE[**] always seemed more desirable.  

> Regarding Confucius, I'm confused.

Nevermind, just fooling around, that's
all.

> > On 2020-07-25 04:10, Rasmus Liland wrote:
> > >
> > > However, this suppressed "<br/>"
> > > everywhere.?
> >
> > Why is that, please explain.
>
> I don't know why the Missouri
> Secretary of State's web site includes
> "<br/>" to signal a new line, but it
> does.
Me neither!  On top of that, <br /> is
actually[***] an XHTML tag, not an HTML
tag.

> I also don't know why
> XML::readHTMLTable suppressed "<br/>"
> everywhere it occurred, but it did
> that.

Yes, I know, I also observed this.  But
now we swiftly solved this by gsubbig it
with the newline char, "\n", which does
not make sense for HTML parses anyway.

> > > If you aren't aware of one, I can
> > > gsub("<br/>", "\n", ...) on the string
> > > for each political office before
> > > passing it to "XML::readHTMLTable".? I
> > > just tested this:? It works.
> >
> > Such a great hack!  IMHO, this is much
> > more flexible than using
> > xml2::read_html, rvest::read_table,
> > dplyr::mutate like here[1]
> >
> > [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
>
> And I added my solution to this
> problem to this Stackoverflow thread.
I wish you many upvotes, alas the
political competition is obiously not
tough there, as the other guy just got
one down vote.

[*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 
[**] https://pitt.libguides.com/citationhelp/ieee
[***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: help with web scraping

Rasmus Liland-3
In reply to this post by R help mailing list-2
Dear William Michels,

On 2020-07-25 10:58 -0700, William Michels wrote:

>
> Dear Spencer Graves (and Rasmus Liland),
>
> I've had some luck just using gsub()
> to alter the offending "</br>"
> characters, appending a "___" tag at
> each instance of "<br>" (first I
> checked the text to make sure it
> didn't contain any pre-existing
> instances of "___"). See the output
> snippet below:
>
> > library(RCurl)
> > library(XML)
> > sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > sosChars <- getURL(sosURL)
> > sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> > MOcan <- readHTMLTable(sosChars2)
> > MOcan[[2]]
>                   Name
> 1       Raleigh Ritter
> 2          Mike Parson
> 3 James W. (Jim) Neely
> 4     Saundra McDowell
>                            Mailing Address
> 1      4476 FIVE MILE RD___SENECA MO 64865
> 2         1458 E 464 RD___BOLIVAR MO 65613
> 3            PO BOX 343___CAMERON MO 64429
> 4 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
>   Random Number Date Filed
> 1           185  2/25/2020
> 2           348  2/25/2020
> 3           477  2/25/2020
> 4                3/31/2020
> >
>
> It's true, there's one a 'section' of
> MOcan output that contains odd-looking
> characters (see the "Total" line of
> MOcan[[1]]). But my guess is you'll be
> deleting this 'line' anyway--and
> recalulating totals in R.
Perhaps it's the this table you mean?  

                        Offices Republican
        1              Governor          4
        2   Lieutenant Governor          4
        3    Secretary of State          1
        4       State Treasurer          1
        5      Attorney General          1
        6   U.S. Representative         24
        7         State Senator         28
        8  State Representative        187
        9         Circuit Judge         18
        10                Total 268\r\n___
           Democratic Libertarian    Green
        1           5           1        1
        2           2           1        1
        3           1           1        1
        4           1           1        1
        5           2           1        0
        6          16           9        0
        7          22           2        1
        8         137           6        2
        9           1           0        0
        10 187\r\n___   22\r\n___ 7\r\n___
           Constitution      Total
        1             0         11
        2             0          8
        3             1          5
        4             0          4
        5             0          4
        6             0         49
        7             0         53
        8             1        333
        9             0         19
        10     2\r\n___ 486\r\n___

Yes, somehow the Windows[1] character
"0xD" gets converted to "\r\n" after
your gsub, "<br/>" is still ignored.  

There is not a "0xD" inside the
td.AddressCol cells in the tables we are
interested in.

> Now that you have a comprehensive list
> object, you should be able to pull out
> districts/races of interest. You might
> want to take a look at the "rlist"
> package, to see if it can make your
> work a little easier:
>
> https://CRAN.R-project.org/package=rlist
> https://renkun-ken.github.io/rlist-tutorial/index.html

Thank you, this package seems useful.  

Please can you provide a hint (maybe) as
to which of the many functions you were
thinking of?  E.g. instead of using for
over the index of the list of headers
and tables, if typeof list or character,
and updating variables to write in the
political position to each table.

V

r

[1] https://stackoverflow.com/questions/5843495/what-does-m-character-mean-in-vim

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

signature.asc (849 bytes) Download Attachment