Quantcast

Remove superscripts from HTML objects

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Remove superscripts from HTML objects

Chris Stubben
Is there some way to remove superscripts from objects returned by html/xmlParse (XML package)?

h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>"
doc <- htmlParse(h)
 xpathSApply(doc, "//p", xmlValue)
[1] "Cata" "Dog"

I could probably remove the  <sup> tags from the "h" object above, but I'd rather just work with the results from htmlParse if possible (and not use readLines to load raw HTML first).

Thanks,
Chris Stubben
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Remove superscripts from HTML objects

M. Lell
Hi,

h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>"
sub("<sup.*sup>","",h)

see http://en.wikibooks.org/wiki/R_Programming/Text_Processing for more
information.

Regards!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Remove superscripts from HTML objects

S Ellison-2
> h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>"
> sub("<sup.*sup>","",h)

Probably safer to do  

gsub("<sup.*?sup>","",h)

to avoid replacing multiple superscripts.

eg
h2 <- "<html><p>Cat<sup>a</sup></p><p>Dog</p><p>Mouse<sup>a</sup></p><p>Raccoon</p></html>"
sub("<sup.*sup>","",h2)                 #drops everything between first <sup and last sup>
gsub("<sup.*?sup>","",h2)            #Drops each <sub>xxx</sup>


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Remove superscripts from HTML objects

Chris Stubben
Sorry if I was not clear.  I wanted to remove the superscripts using xpath queries if possible.  For example this will get p nodes with superscripts, but how do I remove the superscripts if there are many matching nodes and different superscripts?

xpathSApply(doc, "//p[sup]", xmlValue)
[1] "Cata"


Chris
Loading...