Analyzing Publications from Pubmed via XML

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Analyzing Publications from Pubmed via XML

Farrel Buchinsky
I would like to track in which journals articles about a particular disease
are being published. Creating a pubmed search is trivial. The search
provides data but obviously not as an R dataframe. I can get the search to
export the data as an xml feed and the xml package seems to be able to read
it.

xmlTreeParse("
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
",isURL=TRUE)

But getting from there to a dataframe in which one column would be the name
of the journal and another column would be the year (to keep things simple)
seems to be beyond my capabilities.

Has anyone ever done this and could you share your script? Are there any
published examples where the end result is a dataframe.

I guess what I am looking for is an easy and simple way to parse the feed
and extract the data. Alternatively how does one turn an RSS feed into a CSV
file?

--
Farrel Buchinsky
GrandCentral Tel: (412) 567-7870

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Rajarshi Guha-3

On Dec 13, 2007, at 9:03 PM, Farrel Buchinsky wrote:

> I would like to track in which journals articles about a particular  
> disease
> are being published. Creating a pubmed search is trivial. The search
> provides data but obviously not as an R dataframe. I can get the  
> search to
> export the data as an xml feed and the xml package seems to be able  
> to read
> it.
>
> xmlTreeParse("
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?
> rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
> ",isURL=TRUE)
>
> But getting from there to a dataframe in which one column would be  
> the name
> of the journal and another column would be the year (to keep things  
> simple)
> seems to be beyond my capabilities.

If you're comfortable with Python (or Perl, Ruby etc), it'd be easier  
to just extract the required stuff from the raw feed - using  
ElementTree in Python makes this a trivial task

Once you have the raw data you can read it into R

-------------------------------------------------------------------
Rajarshi Guha  <[hidden email]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A committee is a group that keeps the minutes and loses hours.
        -- Milton Berle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Farrel Buchinsky
I am afraid not! The only thing I know about Python (or Perl, Ruby etc) is
that they exist and that I have been able to download some amazing freeware
or open source software thanks to their existence.
The XML package and specifically the xmlTreeParse function looks as if it is
begging to do the task for me. Is that not true?

On Dec 13, 2007 9:12 PM, Rajarshi Guha <[hidden email]> wrote:

>
> On Dec 13, 2007, at 9:03 PM, Farrel Buchinsky wrote:
>
> > I would like to track in which journals articles about a particular
> > disease
> > are being published. Creating a pubmed search is trivial. The search
> > provides data but obviously not as an R dataframe. I can get the
> > search to
> > export the data as an xml feed and the xml package seems to be able
> > to read
> > it.
> >
> > xmlTreeParse("
> > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?
> > rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
> > ",isURL=TRUE)
> >
> > But getting from there to a dataframe in which one column would be
> > the name
> > of the journal and another column would be the year (to keep things
> > simple)
> > seems to be beyond my capabilities.
>
> If you're comfortable with Python (or Perl, Ruby etc), it'd be easier
> to just extract the required stuff from the raw feed - using
> ElementTree in Python makes this a trivial task
>
> Once you have the raw data you can read it into R
>
> -------------------------------------------------------------------
> Rajarshi Guha  <[hidden email]>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> A committee is a group that keeps the minutes and loses hours.
>        -- Milton Berle
>
>
>


--
Farrel Buchinsky
GrandCentral Tel: (412) 567-7870

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Gabor Grothendieck
In reply to this post by Farrel Buchinsky
On Dec 13, 2007 9:03 PM, Farrel Buchinsky <[hidden email]> wrote:

> I would like to track in which journals articles about a particular disease
> are being published. Creating a pubmed search is trivial. The search
> provides data but obviously not as an R dataframe. I can get the search to
> export the data as an xml feed and the xml package seems to be able to read
> it.
>
> xmlTreeParse("
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
> ",isURL=TRUE)
>
> But getting from there to a dataframe in which one column would be the name
> of the journal and another column would be the year (to keep things simple)
> seems to be beyond my capabilities.
>
> Has anyone ever done this and could you share your script? Are there any
> published examples where the end result is a dataframe.
>
> I guess what I am looking for is an easy and simple way to parse the feed
> and extract the data. Alternatively how does one turn an RSS feed into a CSV
> file?

Try this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE)
sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Rajarshi Guha-3
In reply to this post by Farrel Buchinsky

On Dec 13, 2007, at 9:16 PM, Farrel Buchinsky wrote:

> I am afraid not! The only thing I know about Python (or Perl, Ruby  
> etc) is that they exist and that I have been able to download some  
> amazing freeware or open source software thanks to their existence.
> The XML package and specifically the xmlTreeParse function looks as  
> if it is begging to do the task for me. Is that not true?


Certainly - probably as a better Python programmer than an R  
programmer, it's faster and neater for me to do it in Python:

from elementtree.ElementTree import XML
import urllib

url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?
rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-'
con = urllib.urlopen(url)
dat = con.read()
root = XML(dat)
items = root.findall("channel/item")
for item in items:
     category = item.find("category")
     print category.text

The problem is that the RSS feed you linked to, does not contain the  
year of the article in an easily accessible XML element. Rather you  
have to process the HTML content of the description element - which,  
is something R could do, but you'd be using the wrong tool for the job.

In general, if you're planning to analyze article data from Pubmed  
I'd suggest going through the Entrez CGI's (ESearch and EFetch)  
which will give you all the details of the articles in an XML format  
which can then be easily parsed in your language of choice.

This is something that can be done in R (the rpubchem package  
contains functions to process XML files from Pubchem, which might  
provide some pointers)

-------------------------------------------------------------------
Rajarshi Guha  <[hidden email]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Writing software is more fun than working.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Robert Gentleman
In reply to this post by Gabor Grothendieck
or just try looking in the annotate package from Bioconductor


Gabor Grothendieck wrote:

> On Dec 13, 2007 9:03 PM, Farrel Buchinsky <[hidden email]> wrote:
>> I would like to track in which journals articles about a particular disease
>> are being published. Creating a pubmed search is trivial. The search
>> provides data but obviously not as an R dataframe. I can get the search to
>> export the data as an xml feed and the xml package seems to be able to read
>> it.
>>
>> xmlTreeParse("
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
>> ",isURL=TRUE)
>>
>> But getting from there to a dataframe in which one column would be the name
>> of the journal and another column would be the year (to keep things simple)
>> seems to be beyond my capabilities.
>>
>> Has anyone ever done this and could you share your script? Are there any
>> published examples where the end result is a dataframe.
>>
>> I guess what I am looking for is an easy and simple way to parse the feed
>> and extract the data. Alternatively how does one turn an RSS feed into a CSV
>> file?
>
> Try this:
>
> library(XML)
> doc <-
> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
> isURL = TRUE, useInternalNodes = TRUE)
> sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue)
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
[hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Farrel Buchinsky
In reply to this post by Rajarshi Guha-3
> The problem is that the RSS feed you linked to, does not contain the
> year of the article in an easily accessible XML element. Rather you
> have to process the HTML content of the description element - which,
> is something R could do, but you'd be using the wrong tool for the job.
>

Yes. I have noticed that there two sorts of xml that pubmed will
provide. The kind I had hooked into was an rss feed which provides a
lot of the information simply as a formatted table for viewing in a
rss reader. There is another way to get the xml to come out with more
tags. However, I found the best way to do this is probably through the
bioconductor annotate package

x <- pubmed("18046565", "17978930", "17975511")
a <- xmlRoot(x)
numAbst <- length(xmlChildren(a))
absts <- list()
for (i in 1:numAbst) {
absts[[i]] <- buildPubMedAbst(a[[i]])
   }

I am now trying to work through that approach to see what I can come up with.
--
Farrel Buchinsky

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Farrel Buchinsky
In reply to this post by Robert Gentleman
On Dec 13, 2007 11:35 PM, Robert Gentleman <[hidden email]> wrote:
> or just try looking in the annotate package from Bioconductor
>

Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
2) I have already created a standing search in pubmed using MyNCBI.
There are many ways I can feed those results to the pubmed() function.
The most brute force way of doing it is by running the search and
outputing the data as a UI List and getting that into the pubmed
brackets. A way that involved more finesse would allow me to create a
rss feed based on my search and then give the rss feed url to the
pubmed function. Or perhaps once could just plop the query inside the
pubmed functions
pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH])
OR ((("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms]
OR recurrent[Text Word]) AND respiratory[All Fields] AND
(("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR
papillomatosis[Text Word])))

Does "somefunction" exist?

If there are any further questions do you think I should migrate this
conversation to the bioconductor mailing list?



Farrel Buchinsky

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Gabor Grothendieck
In reply to this post by Farrel Buchinsky
On Dec 14, 2007 3:04 PM, Farrel Buchinsky <[hidden email]> wrote:

> > The problem is that the RSS feed you linked to, does not contain the
> > year of the article in an easily accessible XML element. Rather you
> > have to process the HTML content of the description element - which,
> > is something R could do, but you'd be using the wrong tool for the job.
> >
>
> Yes. I have noticed that there two sorts of xml that pubmed will
> provide. The kind I had hooked into was an rss feed which provides a
> lot of the information simply as a formatted table for viewing in a
> rss reader. There is another way to get the xml to come out with more
> tags. However, I found the best way to do this is probably through the
> bioconductor annotate package
>
> x <- pubmed("18046565", "17978930", "17975511")
> a <- xmlRoot(x)
> numAbst <- length(xmlChildren(a))
> absts <- list()
> for (i in 1:numAbst) {
> absts[[i]] <- buildPubMedAbst(a[[i]])
>   }
>
> I am now trying to work through that approach to see what I can come up with.

Note that the lines after a<-xmlRoot(x) could be reduced to:

xmlSApply(a, buildPubMedAbst)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Duncan Temple Lang
In reply to this post by Farrel Buchinsky
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Farrel Buchinsky wrote:

>> The problem is that the RSS feed you linked to, does not contain the
>> year of the article in an easily accessible XML element. Rather you
>> have to process the HTML content of the description element - which,
>> is something R could do, but you'd be using the wrong tool for the job.
>>
>
> Yes. I have noticed that there two sorts of xml that pubmed will
> provide. The kind I had hooked into was an rss feed which provides a
> lot of the information simply as a formatted table for viewing in a
> rss reader. There is another way to get the xml to come out with more
> tags. However, I found the best way to do this is probably through the
> bioconductor annotate package
>
> x <- pubmed("18046565", "17978930", "17975511")
> a <- xmlRoot(x)
> numAbst <- length(xmlChildren(a))
> absts <- list()
> for (i in 1:numAbst) {
> absts[[i]] <- buildPubMedAbst(a[[i]])
>    }

You can simplify the final 5 lines to

   absts = xmlApply(a, buildPubMedAbst)

which is shorter, fractionally faster and handles cases where there are
no abstracts.


>
> I am now trying to work through that approach to see what I can come up with.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHYv6Z9p/Jzwa2QP4RAp0NAJ4pfGS7Jy9nwHMOGpT1jVM+IMedywCeOZPG
9GER8GI62Y24a+cQT7KbW08=
=4TVP
-----END PGP SIGNATURE-----

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

David Winsemius
In reply to this post by Farrel Buchinsky
"Farrel Buchinsky" <[hidden email]> wrote in
news:[hidden email]:

> On Dec 13, 2007 11:35 PM, Robert Gentleman <[hidden email]> wrote:
>> or just try looking in the annotate package from Bioconductor
>>
>
> Yip. annotate seems to be the most streamlined way to do this.
> 1) How does one turn the list that is created into a dataframe whose
> column names are along the lines of date, title, journal, authors etc

Gabor's example already did that task.

> 2) I have already created a standing search in pubmed using MyNCBI.
> There are many ways I can feed those results to the pubmed() function.
> The most brute force way of doing it is by running the search and
> outputing the data as a UI List and getting that into the pubmed
> brackets. A way that involved more finesse would allow me to create a
> rss feed based on my search and then give the rss feed url to the
> pubmed function. Or perhaps once could just plop the query inside the
> pubmed functions
> pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH])
> OR ((("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms]
> OR recurrent[Text Word]) AND respiratory[All Fields] AND
> (("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR
> papillomatosis[Text Word])))
>
> Does "somefunction" exist?

I could not find it. The pubmed function appears to assume that you will
already have a list of PMIDs. When I set up a function to take an
arbitrary  PubMed search string (quoted by the user) and return the
PMIDs, I had success by following Gabor's example:

> pm.srch<- function (){
   srch.stem <-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
   query <-as.character(scan(file="",what="character"))
   doc <-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
         useInternalNodes = TRUE)
   sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
     }
> pm.srch()
1: "laryngeal neoplasms[mh]"
2:
Read 1 item
      //Id      
 [1,] "18042931"
 [2,] "18038886"
 [3,] "17978930"
 [4,] "17974987"
 [5,] "17972507"
 [6,] "17970149"
 [7,] "17967299"
 [8,] "17962724"
 [9,] "17954109"
[10,] "17942038"
[11,] "17940076"
[12,] "17848290"
[13,] "17848288"
[14,] "17848287"
[15,] "17848278"
[16,] "17938330"
[17,] "17938329"
[18,] "17918311"
[19,] "17910347"
[20,] "17908862"

Emboldened by that minor success, I pushed on. Pubmed said your example
was malformed and I took their suggested modification:
("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH]) OR (("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms] OR recurrent[Text Word]) AND respiratory[All Fields] AND (("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR papillomatosis[Text Word])

That returned 400+ citations, and I put it into a text file.

After quite a bit of hacking (in the sense of ineffective chopping with
a dull ax), I finally came up with:

pm.srch<- function (){
  srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
  query<-readLines(con=file.choose())
  query<-gsub("\\\"","",x=query)
  doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
                     useInternalNodes = TRUE)
  return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
     }

pm.srch()  #choosing the search-file
      //Id      
 [1,] "18046565"
 [2,] "17978930"
 [3,] "17975511"
 [4,] "17935912"
 [5,] "17851940"
 [6,] "17765779"
 [7,] "17688640"
 [8,] "17638782"
 [9,] "17627059"
[10,] "17599582"
[11,] "17589729"
[12,] "17585283"
[13,] "17568846"
[14,] "17560665"
[15,] "17547971"
[16,] "17428551"
[17,] "17419899"
[18,] "17419519"
[19,] "17385606"
[20,] "17366752"

--
David Winsemius

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

David Winsemius
David Winsemius <[hidden email]> wrote in
news:Xns9A077F740B4A0dNOTwinscomcast@80.91.229.13:

> "Farrel Buchinsky" <[hidden email]> wrote in
> news:[hidden email]:
>
>> On Dec 13, 2007 11:35 PM, Robert Gentleman <[hidden email]>
>> wrote:
>>> or just try looking in the annotate package from Bioconductor
>>>
>>
>> Yip. annotate seems to be the most streamlined way to do this.
>> 1) How does one turn the list that is created into a dataframe whose
>> column names are along the lines of date, title, journal, authors etc
>
> Gabor's example already did that task.
>

Actually the object returned by Gabor's method was a list of lists. Here
is one way (probably very inefficient) of getting "doc" into a
data.frame:

colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
           doc = doc, fun = xmlValue)

titles=as.vector(unlist(colvals[1])[3:17])

# needed to drop extraneous titles for search name and an NCBI header
#>str(colvals)
#List of 3
# $ //title   :List of 17
#  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
#  ..$ : chr "NCBI PubMed"

authors=colvals[[2]]
jrnls=colvals[[3]]

# not sure why, but trying to do it in one step failed:
#  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),  
#                     authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
# authors = colvals[[2]],  :
#  arguments imply differing number of rows: 15, 1
# but the following worked

 cites<-data.frame(titles=as.vector(titles))
 cites$author<-authors
 cites$jrnls<-jrnls
 cites

I am still wondering how to extract material that does not have an XML
tag.  Each item looks like:

 <item>
   <title>Gastroesophageal reflux in patients with recurrent laryngeal
papillomatosis.</title>
   <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
tmpl=NoSidebarfile&amp;db=PubMed&amp;cmd=Retrieve&amp;list_uids=17589729
&amp;dopt=Abstract</link>
   <description>
    <![CDATA[
    <table border="0" width="100%"><tr><td align="left"><a
href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S0034-
72992007000200011&amp;lng=en&amp;nrm=iso&amp;tlng=en"><img
src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=PubMed&amp;cmd=Display&amp;dopt=PubMed_PubMed&amp;from_uid=17589729">
Related Articles</a></td></tr></table>
        <p><b>Gastroesophageal reflux in patients with recurrent
laryngeal papillomatosis.</b></p>
        <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
</p>
        <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR,
Fujita R, De Marco EK</p>
        <p>Evidence of a relation between gastroesophaeal reflux and
pediatric respiratory disorders increases every year. Many respiratory
symptoms and clinical conditions such as stridor, chronic cough, and
recurrent pneumonia and bronchitis appear to be related to
gastroesophageal reflux. Some studies have also suggested that
gastroesophageal reflux may be associated with recurrent laryngeal
papillomatosis, contributing to its recurrence and severity. AIM: the aim
of this study was to verify the frequency and intensity of
gastroesophageal reflux in children with recurrent laryngeal
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
between 3 and 12 years, presenting laryngeal papillomatosis, were
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of
gastroesophageal reflux at the distal sphincter; 90% presented reflux at
the proximal sphincter. CONCLUSION: the frequency of proximal
gastroesophageal reflux is significantly increased in patients with
recurrent laryngeal papillomatosis.</p>
        <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
   </description>
   <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
Marco EK</author>
   <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
   <guid isPermaLink="false">PubMed:17589729</guid>
  </item>

I would like to access, for instance, the PMID or the abstract within the
<description> element, but I do not think that they have names in the the
same way that <author> or <category> have xml named nodes. I suspect that
getting the output in a different format, say as MEDLINE, might produce
output that was tagged more completely.

--
David Winsemius

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Gabor Grothendieck
If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE, trim = TRUE)

out <- cbind(
        Author = unlist(xpathApply(doc, "//author", xmlValue)),
        PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))),
        Abstract = unlist(xpathApply(doc, "//description",
                function(x) {
                        on.exit(free(doc2))
                        doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
                                useInternalNodes = TRUE, trim = TRUE)
                        xpathApply(doc2, "//p[4]", xmlValue)
                }
        )))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field


The last line produces (it may look messed up in this email):

> substring(out, 1, 25) # display it
      Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
 [5,] " Hopp R, Natarajan N, Lew" "17908862" ""
 [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente"
 [7,] " Mouadeb DA, Belafsky PC"  "17765779" "OBJECTIVES: The 585nm pul"
 [8,] " Thompson L"               "17702311" ""
 [9,] " Schaffer A, Brotherton J" "17688640" ""
[10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate"
[11,] " Shah KV, Westra WH"       "17627059" ""
[12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off"
[13,] " Akst LM, Broadhurst MS, " "17592395" ""
[14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be"


On Dec 15, 2007 10:13 PM, David Winsemius <[hidden email]> wrote:

> David Winsemius <[hidden email]> wrote in
> news:Xns9A077F740B4A0dNOTwinscomcast@80.91.229.13:
>
> > "Farrel Buchinsky" <[hidden email]> wrote in
> > news:[hidden email]:
> >
> >> On Dec 13, 2007 11:35 PM, Robert Gentleman <[hidden email]>
> >> wrote:
> >>> or just try looking in the annotate package from Bioconductor
> >>>
> >>
> >> Yip. annotate seems to be the most streamlined way to do this.
> >> 1) How does one turn the list that is created into a dataframe whose
> >> column names are along the lines of date, title, journal, authors etc
> >
> > Gabor's example already did that task.
> >
>
> Actually the object returned by Gabor's method was a list of lists. Here
> is one way (probably very inefficient) of getting "doc" into a
> data.frame:
>
> colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
>           doc = doc, fun = xmlValue)
>
> titles=as.vector(unlist(colvals[1])[3:17])
>
> # needed to drop extraneous titles for search name and an NCBI header
> #>str(colvals)
> #List of 3
> # $ //title   :List of 17
> #  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
> #  ..$ : chr "NCBI PubMed"
>
> authors=colvals[[2]]
> jrnls=colvals[[3]]
>
> # not sure why, but trying to do it in one step failed:
> #  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),
> #                     authors=colvals[[2]],jnrls=colvals[[3]])
> # Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
> # authors = colvals[[2]],  :
> #  arguments imply differing number of rows: 15, 1
> # but the following worked
>
>  cites<-data.frame(titles=as.vector(titles))
>  cites$author<-authors
>  cites$jrnls<-jrnls
>  cites
>
> I am still wondering how to extract material that does not have an XML
> tag.  Each item looks like:
>
>  <item>
>   <title>Gastroesophageal reflux in patients with recurrent laryngeal
> papillomatosis.</title>
>   <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
> tmpl=NoSidebarfile&amp;db=PubMed&amp;cmd=Retrieve&amp;list_uids=17589729
> &amp;dopt=Abstract</link>
>   <description>
>    <![CDATA[
>    <table border="0" width="100%"><tr><td align="left"><a
> href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S0034-
> 72992007000200011&amp;lng=en&amp;nrm=iso&amp;tlng=en"><img
> src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
> img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
> href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
> db=PubMed&amp;cmd=Display&amp;dopt=PubMed_PubMed&amp;from_uid=17589729">
> Related Articles</a></td></tr></table>
>        <p><b>Gastroesophageal reflux in patients with recurrent
> laryngeal papillomatosis.</b></p>
>        <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
> </p>
>        <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR,
> Fujita R, De Marco EK</p>
>        <p>Evidence of a relation between gastroesophaeal reflux and
> pediatric respiratory disorders increases every year. Many respiratory
> symptoms and clinical conditions such as stridor, chronic cough, and
> recurrent pneumonia and bronchitis appear to be related to
> gastroesophageal reflux. Some studies have also suggested that
> gastroesophageal reflux may be associated with recurrent laryngeal
> papillomatosis, contributing to its recurrence and severity. AIM: the aim
> of this study was to verify the frequency and intensity of
> gastroesophageal reflux in children with recurrent laryngeal
> papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
> between 3 and 12 years, presenting laryngeal papillomatosis, were
> included in this study. The children underwent 24-hour double-probe pH-
> metry. RESULTS: fifty percent of the patients had evidence of
> gastroesophageal reflux at the distal sphincter; 90% presented reflux at
> the proximal sphincter. CONCLUSION: the frequency of proximal
> gastroesophageal reflux is significantly increased in patients with
> recurrent laryngeal papillomatosis.</p>
>        <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
>   </description>
>   <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
> Marco EK</author>
>   <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
>   <guid isPermaLink="false">PubMed:17589729</guid>
>  </item>
>
> I would like to access, for instance, the PMID or the abstract within the
> <description> element, but I do not think that they have names in the the
> same way that <author> or <category> have xml named nodes. I suspect that
> getting the output in a different format, say as MEDLINE, might produce
> output that was tagged more completely.
>
>
> --
> David Winsemius
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

David Winsemius
On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:

> If we can assume that the abstract is always the 4th paragraph then we
> can try something like this:
>
> library(XML)
> doc <-
> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
> useInternalNodes = TRUE, trim = TRUE)
>
> out <- cbind(
>      Author = unlist(xpathApply(doc, "//author", xmlValue)),
>      PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
>      xmlValue))),
>      Abstract = unlist(xpathApply(doc, "//description",
>           function(x) {
>                on.exit(free(doc2))
>                doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
>                     useInternalNodes = TRUE, trim = TRUE)
>                xpathApply(doc2, "//p[4]", xmlValue)
>           }
>      )))
> free(doc)
> substring(out, 1, 25) # display first 25 chars of each field
>
>
> The last line produces (it may look messed up in this email):
>
>> substring(out, 1, 25) # display it
>       Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
snip
>
>

It looked beautifully regular in my newsreader. It is helpful to see an
example showing the indexed access to nodes. It was also helpful to see the
example of substring for column display. Thank you (for this and all of
your other contributions.)

I find upon further browsing that the pmfetch access point is obsolete.
Experimentation with the PubMed eFetch server access point results in fully
xml-tagged results:

e.fetch.doc<- function (){
   fetch.stem <-
        "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
   src.mode <- "db=pubmed&retmode=xml&"
   request <- "id=11045395"
   doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
                          isURL = TRUE, useInternalNodes = TRUE)
     }
# in the debugging phase I needed to set useInternalNodes = TRUE to see the  
tags. Never did find a way to "print" them when internal.

doc<-e.fetch.doc()
get.info<- function(doc){
         df<-cbind(
  Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
  Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
  Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
                   )
   return(df)
   }

# this works
> substring(get.info(doc), 1, 25)
     Abstract                    Journal                     Pmid      
[1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"


--
David Winsemius

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Gabor Grothendieck
On Dec 16, 2007 2:53 PM, David Winsemius <[hidden email]> wrote:
> # in the debugging phase I needed to set useInternalNodes = TRUE to see the
> tags. Never did find a way to "print" them when internal.

I assume you mean FALSE.  See:
?saveXML

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

David Winsemius
"Gabor Grothendieck" <[hidden email]> wrote in
news:[hidden email]:

> On Dec 16, 2007 2:53 PM, David Winsemius <[hidden email]>
> wrote:
>> # in the debugging phase I needed to set useInternalNodes = TRUE to
>> see the tags. Never did find a way to "print" them when internal.
>
> I assume you mean FALSE.  See:
> ?saveXML

You're correct, yet again; I did a copy/paste/forget-to-edit. And thanks
for the further tip.

--
David

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

David Winsemius
In reply to this post by Gabor Grothendieck
"Gabor Grothendieck" <[hidden email]> wrote in
news:[hidden email]:

> On Dec 16, 2007 2:53 PM, David Winsemius <[hidden email]>
> wrote:
>> # Never did find a way to "print" them when internal.

> ?saveXML

And now I understand where that odd "\n      <text>" originated before I
changed the searched-for node name from \\Abstract to \\AbstractText. It's
a remnant from the pretty-printing of the XML tree after excising the
intervening node name.

--
David Winsemius

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Duncan Temple Lang
In reply to this post by David Winsemius
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



David Winsemius wrote:

> On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
>
>> If we can assume that the abstract is always the 4th paragraph then we
>> can try something like this:
>>
>> library(XML)
>> doc <-
>> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
>> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
>> useInternalNodes = TRUE, trim = TRUE)
>>
>> out <- cbind(
>>      Author = unlist(xpathApply(doc, "//author", xmlValue)),
>>      PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
>>      xmlValue))),
>>      Abstract = unlist(xpathApply(doc, "//description",
>>           function(x) {
>>                on.exit(free(doc2))
>>                doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
>>                     useInternalNodes = TRUE, trim = TRUE)
>>                xpathApply(doc2, "//p[4]", xmlValue)
>>           }
>>      )))
>> free(doc)
>> substring(out, 1, 25) # display first 25 chars of each field
>>
>>
>> The last line produces (it may look messed up in this email):
>>
>>> substring(out, 1, 25) # display it
>>       Author                      PMID       Abstract
>  [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
>  [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
>  [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
>  [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
> snip
>>
>
> It looked beautifully regular in my newsreader. It is helpful to see an
> example showing the indexed access to nodes. It was also helpful to see the
> example of substring for column display. Thank you (for this and all of
> your other contributions.)
>
> I find upon further browsing that the pmfetch access point is obsolete.
> Experimentation with the PubMed eFetch server access point results in fully
> xml-tagged results:
>
> e.fetch.doc<- function (){
>    fetch.stem <-
>         "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
>    src.mode <- "db=pubmed&retmode=xml&"
>    request <- "id=11045395"
>    doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
>                           isURL = TRUE, useInternalNodes = TRUE)
>      }
> # in the debugging phase I needed to set useInternalNodes = TRUE to see the  
> tags. Never did find a way to "print" them when internal.

saveXML(node)

will return a string giving the XML content of that node as tree.


>
> doc<-e.fetch.doc()
> get.info<- function(doc){
>          df<-cbind(
>   Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
>   Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
>   Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
>                    )
>    return(df)
>    }
>
> # this works
>> substring(get.info(doc), 1, 25)
>      Abstract                    Journal                     Pmid      
> [1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT
zdLB6mkNPFh5lWgNgb70sDc=
=SR2E
-----END PGP SIGNATURE-----

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Armin Goralczyk
In reply to this post by David Winsemius
On Dec 15, 2007 6:31 PM, David Winsemius <[hidden email]> wrote:

> After quite a bit of hacking (in the sense of ineffective chopping with
> a dull ax), I finally came up with:
>
> pm.srch<- function (){
>   srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
>   query<-readLines(con=file.choose())
>   query<-gsub("\\\"","",x=query)
>   doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
>                      useInternalNodes = TRUE)
>   return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
>      }
>
> pm.srch()  #choosing the search-file
>       //Id
>  [1,] "18046565"
>  [2,] "17978930"
>  [3,] "17975511"
>  [4,] "17935912"
>  [5,] "17851940"
>  [6,] "17765779"
>  [7,] "17688640"
>  [8,] "17638782"
>  [9,] "17627059"
> [10,] "17599582"
> [11,] "17589729"
> [12,] "17585283"
> [13,] "17568846"
> [14,] "17560665"
> [15,] "17547971"
> [16,] "17428551"
> [17,] "17419899"
> [18,] "17419519"
> [19,] "17385606"
> [20,] "17366752"

I tried the example above, but only the first 20 PMIDs will be
returned. How can I circumvent this (I guesss its a restraint from
pubmed)?
--
Armin Goralczyk, M.D.
--
Universitätsmedizin Göttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 Göttingen
--
Dept. of General Surgery
University of Göttingen
Göttingen, Germany
--
http://www.chirurgie-goettingen.de
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Publications from Pubmed via XML

Martin Morgan
Hi Armin --

See the help page for esearch

http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

especially the 'retmax' key.

A couple of other thoughts on this thread...

1) using the full path, e.g.,

ids <- xpathApply(doc, "/eSearchResult/IdList/Id", xmlValue)

is likely to lead to less grief in the long run, as you'll only select
elements of the node you're interested in, rather than any element,
anywhere in the document, labeled 'Id'

2) From a different post in the thread, things like

On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net> wrote:
[snip]
> get.info<- function(doc){
>          df<-cbind(
>   Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
>   Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
>   Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
>                    )
>    return(df)
>    }

will lead to more trouble, because they assume that AbstractText, etc
occur exactly once in each record. It would seem better to extract the
relevant node, and query that, probably defining appropriate
defaults. I started with

xpath_or_na <- function(doc, q) {
    res <- xpathApply(doc, q, xmlValue)
    if (length(res)==1) res[[1]]
    else NA_character_
}

citn <- function(citation){
  Abstract <- xpath_or_na(citation,
                           "/MedlineCitation/Article/Abstract/AbstractText")
  Journal <- xpath_or_na(citation,
                          "/MedlineCitation/Article/Journal/Title")
  Pmid <- xpath_or_na(citation,
                       "/MedlineCitation/PMID")
    c(Abstract=Abstract, Journal=Journal, Pmid=Pmid)
}

medline_q <- "/PubmedArticleSet/PubmedArticle/MedlineCitation"
res <- xpathApply(doc, medline_q, citn)

One would still have to coerce res into a data.frame. Also worth
thinking about each of the lines in citn -- e.g., clearly only applies
to Journals.  Eventually one wants to consult the DTD (basically, the
contract spelling out the content) of the document, confirm that the
xpath queries will perform correctly, and verify that the document
actually conforms to its DTD.

Following my own advice, I quickly found that doing things 'more
right' becomes quite complicated, and suddenly became satisfied with
the information I can get out of the 'annotate' package.

Martin

"Armin Goralczyk" <[hidden email]> writes:

> On Dec 15, 2007 6:31 PM, David Winsemius <[hidden email]> wrote:
>
>> After quite a bit of hacking (in the sense of ineffective chopping with
>> a dull ax), I finally came up with:
>>
>> pm.srch<- function (){
>>   srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
>>   query<-readLines(con=file.choose())
>>   query<-gsub("\\\"","",x=query)
>>   doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
>>                      useInternalNodes = TRUE)
>>   return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
>>      }
>>
>> pm.srch()  #choosing the search-file
>>       //Id
>>  [1,] "18046565"
>>  [2,] "17978930"
>>  [3,] "17975511"
>>  [4,] "17935912"
>>  [5,] "17851940"
>>  [6,] "17765779"
>>  [7,] "17688640"
>>  [8,] "17638782"
>>  [9,] "17627059"
>> [10,] "17599582"
>> [11,] "17589729"
>> [12,] "17585283"
>> [13,] "17568846"
>> [14,] "17560665"
>> [15,] "17547971"
>> [16,] "17428551"
>> [17,] "17419899"
>> [18,] "17419519"
>> [19,] "17385606"
>> [20,] "17366752"
>
> I tried the example above, but only the first 20 PMIDs will be
> returned. How can I circumvent this (I guesss its a restraint from
> pubmed)?
> --
> Armin Goralczyk, M.D.
> --
> Universitätsmedizin Göttingen
> Abteilung Allgemein- und Viszeralchirurgie
> Rudolf-Koch-Str. 40
> 39099 Göttingen
> --
> Dept. of General Surgery
> University of Göttingen
> Göttingen, Germany
> --
> http://www.chirurgie-goettingen.de
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
12