Quantcast

Parsing large XML documents in R - how to optimize the speed?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Parsing large XML documents in R - how to optimize the speed?

Frederic Fournier
Hello everyone,

I would like to parse very large xml files from MS/MS experiments and
create R objects from their content. (By very large, I mean going up to
5-10Gb, although I am using a 'small' 40M file to test my code.)

My first attempt at parsing the 40M file, using the XML package, took more
than 2200 seconds and left me quite disappointed.
I managed to cut that down to around 40 seconds by:
    -using the 'useInternalNodes' option of the XML package when parsing
the xml tree;
    -vectorizing the parsing (i.e., replacing loops like "for(node in
group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}")
I gained another 5 seconds by making small changes to the functions used
(like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
navigate to the children nodes).
Now I am blocked at around 35 seconds and I would still like to cut this
time by a 5x, but I have no clue what to do to achieve this gain. I'll try
to expose as briefly as possible the relevant structure of the xml file I
am parsing, the structure of the R object I want to create, and the type of
functions I am using to do it. I hope that one of you will be able to point
me towards a better and quicker way of doing the parsing!


Here is the (simplified) structure of the relevant nodes of the xml file:

<model> (many many nodes)
  <protein> (a couple of proteins per model node)
    <peptide> (1 per protein node)
      <domain> (1 or more per peptide node)
        <aa> (0 or more per domain node)
        </aa>
      </domain>
    </peptide>
  </protein>
</model>

Here is the basic structure of the R object that I want to create:

'result' object that contains:
  -various attributes
  -a list of 'protein' objects, each of which containing:
      -various attributes
      -a list of 'peptide' objects, each of which containing:
        -various attributes
        -a list of 'aa' objects, each of which consisting of a couple of
attributes.

Here is the basic structure of the code:

xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE)
result <- new('S4_result_class')
result@proteins <- xpathApply(xml.doc, "//model/protein",
function(protein.node) {
  protein <- new('S4_protein_class')
  ## fill in a couple of attributes of the protein object using xmlValue
and xmlAttrs(protein.node)
  protein@peptides <- xpathApply(protein.node, "./peptide",
function(peptide.node) {
    peptide <- new('S4_peptide_class')
    ## fill in a couple of attributes of the peptide object using xmlValue
and xmlAttrs(peptide.node)
    peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"),
function(aa.node) {
      aa <- new('S4_aa_class')
      ## fill in a couple of attributes of the 'aa' object using xmlValue
and xmlAttrs(aa.node)
    })
  })
})
free(xml.doc)


Does anyone know a better and quicker way of doing this?

Sorry for the very long message and thank you very much for your time and
help!

Frederic

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Parsing large XML documents in R - how to optimize the speed?

Martin Morgan
On 08/10/2012 03:46 PM, Frederic Fournier wrote:
> Hello everyone,
>
> I would like to parse very large xml files from MS/MS experiments and
> create R objects from their content. (By very large, I mean going up to
> 5-10Gb, although I am using a 'small' 40M file to test my code.)

I'm not 100% sure of it's relevance, but

   http://bioconductor.org/packages/2.10/bioc/html/MSnbase.html

There is a vignette here, for instance

 
http://bioconductor.org/packages/2.10/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf

If this is useful, then further questions might be directed to the
Bioconductor mailing list.

   http://bioconductor.org/help/mailing-list/

Martin

>
> My first attempt at parsing the 40M file, using the XML package, took more
> than 2200 seconds and left me quite disappointed.
> I managed to cut that down to around 40 seconds by:
>      -using the 'useInternalNodes' option of the XML package when parsing
> the xml tree;
>      -vectorizing the parsing (i.e., replacing loops like "for(node in
> group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}")
> I gained another 5 seconds by making small changes to the functions used
> (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
> navigate to the children nodes).
> Now I am blocked at around 35 seconds and I would still like to cut this
> time by a 5x, but I have no clue what to do to achieve this gain. I'll try
> to expose as briefly as possible the relevant structure of the xml file I
> am parsing, the structure of the R object I want to create, and the type of
> functions I am using to do it. I hope that one of you will be able to point
> me towards a better and quicker way of doing the parsing!
>
>
> Here is the (simplified) structure of the relevant nodes of the xml file:
>
> <model> (many many nodes)
>    <protein> (a couple of proteins per model node)
>      <peptide> (1 per protein node)
>        <domain> (1 or more per peptide node)
>          <aa> (0 or more per domain node)
>          </aa>
>        </domain>
>      </peptide>
>    </protein>
> </model>
>
> Here is the basic structure of the R object that I want to create:
>
> 'result' object that contains:
>    -various attributes
>    -a list of 'protein' objects, each of which containing:
>        -various attributes
>        -a list of 'peptide' objects, each of which containing:
>          -various attributes
>          -a list of 'aa' objects, each of which consisting of a couple of
> attributes.
>
> Here is the basic structure of the code:
>
> xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE)
> result <- new('S4_result_class')
> result@proteins <- xpathApply(xml.doc, "//model/protein",
> function(protein.node) {
>    protein <- new('S4_protein_class')
>    ## fill in a couple of attributes of the protein object using xmlValue
> and xmlAttrs(protein.node)
>    protein@peptides <- xpathApply(protein.node, "./peptide",
> function(peptide.node) {
>      peptide <- new('S4_peptide_class')
>      ## fill in a couple of attributes of the peptide object using xmlValue
> and xmlAttrs(peptide.node)
>      peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"),
> function(aa.node) {
>        aa <- new('S4_aa_class')
>        ## fill in a couple of attributes of the 'aa' object using xmlValue
> and xmlAttrs(aa.node)
>      })
>    })
> })
> free(xml.doc)
>
>
> Does anyone know a better and quicker way of doing this?
>
> Sorry for the very long message and thank you very much for your time and
> help!
>
> Frederic
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Parsing large XML documents in R - how to optimize the speed?

Duncan Temple Lang
In reply to this post by Frederic Fournier

Hi Frederic

  You definitely want to be using xmlParse() (or equivalently
  xmlTreeParse( , useInternalNodes = TRUE)).

  This then allows use of getNodeSet()

  I would suggest you use Rprof() to find out where the bottlenecks arise,
   e.g. in the XML functions or in S4 code, or in your code that assembles the
    R objects from the XML.

  I'm happy to take a look at speeding it up if you can make the test file available
and show me your code.

    D.
On 8/10/12 3:46 PM, Frederic Fournier wrote:

> Hello everyone,
>
> I would like to parse very large xml files from MS/MS experiments and
> create R objects from their content. (By very large, I mean going up to
> 5-10Gb, although I am using a 'small' 40M file to test my code.)
>
> My first attempt at parsing the 40M file, using the XML package, took more
> than 2200 seconds and left me quite disappointed.
> I managed to cut that down to around 40 seconds by:
>     -using the 'useInternalNodes' option of the XML package when parsing
> the xml tree;
>     -vectorizing the parsing (i.e., replacing loops like "for(node in
> group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}")
> I gained another 5 seconds by making small changes to the functions used
> (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
> navigate to the children nodes).
> Now I am blocked at around 35 seconds and I would still like to cut this
> time by a 5x, but I have no clue what to do to achieve this gain. I'll try
> to expose as briefly as possible the relevant structure of the xml file I
> am parsing, the structure of the R object I want to create, and the type of
> functions I am using to do it. I hope that one of you will be able to point
> me towards a better and quicker way of doing the parsing!
>
>
> Here is the (simplified) structure of the relevant nodes of the xml file:
>
> <model> (many many nodes)
>   <protein> (a couple of proteins per model node)
>     <peptide> (1 per protein node)
>       <domain> (1 or more per peptide node)
>         <aa> (0 or more per domain node)
>         </aa>
>       </domain>
>     </peptide>
>   </protein>
> </model>
>
> Here is the basic structure of the R object that I want to create:
>
> 'result' object that contains:
>   -various attributes
>   -a list of 'protein' objects, each of which containing:
>       -various attributes
>       -a list of 'peptide' objects, each of which containing:
>         -various attributes
>         -a list of 'aa' objects, each of which consisting of a couple of
> attributes.
>
> Here is the basic structure of the code:
>
> xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE)
> result <- new('S4_result_class')
> result@proteins <- xpathApply(xml.doc, "//model/protein",
> function(protein.node) {
>   protein <- new('S4_protein_class')
>   ## fill in a couple of attributes of the protein object using xmlValue
> and xmlAttrs(protein.node)
>   protein@peptides <- xpathApply(protein.node, "./peptide",
> function(peptide.node) {
>     peptide <- new('S4_peptide_class')
>     ## fill in a couple of attributes of the peptide object using xmlValue
> and xmlAttrs(peptide.node)
>     peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"),
> function(aa.node) {
>       aa <- new('S4_aa_class')
>       ## fill in a couple of attributes of the 'aa' object using xmlValue
> and xmlAttrs(aa.node)
>     })
>   })
> })
> free(xml.doc)
>
>
> Does anyone know a better and quicker way of doing this?
>
> Sorry for the very long message and thank you very much for your time and
> help!
>
> Frederic
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Parsing large XML documents in R - how to optimize the speed?

Erdal Karaca-2
In reply to this post by Frederic Fournier
If this is an option for you: An xml database can handle (very) huge xml
files and let you query nodes very efficiently.
Then, you could query the xml databse from R (using REST) to do your
statistics.

There are some open source xquery/xml databases available.

2012/8/11 Frederic Fournier <[hidden email]>

> Hello everyone,
>
> I would like to parse very large xml files from MS/MS experiments and
> create R objects from their content. (By very large, I mean going up to
> 5-10Gb, although I am using a 'small' 40M file to test my code.)
>
> My first attempt at parsing the 40M file, using the XML package, took more
> than 2200 seconds and left me quite disappointed.
> I managed to cut that down to around 40 seconds by:
>     -using the 'useInternalNodes' option of the XML package when parsing
> the xml tree;
>     -vectorizing the parsing (i.e., replacing loops like "for(node in
> group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}")
> I gained another 5 seconds by making small changes to the functions used
> (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
> navigate to the children nodes).
> Now I am blocked at around 35 seconds and I would still like to cut this
> time by a 5x, but I have no clue what to do to achieve this gain. I'll try
> to expose as briefly as possible the relevant structure of the xml file I
> am parsing, the structure of the R object I want to create, and the type of
> functions I am using to do it. I hope that one of you will be able to point
> me towards a better and quicker way of doing the parsing!
>
>
> Here is the (simplified) structure of the relevant nodes of the xml file:
>
> <model> (many many nodes)
>   <protein> (a couple of proteins per model node)
>     <peptide> (1 per protein node)
>       <domain> (1 or more per peptide node)
>         <aa> (0 or more per domain node)
>         </aa>
>       </domain>
>     </peptide>
>   </protein>
> </model>
>
> Here is the basic structure of the R object that I want to create:
>
> 'result' object that contains:
>   -various attributes
>   -a list of 'protein' objects, each of which containing:
>       -various attributes
>       -a list of 'peptide' objects, each of which containing:
>         -various attributes
>         -a list of 'aa' objects, each of which consisting of a couple of
> attributes.
>
> Here is the basic structure of the code:
>
> xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE)
> result <- new('S4_result_class')
> result@proteins <- xpathApply(xml.doc, "//model/protein",
> function(protein.node) {
>   protein <- new('S4_protein_class')
>   ## fill in a couple of attributes of the protein object using xmlValue
> and xmlAttrs(protein.node)
>   protein@peptides <- xpathApply(protein.node, "./peptide",
> function(peptide.node) {
>     peptide <- new('S4_peptide_class')
>     ## fill in a couple of attributes of the peptide object using xmlValue
> and xmlAttrs(peptide.node)
>     peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"),
> function(aa.node) {
>       aa <- new('S4_aa_class')
>       ## fill in a couple of attributes of the 'aa' object using xmlValue
> and xmlAttrs(aa.node)
>     })
>   })
> })
> free(xml.doc)
>
>
> Does anyone know a better and quicker way of doing this?
>
> Sorry for the very long message and thank you very much for your time and
> help!
>
> Frederic
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...