Mismatch distribution

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Mismatch distribution

Myriam Croze
Hello!

I need your help. I am trying to calculate the pairwise differences between
sequences from several fasta files.
I would like for each of my DNA alignments (fasta files), calculate the
pairwise differences and then:
- 1. Combine all the data of each file to have one file and one histogram
(mismatch distribution)
- 2. calculate the mean for each difference for all the file and again make
a mismatch distribution plot

Here the script that I wrote:

library("pegas")
> library("seqinr")
> library("ggplot2")
>
>

> Files <- list.files(pattern="fas")
> nb_files <- length(Files)
>
>
> for (i in 1:nb_files) {
>         Dist <-  as.numeric(dist.gene(read.dna(Files[i], "fasta"), method
> = "pairwise",
>                            pairwise.deletion = FALSE, variance = FALSE))
>
>         Data <- merge(Data, Dist, by=c("x"), all=T)
>     }
>


> hist(Data, prob=TRUE)
> lines(density(Data), col="blue", lwd=2)
>

However, the script does not work and I do not know what to change to make
it working.
Thanks in advance for your help.

Myriam

--
Myriam Croze, PhD
Post-doctorante
Division of EcoScience,
Ewha Womans University
Seoul, South Korea

Email: [hidden email]

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Mismatch distribution

Bert Gunter-2
"Do not work" does not work (in providing sufficient info). See the Posting
guide  linked below for how to post an intelligible question.

HOWEVER, I suspect you would do better posting on te Bioconductor list
where they are much more likely to know what "fasta" files look like and
might even have software already developed to do what you want. You could
well be trying to reinvent wheels.

Cheers,
Bert


On Mon, Jan 21, 2019 at 5:35 PM Myriam Croze <[hidden email]>
wrote:

> Hello!
>
> I need your help. I am trying to calculate the pairwise differences between
> sequences from several fasta files.
> I would like for each of my DNA alignments (fasta files), calculate the
> pairwise differences and then:
> - 1. Combine all the data of each file to have one file and one histogram
> (mismatch distribution)
> - 2. calculate the mean for each difference for all the file and again make
> a mismatch distribution plot
>
> Here the script that I wrote:
>
> library("pegas")
> > library("seqinr")
> > library("ggplot2")
> >
> >
>
> > Files <- list.files(pattern="fas")
> > nb_files <- length(Files)
> >
> >
> > for (i in 1:nb_files) {
> >         Dist <-  as.numeric(dist.gene(read.dna(Files[i], "fasta"), method
> > = "pairwise",
> >                            pairwise.deletion = FALSE, variance = FALSE))
> >
> >         Data <- merge(Data, Dist, by=c("x"), all=T)
> >     }
> >
>
>
> > hist(Data, prob=TRUE)
> > lines(density(Data), col="blue", lwd=2)
> >
>
> However, the script does not work and I do not know what to change to make
> it working.
> Thanks in advance for your help.
>
> Myriam
>
> --
> Myriam Croze, PhD
> Post-doctorante
> Division of EcoScience,
> Ewha Womans University
> Seoul, South Korea
>
> Email: [hidden email]
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Mismatch distribution

Boris Steipe
Myriam -

This is the right list in principle, all the packages you use are CRAN packages, not Bioconductor.

However I am at a loss as to how you wrote your code: both pegas and seqinr have "read.<something>()" functions, but neither has read.dna(); similarly both pegas and seqinr have "dist.<something>()" functions, but neither has dist.gene(). Did you just extrapolate those function names and parameters from other function calls?

In any case: please start from a minimal, reproducible example that comes close to what you are trying to achieve, then post again. Here are the three URLs we usually recommend to get things started. Use a small number of small example files, don't nest your expressions until you are sure they produce what you think they do, and take it step by step.

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
http://adv-r.had.co.nz/Reproducibility.html
https://cran.r-project.org/web/packages/reprex/index.html (read the vignette)

Cheers,
B

-



> On 2019-01-21, at 21:08, Bert Gunter <[hidden email]> wrote:
>
> "Do not work" does not work (in providing sufficient info). See the Posting
> guide  linked below for how to post an intelligible question.
>
> HOWEVER, I suspect you would do better posting on te Bioconductor list
> where they are much more likely to know what "fasta" files look like and
> might even have software already developed to do what you want. You could
> well be trying to reinvent wheels.
>
> Cheers,
> Bert
>
>
> On Mon, Jan 21, 2019 at 5:35 PM Myriam Croze <[hidden email]>
> wrote:
>
>> Hello!
>>
>> I need your help. I am trying to calculate the pairwise differences between
>> sequences from several fasta files.
>> I would like for each of my DNA alignments (fasta files), calculate the
>> pairwise differences and then:
>> - 1. Combine all the data of each file to have one file and one histogram
>> (mismatch distribution)
>> - 2. calculate the mean for each difference for all the file and again make
>> a mismatch distribution plot
>>
>> Here the script that I wrote:
>>
>> library("pegas")
>>> library("seqinr")
>>> library("ggplot2")
>>>
>>>
>>
>>> Files <- list.files(pattern="fas")
>>> nb_files <- length(Files)
>>>
>>>
>>> for (i in 1:nb_files) {
>>>        Dist <-  as.numeric(dist.gene(read.dna(Files[i], "fasta"), method
>>> = "pairwise",
>>>                           pairwise.deletion = FALSE, variance = FALSE))
>>>
>>>        Data <- merge(Data, Dist, by=c("x"), all=T)
>>>    }
>>>
>>
>>
>>> hist(Data, prob=TRUE)
>>> lines(density(Data), col="blue", lwd=2)
>>>
>>
>> However, the script does not work and I do not know what to change to make
>> it working.
>> Thanks in advance for your help.
>>
>> Myriam
>>
>> --
>> Myriam Croze, PhD
>> Post-doctorante
>> Division of EcoScience,
>> Ewha Womans University
>> Seoul, South Korea
>>
>> Email: [hidden email]
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Mismatch distribution

Myriam Croze
Thanks for your answer.

First, concerning the function read.dna and dist.gene, they come from the
package ape which is downloaded with pegas.

Here the code that I did for one sequence and which works:

##Code
Seqs1 <- "file1.fas"
Seqs2 <- read.dna(Seqs1, "fasta")

Dist <- dist.gene(Seqs2, method = "pairwise", pairwise.deletion = FALSE,
variance = FALSE)
Dist2 <- as.numeric(Dist)

hist(Dist2, prob=TRUE)
##

And then the code for several files:

#######
Files <- list.files(pattern="fas")
nb_files <- length(Files)
Data1 <- as.numeric()

for (i in 1:nb_files) {
  Seqs <- read.dna(Files[i], "fasta")

  Dist <- dist.gene(Seqs, method = "pairwise", pairwise.deletion = FALSE,
variance = FALSE)
  Dist <-  as.numeric(Dist)

  Data1 <- merge(Data1, Dist)
    }

hist(Data1, prob=TRUE)
########

In the last code, the file Data1 (where I want all the data from the 3
files) is empty at the end. I guess something is missing in this last step
or maybe should I use another function.

Cheers,
Myriam

Le mar. 22 janv. 2019 à 11:52, Boris Steipe <[hidden email]> a
écrit :

> Myriam -
>
> This is the right list in principle, all the packages you use are CRAN
> packages, not Bioconductor.
>
> However I am at a loss as to how you wrote your code: both pegas and
> seqinr have "read.<something>()" functions, but neither has read.dna();
> similarly both pegas and seqinr have "dist.<something>()" functions, but
> neither has dist.gene(). Did you just extrapolate those function names and
> parameters from other function calls?
>
> In any case: please start from a minimal, reproducible example that comes
> close to what you are trying to achieve, then post again. Here are the
> three URLs we usually recommend to get things started. Use a small number
> of small example files, don't nest your expressions until you are sure they
> produce what you think they do, and take it step by step.
>
>
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> http://adv-r.had.co.nz/Reproducibility.html
> https://cran.r-project.org/web/packages/reprex/index.html (read the
> vignette)
>
> Cheers,
> B
>
> -
>
>
>
> > On 2019-01-21, at 21:08, Bert Gunter <[hidden email]> wrote:
> >
> > "Do not work" does not work (in providing sufficient info). See the
> Posting
> > guide  linked below for how to post an intelligible question.
> >
> > HOWEVER, I suspect you would do better posting on te Bioconductor list
> > where they are much more likely to know what "fasta" files look like and
> > might even have software already developed to do what you want. You could
> > well be trying to reinvent wheels.
> >
> > Cheers,
> > Bert
> >
> >
> > On Mon, Jan 21, 2019 at 5:35 PM Myriam Croze <[hidden email]>
> > wrote:
> >
> >> Hello!
> >>
> >> I need your help. I am trying to calculate the pairwise differences
> between
> >> sequences from several fasta files.
> >> I would like for each of my DNA alignments (fasta files), calculate the
> >> pairwise differences and then:
> >> - 1. Combine all the data of each file to have one file and one
> histogram
> >> (mismatch distribution)
> >> - 2. calculate the mean for each difference for all the file and again
> make
> >> a mismatch distribution plot
> >>
> >> Here the script that I wrote:
> >>
> >> library("pegas")
> >>> library("seqinr")
> >>> library("ggplot2")
> >>>
> >>>
> >>
> >>> Files <- list.files(pattern="fas")
> >>> nb_files <- length(Files)
> >>>
> >>>
> >>> for (i in 1:nb_files) {
> >>>        Dist <-  as.numeric(dist.gene(read.dna(Files[i], "fasta"),
> method
> >>> = "pairwise",
> >>>                           pairwise.deletion = FALSE, variance = FALSE))
> >>>
> >>>        Data <- merge(Data, Dist, by=c("x"), all=T)
> >>>    }
> >>>
> >>
> >>
> >>> hist(Data, prob=TRUE)
> >>> lines(density(Data), col="blue", lwd=2)
> >>>
> >>
> >> However, the script does not work and I do not know what to change to
> make
> >> it working.
> >> Thanks in advance for your help.
> >>
> >> Myriam
> >>
> >> --
> >> Myriam Croze, PhD
> >> Post-doctorante
> >> Division of EcoScience,
> >> Ewha Womans University
> >> Seoul, South Korea
> >>
> >> Email: [hidden email]
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>

--
Myriam Croze, PhD
Post-doctorante
Division of EcoScience,
Ewha Womans University
Seoul, South Korea

Email: [hidden email]

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Mismatch distribution

Boris Steipe
Your "file1.fas" contains one sequence?
I can't see how that would work to produce a distance matrix.

Please show the output of:
str(Seqs2)

------

You need to understand what a distance matrix is, and what merge() does. Consider:

x <- data.frame(l1 = c("a", "a", "g"),
                l2 = c("g", "g", "t"),
                l3 = c("t", "c", "t"),
                stringsAsFactors = FALSE)
rownames(x) <- c("s1", "s2", "s3")
(myDist <- ape::dist.gene(x))

   s1 s2
s2  1  
s3  2  3

- this means s1/s2 have 1 difference,  agt vs agc;
             s1/s3 have 2 differences, agt vs gtt;
             s2/s3 have 3 differences, agc vs gtt;

The values are the lower triangle of a square matrix, perhaps easier to understand if we write the full matrix:

   s1 s2 s3
s1  0
s2  1  0    
s3  2  3  0

The values in the diagonal are always zero (d(a, a) == 0); and the values in the upper triangle are the same as in the lower triangle (d(a, b) == d(b, a)) So we don't store them separately. TLDR: a distance object stores the pairwise distances of n objects as n*(n-1)/2 numbers.

merge() would try to coerce the distance matrices to data frames, then combine them based on shared row- or column names. That won't make sense. This won't be meaningful if there are shared names, and it won't work if there are no shared names.

But you don't need merged objects since you are merely producing histograms of the individual distances. Dump the numbers into a vector, then c() the vectors. If I understand correctly what your input data actually is, and what you are trying to do, I would write it as:

fileNames <- list.files(pattern = "\\.fas$")

dVec <- numeric()
for (myFile in fileNames) {
  dMat <- ape::dist.gene(ape::read.dna(myFile, format = "fasta"))
  dVec <-  c(dVec, as.vector(dMat))
}

hist(dVec, prob=TRUE)


--
B.





> On 2019-01-21, at 22:27, Myriam Croze <[hidden email]> wrote:
>
> Thanks for your answer.
>
> First, concerning the function read.dna and dist.gene, they come from the package ape which is downloaded with pegas.
>
> Here the code that I did for one sequence and which works:
>
> ##Code
> Seqs1 <- "file1.fas"
> Seqs2 <- read.dna(Seqs1, "fasta")
>
> Dist <- dist.gene(Seqs2, method = "pairwise", pairwise.deletion = FALSE, variance = FALSE)
> Dist2 <- as.numeric(Dist)
>
> hist(Dist2, prob=TRUE)
> ##
>
> And then the code for several files:
>
> #######
> Files <- list.files(pattern="fas")
> nb_files <- length(Files)
> Data1 <- as.numeric()
>
> for (i in 1:nb_files) {
>   Seqs <- read.dna(Files[i], "fasta")  
>  
>   Dist <- dist.gene(Seqs, method = "pairwise", pairwise.deletion = FALSE, variance = FALSE)
>   Dist <-  as.numeric(Dist)
>  
>   Data1 <- merge(Data1, Dist)
>     }
>
> hist(Data1, prob=TRUE)
> ########
>
> In the last code, the file Data1 (where I want all the data from the 3 files) is empty at the end. I guess something is missing in this last step or maybe should I use another function.
>
> Cheers,
> Myriam
>
> Le mar. 22 janv. 2019 à 11:52, Boris Steipe <[hidden email]> a écrit :
> Myriam -
>
> This is the right list in principle, all the packages you use are CRAN packages, not Bioconductor.
>
> However I am at a loss as to how you wrote your code: both pegas and seqinr have "read.<something>()" functions, but neither has read.dna(); similarly both pegas and seqinr have "dist.<something>()" functions, but neither has dist.gene(). Did you just extrapolate those function names and parameters from other function calls?
>
> In any case: please start from a minimal, reproducible example that comes close to what you are trying to achieve, then post again. Here are the three URLs we usually recommend to get things started. Use a small number of small example files, don't nest your expressions until you are sure they produce what you think they do, and take it step by step.
>
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> http://adv-r.had.co.nz/Reproducibility.html
> https://cran.r-project.org/web/packages/reprex/index.html (read the vignette)
>
> Cheers,
> B
>
> -
>
>
>
> > On 2019-01-21, at 21:08, Bert Gunter <[hidden email]> wrote:
> >
> > "Do not work" does not work (in providing sufficient info). See the Posting
> > guide  linked below for how to post an intelligible question.
> >
> > HOWEVER, I suspect you would do better posting on te Bioconductor list
> > where they are much more likely to know what "fasta" files look like and
> > might even have software already developed to do what you want. You could
> > well be trying to reinvent wheels.
> >
> > Cheers,
> > Bert
> >
> >
> > On Mon, Jan 21, 2019 at 5:35 PM Myriam Croze <[hidden email]>
> > wrote:
> >
> >> Hello!
> >>
> >> I need your help. I am trying to calculate the pairwise differences between
> >> sequences from several fasta files.
> >> I would like for each of my DNA alignments (fasta files), calculate the
> >> pairwise differences and then:
> >> - 1. Combine all the data of each file to have one file and one histogram
> >> (mismatch distribution)
> >> - 2. calculate the mean for each difference for all the file and again make
> >> a mismatch distribution plot
> >>
> >> Here the script that I wrote:
> >>
> >> library("pegas")
> >>> library("seqinr")
> >>> library("ggplot2")
> >>>
> >>>
> >>
> >>> Files <- list.files(pattern="fas")
> >>> nb_files <- length(Files)
> >>>
> >>>
> >>> for (i in 1:nb_files) {
> >>>        Dist <-  as.numeric(dist.gene(read.dna(Files[i], "fasta"), method
> >>> = "pairwise",
> >>>                           pairwise.deletion = FALSE, variance = FALSE))
> >>>
> >>>        Data <- merge(Data, Dist, by=c("x"), all=T)
> >>>    }
> >>>
> >>
> >>
> >>> hist(Data, prob=TRUE)
> >>> lines(density(Data), col="blue", lwd=2)
> >>>
> >>
> >> However, the script does not work and I do not know what to change to make
> >> it working.
> >> Thanks in advance for your help.
> >>
> >> Myriam
> >>
> >> --
> >> Myriam Croze, PhD
> >> Post-doctorante
> >> Division of EcoScience,
> >> Ewha Womans University
> >> Seoul, South Korea
> >>
> >> Email: [hidden email]
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Myriam Croze, PhD
> Post-doctorante
> Division of EcoScience,
> Ewha Womans University
> Seoul, South Korea
>
> Email: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.