ALTREP string methods for substr and nchar

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

ALTREP string methods for substr and nchar

Jim Hester
A useful extension of ALTREP is having two new string methods which
return the number of characters of a given string element and to
return a substring of an element.

Having these methods would allow retrieving these values without
needing to create a CHARSXP for the full element data, which could
potentially be costly for long elements.

For example say you have an ALTREP altstring vector where each element
holds the sequence of a single chromosome, it would be useful to query
the lengths of each chromosome and retrieve the first 100 characters
etc. without having to put the whole chromosome in memory. I realize
there are tools in Bioconductor to handle this particular case, but it
seems the general case would be perfect for ALTREP.

Jim

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: ALTREP string methods for substr and nchar

Gabriel Becker-2
Hi Jim,

Thanks for posting this. Honestly the methods list I initially proposed
years ago and the one Luke eventually put in which had some of what I had
said and a bunch of new stuff, was pretty heavily focused on numeric values
(exclusively so, in my case, I think).

I agree that there is a lot of space to beef up how AltStrings behave. I
agree that nchar and substring also make a lot of sense. Perhaps nzchar as
well.

There are some other things that might be good as well. In particularly
Michael Lawrence and I have talked about things like AltStrings that *know* all
their elements have the same encoding in the same way that numeric,
integer, or nwo logical ALTREP vectors can know they don't have any NAs.
The suspicion being that this would make certain expensive (I think)
encoding checks and possibly conversions, much cheaper. I'm far from an
expert on encodings and the the costs/difficulties therein, but the concept
seems pretty straightforward and reasonable to me.

I hope to get back to the matching logic and get that hooked in  (I did a
bunch of work on it sometime ago but it ended up having problems at the
time, so it either never went in or it did go in but luke had to pull it
back out, I don't recall which). When(/if) that does happen I'd suspect
that matching would be another one that we'd want AltStrings to have first
class support for.

Regexes in general are probably another big area, since I'd think it would
be nice to not need to wrap and unrwap  the elements when the underlying
library doesn't want them wrapped as CHARSXPs anyway...

Another area that is more fraught, but my intuition suggests might be
really nice, is pasting. A paste(x,collapse="bla") method would be easily
achievable and potentially useful. paste(x,y, z) where x is an ALTREP with
a paste method could also be nice, potentially returning the same type of
AltString representation of the concatenation. If there were both paste
before and paste after then it would be possible to potentially support
arbitrary pastes, though things would get complicated (perhaps fatally so?)
if more than one argument was an AltString.

Overall, though I agree. it is looking like I'll have some time shortly to
get back to some R things I've been wanting to do so I'll put a proposal
for some string altmethods together and see what people (mostly Luke, tbh)
think.

Best,
~G

On Thu, Dec 19, 2019 at 11:39 AM Jim Hester <[hidden email]>
wrote:

> A useful extension of ALTREP is having two new string methods which
> return the number of characters of a given string element and to
> return a substring of an element.
>
> Having these methods would allow retrieving these values without
> needing to create a CHARSXP for the full element data, which could
> potentially be costly for long elements.
>
> For example say you have an ALTREP altstring vector where each element
> holds the sequence of a single chromosome, it would be useful to query
> the lengths of each chromosome and retrieve the first 100 characters
> etc. without having to put the whole chromosome in memory. I realize
> there are tools in Bioconductor to handle this particular case, but it
> seems the general case would be perfect for ALTREP.
>
> Jim
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel