internal string comparison (Scollate)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

internal string comparison (Scollate)

Romain François-2
Hello,

I’d like to compare two strings internally the way R would, so I need Scollate which is not part of the authorized R api.

So:
 - Can Scollate (and perhaps Seql) be promoted to api ?
 - If not what are the alternatives ? Using strcmp or stroll does not seem as general as Scollate.

Romain

PS: Here is some context: https://github.com/hadley/dplyr/issues/325
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Dirk Eddelbuettel

On 26 March 2014 at 17:22, Romain François wrote:
| I’d like to compare two strings internally the way R would, so I need Scollate which is not part of the authorized R api.
|
| So:
|  - Can Scollate (and perhaps Seql) be promoted to api ?
|  - If not what are the alternatives ? Using strcmp or stroll does not seem as general as Scollate.

I'd add a third option:

   - Put this in a new package and register the functions you want.

as I don't see a R Core change on the public vs non-public APIs anytime soon.

But we have CRAN, and we can cheaply and reliably import functions from other
packages.  Yes, it is code duplication, but it offers a layer of indirection
that permits possibles changes to such an API.

I have been meaning to put such a package up for serialization code. It will
contain a few lines of C code from base R. Junji and Ei-ji have already
placed this in their Rhpc package. I am using them (in yet another copy) in
my unfinished Redis package in order to get to serialization from C++.

Dirk

--
Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Gabriel Becker
On Wed, Mar 26, 2014 at 9:50 AM, Dirk Eddelbuettel <[hidden email]> wrote:

>
> On 26 March 2014 at 17:22, Romain François wrote:
> | I'd like to compare two strings internally the way R would, so I need
> Scollate which is not part of the authorized R api.
> |
> | So:
> |  - Can Scollate (and perhaps Seql) be promoted to api ?
> |  - If not what are the alternatives ? Using strcmp or stroll does not
> seem as general as Scollate.
>
> I'd add a third option:
>
>    - Put this in a new package and register the functions you want.
>
That would not achieve what Romain wants. Or rather, it would when he did
it, but would not be guaranteed to do so at any point after the next
release of R.

My understanding of his request is that he wants something that will
"behave as R will", not "behave as R does now".

All that having been said, not every single thing R does internally can be
public and of course I'm not privy or a party to discussions on what should
and shouldn't be. But freezing and duplicating small pieces of R's
non-exported internal code seems like a dangerous move to me.

~G




--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Romain François-2

Le 26 mars 2014 à 18:03, Gabriel Becker <[hidden email]> a écrit :

> On Wed, Mar 26, 2014 at 9:50 AM, Dirk Eddelbuettel <[hidden email]> wrote:
>
> On 26 March 2014 at 17:22, Romain François wrote:
> | I’d like to compare two strings internally the way R would, so I need Scollate which is not part of the authorized R api.
> |
> | So:
> |  - Can Scollate (and perhaps Seql) be promoted to api ?
> |  - If not what are the alternatives ? Using strcmp or stroll does not seem as general as Scollate.
>
> I'd add a third option:
>
>    - Put this in a new package and register the functions you want.
>
> That would not achieve what Romain wants. Or rather, it would when he did it, but would not be guaranteed to do so at any point after the next release of R.
That’s one part of the problem. Indeed I’d rather use something rather than copy and paste it and run the risk of being outdated. The answer to that is testing though. I can develop a test suite that can let me know I’m out of date and I need to copy and paste some new code, etc … Done that before, this is tedious, but so what.

The other part of the problem (the real part of the problem actually) is that, at least when R is built with ICU support, Scollate will depend on a the collator pointer in util.c
https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777

And this can be controlled by the base::icuSetCollate function. Of course the collator pointer is not public.

So in short that does not help.

> My understanding of his request is that he wants something that will "behave as R will", not "behave as R does now".
>
> All that having been said, not every single thing R does internally can be public and of course I'm not privy or a party to discussions on what should and shouldn't be. But freezing and duplicating small pieces of R's non-exported internal code seems like a dangerous move to me.
>
> ~G
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Dirk Eddelbuettel

On 26 March 2014 at 19:09, Romain François wrote:
| That’s one part of the problem. Indeed I’d rather use something rather than
| copy and paste it and run the risk of being outdated. The answer to that is

We all would. But "they" won't let us by refusing to create more API access points.

| testing though. I can develop a test suite that can let me know I’m out of

Correct.

| date and I need to copy and paste some new code, etc … Done that before, this
| is tedious, but so what.
|
| The other part of the problem (the real part of the problem actually) is that,
| at least when R is built with ICU support, Scollate will depend on a the
| collator pointer in util.c
| https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
|
| And this can be controlled by the base::icuSetCollate function. Of course the
| collator pointer is not public.

So the next (and even less pleasant) answer is to build a new package which
links to, (or worse yet, embeds) libicu.  

As you want ICU behaviour, you will need ICU code.

Dirk

--
Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Kevin Ushey
I too think it would be useful if R exported some version of its
string sorting routines, since sorting strings while respecting
locale, and doing so in a portable fashion while respecting the user's
environment, is not trivial. R holds a fast, portable, well-tested
solution, and I think package developers would be very appreciative if
some portion of this was exposed at the C level.

If not `Scollate`, then perhaps other candidates could be the more
generic `sortVector`, or the more string-specific (and NA-respecting)
`scmp`.

I understand that the volunteers at R Core have limited time and
resources, and exposing an API imposes additional maintenance burdens
on an already thinly stretched team, but this is a situation where the
R users and package authors alike could benefit. Or, if there are
other reasons why exporting such routines is not possible nor
recommended, it would be very informative to know why.

Thanks,
Kevin

On Thu, Mar 27, 2014 at 11:08 AM, Dirk Eddelbuettel <[hidden email]> wrote:

>
> On 26 March 2014 at 19:09, Romain François wrote:
> | That's one part of the problem. Indeed I'd rather use something rather than
> | copy and paste it and run the risk of being outdated. The answer to that is
>
> We all would. But "they" won't let us by refusing to create more API access points.
>
> | testing though. I can develop a test suite that can let me know I'm out of
>
> Correct.
>
> | date and I need to copy and paste some new code, etc ... Done that before, this
> | is tedious, but so what.
> |
> | The other part of the problem (the real part of the problem actually) is that,
> | at least when R is built with ICU support, Scollate will depend on a the
> | collator pointer in util.c
> | https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
> |
> | And this can be controlled by the base::icuSetCollate function. Of course the
> | collator pointer is not public.
>
> So the next (and even less pleasant) answer is to build a new package which
> links to, (or worse yet, embeds) libicu.
>
> As you want ICU behaviour, you will need ICU code.
>
> Dirk
>
> --
> Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Duncan Murdoch-2
On 14-03-27 3:01 PM, Kevin Ushey wrote:
> I too think it would be useful if R exported some version of its
> string sorting routines, since sorting strings while respecting
> locale, and doing so in a portable fashion while respecting the user's
> environment, is not trivial. R holds a fast, portable, well-tested
> solution, and I think package developers would be very appreciative if
> some portion of this was exposed at the C level.

It does.  You can put your strings in an R STRSXP vector, and call the R
sort function on it.

The usual objection to constructing an R expression and evaluating it is
that it is slow, but if you are talking about sorting, the time spent in
the sort is likely to dominate the time spent in the setup.

>
> If not `Scollate`, then perhaps other candidates could be the more
> generic `sortVector`, or the more string-specific (and NA-respecting)
> `scmp`.

Evaluating an R expression gives you sortVector.

I can see an argument for Scollate being useful (sorting isn't the only
reason to compare strings), but I can see arguments against exposing it
too.  Take a look at the source:  it needs to be used carefully.  In
particular, it can return a 0 for unequal strings, and users are likely
to get messed up by that, or to submit bogus bug reports.  And it's not
impossible to work around: if you can collect the universe of strings to
compare in advance, then just use order() to convert them to integer
values, and compare those.

Duncan Murdoch

>
> I understand that the volunteers at R Core have limited time and
> resources, and exposing an API imposes additional maintenance burdens
> on an already thinly stretched team, but this is a situation where the
> R users and package authors alike could benefit. Or, if there are
> other reasons why exporting such routines is not possible nor
> recommended, it would be very informative to know why.
>
> Thanks,
> Kevin
>
> On Thu, Mar 27, 2014 at 11:08 AM, Dirk Eddelbuettel <[hidden email]> wrote:
>>
>> On 26 March 2014 at 19:09, Romain François wrote:
>> | That's one part of the problem. Indeed I'd rather use something rather than
>> | copy and paste it and run the risk of being outdated. The answer to that is
>>
>> We all would. But "they" won't let us by refusing to create more API access points.
>>
>> | testing though. I can develop a test suite that can let me know I'm out of
>>
>> Correct.
>>
>> | date and I need to copy and paste some new code, etc ... Done that before, this
>> | is tedious, but so what.
>> |
>> | The other part of the problem (the real part of the problem actually) is that,
>> | at least when R is built with ICU support, Scollate will depend on a the
>> | collator pointer in util.c
>> | https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
>> |
>> | And this can be controlled by the base::icuSetCollate function. Of course the
>> | collator pointer is not public.
>>
>> So the next (and even less pleasant) answer is to build a new package which
>> links to, (or worse yet, embeds) libicu.
>>
>> As you want ICU behaviour, you will need ICU code.
>>
>> Dirk
>>
>> --
>> Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: internal string comparison (Scollate)

Romain François-2
Hello,

The use case I have might involve sorting many small such STRSXP vectors.

If I have Scollate, I don’t need to materialize the vectors and I can use the sorting algorithm I choose.

Here is some made up data:

df <- data.frame(
  x = sample( 1:10, 1000, replace = TRUE),
  y = sample( 1:100, 100, replace = TRUE),
  z = replicate( 10000, paste( sample(letters, sample(1:100, size = 1), replace = TRUE ), collapse = "" ) ),
  stringsAsFactors = FALSE
)

For which I’d like something like what order( df$x, df$y, df$z ) gives me.

For example:

> system.time( res1 <- order( df$x, df$y, df$z) )
utilisateur     système      écoulé
      0.017       0.000       0.017
> system.time( res2 <- dplyr::order_( df$x, df$y, df$z ) )
utilisateur     système      écoulé
      0.005       0.000       0.005
> identical( res1, res2 )
[1] TRUE

The way dplyr::order_ is implemented I don’t need to materialize 500 STRSXP vectors and call order or sort on them ( 492 == nrow( unique( df[, c("x", "y" ) ] ) ) )

I just need to be able to compare two scalars together (either two ints, two doubles, or two CHARSXP SEXP). We already have special code to handle what it means to compare int, double etc in the R world with NA and NaN, etc ...

Scollate would give a way to compare two CHARSXP SEXP, the way R would. Of course one has to be careful how it is called, I have read the source.

Materialising temporary values into an R vector may be the R way of doing things, but sometimes it is a waste of both memory and time. Yes, this is about performance. We are often asked to choose between performance and correctness when in fact we can have both.

Romain

Le 27 mars 2014 à 22:12, Duncan Murdoch <[hidden email]> a écrit :

> On 14-03-27 3:01 PM, Kevin Ushey wrote:
>> I too think it would be useful if R exported some version of its
>> string sorting routines, since sorting strings while respecting
>> locale, and doing so in a portable fashion while respecting the user's
>> environment, is not trivial. R holds a fast, portable, well-tested
>> solution, and I think package developers would be very appreciative if
>> some portion of this was exposed at the C level.
>
> It does.  You can put your strings in an R STRSXP vector, and call the R sort function on it.
>
> The usual objection to constructing an R expression and evaluating it is that it is slow, but if you are talking about sorting, the time spent in the sort is likely to dominate the time spent in the setup.
>
>>
>> If not `Scollate`, then perhaps other candidates could be the more
>> generic `sortVector`, or the more string-specific (and NA-respecting)
>> `scmp`.
>
> Evaluating an R expression gives you sortVector.
>
> I can see an argument for Scollate being useful (sorting isn't the only reason to compare strings), but I can see arguments against exposing it too.  Take a look at the source:  it needs to be used carefully.  In particular, it can return a 0 for unequal strings, and users are likely to get messed up by that, or to submit bogus bug reports.  And it's not impossible to work around: if you can collect the universe of strings to compare in advance, then just use order() to convert them to integer values, and compare those.
>
> Duncan Murdoch
>
>>
>> I understand that the volunteers at R Core have limited time and
>> resources, and exposing an API imposes additional maintenance burdens
>> on an already thinly stretched team, but this is a situation where the
>> R users and package authors alike could benefit. Or, if there are
>> other reasons why exporting such routines is not possible nor
>> recommended, it would be very informative to know why.
>>
>> Thanks,
>> Kevin
>>
>> On Thu, Mar 27, 2014 at 11:08 AM, Dirk Eddelbuettel <[hidden email]> wrote:
>>>
>>> On 26 March 2014 at 19:09, Romain François wrote:
>>> | That's one part of the problem. Indeed I'd rather use something rather than
>>> | copy and paste it and run the risk of being outdated. The answer to that is
>>>
>>> We all would. But "they" won't let us by refusing to create more API access points.
>>>
>>> | testing though. I can develop a test suite that can let me know I'm out of
>>>
>>> Correct.
>>>
>>> | date and I need to copy and paste some new code, etc ... Done that before, this
>>> | is tedious, but so what.
>>> |
>>> | The other part of the problem (the real part of the problem actually) is that,
>>> | at least when R is built with ICU support, Scollate will depend on a the
>>> | collator pointer in util.c
>>> | https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
>>> |
>>> | And this can be controlled by the base::icuSetCollate function. Of course the
>>> | collator pointer is not public.
>>>
>>> So the next (and even less pleasant) answer is to build a new package which
>>> links to, (or worse yet, embeds) libicu.
>>>
>>> As you want ICU behaviour, you will need ICU code.
>>>
>>> Dirk
>>>
>>> --
>>> Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel