ALTREP: Design concept of alternative string

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ALTREP: Design concept of alternative string

Wang Jiefei
Hello from Bioconductor,

I'm developing a package to share R objects across clusters using boost
library. The concept is similar to mmap package:
https://cran.r-project.org/web/packages/mmap/index.html . However, I have a
problem when I was trying to write Dataptr_method for the alternative
string.

Based on my understanding, the return value of the Dataptr_method function
should be a vector of CHARSXP pointers. This design might be problematic in
two ways:

1. The behavior of Dataptr_method function is inconsistent for string and
the other ALTREP types. For the other types we return a vector of pure data
in memory allocated outside of R, but for the string, we return a vector of
R objects allocated by R.

2. It causes an unnecessary duplication of the data. In order to return
CHARSXPs to R, It forces me to allocate CHARSXPs and copy the entire data
to the R process. By contrast, for the other ALTREP types, say altreal, I
can just return the pointer to R if the data is in the memory.

The same problem occurs for Elt_method as well but is less serious since
only one CHARSXPs is allocated. Because my package is designed for sharing
a large R object. An allocation of the memory is undesired especially when
the data is read-only in the code(eg. print function). I'm not sure if
there are any solutions existed in the current R version, but I can imagine
three workarounds:

1. Change the behavior of the R functions and use get_element function
instead of Dataptr function. This would make the problem more
memory-friendly but still cause the allocation.

2. Return a vector of const char* in Dataptr method. It would be very
efficient and consistent with the return values of the other ALTREP types.

3. Provide an alternative CHARSXP. This might be the best solution since
STRSXP behaves more like a list instead of a string, so an alternative
CHARSXP fits the concept of ALTREP better.

Since I'm not an expert in R so I might post a solved problem. I would be
very happy and appreciate any suggestions regarding this problem.

Best,
Jiefei

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: ALTREP: Design concept of alternative string

Gabriel Becker-2
Hi Jiefei,

The issue here is that while the memory consequences of what you're
describing may be true, this is simply how R handles character vector (what
you're calling string) values internally. It doesn't actually have anything
to do with ALTREP. Standard character vector SEXPs have an array of CHARSXP
pointers in their payload (what is returned by DATAPTR) as well.

As far as I know, this is important for string caching  and is actually
intended to save memory when the same string value appears many times in an
R session (and takes up more bytes than a pointer), though I haven't dug
around R's low-level string handling a ton. Either way though, this would
be a much much larger change than just changing the ALTREP API (which for
things like this explicitly and intentionally matches how the C api behaves
for non-ALTREP SEXPs for compatability).

Likewise the reason that get_element is going to return a CHARSXP, is
because that is what STRING_ELT(x, i) returns (equivalent to (SEXP)
DATAPTR(x)[i] ), so I don't think that can be changed either.

One other thing to note, though, is that if your'e asking for the dataptr
(and it isn't read only) then you're basically stepping out of ALTREP space
anyway, so it makes sense that a normally laid-out STRSXP (with it's
CHARSXP payload).

Best,
~G

On Thu, May 9, 2019 at 8:09 AM 介非王 <[hidden email]> wrote:

> Hello from Bioconductor,
>
> I'm developing a package to share R objects across clusters using boost
> library. The concept is similar to mmap package:
> https://cran.r-project.org/web/packages/mmap/index.html . However, I have
> a
> problem when I was trying to write Dataptr_method for the alternative
> string.
>
> Based on my understanding, the return value of the Dataptr_method function
> should be a vector of CHARSXP pointers. This design might be problematic in
> two ways:
>
> 1. The behavior of Dataptr_method function is inconsistent for string and
> the other ALTREP types. For the other types we return a vector of pure data
> in memory allocated outside of R, but for the string, we return a vector of
> R objects allocated by R.
>
> 2. It causes an unnecessary duplication of the data. In order to return
> CHARSXPs to R, It forces me to allocate CHARSXPs and copy the entire data
> to the R process. By contrast, for the other ALTREP types, say altreal, I
> can just return the pointer to R if the data is in the memory.
>
> The same problem occurs for Elt_method as well but is less serious since
> only one CHARSXPs is allocated. Because my package is designed for sharing
> a large R object. An allocation of the memory is undesired especially when
> the data is read-only in the code(eg. print function). I'm not sure if
> there are any solutions existed in the current R version, but I can imagine
> three workarounds:
>
> 1. Change the behavior of the R functions and use get_element function
> instead of Dataptr function. This would make the problem more
> memory-friendly but still cause the allocation.
>
> 2. Return a vector of const char* in Dataptr method. It would be very
> efficient and consistent with the return values of the other ALTREP types.
>
> 3. Provide an alternative CHARSXP. This might be the best solution since
> STRSXP behaves more like a list instead of a string, so an alternative
> CHARSXP fits the concept of ALTREP better.
>
> Since I'm not an expert in R so I might post a solved problem. I would be
> very happy and appreciate any suggestions regarding this problem.
>
> Best,
> Jiefei
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: ALTREP: Design concept of alternative string

Wang Jiefei
Hi Gabriel,

Thanks for your explanation, I totally understand that it is almost
impossible to change the data structure of STRSXP. However, what I'm
proposing is not about changing the internal representation, but rather
about how we design and use the ALTREP API.

 I might do not state the workarounds clearly as English is not my first
language. Please let me explain them again in detail.

1. Update the existing R functions. When the ALTREP API Dataptr_or_null
returns NULL, use get_element instead(or as best as we can). I have seen
this pattern for some R functions, but somehow there are still some
functions left that do not follow this rule. For example, print function
will blindly call Dataptr (It even did not call Dataptr_or_null first) and
forces me to allocate a large chunk of memory in R. Updating these
functions would not completely solve the problem we are discussing but will
make it less serious.

2. Update the ALTREP API, return a vector of const char *, and internally
wrap them as CHARSXP. This can be a way to "hack" the R data structure with
only a little cost to create the CHARSXP header.

3. Provide character ALTREP. Instead of using string ALTREP, we can define
an alternative CHARSXP. By doing it we will completely solve the problem
since the return value of the Dataptr of CHARSXP is a const char*. We do
not have to change any internal representation of characters, it just
requires a remap of the DATAPTR macro( or function?).

Again, I sincerely appreciate your time and the detailed you provided. I'm
looking forward to seeing any method to solve this problem in the current
and future R release.

Best,
Jiefei

Gabriel Becker <[hidden email]> 于2019年5月9日周四 下午2:07写道:

> Hi Jiefei,
>
> The issue here is that while the memory consequences of what you're
> describing may be true, this is simply how R handles character vector (what
> you're calling string) values internally. It doesn't actually have anything
> to do with ALTREP. Standard character vector SEXPs have an array of CHARSXP
> pointers in their payload (what is returned by DATAPTR) as well.
>
> As far as I know, this is important for string caching  and is actually
> intended to save memory when the same string value appears many times in an
> R session (and takes up more bytes than a pointer), though I haven't dug
> around R's low-level string handling a ton. Either way though, this would
> be a much much larger change than just changing the ALTREP API (which for
> things like this explicitly and intentionally matches how the C api behaves
> for non-ALTREP SEXPs for compatability).
>
> Likewise the reason that get_element is going to return a CHARSXP, is
> because that is what STRING_ELT(x, i) returns (equivalent to (SEXP)
> DATAPTR(x)[i] ), so I don't think that can be changed either.
>
> One other thing to note, though, is that if your'e asking for the dataptr
> (and it isn't read only) then you're basically stepping out of ALTREP space
> anyway, so it makes sense that a normally laid-out STRSXP (with it's
> CHARSXP payload).
>
> Best,
> ~G
>
> On Thu, May 9, 2019 at 8:09 AM 介非王 <[hidden email]> wrote:
>
>> Hello from Bioconductor,
>>
>> I'm developing a package to share R objects across clusters using boost
>> library. The concept is similar to mmap package:
>> https://cran.r-project.org/web/packages/mmap/index.html . However, I
>> have a
>> problem when I was trying to write Dataptr_method for the alternative
>> string.
>>
>> Based on my understanding, the return value of the Dataptr_method function
>> should be a vector of CHARSXP pointers. This design might be problematic
>> in
>> two ways:
>>
>> 1. The behavior of Dataptr_method function is inconsistent for string and
>> the other ALTREP types. For the other types we return a vector of pure
>> data
>> in memory allocated outside of R, but for the string, we return a vector
>> of
>> R objects allocated by R.
>>
>> 2. It causes an unnecessary duplication of the data. In order to return
>> CHARSXPs to R, It forces me to allocate CHARSXPs and copy the entire data
>> to the R process. By contrast, for the other ALTREP types, say altreal, I
>> can just return the pointer to R if the data is in the memory.
>>
>> The same problem occurs for Elt_method as well but is less serious since
>> only one CHARSXPs is allocated. Because my package is designed for sharing
>> a large R object. An allocation of the memory is undesired especially when
>> the data is read-only in the code(eg. print function). I'm not sure if
>> there are any solutions existed in the current R version, but I can
>> imagine
>> three workarounds:
>>
>> 1. Change the behavior of the R functions and use get_element function
>> instead of Dataptr function. This would make the problem more
>> memory-friendly but still cause the allocation.
>>
>> 2. Return a vector of const char* in Dataptr method. It would be very
>> efficient and consistent with the return values of the other ALTREP types.
>>
>> 3. Provide an alternative CHARSXP. This might be the best solution since
>> STRSXP behaves more like a list instead of a string, so an alternative
>> CHARSXP fits the concept of ALTREP better.
>>
>> Since I'm not an expert in R so I might post a solved problem. I would be
>> very happy and appreciate any suggestions regarding this problem.
>>
>> Best,
>> Jiefei
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

--
Jiefei Wang
Room 2-501,Tangxuan,QilinGarden,NanshanDistrict,Shenzhen
Guangdong,China
Phone (+86)18312589584
[hidden email]

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: ALTREP: Design concept of alternative string

Tierney, Luke
On Fri, 10 May 2019, 介非王 wrote:

> Hi Gabriel,
>
> Thanks for your explanation, I totally understand that it is almost
> impossible to change the data structure of STRSXP. However, what I'm
> proposing is not about changing the internal representation, but rather
> about how we design and use the ALTREP API.
>
> I might do not state the workarounds clearly as English is not my first
> language. Please let me explain them again in detail.
>
> 1. Update the existing R functions. When the ALTREP API Dataptr_or_null
> returns NULL, use get_element instead(or as best as we can). I have seen
> this pattern for some R functions, but somehow there are still some
> functions left that do not follow this rule. For example, print function
> will blindly call Dataptr (It even did not call Dataptr_or_null first) and
> forces me to allocate a large chunk of memory in R. Updating these
> functions would not completely solve the problem we are discussing but will
> make it less serious.

Fixing print() is pretty high priority (I thought we had done so for R
3.6.0 but apparently not). Others will come in over time; filing a
request with bugzilla is one way to push up priority for a particular
function or set of functions.

Keep in mind that one option for your implementation is to signal an
error if a data pointer is requested. You could make that dependent on
some sort of option setting or make the error continuable by providing
a restart.

> 2. Update the ALTREP API, return a vector of const char *, and internally
> wrap them as CHARSXP. This can be a way to "hack" the R data structure with
> only a little cost to create the CHARSXP header.

That doesn't seem feasible but I may not be understanding what you mean.

> 3. Provide character ALTREP. Instead of using string ALTREP, we can define
> an alternative CHARSXP. By doing it we will completely solve the problem
> since the return value of the Dataptr of CHARSXP is a const char*. We do
> not have to change any internal representation of characters, it just
> requires a remap of the DATAPTR macro( or function?).

Allowing ALTREP CHARSXP objects might be something to consider in the
future, but the combination of caching and encoding issues make that
very complex. I'm nat sure it would be a good idea or even
feasible. In any case it won't happen anytime soon.

Best,

luke

>
> Again, I sincerely appreciate your time and the detailed you provided. I'm
> looking forward to seeing any method to solve this problem in the current
> and future R release.
>
> Best,
> Jiefei
>
> Gabriel Becker <[hidden email]> 于2019年5月9日周四 下午2:07写道:
>
>> Hi Jiefei,
>>
>> The issue here is that while the memory consequences of what you're
>> describing may be true, this is simply how R handles character vector (what
>> you're calling string) values internally. It doesn't actually have anything
>> to do with ALTREP. Standard character vector SEXPs have an array of CHARSXP
>> pointers in their payload (what is returned by DATAPTR) as well.
>>
>> As far as I know, this is important for string caching  and is actually
>> intended to save memory when the same string value appears many times in an
>> R session (and takes up more bytes than a pointer), though I haven't dug
>> around R's low-level string handling a ton. Either way though, this would
>> be a much much larger change than just changing the ALTREP API (which for
>> things like this explicitly and intentionally matches how the C api behaves
>> for non-ALTREP SEXPs for compatability).
>>
>> Likewise the reason that get_element is going to return a CHARSXP, is
>> because that is what STRING_ELT(x, i) returns (equivalent to (SEXP)
>> DATAPTR(x)[i] ), so I don't think that can be changed either.
>>
>> One other thing to note, though, is that if your'e asking for the dataptr
>> (and it isn't read only) then you're basically stepping out of ALTREP space
>> anyway, so it makes sense that a normally laid-out STRSXP (with it's
>> CHARSXP payload).
>>
>> Best,
>> ~G
>>
>> On Thu, May 9, 2019 at 8:09 AM 介非王 <[hidden email]> wrote:
>>
>>> Hello from Bioconductor,
>>>
>>> I'm developing a package to share R objects across clusters using boost
>>> library. The concept is similar to mmap package:
>>> https://cran.r-project.org/web/packages/mmap/index.html . However, I
>>> have a
>>> problem when I was trying to write Dataptr_method for the alternative
>>> string.
>>>
>>> Based on my understanding, the return value of the Dataptr_method function
>>> should be a vector of CHARSXP pointers. This design might be problematic
>>> in
>>> two ways:
>>>
>>> 1. The behavior of Dataptr_method function is inconsistent for string and
>>> the other ALTREP types. For the other types we return a vector of pure
>>> data
>>> in memory allocated outside of R, but for the string, we return a vector
>>> of
>>> R objects allocated by R.
>>>
>>> 2. It causes an unnecessary duplication of the data. In order to return
>>> CHARSXPs to R, It forces me to allocate CHARSXPs and copy the entire data
>>> to the R process. By contrast, for the other ALTREP types, say altreal, I
>>> can just return the pointer to R if the data is in the memory.
>>>
>>> The same problem occurs for Elt_method as well but is less serious since
>>> only one CHARSXPs is allocated. Because my package is designed for sharing
>>> a large R object. An allocation of the memory is undesired especially when
>>> the data is read-only in the code(eg. print function). I'm not sure if
>>> there are any solutions existed in the current R version, but I can
>>> imagine
>>> three workarounds:
>>>
>>> 1. Change the behavior of the R functions and use get_element function
>>> instead of Dataptr function. This would make the problem more
>>> memory-friendly but still cause the allocation.
>>>
>>> 2. Return a vector of const char* in Dataptr method. It would be very
>>> efficient and consistent with the return values of the other ALTREP types.
>>>
>>> 3. Provide an alternative CHARSXP. This might be the best solution since
>>> STRSXP behaves more like a list instead of a string, so an alternative
>>> CHARSXP fits the concept of ALTREP better.
>>>
>>> Since I'm not an expert in R so I might post a solved problem. I would be
>>> very happy and appreciate any suggestions regarding this problem.
>>>
>>> Best,
>>> Jiefei
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: ALTREP: Design concept of alternative string

Wang Jiefei
Thank you very much for your explanation! I'm looking forward to seeing the
changes of R functions in a furture release.

Best,
Jiefei

Tierney, Luke <[hidden email]> 于2019年5月10日周五 下午12:22写道:

> On Fri, 10 May 2019, 介非王 wrote:
>
> > Hi Gabriel,
> >
> > Thanks for your explanation, I totally understand that it is almost
> > impossible to change the data structure of STRSXP. However, what I'm
> > proposing is not about changing the internal representation, but rather
> > about how we design and use the ALTREP API.
> >
> > I might do not state the workarounds clearly as English is not my first
> > language. Please let me explain them again in detail.
> >
> > 1. Update the existing R functions. When the ALTREP API Dataptr_or_null
> > returns NULL, use get_element instead(or as best as we can). I have seen
> > this pattern for some R functions, but somehow there are still some
> > functions left that do not follow this rule. For example, print function
> > will blindly call Dataptr (It even did not call Dataptr_or_null first)
> and
> > forces me to allocate a large chunk of memory in R. Updating these
> > functions would not completely solve the problem we are discussing but
> will
> > make it less serious.
>
> Fixing print() is pretty high priority (I thought we had done so for R
> 3.6.0 but apparently not). Others will come in over time; filing a
> request with bugzilla is one way to push up priority for a particular
> function or set of functions.
>
> Keep in mind that one option for your implementation is to signal an
> error if a data pointer is requested. You could make that dependent on
> some sort of option setting or make the error continuable by providing
> a restart.
>
> > 2. Update the ALTREP API, return a vector of const char *, and internally
> > wrap them as CHARSXP. This can be a way to "hack" the R data structure
> with
> > only a little cost to create the CHARSXP header.
>
> That doesn't seem feasible but I may not be understanding what you mean.
>
> > 3. Provide character ALTREP. Instead of using string ALTREP, we can
> define
> > an alternative CHARSXP. By doing it we will completely solve the problem
> > since the return value of the Dataptr of CHARSXP is a const char*. We do
> > not have to change any internal representation of characters, it just
> > requires a remap of the DATAPTR macro( or function?).
>
> Allowing ALTREP CHARSXP objects might be something to consider in the
> future, but the combination of caching and encoding issues make that
> very complex. I'm nat sure it would be a good idea or even
> feasible. In any case it won't happen anytime soon.
>
> Best,
>
> luke
>
> >
> > Again, I sincerely appreciate your time and the detailed you provided.
> I'm
> > looking forward to seeing any method to solve this problem in the current
> > and future R release.
> >
> > Best,
> > Jiefei
> >
> > Gabriel Becker <[hidden email]> 于2019年5月9日周四 下午2:07写道:
> >
> >> Hi Jiefei,
> >>
> >> The issue here is that while the memory consequences of what you're
> >> describing may be true, this is simply how R handles character vector
> (what
> >> you're calling string) values internally. It doesn't actually have
> anything
> >> to do with ALTREP. Standard character vector SEXPs have an array of
> CHARSXP
> >> pointers in their payload (what is returned by DATAPTR) as well.
> >>
> >> As far as I know, this is important for string caching  and is actually
> >> intended to save memory when the same string value appears many times
> in an
> >> R session (and takes up more bytes than a pointer), though I haven't dug
> >> around R's low-level string handling a ton. Either way though, this
> would
> >> be a much much larger change than just changing the ALTREP API (which
> for
> >> things like this explicitly and intentionally matches how the C api
> behaves
> >> for non-ALTREP SEXPs for compatability).
> >>
> >> Likewise the reason that get_element is going to return a CHARSXP, is
> >> because that is what STRING_ELT(x, i) returns (equivalent to (SEXP)
> >> DATAPTR(x)[i] ), so I don't think that can be changed either.
> >>
> >> One other thing to note, though, is that if your'e asking for the
> dataptr
> >> (and it isn't read only) then you're basically stepping out of ALTREP
> space
> >> anyway, so it makes sense that a normally laid-out STRSXP (with it's
> >> CHARSXP payload).
> >>
> >> Best,
> >> ~G
> >>
> >> On Thu, May 9, 2019 at 8:09 AM 介非王 <[hidden email]> wrote:
> >>
> >>> Hello from Bioconductor,
> >>>
> >>> I'm developing a package to share R objects across clusters using boost
> >>> library. The concept is similar to mmap package:
> >>> https://cran.r-project.org/web/packages/mmap/index.html . However, I
> >>> have a
> >>> problem when I was trying to write Dataptr_method for the alternative
> >>> string.
> >>>
> >>> Based on my understanding, the return value of the Dataptr_method
> function
> >>> should be a vector of CHARSXP pointers. This design might be
> problematic
> >>> in
> >>> two ways:
> >>>
> >>> 1. The behavior of Dataptr_method function is inconsistent for string
> and
> >>> the other ALTREP types. For the other types we return a vector of pure
> >>> data
> >>> in memory allocated outside of R, but for the string, we return a
> vector
> >>> of
> >>> R objects allocated by R.
> >>>
> >>> 2. It causes an unnecessary duplication of the data. In order to return
> >>> CHARSXPs to R, It forces me to allocate CHARSXPs and copy the entire
> data
> >>> to the R process. By contrast, for the other ALTREP types, say
> altreal, I
> >>> can just return the pointer to R if the data is in the memory.
> >>>
> >>> The same problem occurs for Elt_method as well but is less serious
> since
> >>> only one CHARSXPs is allocated. Because my package is designed for
> sharing
> >>> a large R object. An allocation of the memory is undesired especially
> when
> >>> the data is read-only in the code(eg. print function). I'm not sure if
> >>> there are any solutions existed in the current R version, but I can
> >>> imagine
> >>> three workarounds:
> >>>
> >>> 1. Change the behavior of the R functions and use get_element function
> >>> instead of Dataptr function. This would make the problem more
> >>> memory-friendly but still cause the allocation.
> >>>
> >>> 2. Return a vector of const char* in Dataptr method. It would be very
> >>> efficient and consistent with the return values of the other ALTREP
> types.
> >>>
> >>> 3. Provide an alternative CHARSXP. This might be the best solution
> since
> >>> STRSXP behaves more like a list instead of a string, so an alternative
> >>> CHARSXP fits the concept of ALTREP better.
> >>>
> >>> Since I'm not an expert in R so I might post a solved problem. I would
> be
> >>> very happy and appreciate any suggestions regarding this problem.
> >>>
> >>> Best,
> >>> Jiefei
> >>>
> >>>         [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



--
Jiefei Wang
Room 2-501,Tangxuan,QilinGarden,NanshanDistrict,Shenzhen
Guangdong,China
Phone (+86)18312589584
[hidden email]

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel