R tutorials#
R tutorial on adding a lubridate binding#
In this tutorial, we will document the contribution of a binding to Arrow R package following the steps specified by the Quick Reference section of the guide and a more detailed Steps in making your first PR section. Navigate there whenever there is some information you may find is missing here.
The binding will be added to the expression.R
file in the
R package. But you can also follow these steps in case you are
adding a binding that will live somewhere else.
This tutorial is different from the Steps in making your first PR as we will be working on a specific case. This tutorial is not meant as a step-by-step guide.
Letβs start!
Set up#
Letβs set up the Arrow repository. We presume here that Git is already installed. Otherwise please see the Set up section.
Once the Apache Arrow repository is forked (see Fork the repository) we will clone it and add the link of the main repository to our upstream.
$ git clone https://github.com/<your username>/arrow.git
$ cd arrow
$ git remote add upstream https://github.com/apache/arrow
Building R package#
The steps to follow for building the R package differs depending on the operating system you are using. For this reason we will only refer to the instructions for the building process in this tutorial.
See also
For the introduction to the building process refer to the Building the Arrow libraries ππΏββοΈ section.
For the instructions on how to build the R package refer to the R developer docs.
The issue#
In this tutorial we will be tackling an issue for implementing
a simple binding for mday()
function that will match that of the
existing R function from lubridate
.
Note
If you do not have an issue and you need help finding one please refer to the Finding good first issues π part of the guide.
Once you have an issue picked out and assigned to yourself, you can proceed to the next step.
Start the work on a new branch#
Before we start working on adding the binding we should create a new branch from the updated main.
$ git checkout main
$ git fetch upstream
$ git pull --ff-only upstream main
$ git checkout -b ARROW-14816
Now we can start with researching the R function and the C++ Arrow compute function we want to expose or connect to.
Examine the lubridate mday() function
Going through the lubridate documentation
we can see that mday()
takes a date object
and returns the day of the month as a numeric object.
We can run some examples in the R console to help us understand the function better:
> library(lubridate)
> mday(as.Date("2000-12-31"))
[1] 31
> mday(ymd(080306))
[1] 6
Examine the Arrow C++ day() function
From the compute function documentation
we can see that day
is a unary function, which means that it takes
a single data input. The data input must be a Temporal class
and
the returned value is an Integer/numeric
type.
The Temporal class
is specified as: Date types (Date32, Date64),
Time types (Time32, Time64), Timestamp, Duration, Interval.
We can call an Arrow C++ function from an R console using call_function
to see how it works:
> call_function("day", Scalar$create(lubridate::ymd("2000-12-31")))
Scalar
31
We can see that lubridate and Arrow functions operate on and return
equivalent data types. lubridateβs mday()
function has no additional
arguments and there are also no option classes associated with Arrow C++
function day()
.
Looking at the code in expressions.R
we can see the day function
is already specified/mapped on the R package side:
apache/arrow
We only need to add mday()
to the list of expressions connecting
it to the C++ day
function.
# second is defined in dplyr-functions.R
# wday is defined in dplyr-functions.R
"mday" = "day",
"yday" = "day_of_year",
"year" = "year",
Adding a test#
Now we need to add a test that checks if everything works well.
If there are additional options or edge cases, we would have to
add more. Looking at tests for similar functions (for example
yday()
or day())
we can see that a good place to add two
tests we have is in test-dplyr-funcs-datetime.R
:
test_that("extract mday from timestamp", {
compare_dplyr_binding(
.input %>%
mutate(x = mday(datetime)) %>%
collect(),
test_df
)
})
And
test_that("extract mday from date", {
compare_dplyr_binding(
.input %>%
mutate(x = mday(date)) %>%
collect(),
test_df
)
})
Now we need to see if the tests are passing or we need to do some more research and code corrections.
devtools::test(filter="datetime")
> devtools::test(filter="datetime")
βΉ Loading arrow
See arrow_info() for available features
βΉ Testing arrow
See arrow_info() for available features
β | F W S OK | Context
β | 1 230 | dplyr-funcs-datetime [1.4s]
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Failure (test-dplyr-funcs-datetime.R:187:3): strftime
``%>%`(...)` did not throw the expected error.
Backtrace:
1. testthat::expect_error(...) test-dplyr-funcs-datetime.R:187:2
2. testthat:::expect_condition_matching(...)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Results βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Duration: 1.4 s
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 230 ]
There is a failure we get for the strftime
function but looking
at the code we see is not connected to our work. We can move on and
maybe ask others if they are getting similar fail when running the tests.
It could be we only need to rebuild the library.
Check styling#
We should also run linters to check that the styling of the code follows the tidyverse style. To do that we run the following command in the shell:
$ make style
R -s -e 'setwd(".."); if (requireNamespace("styler")) styler::style_file(setdiff(system("git diff --name-only | grep r/.*R$", intern = TRUE), file.path("r", source("r/.styler_excludes.R")$value)))'
Loading required namespace: styler
Styling 2 files:
r/R/expression.R β
r/tests/testthat/test-dplyr-funcs-datetime.R βΉ
ββββββββββββββββββββββββββββββββββββββββββββ
Status Count Legend
β 1 File unchanged.
βΉ 1 File changed.
β 0 Styling threw an error.
ββββββββββββββββββββββββββββββββββββββββββββ
Please review the changes carefully!
Creating a Pull Request#
First letβs review our changes in the shell using git status
to see
which files have been changed and to commit only the ones we are working on.
$ git status
On branch ARROW-14816
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: R/expression.R
modified: tests/testthat/test-dplyr-funcs-datetime.R
And git diff
to see the changes in the files in order to spot any error we might have made.
$ git diff
diff --git a/r/R/expression.R b/r/R/expression.R
index 37fc21c25..0e71803ec 100644
--- a/r/R/expression.R
+++ b/r/R/expression.R
@@ -70,6 +70,7 @@
"quarter" = "quarter",
# second is defined in dplyr-functions.R
# wday is defined in dplyr-functions.R
+ "mday" = "day",
"yday" = "day_of_year",
"year" = "year",
diff --git a/r/tests/testthat/test-dplyr-funcs-datetime.R b/r/tests/testthat/test-dplyr-funcs-datetime.R
index 359a5403a..228eca56a 100644
--- a/r/tests/testthat/test-dplyr-funcs-datetime.R
+++ b/r/tests/testthat/test-dplyr-funcs-datetime.R
@@ -444,6 +444,15 @@ test_that("extract wday from timestamp", {
)
})
+test_that("extract mday from timestamp", {
+ compare_dplyr_binding(
+ .input %>%
+ mutate(x = mday(datetime)) %>%
+ collect(),
+ test_df
+ )
+})
+
test_that("extract yday from timestamp", {
compare_dplyr_binding(
.input %>%
@@ -626,6 +635,15 @@ test_that("extract wday from date", {
)
})
+test_that("extract mday from date", {
+ compare_dplyr_binding(
+ .input %>%
+ mutate(x = mday(date)) %>%
+ collect(),
+ test_df
+ )
+})
+
test_that("extract yday from date", {
compare_dplyr_binding(
.input %>%
Everything looks OK. Now we can make the commit (save our changes to the branch history):
$ git commit -am "Adding a binding and a test for mday() lubridate"
[ARROW-14816 ed37d3a3b] Adding a binding and a test for mday() lubridate
2 files changed, 19 insertions(+)
We can use git log
to check the history of commits:
$ git log
commit ed37d3a3b3eef76b696532f10562fea85f809fab (HEAD -> ARROW-14816)
Author: Alenka Frim <frim.alenka@gmail.com>
Date: Fri Jan 21 09:15:31 2022 +0100
Adding a binding and a test for mday() lubridate
commit c5358787ee8f7b80f067292f49e5f032854041b9 (upstream/main, upstream/HEAD, main, ARROW-15346, ARROW-10643)
Author: KrisztiΓ‘n SzΕ±cs <szucs.krisztian@gmail.com>
Date: Thu Jan 20 09:45:59 2022 +0900
ARROW-15372: [C++][Gandiva] Gandiva now depends on boost/crc.hpp which is missing from the trimmed boost archive
See build error https://github.com/ursacomputing/crossbow/runs/4871392838?check_suite_focus=true#step:5:11762
Closes #12190 from kszucs/ARROW-15372
Authored-by: KrisztiΓ‘n SzΕ±cs <szucs.krisztian@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
If we started the branch some time ago, we may need to rebase to upstream main to make sure there are no merge conflicts:
$ git pull upstream main --rebase
And now we can push our work to the forked Arrow repository on GitHub called origin.
$ git push origin ARROW-14816
Enumerating objects: 233, done.
Counting objects: 100% (233/233), done.
Delta compression using up to 8 threads
Compressing objects: 100% (130/130), done.
Writing objects: 100% (151/151), 35.78 KiB | 8.95 MiB/s, done.
Total 151 (delta 129), reused 33 (delta 20), pack-reused 0
remote: Resolving deltas: 100% (129/129), completed with 80 local objects.
remote:
remote: Create a pull request for 'ARROW-14816' on GitHub by visiting:
remote: https://github.com/AlenkaF/arrow/pull/new/ARROW-14816
remote:
To https://github.com/AlenkaF/arrow.git
* [new branch] ARROW-14816 -> ARROW-14816
Now we have to go to the Arrow repository on GitHub to create a Pull Request. On the GitHub Arrow page (main or forked) we will see a yellow notice bar with a note that we made recent pushes to the branch ARROW-14816. Thatβs great, now we can make the Pull Request by clicking on Compare & pull request.
First we need to change the Title to ARROW-14816: [R] Implement bindings for lubridate::mday() in order to match it with the issue. Note a punctuation mark was added!
Extra note: when this tutorial was created, we had been using the Jira issue tracker. As we are currently using GitHub issues, the title would be prefixed with GH-14816: [R] Implement bindings for lubridate::mday().
We will also add a description to make it clear to others what we are trying to do.
Once we click Create pull request our code can be reviewed as a Pull Request in the Apache Arrow repository.
The pull request gets connected to the issue and the CI is running. After some time passes and we get a review we can correct the code, comment, resolve conversations and so on.
See also
For more information about Pull Request workflow see Lifecycle of a pull request.
The Pull Request we made can be viewed here.