Arrow Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:
- Flight removes the need for deserialization during data transfer.
- Flight allows for parallel data streaming.
- Flight employs optimizations designed to take advantage of Arrow’s columnar format.
The arrow package provides methods for connecting to Flight servers to send and receive data.
Prerequisites
At present the arrow package in R does not supply an independent implementation of Arrow Flight: it works by calling Flight methods supplied by PyArrow Python, and requires both the reticulate package and the Python PyArrow library to be installed. If you are using them for the first time you can install them like this:
install.packages("reticulate")
arrow::install_pyarrow()
See the python integrations article for more details on setting up pyarrow.
Example
The package includes methods for starting a Python-based Flight server, as well as methods for connecting to a Flight server running elsewhere. To illustrate both sides, in one R process we’ll start a demo server:
library(arrow)
demo_server <- load_flight_server("demo_flight_server")
server <- demo_server$DemoFlightServer(port = 8089)
server$serve()
We’ll leave that one running.
In a different R process, let’s connect to it and put some data in it.
library(arrow)
client <- flight_connect(port = 8089)
flight_put(client, iris, path = "test_data/iris")
Now, in yet another R process, we can connect to the server and pull the data we put there:
library(arrow)
library(dplyr)
client <- flight_connect(port = 8089)
client %>%
flight_get("test_data/iris") %>%
group_by(Species) %>%
summarize(max_petal = max(Petal.Length))
## # A tibble: 3 x 2
## Species max_petal
## <fct> <dbl>
## 1 setosa 1.9
## 2 versicolor 5.1
## 3 virginica 6.9
Because flight_get()
returns an Arrow data structure,
you can directly pipe its result into a dplyr workflow. See the article
on data wrangling for more
information on working with Arrow objects via a dplyr interface.
Further reading
- The specification of the Flight remote procedure call protocol is listed on the Arrow project homepage
- The Arrow C++ documentation contains a list of best practices for Arrow Flight.
- A detailed worked example of an Arrow Flight server in Python is provided in the Apache Arrow Python Cookbook.