Object-Oriented Programming
Martin Westgate & Dax Kellie
2024-11-19
Source:vignettes/object_oriented_programming.Rmd
object_oriented_programming.Rmd
The default method for building queries in galah
is to
first use galah_call()
to create a query object called a
“data_request
”. This object class is specific to
galah
.
galah_call() |>
filter(genus == "Crinia") |>
class()
## [1] "data_request"
When a piped object is of class data_request
, galah can
trigger functions to use specific methods for this object class, even if
a function name is used by another package. For example, users can use
filter()
and group_by()
functions from dplyr instead of
galah_filter()
and galah_group_by()
to
construct a query. Consequently, the following queries are
synonymous:
galah_call() |>
galah_filter(genus == "Crinia", year == 2020) |>
galah_group_by(species) |>
atlas_counts()
galah_call() |>
filter(genus == "Crinia", year == 2020) |>
group_by(species) |>
atlas_counts()
## # A tibble: 16 × 2
## species count
## <chr> <int>
## 1 Crinia signifera 42621
## 2 Crinia parinsignifera 8664
## 3 Crinia glauerti 3111
## 4 Crinia georgiana 1509
## 5 Crinia remota 718
## 6 Crinia sloanei 682
## 7 Crinia insignifera 530
## 8 Crinia tinnula 291
## 9 Crinia deserticola 253
## 10 Crinia pseudinsignifera 223
## 11 Crinia tasmaniensis 181
## 12 Crinia bilingua 74
## 13 Crinia subinsignifera 46
## 14 Crinia riparia 10
## 15 Crinia flindersensis 3
## 16 Crinia nimba 1
Thanks to object-oriented programming, galah “masks”
filter()
and group_by()
functions to use
methods defined for data_request
objects instead. The full
list of masked functions is:
-
arrange()
(dplyr) -
count()
(dplyr) -
identify()
({graphics}
) as a synonym forgalah_identify()
-
select()
(dplyr) as a synonym forgalah_select()
-
group_by()
(dplyr) as a synonym forgalah_group_by()
-
slice_head()
(dplyr) as a synonym for thelimit
argument inatlas_counts()
-
st_crop()
(sf) as a synonym forgalah_polygon()
Note that these functions are all evaluated lazily; they amend the
underlying object, but do not amend the nature of the data until the
call is evaluated. To actually build and run the query, we’ll need to
use one or more of a different set of dplyr verbs:
collapse()
, compute()
and
collect()
.
Advanced query building
The usual way to begin a query to request data in galah is using
galah_call()
. However, this function now calls one of three
types of request_
functions. If you prefer, you can begin
your pipe with one of these dedicated request_
functions
(rather than galah_call()
) depending on the type of data
you want to collect.
For example, if you want to download occurrences, use
request_data()
:
x <- request_data("occurrences") |> # note that "occurrences" is the default `type`
filter(species == "Crinia tinnula",
year == 2010) |>
collect()
You’ll notice that this query differs slightly from the query
structure used in earlier versions of galah
. The desired
data type, "occurrences"
, is specified at the beginning of
the query within request_data()
rather than at the end
using atlas_occurrences()
. Specifying the data type at the
start allows users to make use of advanced query building using three
newly implemented stages of query building: collapse()
,
compute()
and collect()
. These stages mirror
existing functions in
dplyr for querying databases, and act in the following way:
-
collapse()
converts the object to aquery
. This allows users to inspect
their API calls before they are sent. Depending on the request, this function may also call ‘supplementary’ APIs to collect required information, such as Taxon Concept Identifiers or field names. -
compute()
is intended to send the query in question to the requested API for processing. This is particularly important for occurrences, where it can be useful to submit a query and retrieve it at a later time. If thecompute()
stage is not required, however,compute()
simply converts thequery
to a new class (computed_query
). -
collect()
retrieves the requested data into your workspace, returning atibble
.
We can use these in sequence, or just leap ahead to the stage we want:
x <- request_data() |>
filter(genus == "Crinia", year == 2020) |>
group_by(species) |>
arrange(species) |>
count()
collapse(x)
## Object of class query with type data/occurrences-count-groupby
## url: https://api.ala.org.au/occurrences/occurrences/facets?fq=%28genus%3A%2...
## arrange: species (ascending)
compute(x)
## Object of class computed_query with type data/occurrences-count-groupby
## url: https://api.ala.org.au/occurrences/occurrences/facets?fq=%28genus%3A%2...
## arrange: species (ascending)
## # A tibble: 6 × 2
## species count
## <chr> <int>
## 1 Crinia bilingua 74
## 2 Crinia deserticola 253
## 3 Crinia flindersensis 3
## 4 Crinia georgiana 1509
## 5 Crinia glauerti 3111
## 6 Crinia insignifera 530
The benefit of using collapse()
, compute()
and collect()
is that queries are more modular. This is
particularly useful for large data requests in galah. Users can send
their query using compute()
, and download data once the
query has finished — downloading with collect()
later —
rather than waiting for the request to finish within R.
# Create and send query to be calculated server-side
request <- request_data() |>
identify("perameles") |>
filter(year > 1900) |>
compute()
# Download data
request |>
collect()
Additionally, functions that are more modular are generally easier to
interrogate and debug. Previously some functions did several different
things, making it difficult to know which APIs were being called, when,
and for what purpose. Partitioning queries into three distinct stages is
much more transparent, and allows users to check their query
construction prior to sending a request. For example, the query above is
constructed with the following information, returned by
collapse()
.
request_data() |>
identify("perameles") |>
filter(year > 1900) |>
collapse()
## Object of class query with type data/occurrences
## url: https://api.ala.org.au/occurrences/occurrences/offline/download?fq=%28...
The collapse()
stage includes an additional argument
(.expand
) that, when set to TRUE
, shows all
the APIs called to construct the user-requested query. This is
especially useful for debugging.
Object classes
Under the hood, the different query-building verbs each amend the supplied object to a new class:
-
collapse()
returns classquery
, which is a list containing atype
slot and one or moreurl
s -
compute()
returns a single object of classcomputed_query
-
collect()
returns atibble
These can be called directly, or via the method
and
type
arguments of galah_call()
, which specify
which dedicated request_
function and data type to return.
To demonstrate what we mean, take the following calls, which despite
using different syntax, all return the number of records available for
the year 2020:
# new syntax
request_data() |>
filter(year == 2020) |>
count() |>
collect()
# similar, but using `galah_call()`
galah_call(method = "data",
type = "occurrences-count") |>
filter(year == 2020) |>
collect()
# original syntax
galah_call() |>
galah_filter(year == 2020) |>
atlas_counts()
Another example is to list available fields
in the
selected atlas:
request_metadata(type = "fields") |>
collect()
galah_call(method = "metadata",
type = "fields") |>
collect()
show_all(fields)
Or to show values for states and territories:
request_metadata() |>
filter(field == "cl22") |>
unnest() |>
collect()
galah_call(method = "metadata",
type = "fields-unnest") |>
galah_filter(id == "cl22") |>
collect()
search_all(fields, "cl22") |>
show_values()
While request_metadata()
is more modular than
show_all()
, there is little benefit to using it for most
applications. However, in some cases, larger databases like GBIF return
huge data.frame
s of metadata when called via
show_all()
. Using request_metdata()
allows
users to specify a slice_head()
line within their pipe to
get around this issue.
Which syntax should I prefer?
Despite these benefits, we have no plans to require users to
call masked functions. Functions prefixed with galah_
or
atlas_
are not going away. Indeed, while there is perfect
redundancy between old and new syntax in some cases, in others they
serve different purposes. In atlas_media()
for example,
several calls are made and joined in a way that reduces the number of
steps required by the user. Under the hood, however, all
atlas_
functions are now entirely built using the above
syntax.