The filter()
function is used to subset a data, retaining all rows that
satisfy your conditions. To be retained, the row must produce a value of
TRUE
for all conditions. Unlike 'local' filters that act on a tibble
,
the galah implementations work by amending a query which is then enacted
by collect()
or one of the atlas_
family of functions (such as
atlas_counts()
or atlas_occurrences()
).
Arguments
- .data
An object of class
data_request
,metadata_request
orfiles_request
, created usinggalah_call()
or related functions.- ...
Expressions that return a logical value, and are defined in terms of the variables in the selected atlas (and checked using
show_all(fields)
. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate toTRUE
are kept.- profile
Details
Syntax
filter.data_request()
and galah_filter()
uses non-standard evaluation
(NSE), and are designed to be as compatible as possible with
dplyr::filter()
syntax. Permissible examples include:
==
(e.g.year = 2020
) but not=
(for consistency withdplyr
)!=
, e.g.year != 2020
)>
or>=
(e.g.year >= 2020
)<
or<=
(e.g.year <= 2020
)OR
statements (e.g.year == 2018 | year == 2020
)AND
statements (e.g.year >= 2000 & year <= 2020
)
Some general tips:
Separating statements with a comma is equivalent to an
AND
statement; Ergofilter(year >= 2010 & year < 2020)
is the same as_filter(year >= 2010, year < 2020)
.All statements must include the field name; so
filter(year == 2010 | year == 2021)
works, as doesfilter(year == c(2010, 2021))
, butfilter(year == 2010 | 2021)
fails.It is possible to use an object to specify required values, e.g.
year_value <- 2010; filter(year > year_value)
.solr
supports range queries on text as well as numbers; sofilter(cl22 >= "Tasmania")
is valid.It is possible to filter by 'assertions', which are statements about data validity, such as
filter(assertions != c("INVALID_SCIENTIFIC_NAME", "COORDINATE_INVALID")
. Valid assertions can be found usingshow_all(assertions)
.
Exceptions
When querying occurrences, species, or their respective counts (i.e. all of
the above examples), field names are checked internally against
show_all(fields)
. There are some cases where bespoke field names are
required, as follows.
When requesting a data download from a DOI, the field doi
is valid, i.e.:
galah_call() |>
filter(doi = "a-long-doi-string") |>
collect()
For taxonomic metadata, the taxa
field is valid:
request_metadata() |>
filter(taxa == "Chordata") |>
unnest()
For building taxonomic trees, the rank
field is valid:
request_data() |>
identify("Chordata") |>
filter(rank == "class") |>
atlas_taxonomy()
Media queries are more involved, but break two rules: they accept the media
field, and they accept a tibble on the rhs of the equation. For example,
users wishing to break down media queries into their respective API calls
should begin with an occurrence query:
occurrences <- galah_call() |>
identify("Litoria peronii) |>
select(group = c("basic", "media") |>
collect()
They can then use the media
field to request media metadata:
media_metadata <- galah_call("metadata") |>
filter(media == occurrences) |>
collect()
And finally, the metadata tibble can be used to request files:
galah_call("files") |>
filter(media == media_metadata) |>
collect()
See also
select()
,
group_by()
and geolocate()
for
other ways to amend the information returned by atlas_()
functions. Use
search_all(fields)
to find fields that you can filter by, and
show_values()
to find what values of those filters are available.
Examples
if (FALSE) { # \dontrun{
galah_call() |>
filter(year >= 2019,
basisOfRecord == "HumanObservation") |>
count() |>
collect()
} # }