The Global Biodiversity Information Facility (GBIF; https://www.gbif.org)
provides tools to enable users to find, access, combine and visualise
biodiversity data. galah is a dplyr extension package that enables the R
community to directly access data and resources hosted by GBIF and several
of it's subsidiary organisations (known as 'nodes') using dplyr verbs.
The basic unit of data stored by these infrastructures is
an occurrence record, which is an observation of a biological entity at
a specific time and place. However, galah also facilitates access to
taxonomic information, or associated media such images or sounds,
all while restricting their queries to particular taxa or locations. Users
can specify which columns are returned by a query, or restrict their results
to observations that meet particular quality-control criteria.
For those outside Australia, 'galah' is the common name of Eolophus roseicapilla, a widely-distributed Australian bird species.
Functions
Getting Started
galah_config()Set package configuration optionsgalah_call()/request_()Start to build a request
Update a request object
apply_profile()Restrict to data that pass predefined checksarrange()Arrange rows of a query on the server sideauthenticate()Authenticate your request via OAUTH in the browsercount()Request counts of the specified data typedistinct()Keep distinct/unique rowsfilter()Filter records (see alsofilter_object_classes))geolocate()Spatial filtering of a queryglimpse()Get a glimpse of your datagroup_by()Group counts by one or more fieldsidentify()Search for taxonomic identifiers (see alsotaxonomic_searches)select()Fields to report information forslice_head()Choose the first n rows of a downloadunnest()Expand metadata forfields,lists,profilesortaxa
Create and execute a query
capture()Convert a request into aprequeryorquerycompound()Convert an object into aquery_setshowing all calls needed for evaluationcollapse()Convert an object to a validquerycompute()Compute a querycollect()Retrieve a database query
Wrappers for accessing data
show_all()&search_all()Data for generating filter queriesshow_values()&search_values()Show or search for values withinfields,profiles,lists,collections,datasetsorprovidersatlas_occurrences()Download occurrence dataatlas_counts()Get a summary of the number of records or speciesatlas_species()Download occurrences grouped byspeciesIDatlas_taxonomy()Download taxonomic treesatlas_media()Download media metadata linked to occurrencescollect_media()Download media (images and sounds)
Miscellaneous functions
atlas_citation()Get a citation for a datasetread_zip()To read data from an earlier downloadprint()Print functions for galah objects
Terminology
To get the most value from galah, it is helpful to understand some
terminology. Each occurrence record contains taxonomic
information, and usually some information about the observation itself, such
as its location. In addition to this record-specific information, the living
atlases append contextual information to each record, particularly data from
spatial layers reflecting climate gradients or political boundaries. They
also run a number of quality checks against each record, resulting in
assertions attached to the record. Each piece of information
associated with a given occurrence record is stored in a field,
which corresponds to a column when imported to an
tibble. See show_all(fields) to view valid fields,
layers and assertions, or conduct a search using search_all(fields).
Data fields are important because they provide a means to filter
occurrence records; i.e. to return only the information that you need, and
no more. Consequently, much of the architecture of galah has been
designed to make filtering as simple as possible. The easiest way to do this
is to start a pipe with galah_call() and follow it with the relevant
dplyr function; starting with filter(),
but also including select(),
group_by() or others. Functions without
a relevant dplyr synonym include
identify() for choosing a taxon, or
geolocate() for choosing a specific location. By combining different filters,
it is possible to build complex queries to return only the most valuable
information for a given problem.
A notable extension of the filtering approach is to remove records with low
'quality'. All living atlases perform quality control checks on all records
that they store. These checks are used to generate new fields, that can then
be used to filter out records that are unsuitable for particular applications.
However, there are many possible data quality checks, and it is not always
clear which are most appropriate in a given instance. Therefore, galah
supports data quality profiles, which can be passed to
apply_profile() to quickly remove undesirable records. A full list of
data quality profiles is returned by show_all(profiles).
Author
Maintainer: Martin Westgate martin.westgate@csiro.au
Authors:
Dax Kellie dax.kellie@csiro.au
Other contributors:
Shandiya Balasubramaniam shandiya.balasubramaniam@csiro.au [contributor]
Matilda Stevenson [contributor]
