
Narrow your results
Martin Westgate & Dax Kellie
2023-11-07
Source:vignettes/narrow_your_results.Rmd
narrow_your_results.Rmd
Each occurrence record contains taxonomic information and information
about the observation itself, like its location and the date of
observation. These pieces of information are recorded and categorised
into respective fields. When you import data using
galah
, columns of the resulting tibble
correspond to these fields.
Data fields are important because they provide a means to manipulate
queries to return only the information that you need, and no more.
Consequently, much of the architecture of galah
has been
designed to make narrowing as simple as possible. These functions
include:
-
galah_identify
oridentify
-
galah_filter
orfilter
-
galah_select
orselect
-
galah_group_by
orgroup_by
-
galah_geolocate
orst_crop
These names have been chosen to echo comparable functions from
dplyr
; namely filter
, select
and
group_by
. With the exception of
galah_geolocate
, they also use dplyr
tidy
evaluation and syntax. This means that how you use dplyr
functions is also how you use galah_
functions.
galah_identify & search_taxa
Perhaps unsurprisingly, search_taxa
searches for
taxonomic information. It uses fuzzy matching to work a lot like the
search bar on the Atlas of Living
Australia website, and you can use it to search for taxa by their
scientific name. Finding your desired taxon with
search_taxa
is an important step to using this taxonomic
information to download data with galah
.
For example, to search for reptiles, we first need to identify whether we have the correct query:
search_taxa("Reptilia")
## # A tibble: 1 × 9
## search_term scientific_name taxon_concept_id rank match_type kingdom phylum class issues
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Reptilia REPTILIA https://biodiver… class exactMatch Animal… Chord… Rept… noIss…
If we want to be more specific by providing additional taxonomic
information to search_taxa
, you can provide a
tibble
(or data.frame
) containing more levels
of the taxonomic hierarchy:
search_taxa(tibble(genus = "Eolophus", kingdom = "Aves"))
## # A tibble: 1 × 13
## search_term scientific_name scientific_name_authorship taxon_concept_id rank match_type
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Eolophus_Aves Eolophus Bonaparte, 1854 https://biodive… genus exactMatch
## # ℹ 7 more variables: kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## # genus <chr>, issues <chr>
Once we know that our search matches the correct taxon or taxa, we
can use galah_identify
to narrow the results of our
queries:
galah_call() |>
galah_identify("Reptilia") |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 1673906
taxa <- search_taxa(tibble(genus = "Eolophus", kingdom = "Aves"))
galah_call() |>
galah_identify(taxa) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 1092638
If you’re using an international atlas, search_taxa
will
automatically switch to using the local name-matching service. For
example, Portugal uses the GBIF taxonomic backbone, but integrates
seamlessly with our standard workflow.
galah_config(atlas = "Portugal")
## Atlas selected: GBIF Portugal (GBIF.pt) [Portugal]
galah_call() |>
galah_identify("Lepus") |>
galah_group_by(species) |>
atlas_counts()
## # A tibble: 5 × 2
## species count
## <chr> <int>
## 1 Lepus granatensis 1378
## 2 Lepus microtis 64
## 3 Lepus europaeus 10
## 4 Lepus saxatilis 2
## 5 Lepus capensis 1
Conversely, the UK’s National Biodiversity Network (NBN), has its’ own taxonomic backbone, but is supported using the same function call.
galah_config(atlas = "United Kingdom")
## Atlas selected: National Biodiversity Network (NBN) [United Kingdom]
galah_call() |>
galah_filter(genus == "Bufo") |>
galah_group_by(species) |>
atlas_counts()
## # A tibble: 3 × 2
## species count
## <chr> <int>
## 1 Bufo bufo 94054
## 2 Bufo spinosus 87
## 3 Bufo marinus 1
galah_filter
Perhaps the most important function in galah
is
galah_filter
, which is used to filter the rows of
queries:
galah_config(atlas = "Australia")
## Atlas selected: Atlas of Living Australia (ALA) [Australia]
# Get total record count since 2000
galah_call() |>
galah_filter(year > 2000) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 90503179
# Get total record count for iNaturalist in 2021
galah_call() |>
galah_filter(
year > 2000,
dataResourceName == "iNaturalist Australia") |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 5600557
To find available fields and corresponding valid values, use the
field lookup functions show_all(fields)
,
search_all(fields)
& show_values()
.
Finally, a special case of galah_filter
is to make more
complex taxonomic queries than are possible using
search_taxa
. By using the taxonConceptID
field, it is possible to build queries that exclude certain taxa, for
example. This can be useful for paraphyletic concepts such as
invertebrates:
galah_call() |>
galah_filter(
taxonConceptID == search_taxa("Animalia")$taxon_concept_id,
taxonConceptID != search_taxa("Chordata")$taxon_concept_id
) |>
galah_group_by(class) |>
atlas_counts()
## # A tibble: 30 × 2
## class count
## <chr> <int>
## 1 Insecta 5806340
## 2 Gastropoda 957271
## 3 Maxillopoda 793194
## 4 Arachnida 689708
## 5 Malacostraca 651658
## 6 Polychaeta 276272
## 7 Bivalvia 234110
## 8 Anthozoa 216343
## 9 Cephalopoda 145929
## 10 Demospongiae 118375
## # ℹ 20 more rows
galah_apply_profile
When working with the ALA, a notable feature is the ability to
specify a profile
to remove records that are suspect in
some way.
galah_call() |>
galah_filter(year > 2000) |>
galah_apply_profile(ALA) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 81471797
To see a full list of data quality profiles, use
show_all(profiles)
.
galah_group_by
Use galah_group_by
to group record counts and summarise
counts by specified fields:
# Get record counts since 2010, grouped by year and basis of record
galah_call() |>
galah_filter(year > 2015 & year <= 2020) |>
galah_group_by(year, basisOfRecord) |>
atlas_counts()
## # A tibble: 36 × 3
## year basisOfRecord count
## <chr> <chr> <int>
## 1 2020 Human observation 6551035
## 2 2020 Occurrence 419842
## 3 2020 Preserved specimen 84136
## 4 2020 Machine observation 38906
## 5 2020 Observation 24887
## 6 2020 Material Sample 1677
## 7 2020 Living specimen 62
## 8 2019 Human observation 5730445
## 9 2019 Occurrence 290610
## 10 2019 Preserved specimen 165391
## # ℹ 26 more rows
galah_select
Use galah_select
to choose which columns are returned
when downloading records:
# Get *Reptilia* records from 1930, but only 'eventDate' and 'kingdom' columns
occurrences <- galah_call() |>
galah_identify("reptilia") |>
galah_filter(year == 1930) |>
galah_select(kingdom, species, eventDate) |>
atlas_occurrences()
## Retrying in 1 seconds.
occurrences |> head()
## # A tibble: 6 × 3
## kingdom species eventDate
## <chr> <chr> <dttm>
## 1 Animalia Drysdalia coronoides 1930-06-16 00:00:00
## 2 Animalia Intellagama lesueurii 1930-01-01 00:00:00
## 3 Animalia <NA> 1930-04-23 00:00:00
## 4 Animalia <NA> 1930-01-01 00:00:00
## 5 Animalia Oxyuranus scutellatus 1930-01-01 00:00:00
## 6 Animalia Tympanocryptis centralis 1930-11-30 00:00:00
You can also use other dplyr
functions that work with
dplyr::select()
with galah_select()
occurrences <- galah_call() |>
galah_identify("reptilia") |>
galah_filter(year == 1930) |>
galah_select(starts_with("accepted") | ends_with("record")) |>
atlas_occurrences()
## Retrying in 1 seconds.
occurrences |> head()
## # A tibble: 6 × 6
## acceptedNameUsage acceptedNameUsageID basisOfRecord raw_basisOfRecord
## <chr> <lgl> <chr> <chr>
## 1 <NA> NA PRESERVED_SPECIMEN Museum specimen
## 2 <NA> NA HUMAN_OBSERVATION HumanObservation
## 3 <NA> NA PRESERVED_SPECIMEN PreservedSpecimen
## 4 <NA> NA HUMAN_OBSERVATION HumanObservation
## 5 <NA> NA PRESERVED_SPECIMEN PreservedSpecimen
## 6 <NA> NA HUMAN_OBSERVATION HumanObservation
## # ℹ 2 more variables: OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD <lgl>,
## # userDuplicateRecord <lgl>
galah_geolocate
Use galah_geolocate
to specify a geographic area or
region to limit your search:
# Get list of perameles species only in area specified:
# (Note: This can also be specified by a shapefile)
wkt <- "POLYGON((131.36328125 -22.506468769126,135.23046875 -23.396716654542,134.17578125 -27.287832521411,127.40820312499 -26.661206402316,128.111328125 -21.037340349154,131.36328125 -22.506468769126))"
galah_call() |>
galah_identify("perameles") |>
galah_geolocate(wkt) |>
atlas_species()
## # A tibble: 1 × 10
## kingdom phylum class order family genus species author species_guid vernacular_name
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Animalia Chordata Mammalia Peram… Peram… Pera… Perame… Spenc… https://bio… Desert Bandico…