Narrow your results
Martin Westgate & Dax Kellie
2024-04-09
Source:vignettes/narrow_your_results.Rmd
narrow_your_results.Rmd
Each occurrence record contains taxonomic information and information
about the observation itself, like its location and the date of
observation. These pieces of information are recorded and categorised
into respective fields. When you import data using
galah, columns of the resulting tibble
correspond to these
fields.
Data fields are important because they provide a means to narrow and refine queries to return only the information that you need, and no more. Consequently, much of the architecture of galah has been designed to make narrowing as simple as possible. These functions include:
-
galah_identify()
oridentify()
-
galah_filter()
orfilter()
-
galah_select()
orselect()
-
galah_group_by()
orgroup_by()
-
galah_geolocate()
orst_crop()
These names have been chosen to echo comparable functions from
dplyr; namely filter()
,
select()
and group_by()
. With the exception of
galah_geolocate()
, they also use dplyr tidy
evaluation and syntax. This means that you can alternate between dplyr
and galah versions of these functions as you see fit. Below we use the
galah_
prefix for consistency with earlier versions of this
vignette.
galah_identify
& search_taxa
Perhaps unsurprisingly, search_taxa()
searches for
taxonomic information. search_taxa()
uses fuzzy-matching to
work a lot like the search bar on the Atlas of Living Australia website,
and you can use it to search for taxa by their scientific name.
Finding your desired taxon with search_taxa()
is an
important step to using this taxonomic information to download data. For
example, to search for reptiles, we first need to identify whether we
have the correct query:
search_taxa("Reptilia")
## # A tibble: 1 × 9
## search_term scientific_name taxon_concept_id rank match_type kingdom phylum class issues
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Reptilia REPTILIA https://biodiversity.org.au/afd/taxa/682e1228-5b3c-45ff-833b-550efd40c399 class exactMatch Animalia Chordata Reptilia noIssue
If we want to be more specific, we can provide a tibble
(or data.frame
) providing additional taxonomic
information.
search_taxa(tibble(genus = "Eolophus", kingdom = "Aves"))
## # A tibble: 1 × 13
## search_term scientific_name scientific_name_authorship taxon_concept_id rank match_type kingdom phylum class order family genus issues
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Eolophus_Aves Eolophus Bonaparte, 1854 https://biodiversity.org.au/afd/taxa/009169a9-a9… genus exactMatch Animal… Chord… Aves Psit… Cacat… Eolo… noIss…
Once we know that our search matches the correct taxon or taxa, we
can use galah_identify()
to narrow the results of our
query.
galah_call() |>
galah_identify("Reptilia") |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 1732625
taxa <- search_taxa(tibble(genus = "Eolophus", kingdom = "Aves"))
galah_call() |>
galah_identify(taxa) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 1094712
If you’re using an international atlas, search_taxa()
will automatically switch to using the local name-matching service. For
example, Portugal uses the GBIF taxonomic backbone, but integrates
seamlessly with our standard workflow.
galah_config(atlas = "Portugal")
## Atlas selected: GBIF Portugal (GBIF.pt) [Portugal]
galah_call() |>
galah_identify("Lepus") |>
galah_group_by(species) |>
atlas_counts()
## # A tibble: 5 × 2
## species count
## <chr> <int>
## 1 Lepus granatensis 1378
## 2 Lepus microtis 64
## 3 Lepus europaeus 10
## 4 Lepus saxatilis 2
## 5 Lepus capensis 1
Conversely, the UK’s National Biodiversity Network (NBN), has its own taxonomic backbone, but is supported using the same function call.
galah_config(atlas = "United Kingdom")
## Atlas selected: National Biodiversity Network (NBN) [United Kingdom]
galah_call() |>
galah_filter(genus == "Bufo") |>
galah_group_by(species) |>
atlas_counts()
## # A tibble: 3 × 2
## species count
## <chr> <int>
## 1 Bufo bufo 95466
## 2 Bufo spinosus 91
## 3 Bufo marinus 1
galah_filter
Perhaps the most important function in galah is
galah_filter()
, which is used to filter the rows of
queries.
galah_config(atlas = "Australia")
## Atlas selected: Atlas of Living Australia (ALA) [Australia]
# Get total record count since 2000
galah_call() |>
galah_filter(year > 2000) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 92053621
# Get total record count for iNaturalist in 2021
galah_call() |>
galah_filter(
year > 2000,
dataResourceName == "iNaturalist Australia") |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 6589403
To find available fields and corresponding valid values, use the
field lookup functions show_all(fields)
,
search_all(fields)
& show_values()
.
galah_filter()
can also be used to make more complex
taxonomic queries than are possible using search_taxa()
. By
using the taxonConceptID
field, it is possible to build
queries that exclude certain taxa, for example. This can be useful to
filter for paraphyletic concepts such as invertebrates.
galah_call() |>
galah_filter(
taxonConceptID == search_taxa("Animalia")$taxon_concept_id,
taxonConceptID != search_taxa("Chordata")$taxon_concept_id
) |>
galah_group_by(class) |>
atlas_counts()
## # A tibble: 70 × 2
## class count
## <chr> <int>
## 1 Insecta 6228043
## 2 Gastropoda 970153
## 3 Arachnida 812946
## 4 Maxillopoda 700845
## 5 Malacostraca 657558
## 6 Polychaeta 276240
## 7 Bivalvia 231228
## 8 Anthozoa 221108
## 9 Cephalopoda 148054
## 10 Demospongiae 117913
## # ℹ 60 more rows
galah_apply_profile
When working with the ALA, a notable feature is the ability to
specify a data profile
—a set of data quality filters—to
remove records that are suspect in some way.
galah_call() |>
galah_filter(year > 2000) |>
galah_apply_profile(ALA) |>
atlas_counts()
## # A tibble: 1 × 1
## count
## <int>
## 1 82002698
To see a full list of data quality profiles, use
show_all(profiles)
.
galah_group_by
Use galah_group_by()
to group and summarise record
counts by specified fields.
# Get record counts since 2010, grouped by year and basis of record
galah_call() |>
galah_filter(year > 2015 & year <= 2020) |>
galah_group_by(year, basisOfRecord) |>
atlas_counts()
## # A tibble: 36 × 3
## year basisOfRecord count
## <chr> <chr> <int>
## 1 2020 HUMAN_OBSERVATION 6583810
## 2 2020 OCCURRENCE 419843
## 3 2020 PRESERVED_SPECIMEN 86211
## 4 2020 MACHINE_OBSERVATION 39643
## 5 2020 OBSERVATION 24801
## 6 2020 MATERIAL_SAMPLE 2034
## 7 2020 LIVING_SPECIMEN 62
## 8 2019 HUMAN_OBSERVATION 5753392
## 9 2019 OCCURRENCE 290610
## 10 2019 PRESERVED_SPECIMEN 167373
## # ℹ 26 more rows
galah_select
Use galah_select()
to select which columns are returned
when downloading records.
Return columns 'kingdom', 'eventDate' & `species` only
occurrences <- galah_call() |>
galah_identify("reptilia") |>
galah_filter(year == 1930) |>
galah_select(kingdom, species, eventDate) |>
atlas_occurrences()
occurrences |> head()
## Error: <text>:1:8: unexpected symbol
## 1: Return columns
## ^
You can also use other dplyr functions that work with
dplyr::select()
with galah_select()
.
occurrences <- galah_call() |>
galah_identify("reptilia") |>
galah_filter(year == 1930) |>
galah_select(starts_with("accepted") | ends_with("record")) |>
atlas_occurrences()
occurrences |> head()
## # A tibble: 6 × 6
## acceptedNameUsage acceptedNameUsageID basisOfRecord raw_basisOfRecord OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD userDuplicateRecord
## <chr> <lgl> <chr> <chr> <lgl> <lgl>
## 1 <NA> NA PRESERVED_SPECIMEN Museum specimen FALSE FALSE
## 2 <NA> NA PRESERVED_SPECIMEN PreservedSpecimen FALSE FALSE
## 3 <NA> NA PRESERVED_SPECIMEN Museum specimen FALSE FALSE
## 4 <NA> NA HUMAN_OBSERVATION HumanObservation FALSE FALSE
## 5 <NA> NA PRESERVED_SPECIMEN PreservedSpecimen FALSE FALSE
## 6 <NA> NA PRESERVED_SPECIMEN PreservedSpecimen FALSE FALSE
galah_geolocate
Use galah_geolocate()
to specify a geographic area or
region to limit your search.
# Get list of perameles species in area specified:
# (Note: This can also be specified by a shapefile)
wkt <- "POLYGON((131.36328125 -22.506468769126,135.23046875 -23.396716654542,134.17578125 -27.287832521411,127.40820312499 -26.661206402316,128.111328125 -21.037340349154,131.36328125 -22.506468769126))"
galah_call() |>
galah_identify("perameles") |>
galah_geolocate(wkt) |>
atlas_species()
## # A tibble: 1 × 11
## taxon_concept_id species_name scientific_name_auth…¹ taxon_rank kingdom phylum class order family genus vernacular_name
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 https://biodiversity.org.au/afd/taxa/f234f91f-8e00-405d-a89b-eb8fb… Perameles e… Spencer, 1897 species Animal… Chord… Mamm… Pera… Peram… Pera… Desert Bandico…
## # ℹ abbreviated name: ¹scientific_name_authorship
galah_geolocate()
also accepts shapefiles. More complex
shapefiles may need to be simplified first (e.g., using rmapshaper::ms_simplify()
)