Spatial filtering#

Adapted from the Spatial Filtering galah-R article.

Callum Waite, Shandiya Balasubramaniam, Amanda Buyan

Biodiversity queries to the ALA usually require some spatial filtering. Let’s see how spatial data are stored in the ALA and a few different methods to spatially filter data with {galah}, {shapely} and {geopandas}.

Most records in the ALA contain location data in the form of two key fields: decimalLatitude and decimalLongitude. As expected, these fields are the decimal coordinates of each occurrence, with south and west values denoted by negatives. While there may be some uncertainty in these values (see field coordinateUncertaintyInMeters), they are generally very accurate.

While these are very important and useful fields, very rarely will we encounter situations that require queries directly calling decimalLatitude and decimalLongitude. Instead, the ALA and galah have a number of features that make spatial queries simpler.

Contextual and spatial layers#

Often we want to filter results down to some commonly defined spatial regions, such as states, LGAs or IBRA/IMCRA regions. The ALA contains a large range (>100) of contextual and spatial layers, in-built as searchable and queriable fields. They are denoted by names beginning with “cl”, followed by an identifying number that may be up to 6 digits long. These fields are each based on shapefiles, and contain the names of the regions in these layers that each record lies in.

We strongly recommend using galah.search_all(fields="<your-search-term-here>") to check whether a contextual layer already exists in the ALA that matches what you require before proceeding with other methods of spatial filtering. These fields are all able to be used in the filters argument of atlas_counts, atlas_occurrences, atlas_species or atlas_media, so they are generally easier to use.

Suppose we are interested in querying records of the Red-Necked Avocet (Recurvirostra novaehollandiae) in a particular protected wetlands, the Coorong wetlands in South Australia. We can search the ALA fields for wetlands.

>>> import galah
>>> galah.galah_config(email="your-email-here")
>>> galah.search_all(fields="wetlands")
        id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           description    type link
0    cl901                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Directory of Important Wetlands Directory of Important Wetlands Spatial Database including Wetlands Type and Criteria  layers     
1  cl11192  Ramsar Wetlands of Australia 2024 National dataset of Australia's Ramsar Wetlands.\r\n\r\nThe Convention on Wetlands of International Importance (the Ramsar Convention) was signed in Ramsar, Iran on 2 February 1971. The Ramsar Convention aims to halt the worldwide loss of wetlands and to conserve, through wise use and management, those that remain. The Convention encourages member countries to nominate sites containing representative, rare or unique wetlands, or that are important for conserving biological diversity, to the List of Wetlands of International Importance (Ramsar sites). Australia was one of the first countries to become a Contracting Party to the Convention and designated the world's first Ramsar site, Cobourg Peninsula, in 1974.\r\n\r\nThis project was initiated by the Wetlands Section of the Australian Government Department of Climate Change, Energy, the Environment and Water. Spatial data was sourced from the relevant State and Territory agencies and compiled into a single national coverage.\r\n\r\nIn WGS84 CRS.  layers

Our search identifies that layer cl901 seems to match what we are looking for. We can then either view all possible values in the field with galah.show_values(), or use galah.search_values() again for our particular field.

>>> galah.search_values(field="cl901",value="coorong")
   field                                     category
0  cl901  The Coorong, Lake Alexandrina & Lake Albert

We can filter all occurrences for exact matches with this value, “The Coorong, Lake Alexandrina & Lake Albert”. Our galah query can be built as follows:

>>> avocets = galah.atlas_occurrences(
...   taxa = "Recurvirostra novaehollandiae",
...   filters = "cl901=The Coorong, Lake Alexandrina & Lake Albert"
... )
>>> avocets.head(5)
   decimalLatitude  decimalLongitude             eventDate                 scientificName                                                             taxonConceptID                              recordID                          dataResourceName occurrenceStatus
0       -36.525833        139.823608  2012-07-29T00:00:00Z  Recurvirostra novaehollandiae  https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100  838c1162-6c3e-4b5f-84d3-3639556f1acf               BirdLife Australia, Birdata          PRESENT
1       -36.521950        139.827200                   NaN  Recurvirostra novaehollandiae  https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100  3e4d190c-7032-48ba-885e-ebf513fe5af0  Australian Waterbird Surveys (1983-2018)          PRESENT
2       -36.521950        139.827200  2010-10-21T00:00:00Z  Recurvirostra novaehollandiae  https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100  8fdebcda-de5e-4c2c-94dc-a9fb25cd6817  Australian Waterbird Surveys (1983-2018)          PRESENT
3       -36.521950        139.827200                   NaN  Recurvirostra novaehollandiae  https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100  e02745af-44d4-4ad8-9816-7aaaaa318a82  Australian Waterbird Surveys (1983-2018)          PRESENT
4       -36.518368        139.824353  1997-01-01T00:00:00Z  Recurvirostra novaehollandiae  https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100  447135d5-7ccf-4dd5-9ca5-e6bb1e238823                          NSW BioNet Atlas          PRESENT

Shapes and Regions#

While server-side spatial information is useful, there are likely to be cases where the shapefile or region you wish to query will not be pre-loaded as a contextual layer in the ALA. In this case, shapefiles can be introduced to the filtering process using the {shapely} package and the polygon or bbox arguments in atlas_counts, atlas_occurrences, atlas_species or atlas_media. Shapefiles can be provided as a shapely object, the string denoting the polygon itself, or in the base of bbox, a dictionary specifying xmin, xmax, ymin, and ymax. For instance, we might interested in species occurrences in King George Square, Brisbane. We can take the MULTIPOLYGON object for the square (as sourced from the Brisbane City Council) and transform it into a {shapely} object.

>>> import shapely.wkt
>>> king_george_square = shapely.wkt.loads("MULTIPOLYGON(((153.0243 -27.46886, 153.0242 -27.46896, 153.0236 -27.46837, 153.0239 -27.46814, 153.0239 -27.46813, 153.0242 -27.46789, 153.0244 -27.46805, 153.0245 -27.46821, 153.0246 -27.46828, 153.0247 -27.46835, 153.0248 -27.46848, 153.0246 -27.4686, 153.0246 -27.46862, 153.0245 -27.46871, 153.0243 -27.46886)))")

We can provide this MULTIPOLYGON as the argument polygon in atlas_occurrences to assess which species have been recorded in King George Square.

>>> species_king_george_square = galah.atlas_occurrences(
...   polygon=king_george_square,
...   fields=["decimalLatitude", "decimalLongitude", "eventDate", "scientificName", "vernacularName"]
... )
>>> species_king_george_square.head(10)
    decimalLatitude  decimalLongitude             eventDate         scientificName         vernacularName
0        -27.468625        153.024333  2006-03-30T09:32:00Z  Threskiornis moluccus  Australian White Ibis
1        -27.468611        153.024444  2023-12-02T09:12:02Z    Cosmophasis baehrae                    NaN
2        -27.468546        153.023911  2024-02-03T20:45:00Z            Mediastinia                    NaN
3        -27.468510        153.023821  2023-04-08T17:49:00Z     Entomyzon cyanotis  Blue-faced Honeyeater
4        -27.468418        153.024140  2022-08-30T11:29:00Z  Threskiornis moluccus  Australian White Ibis
5        -27.468364        153.024121  2024-08-28T10:31:54Z               Vitellus                    NaN
6        -27.468326        153.024077  2022-07-24T14:43:00Z  Threskiornis moluccus  Australian White Ibis
7        -27.468308        153.024097  2024-10-13T14:34:00Z  Threskiornis moluccus  Australian White Ibis
8        -27.468295        153.023967  2024-10-13T14:35:00Z     Gymnorhina tibicen      Australian Magpie
9        -27.468167        153.024047  2024-07-23T10:00:53Z  Threskiornis moluccus  Australian White Ibis
10       -27.468155        153.024011  2024-04-26T12:07:00Z  Threskiornis moluccus  Australian White Ibis

This argument can be provided as its string; however, we show you the above to give you an idea of what galah-python does when you provide it a string.

There is a third argument of atlas_counts, atlas_occurrences, atlas_species or atlas_media called simplify_polygon, which defaults to False. By setting the simplify_polygon argument to True, the provided POLYGON or MULTIPOLYGON will be converted into the smallest bounding box (rectangle) that contains the POLYGON. In this case, records will be included that may not exactly lie inside the provided shape.

>>> species_king_george_square = galah.atlas_occurrences(
...   polygon=king_george_sq,
...   simplify_polygon=True,
...   fields=["decimalLatitude", "decimalLongitude", "eventDate", "scientificName", "vernacularName"]
... )
>>> species_king_george_square.head(10)
   decimalLatitude  decimalLongitude             eventDate                             scientificName         vernacularName
0       -27.468889        153.024444  2015-10-09T00:00:00Z             Burhinus (Burhinus) grallarius      Bush Stone-curlew
1       -27.468817        153.023616  2019-04-21T14:40:00Z                      Threskiornis moluccus  Australian White Ibis
2       -27.468816        153.023613  2005-08-28T12:18:00Z                      Threskiornis moluccus  Australian White Ibis
3       -27.468680        153.023828  2024-07-15T15:00:18Z                      Pseudanapaea denotata                    NaN
4       -27.468680        153.023828  2024-01-13T15:07:42Z                      Tenodera australasiae    Purplewinged Mantid
5       -27.468680        153.023828  2024-01-13T15:04:53Z                  Pristhesancus plagipennis             Bee-killer
6       -27.468680        153.023828  2024-01-13T15:06:26Z  Toxorhynchites (Toxorhynchites) speciosus                    NaN
7       -27.468680        153.023828  2024-01-13T12:20:00Z                                  Xixuthrus                    NaN
8       -27.468680        153.023828  2024-07-15T12:12:27Z                    Columba (Columba) livia              Rock Dove
9       -27.468680        153.023828  2024-01-13T15:07:42Z              Lyramorpha (Lyramorpha) rosea       Litchi Stink Bug

Large shapefiles#

The simplify_polygon argument with option polygon is provided because objects with a large amount of vertices will take a long time to filter on the ALA’s end. In the event you have a large shapefile, using simplify_polygon=True will at least enable an initial reduction of the time it takes for the data to be downloaded, before finer filtering to the actual shapefile will obtain the desired set of occurrences.

Alternatively, one can also perform the “bbox” reduction before passing the shape to atlas_counts, atlas_occurrences, atlas_species or atlas_media by using {geopandas} and the unary_union function of the {shapely} package.

A common situation for this to occur is when a shapefile with multiple shapes is provided, where we are interested in grouping our results by each shape. Here is a mock workflow using a subset of a shapefile of all 2,184 Brisbane parks.

Let’s say we are interested in knowing which parks in the Brisbane postcode 4075 have the most occurrences of the Scaly-Breasted Lorikeet, Trichoglossus chlorolepidotus, since 2020. We can download the entire shapefile from the above link, and perform our filtering and summarising as follows:

>>> # first, get all parts within the Brisbane postcode 4075
>>> import geopandas as gpd
>>> parks = gpd.read_file("Park___Locations.shp")
>>> parks.columns
>>> brisbane_parks = parks[parks["POST_CODE"] == '4075']
>>> brisbane_parks.head(10)
   OBJECTID  ...                                           geometry
0        64  ...  POLYGON ((152.99236 -27.53098, 152.99118 -27.5...
1        65  ...  POLYGON ((152.99137 -27.53090, 152.99130 -27.5...
2       102  ...  POLYGON ((152.98534 -27.57025, 152.98534 -27.5...
3       115  ...  POLYGON ((152.98702 -27.57090, 152.98723 -27.5...
4       116  ...  POLYGON ((152.98525 -27.57294, 152.98513 -27.5...
5       147  ...  POLYGON ((152.98423 -27.57400, 152.98429 -27.5...
6       166  ...  POLYGON ((152.96348 -27.55583, 152.96348 -27.5...
7       188  ...  POLYGON ((152.98312 -27.56635, 152.98302 -27.5...
8       401  ...  POLYGON ((152.98792 -27.52198, 152.98800 -27.5...
9       402  ...  POLYGON ((152.98165 -27.51611, 152.98159 -27.5...

[10 rows x 13 columns]
>>> # second, get a shape of all the Brisbane parks and draw a bounding box around it
>>> import shapely
>>> from shapely.ops import unary_union
>>> brisbane_parks_all = gpd.GeoSeries(unary_union(brisbane_parks["geometry"]))
>>> brisbane_parks_bbox = brisbane_parks_all.bounds
         minx       miny        maxx       maxy
0  152.963302 -27.577386  152.996692 -27.516073

To visualise

>>> plot_brisbane_parks_bbox = shapely.box(xmin=brisbane_parks_bbox["minx"][0],
...                                xmax=brisbane_parks_bbox["maxx"][0],
...                                ymin=brisbane_parks_bbox["miny"][0],
...                                ymax=brisbane_parks_bbox["maxy"][0]
...                                )
>>> brisbane_parks_all.plot(edgecolor = "#5A5A5A", linewidth = 1, facecolor = "white", figsize = (7,10))
>>> plt.plot(*plot_brisbane_parks_bbox.exterior.xy,color="red")
>>> plt.ylabel("Latitude",size=16,x=.45,y=0.5)
>>> plt.xlabel("Longitude",size=16)
../_images/brisbane_parks_and_bbox.png
>>> # third, find all occurrences of Trichoglossus chlorolepidotus in the bounding box in 2022
>>> import galah
>>> galah.galah_config(email="your-email-here")
>>> lorikeet_brisbane = galah.atlas_occurrences(
...     taxa="Trichoglossus chlorolepidotus",
...     filters="year=2022",
...     bbox=brisbane_parks_bbox
... )
>>> lorikeet_brisbane
      decimalLatitude  decimalLongitude  ...       dataResourceName occurrenceStatus
0          -27.576858        152.970235  ...        eBird Australia          PRESENT
1          -27.576303        152.981694  ...        eBird Australia          PRESENT
2          -27.575771        152.989980  ...  iNaturalist Australia          PRESENT
3          -27.574852        152.992831  ...        eBird Australia          PRESENT
4          -27.574169        152.995529  ...        eBird Australia          PRESENT
...               ...               ...  ...                    ...              ...
2780       -27.517759        152.985278  ...        eBird Australia          PRESENT
2781       -27.517759        152.985278  ...        eBird Australia          PRESENT
2782       -27.517759        152.985278  ...        eBird Australia          PRESENT
2783       -27.517759        152.985278  ...        eBird Australia          PRESENT
2784       -27.517759        152.985278  ...        eBird Australia          PRESENT

[2785 rows x 8 columns]
>>> brisbane_parks_all.plot(edgecolor = "#5A5A5A", linewidth = 1, facecolor = "white", figsize = (7,10))
>>> plt.plot(*plot_brisbane_parks_bbox.exterior.xy,color="red")
>>> plt.ylabel("Latitude",size=16,x=.45,y=0.5)
>>> plt.xlabel("Longitude",size=16)
>>> plt.scatter(lorikeet_brisbane["decimalLongitude"],lorikeet_brisbane["decimalLatitude"],alpha=0.5,color="orange",label="Lorikeet occurrences")
>>> plt.legend(loc=(0.5,0.96))
../_images/lorikeets_on_map_shapefile.png
>>> # Filter records down to only those in the shapefile polygons
>>> brisbane_parks["count"] = pd.Series([0 for i in range(len(brisbane_parks))])
>>> for i,park in brisbane_parks.iterrows():
...     points = [(x,y) for x,y in zip(lorikeet_brisbane["decimalLongitude"], lorikeet_brisbane["decimalLatitude"]) if shapely.contains_xy(park["geometry"],x,y)]
...     brisbane_parks.at[i,"count"] = len(points)
>>> brisbane_parks_counts = brisbane_parks[["PARK_NAME","count"]].sort_values("count",ascending=False).reset_index(drop=True)
>>> brisbane_parks_counts.head(10)
                        PARK_NAME  count
0                NYUNDARE-BA PARK    552
1              SHERWOOD ARBORETUM    472
2                   FAULKNER PARK     64
3              FORT ROAD BUSHLAND     46
4         STRICKLAND TERRACE PARK     44
5               BENARRAWA RESERVE     34
6  GRACEVILLE RIVERSIDE PARKLANDS     33
7                  NOSWORTHY PARK     17
8           HORACE WINDOW RESERVE     14
9          GRACEVILLE AVENUE PARK      7

Some shapefiles cover large geographic areas with the caveat that even the bounding box doesn’t restrict the number of records to a value that can be downloaded easily. In this case, we recommend more nuances and detailed methods that can be performed using looping techniques. One of our ALA Labs blog posts, Hex maps for species occurrence data, has been written detailing how to approach larger problems such as this.