h3.dataloading package#

Submodules#

h3.dataloading.HurricaneDataset module#

class h3.dataloading.HurricaneDataset.HurricaneDataset(*args: Any, **kwargs: Any)[source]#

Bases: Dataset

h3.dataloading.cmorph module#

h3.dataloading.cmorph.cmorph_filename_range(start_date: date, stop_date: date) list[source]#
h3.dataloading.cmorph.load_cmorph(hurricane_date: date, day_range: int)[source]#
h3.dataloading.cmorph.main()[source]#

h3.dataloading.flood_risk module#

h3.dataloading.flood_risk.main()[source]#

h3.dataloading.general_df_utils module#

h3.dataloading.general_df_utils.calc_distance_between_df_cols(df: pd.DataFrame, cols_compare: list[tuple[str]] | list[list[str]], new_col_name: str = 'distance') pd.DataFrame[source]#

Calculate the geodesic distance between sets of lat/lon values. See https://geopy.readthedocs.io/en/stable/#module-geopy.distance for more info.

Parameters:
  • df (pd.DataFrame) – df containing two pairs of lat/lon values

  • cols_compare (list[[tuple[str]] or list[list[str]]) – list of columns of lat/lon values. Inputted as pairs as a tuple or list

  • new_col_name (str, optional) – The default is ‘distance’.

Returns:

copy of df with an extra ‘distance’ column

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.calc_means_df_cols(df: pd.DataFrame, col_names: list[str]) pd.DataFrame[source]#

Return mean values of prescribed columns in df

Parameters:
  • df (pd.DataFrame) –

  • col_names (list[str]) – list of columns to calculate mean

Return type:

list[float] of mean value of each column

h3.dataloading.general_df_utils.calculate_first_last_dates_from_df(df: pd.DataFrame, time_buffer: tuple[float, str] = [0, 'h'], date_col_name: list[str] = None) tuple[pd.Timestamp][source]#

Calculate the first and last dates from a df, with a time buffer before and after.

Parameters:
  • df (pd.DataFrame) – should contain at least one datetime column. If multiple datetime columns, column to be used should be specified. Will default to first occurence of such a column

  • time_buffer (tuple[float,str] defaults to [0,'h'] (no buffer)) – extra time to remove from first occurence and add to last occurence. A string specifying the unit of time is necessary e.g. ‘h’ for hours (either as a tuple or list). See https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html for more info.

  • date_col_name (list[str] defaults to None) – name of column containing datetime objects to be processed

Returns:

  • tuple[pd.Timestamp] – detailing start and end time/date

  • N.B. currently discarding any timezone information

h3.dataloading.general_df_utils.concat_df_cols(df: pd.DataFrame, concatted_col_name: str, cols_to_concat: list[str], delimiter: str = '') pd.DataFrame[source]#

Concatenate columns in a pd.DataFrame into a new column of strings linked by ‘delimiter’.capitalize()

Parameters:
  • df (pd.DataFrame) – df containing columns to concatenate

  • concatted_col_name (str) – name of new concatenated column

  • cols_to_concat (list[str]) – names of columns to concatenate (in desired order)

  • delimiter (str, optional) – character to insert in between column values. Defaults to empty string

Returns:

with additional concatted column

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.exclude_df_rows_by_range(df: pd.DataFrame, col_names: list[str], value_bounds: list[tuple[float]] | list[float], buffer: list[float] | list[tuple[float, str]] = 0) pd.DataFrame[source]#

Return pd.DataFrame composed of only rows containing only values in columns listed in col_names within the range of value_bounds +/- optional buffer amount. Handy for restricting large dataframes based on date ranges (must specify bounds as pd.Timestamp objects), or lat/lon ranges.

Parameters:
  • df (pd.DataFrame) – df to limit

  • col_names (list[str]) – e.g. [‘col1’, …, ‘colN’] list of column names to be restricted by their relevant…

  • value_bounds (list[ [tuple[float], list[float]] ]) – e.g. [ (start_val1,end_val1), …, (start_valN,end_valN) ] list of tuples (or lists) specifying minimum and maximum values to allow

  • buffer ([ list[float], list[tuple[float,str]] ] = 0) – add buffer on either side of value_bounds. Defaults to no buffer. Useful for specifying weather station observations must exist some time before and after the event of interest

Return type:

restricted pd.DataFrame object (sub-set of original df)

h3.dataloading.general_df_utils.exclude_df_rows_symmetrically_around_value(df: pd.DataFrame, col_names: list[str], poi: list[float] | list[pd.Timestamp], buffer_val: list[float] | list[tuple[float, str]]) pd.DataFrame[source]#

Return a pd.DataFrame which excludes rows outside a range of +/- 1 buffer. Buffer can be floats objects, or can specify a period of time. Handy e.g. for excluding stations for which there is no weather data within the period of interest.

Parameters:
  • df (pd.DataFrame) – pd.DataFrame containing values to potentially exclude

  • col_names (list[str]) – list of strings specifying the names of the columns of interest

  • poi ([ list[float], list[pd.Timestamp] ]) – points of interest (value about which any exclusion will be centred). One value for each relevant column.

  • buffer ([ list[float], list[tuple[float,str]] ]) – distance from poi to be excluded. In the case that poi is a Timestamp object, a string specifying the unit of time is necessary e.g. ‘h’ for hours (either as a tuple or list). See https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html for more info. One value for each relevant column.

Return type:

pd.DataFrame excluding values outside of provided ranges

h3.dataloading.general_df_utils.find_index_closest_point_in_col(poi: shapely.Point, points_df: pandas.DataFrame, points_df_geom_col: str, which_closest: int = 0) int[source]#

Find the df index of the closest point object to poi in the df object.

Parameters:
  • poi (shapely.Point) – point of interest (shapely.Point object)

  • points_df (pd.DataFrame) – dataframe containing a column of shapely.Point objects

  • points_df_geom_col (str) – name of column of shapely.Point objects

  • which_closest (int = 1) – if 1 (default), find closest. For any other N, find the Nth closest

Return type:

int object relating to index of points_df df

h3.dataloading.general_df_utils.generate_lat_lon_from_points_cols(df: pd.DataFrame, points_cols: list[str]) None[source]#

Generate a column(s) of lat and lon from column(s) of shapely.Point objects. Column(s) added to current df being processed

Parameters:
  • df (pd.DataFrame) – pd.DataFrame containing column(s) of shapely.Point objects

  • points_cols (list[str]) – names of columns to convert to lat/lon. Chosen not to find the columns as default (by dtype) since might only want to convert one.

Returns:

containing new lat and lon columns

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.limit_df_spatial_range(df: pd.DataFrame, centre_coords: list[float] | tuple[float], min_number: int = None, distance_buffer: float = None, verbose: bool = False) pd.DataFrame[source]#

Restrict df to within +/- a lat-lon distance, or to a min_number number of rows.

Parameters:
  • df (pd.DataFrame) – df containing ‘lat’ and ‘lon’ columns

  • centre_coords (list[float] or tuple[float]) – geographical centre about which to restrict df

  • min_number (int = None) – minimum number of rows in df to be returned

  • distance_buffer (float = None) – distance from geographical centre within which points in df should be returned

  • verbose (bool = False (don't show message re expansion of distance_buffer)) – choose whether or not to show that distance_buffer was expanded

Returns:

spatially limited df

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.points_from_df_lat_lon_cols(df: pandas.DataFrame, point_col_name: str = 'geometry') pandas.DataFrame[source]#

TODO: docstring

h3.dataloading.general_df_utils.standardise_df(df: pd.DataFrame, date_cols: list[str] = None, new_point_col_name: str = 'geometry') pd.DataFrame[source]#

Apply various formatting functions to make any df behave as you’d expect.

Parameters:
  • df (pd.DataFrame) – any pandas df

  • date_cols (list[str], optional) – list of column names containing date values. Default is None.

Returns:

reformatted pd.DataFrame object

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.station_availability(df_stations: pd.DataFrame, df_noaa_weather_event: pd.DataFrame, time_buffer: list[float, str] = [0, 'h'], available: bool = True) pd.DataFrame[source]#

Filter dataframe by time to return only stations with observation present. Defaults to available

h3.dataloading.noaa_six_hourly_processing module#

h3.dataloading.noaa_six_hourly_processing.convert_lat_lon(coord: str) str[source]#

Convert lat/long of type 00N/S to +/-

h3.dataloading.noaa_six_hourly_processing.preprocess_noaa_textfile(data: list) list[source]#

Some data preprocessing before reading into pandas df. assigning event to each row, deleting headers, reformatting lat/long. Must have been read in from standard new NOAA .txt file format.

h3.dataloading.noaa_six_hourly_processing.reformat_noaa_df(df: pandas.DataFrame) pandas.DataFrame[source]#

Tidy up data types in pd.DataFrame

h3.dataloading.noaa_six_hourly_processing.return_most_recent_events_by_name(df: pd.DataFrame, event_names: list[str]) pd.DataFrame[source]#

Returns the df containing the data for the most recent occurence of each event included in ‘names’. df must have a ‘date’ column to judge most recent

Returns:

  • restricted pd.DataFrame

  • TODO (make this more flexible for selecting events)

h3.dataloading.noaa_six_hourly_processing.windspeed_to_strength_category(val: float | int) bool | int[source]#

Assign an intensity value based on maximum sustained wind speed

Parameters:

val (float | int) – numerical value to be compared

Returns:

storm categorisation

Return type:

int

h3.dataloading.strom_surge module#

h3.dataloading.strom_surge.check_storm(clean_after_unpack: bool = False, reload: bool = False) None[source]#

Helper function to check if files are present in storm_dir It is quite naive function, as it will not check what files are present, but only if any files are present. If no files are present, it will download and unpack them.

Parameters:
  • clean_after_unpack (bool, optional) – If True, will delete the zip downloaded files after unpacking them. The default is False.

  • reload (bool, optional) – If True, force the re-download and unpack,

h3.dataloading.strom_surge.get_location_box(location: Literal['us', 'haiti']) None | tuple[source]#

Get the location box.

Parameters:

location (str, {"us", "haiti"}) – location to get the box coordinate in lat lon

Returns:

a tuple of the coordinates of the location, as follows: LON_MIN, LON_MAX, LAT_MIN, LAT_MAX

Return type:

tuple, optional

h3.dataloading.strom_surge.is_in_area(point: Sequence[float], area_box: Container[float]) bool[source]#

Check if point is in an area bounding box.

Parameters:
  • point (tuple_like) – A tuple_like of the lat, lon of the point to check.

  • area_box (tuple_like) – A tuple_like of the bounding box of the area to check. It needs to be as follows: LON_MIN, LON_MAX, LAT_MIN, LAT_MAX Usually this bounding box is given by the get_location_box function.

Returns:

True if the point is in the bounding box

Return type:

bool

h3.dataloading.strom_surge.latlon2_storm_surge(coords: Iterable[Iterable[float, float]], category: int) list[int][source]#
h3.dataloading.strom_surge.main()[source]#
h3.dataloading.strom_surge.point_locations(lat: float, lon: float, locations: Iterable[str]) None | str[source]#

Get the location name of a point given its latitude and longitude

Parameters:
  • lat (float) – latitude of the point to check

  • lon (float) – longitude of the point to check

  • locations (Iterable of str) – Iterable-like of the name of the different locations to check.

Returns:

Name of the location areas of the point.

Return type:

str, optional

Examples

>>> lat, lon = 18, -72  # this point is in Haiti
>>> point_locations(lat, lon, ["us", "haiti"])
haiti

h3.dataloading.terrain_ef module#

h3.dataloading.terrain_ef.calculate_esa(building_groups: pandas.DataFrameGroupBy, coast_points: numpy.ndarray, dem_urls: list, dis_threshold: int = 2) pandas.DataFrame[source]#
h3.dataloading.terrain_ef.check_coastlines_file()[source]#
h3.dataloading.terrain_ef.check_dem_files()[source]#
h3.dataloading.terrain_ef.get_building_group() pandas.core.groupby.generic.DataFrameGroupBy[source]#

Load the data from the building damage pickle file and group them in groups of 1 deg latlon

Return type:

DataFrameGroupBy

See also

pd.DataFrame.groupby

h3.dataloading.terrain_ef.get_buildings_bounding_box(buildings_df: pd.DataFrame) tuple[int, int, int, int][source]#

Helper function to get the bounding box of a buildings DataFrame. It will get the min and max of the lat lon present in the buildings_df.

Parameters:

buildings_df (pd.DataFrame) – A pd.DataFrame of the buildings, need to have a columns labeled lat an lon.

Returns:

tuple of int of the coordinates of the bounding box. The values are in degrees and as follows: east, north, west, south

Return type:

tuple

h3.dataloading.terrain_ef.get_coastlines() list[tuple[float, float]][source]#

Load the coastline from the .shp data. Converts the data from the file to a list of (lat, lon)

Returns:

List of tuple of the coordinates of the coastline

Return type:

list of tuple of float,

h3.dataloading.terrain_ef.get_coastpoints_range(bounding_box: tuple, coast_points: numpy.ndarray, dis_threshold: int = 2) numpy.ndarray[source]#

Function to select a subset of the coastlines points which are in range of buildings.

Parameters:
  • bounding_box (tuple) – tuple of int of the coordinates of the bounding box. The values are in degrees and as follows: east, north, west, south. See the function get_building_bounding_box().

  • coast_points (ndarray) – an array of the coast points. Coordinates are in lat and lon.

  • dis_threshold (int, optional) – padding distance in degrees to the bounding box. The default is 2.

Returns:

an array of the coast point dis_threshold distant to the buildings bounding box. Gives the coordinates in lat, lon.

Return type:

ndarray

h3.dataloading.terrain_ef.get_dem_urls(building_groups: pandas.core.groupby.generic.DataFrameGroupBy) list[source]#

Generate the list of DEM to download from Land Processes Distributed Active Archive Center (LP DAAC): https://e4ftl01.cr.usgs.gov/ASTT/ASTGTM.003/2000.03.01/.

Parameters:

building_groups (DataFrameGroupBy) – DataFrameGroupBy object of the building

Returns:

dem_urls – list of all the urls to download.

Return type:

list

Notes

The DEM files can be manually downloaded https://e4ftl01.cr.usgs.gov/ASTT/ASTGTM.003/2000.03.01/.

To automatically download them an account will be needed. To create the account please go to https://urs.earthdata.nasa.gov/users/new/

h3.dataloading.terrain_ef.get_distance_coast(buildings: np.ndarray, coast_points: np.ndarray) tuple[np.ndarray, np.ndarray][source]#
Parameters:
  • buildings (ndarray) – array of the buildings coordinates in lat, lon.

  • coast_points (ndarray) – array of the coastline point coordinates in lat, lon.

Returns:

  • nearest_coast_point (ndarray) – array of the coordinates of the coast point closest to the corresponding building point.

  • dist (ndarray) – array of the distance between the nearest the coast point and corresponding building point, in metres.

See also

sklearn.metrics.pairwise.haversine_distances

Compute the Haversine distance between two samples.

sklearn.neighbors.BallTree

BallTree for nearest neighbours lookup, generalised for N-points.

Notes

The haversine function assumes the coordinates are latitude and longitude in radians. Use the following equation for the Haversine distance:

\[D(x, y) = 2 \arcsin{\left(\sqrt{\sin^{2}{\left(\frac{x_1 - y_1}{2}\right)} + \cos{(x_1)}\cos{(y_1)}\sin^{2}{\left(\frac{x_2-y_2}{2}\right)}}\right)}\]
h3.dataloading.terrain_ef.get_elevation(lon: list, lat: list, dem: rasterio.DatasetReader) numpy.ndarray[source]#

Return the elevation values for the given (lon, lat) coordinates from the provided DEM raster dataset.

Parameters:
  • lon (list) – A list of longitude values of the query points.

  • lat (list) – A list of latitude values of the query points.

  • dem (rasterio.DatasetReader) – A rasterio dataset reader object of the DEM raster dataset.

Returns:

elevation – A 1D numpy array of elevation values for the query points.

Return type:

np.ndarray

h3.dataloading.terrain_ef.get_terrain_ef(esa_df)[source]#
h3.dataloading.terrain_ef.lonlat2xy(lon: list, lat: list, transform: affine.Affine) tuple[source]#
h3.dataloading.terrain_ef.main()[source]#

h3.dataloading.weather module#

h3.dataloading.xbd module#