h3.dataloading package#

Submodules#

h3.dataloading.HurricaneDataset module#

class h3.dataloading.HurricaneDataset.HurricaneDataset(*args: Any, **kwargs: Any)[source]#: Bases: Dataset

h3.dataloading.cmorph module#

h3.dataloading.cmorph.cmorph_filename_range(start_date: date, stop_date: date) → list[source]#

h3.dataloading.cmorph.load_cmorph(hurricane_date: date, day_range: int)[source]#

h3.dataloading.cmorph.main()[source]#

h3.dataloading.flood_risk module#

h3.dataloading.flood_risk.main()[source]#

h3.dataloading.general_df_utils module#

h3.dataloading.general_df_utils.calc_distance_between_df_cols(df: pd.DataFrame, cols_compare: list[tuple[str]] | list[list[str]], new_col_name: str = 'distance') → pd.DataFrame[source]#

Calculate the geodesic distance between sets of lat/lon values. See https://geopy.readthedocs.io/en/stable/#module-geopy.distance for more info.

Parameters:

df (pd.DataFrame) – df containing two pairs of lat/lon values
cols_compare (list[[tuple[str]] or list[list[str]]) – list of columns of lat/lon values. Inputted as pairs as a tuple or list
new_col_name (str, optional) – The default is ‘distance’.

Returns:

copy of df with an extra ‘distance’ column

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.calc_means_df_cols(df: pd.DataFrame, col_names: list[str]) → pd.DataFrame[source]#

Return mean values of prescribed columns in df

Parameters:

df (pd.DataFrame) –
col_names (list[str]) – list of columns to calculate mean

Return type:

list[float] of mean value of each column

h3.dataloading.general_df_utils.calculate_first_last_dates_from_df(df: pd.DataFrame, time_buffer: tuple[float, str] = [0, 'h'], date_col_name: list[str] = None) → tuple[pd.Timestamp][source]#

Calculate the first and last dates from a df, with a time buffer before and after.

Parameters:

df (pd.DataFrame) – should contain at least one datetime column. If multiple datetime columns, column to be used should be specified. Will default to first occurence of such a column
time_buffer (tuple[float,str] defaults to [0,'h'] (no buffer)) – extra time to remove from first occurence and add to last occurence. A string specifying the unit of time is necessary e.g. ‘h’ for hours (either as a tuple or list). See https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html for more info.
date_col_name (list[str] defaults to None) – name of column containing datetime objects to be processed

Returns:

tuple[pd.Timestamp] – detailing start and end time/date
N.B. currently discarding any timezone information

h3.dataloading.general_df_utils.concat_df_cols(df: pd.DataFrame, concatted_col_name: str, cols_to_concat: list[str], delimiter: str = '') → pd.DataFrame[source]#

Concatenate columns in a pd.DataFrame into a new column of strings linked by ‘delimiter’.capitalize()

Parameters:

df (pd.DataFrame) – df containing columns to concatenate
concatted_col_name (str) – name of new concatenated column
cols_to_concat (list[str]) – names of columns to concatenate (in desired order)
delimiter (str, optional) – character to insert in between column values. Defaults to empty string

Returns:

with additional concatted column

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.exclude_df_rows_by_range(df: pd.DataFrame, col_names: list[str], value_bounds: list[tuple[float]] | list[float], buffer: list[float] | list[tuple[float, str]] = 0) → pd.DataFrame[source]#

Return pd.DataFrame composed of only rows containing only values in columns listed in col_names within the range of value_bounds +/- optional buffer amount. Handy for restricting large dataframes based on date ranges (must specify bounds as pd.Timestamp objects), or lat/lon ranges.

Parameters:

df (pd.DataFrame) – df to limit
col_names (list[str]) – e.g. [‘col1’, …, ‘colN’] list of column names to be restricted by their relevant…
value_bounds (list[ [tuple[float], list[float]] ]) – e.g. [ (start_val1,end_val1), …, (start_valN,end_valN) ] list of tuples (or lists) specifying minimum and maximum values to allow
buffer ([ list[float], list[tuple[float,str]] ] = 0) – add buffer on either side of value_bounds. Defaults to no buffer. Useful for specifying weather station observations must exist some time before and after the event of interest

Return type:

restricted pd.DataFrame object (sub-set of original df)

h3.dataloading.general_df_utils.exclude_df_rows_symmetrically_around_value(df: pd.DataFrame, col_names: list[str], poi: list[float] | list[pd.Timestamp], buffer_val: list[float] | list[tuple[float, str]]) → pd.DataFrame[source]#

Return a pd.DataFrame which excludes rows outside a range of +/- 1 buffer. Buffer can be floats objects, or can specify a period of time. Handy e.g. for excluding stations for which there is no weather data within the period of interest.

Parameters:

df (pd.DataFrame) – pd.DataFrame containing values to potentially exclude
col_names (list[str]) – list of strings specifying the names of the columns of interest
poi ([ list[float], list[pd.Timestamp] ]) – points of interest (value about which any exclusion will be centred). One value for each relevant column.
buffer ([ list[float], list[tuple[float,str]] ]) – distance from poi to be excluded. In the case that poi is a Timestamp object, a string specifying the unit of time is necessary e.g. ‘h’ for hours (either as a tuple or list). See https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html for more info. One value for each relevant column.

Return type:

pd.DataFrame excluding values outside of provided ranges

h3.dataloading.general_df_utils.find_index_closest_point_in_col(poi: shapely.Point, points_df: pandas.DataFrame, points_df_geom_col: str, which_closest: int = 0) → int[source]#

Find the df index of the closest point object to poi in the df object.

Parameters:

poi (shapely.Point) – point of interest (shapely.Point object)
points_df (pd.DataFrame) – dataframe containing a column of shapely.Point objects
points_df_geom_col (str) – name of column of shapely.Point objects
which_closest (int = 1) – if 1 (default), find closest. For any other N, find the Nth closest

Return type:

int object relating to index of points_df df

h3.dataloading.general_df_utils.generate_lat_lon_from_points_cols(df: pd.DataFrame, points_cols: list[str]) → None[source]#

Generate a column(s) of lat and lon from column(s) of shapely.Point objects. Column(s) added to current df being processed

Parameters:

df (pd.DataFrame) – pd.DataFrame containing column(s) of shapely.Point objects
points_cols (list[str]) – names of columns to convert to lat/lon. Chosen not to find the columns as default (by dtype) since might only want to convert one.

Returns:

containing new lat and lon columns

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.limit_df_spatial_range(df: pd.DataFrame, centre_coords: list[float] | tuple[float], min_number: int = None, distance_buffer: float = None, verbose: bool = False) → pd.DataFrame[source]#

Restrict df to within +/- a lat-lon distance, or to a min_number number of rows.

Parameters:

df (pd.DataFrame) – df containing ‘lat’ and ‘lon’ columns
centre_coords (list[float] or tuple[float]) – geographical centre about which to restrict df
min_number (int = None) – minimum number of rows in df to be returned
distance_buffer (float = None) – distance from geographical centre within which points in df should be returned
verbose (bool = False (don't show message re expansion of distance_buffer)) – choose whether or not to show that distance_buffer was expanded

Returns:

spatially limited df

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.points_from_df_lat_lon_cols(df: pandas.DataFrame, point_col_name: str = 'geometry') → pandas.DataFrame[source]#: TODO: docstring

h3.dataloading.general_df_utils.standardise_df(df: pd.DataFrame, date_cols: list[str] = None, new_point_col_name: str = 'geometry') → pd.DataFrame[source]#

Apply various formatting functions to make any df behave as you’d expect.

Parameters:

df (pd.DataFrame) – any pandas df
date_cols (list[str], optional) – list of column names containing date values. Default is None.

Returns:

reformatted pd.DataFrame object

Return type:

pd.DataFrame

h3.dataloading.general_df_utils.station_availability(df_stations: pd.DataFrame, df_noaa_weather_event: pd.DataFrame, time_buffer: list[float, str] = [0, 'h'], available: bool = True) → pd.DataFrame[source]#: Filter dataframe by time to return only stations with observation present. Defaults to available

h3.dataloading.noaa_six_hourly_processing module#

h3.dataloading.noaa_six_hourly_processing.convert_lat_lon(coord: str) → str[source]#: Convert lat/long of type 00N/S to +/-

h3.dataloading.noaa_six_hourly_processing.preprocess_noaa_textfile(data: list) → list[source]#: Some data preprocessing before reading into pandas df. assigning event to each row, deleting headers, reformatting lat/long. Must have been read in from standard new NOAA .txt file format.

h3.dataloading.noaa_six_hourly_processing.reformat_noaa_df(df: pandas.DataFrame) → pandas.DataFrame[source]#: Tidy up data types in pd.DataFrame

h3.dataloading.noaa_six_hourly_processing.return_most_recent_events_by_name(df: pd.DataFrame, event_names: list[str]) → pd.DataFrame[source]#

Returns the df containing the data for the most recent occurence of each event included in ‘names’. df must have a ‘date’ column to judge most recent

Returns:

restricted pd.DataFrame
TODO (make this more flexible for selecting events)

h3.dataloading.noaa_six_hourly_processing.windspeed_to_strength_category(val: float | int) → bool | int[source]#

Assign an intensity value based on maximum sustained wind speed

Parameters:: val (float | int) – numerical value to be compared
Returns:: storm categorisation
Return type:: int

h3.dataloading.strom_surge module#

h3.dataloading.strom_surge.check_storm(clean_after_unpack: bool = False, reload: bool = False) → None[source]#

Helper function to check if files are present in storm_dir It is quite naive function, as it will not check what files are present, but only if any files are present. If no files are present, it will download and unpack them.

Parameters:

clean_after_unpack (bool, optional) – If True, will delete the zip downloaded files after unpacking them. The default is False.
reload (bool, optional) – If True, force the re-download and unpack,

h3.dataloading.strom_surge.get_location_box(location: Literal['us', 'haiti']) → None | tuple[source]#

Get the location box.

Parameters:: location (str, {"us", "haiti"}) – location to get the box coordinate in lat lon
Returns:: a tuple of the coordinates of the location, as follows: LON_MIN, LON_MAX, LAT_MIN, LAT_MAX
Return type:: tuple, optional

h3.dataloading.strom_surge.is_in_area(point: Sequence[float], area_box: Container[float]) → bool[source]#

Check if point is in an area bounding box.

Parameters:

point (tuple_like) – A tuple_like of the lat, lon of the point to check.
area_box (tuple_like) – A tuple_like of the bounding box of the area to check. It needs to be as follows: LON_MIN, LON_MAX, LAT_MIN, LAT_MAX Usually this bounding box is given by the get_location_box function.

Returns:

True if the point is in the bounding box

Return type:

bool

h3.dataloading.strom_surge.latlon2_storm_surge(coords: Iterable[Iterable[float, float]], category: int) → list[int][source]#

h3.dataloading.strom_surge.main()[source]#

h3.dataloading.strom_surge.point_locations(lat: float, lon: float, locations: Iterable[str]) → None | str[source]#

Get the location name of a point given its latitude and longitude

Parameters:

lat (float) – latitude of the point to check
lon (float) – longitude of the point to check
locations (Iterable of str) – Iterable-like of the name of the different locations to check.

Returns:

Name of the location areas of the point.

Return type:

str, optional

Examples

>>> lat, lon = 18, -72  # this point is in Haiti
>>> point_locations(lat, lon, ["us", "haiti"])
haiti

h3.dataloading.terrain_ef module#

h3.dataloading.terrain_ef.calculate_esa(building_groups: pandas.DataFrameGroupBy, coast_points: numpy.ndarray, dem_urls: list, dis_threshold: int = 2) → pandas.DataFrame[source]#

h3.dataloading.terrain_ef.check_coastlines_file()[source]#

h3.dataloading.terrain_ef.check_dem_files()[source]#

h3.dataloading.terrain_ef.get_building_group() → pandas.core.groupby.generic.DataFrameGroupBy[source]#

Load the data from the building damage pickle file and group them in groups of 1 deg latlon

Return type:: DataFrameGroupBy

h3.dataloading package#

Submodules#

h3.dataloading.HurricaneDataset module#

h3.dataloading.cmorph module#

h3.dataloading.flood_risk module#

h3.dataloading.general_df_utils module#

h3.dataloading.noaa_six_hourly_processing module#

h3.dataloading.strom_surge module#

h3.dataloading.terrain_ef module#

h3.dataloading.weather module#

h3.dataloading.xbd module#