Title: | Convert Addresses to Standard Inputs |
Version: | 0.4.5 |
Description: | Efficient tools for parsing and standardizing Australian addresses from textual data. It utilizes optimized algorithms to accurately identify and extract components of addresses, such as street names, types, and postcodes, especially for large batched data in contexts where sending addresses to internet services may be slow or inappropriate. The core functionality is built on fast string processing techniques to handle variations in address formats and abbreviations commonly found in Australian address data. Designed for data scientists, urban planners, and logistics analysts, the package facilitates the cleaning and normalization of address information, supporting better data integration and analysis in urban studies, geography, and related fields. |
License: | GPL-2 |
Encoding: | UTF-8 |
URL: | https://github.com/HughParsonage/healthyAddress |
BugReports: | https://github.com/HughParsonage/healthyAddress/issues |
RoxygenNote: | 7.2.0 |
Imports: | data.table, fastmatch, fst, hutils, hutilscpp, magrittr, qs, utils |
Suggests: | tinytest |
Depends: | R (≥ 3.5.0) |
NeedsCompilation: | yes |
Packaged: | 2025-01-09 04:23:55 UTC; hughp |
Author: | Hugh Parsonage [aut, cre] |
Maintainer: | Hugh Parsonage <hugh.parsonage@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-01-09 04:50:02 UTC |
Package for address standardization
Description
Efficient tools for parsing and standardizing Australian addresses from textual data. It utilizes optimized algorithms to accurately identify and extract components of addresses, such as street names, types, and postcodes, especially for large batched data in contexts where sending addresses to internet services may be slow or inappropriate. The core functionality is built on fast string processing techniques to handle variations in address formats and abbreviations commonly found in Australian address data. Designed for data scientists, urban planners, and logistics analysts, the package facilitates the cleaning and normalization of address information, supporting better data integration and analysis in urban studies, geography, and related fields.
Author(s)
Maintainer: Hugh Parsonage hugh.parsonage@gmail.com
See Also
Useful links:
Report bugs at https://github.com/HughParsonage/healthyAddress/issues
Extract the n-th digit of a duocentehexaquinquagesimal number
Description
Extract the n-th digit of a duocentehexaquinquagesimal number
Usage
.digit256(x, d)
Arguments
x |
|
d |
|
Value
For b = 256
if
x = a_0 + a_1b + a_2b^2 + a_3b^3
then .digit(x, d) = a_d
Street types allowed.
Description
Street types allowed.
Usage
.permitted_street_type_ord()
Value
A character vector, the permitted street codes. In order of (approximate) occurrence; more common street types appear in the head of the vector.
Hash a street name quickly and accurately
Description
Hash a street name quickly and accurately
Usage
HashStreetName(x)
unHashStreetName(x)
Arguments
x |
A character vector of uppercase street names (without the street type). |
Value
For HashStreetName
, an integer vector the same length as x
,
a hash of the input; for unHashStreetName
the inverse operation.
If the original x
does not contain a recognized street name, the
result of unHashStreetName
will be NA
.
Examples
HashStreetName("FLINDERS")
Compress latitude and longitude to a 32-bit integer
Description
Although lat and lon are represented by doubles, this is usually slightly wasteful. This function allows you to represent coordinates as single integer, vastly reducing memory footprint.
Usage
compress_latlon(lat, lon, nThread = getOption("healthyAddress.nThread", 1L))
decompress_latlon(x, nThread = getOption("healthyAddress.nThread", 1L))
compress_latlon_general(
lat,
lon,
nThread = getOption("healthyAddress.nThread", 1L)
)
decompress_latlon_general(x, nThread = getOption("healthyAddress.nThread", 1L))
Arguments
lat , lon |
Coordinates to compress. |
nThread |
Number of threads to use. |
x |
An integer vector formed by one of the compression functions. |
Value
The _general
version of the compression/decompression use the observed
range of the latitude and longitude to form a 2^16
grid, while the
bare versions use the known limits of Australian address coordinates
(including the overseas territories). Since, in the latter, the grid
will be much less fine, you should expect greater loss of information,
possibly exceeding 100 metres.
compress_latlon
An integer vector.
decompress_latlon
The original
lat,lon
, with some information losscompress_latlon_general
An integer vector, with attributes
minmaxLat
andminmaxLon
.decompress_latlon_general
The original
lat,lon
, with some information loss.
Download latitude longitude data by address
Description
Download latitude longitude data by address
Usage
download_latlon_data(
.ste = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT", "OT"),
data_dir = getOption("healthyAddress.data_dir"),
repo = "https://github.com/HughParsonage/PSMA-202311",
overwrite = NA
)
Arguments
.ste |
The jurisdiction to download. Default is to download all. |
data_dir |
The directory for |
repo |
The repository from which data will be downloaded. Currently only the default is supported,
and |
overwrite |
|
Value
Called for its side effect (downloading the files), but returns the files downloaded.
Extract the flat number, number first/last from an address
Description
Extract the flat number, number first/last from an address
Usage
extract_flatNumberFirstLast(address)
Arguments
address |
A character vector from which the numbers are to be extracted. |
Value
A data.table
of three components: the flat number,
the number first, and number last.
Extract the postcode from the suffix of a string
Description
Extract the postcode from the suffix of a string
Usage
extract_postcode(x)
Arguments
x |
A character vector. |
Value
An integer vector the same length as x
, giving the
postcode as it appears in the last 3 or 4 characters in each
string. Returns NA_integer_
for other strings.
There is no guarantee made that the postcode is a real postcode.
Examples
extract_postcode("3000")
extract_postcode("Melbourne Vic 3000")
Find the street type within an address
Description
Find the street type within an address
Usage
match_StreetType(address)
Arguments
address |
A character vector, every string an address. |
Value
A list of two elements. The first element are the indices of
street type in .permitted_street_type_ord()
that is found in the
address. The second element are the corresponding string positions of
the street so identified.
Examples
cds <- .permitted_street_type_ord()
head(cds)
match_StreetType("712 FLINDERS STREET MELBOURNE 3004")
# 012345678901234
match_StreetType("712 FLINDERS ST MELBOURNE 3004")
Find word within a sentence
Description
Find word within a sentence
Usage
match_word(x, tbl)
Arguments
x |
A character vector of uppercase sentences. |
tbl |
A table of words. Long vectors are not permitted. |
Value
An integer vector the same length as x
, where the
i
-th entry
is the integer position of the first word in tbl
detected in x[i]
. Non-matches return NA
. Words
are strings of uppercase separated by spaces.
Add latitude and longitude columns to a standard address
Description
Add latitude and longitude columns to a standard address
Usage
mutate_latlon(DT, data_dir = getOption("healthyAddress.data_dir"))
Arguments
DT |
A |
data_dir |
The directory in which the latitude longitude data has been
downloaded. (See |
Value
DT
with the columns lat
and lon
added, by reference,
the latitude and longitude of the address for that row.
Uppercase character vectors
Description
Ensures all elements of a character vector are uppercase; no lowercase characters.
Usage
nany_lowercase(x, nThread = getOption("healthyAddress.nThread", 1L))
Arguments
x |
A character vector, of ASCII elements. |
nThread |
Number of threads to use. |
Value
nany_lowercase
FALSE
if any char inx
is a lowercase letter.
Examples
nany_lowercase("ABC")
nany_lowercase("ABC 123 /--")
nany_lowercase("ABC 123 /-- z")
In what states do postcodes lie?
Description
While for most postcodes, the state enclosing it is easy to evaluate (e.g. most postcodes in 2000-2999 are in NSW), the general case is non-trivial. In particular, some postcodes straddle state borders.
Usage
postcode2ste(Postcodes, result = c("integer", "character"))
Arguments
Postcodes |
An integer vector of postcodes. |
result |
One of |
Value
A vector, the minimal states that cover all postcodes given. For example, if all postcodes lie within a single state a scalar integer/string of that state is returned.
Examples
vic_poa <- c(3021L, 3084L, 3013L, 3147L, 3030L,
3123L, 3070L, 3004L, 3250L, 3630L)
postcode2ste(vic_poa)
postcode2ste(vic_poa, result = "character")
postcode2ste(c(vic_poa, 2000L))
postcode2ste(3644L)
Get internal data
Description
Get internal data
Usage
read_ste_fst(
ste = c("ACT", "NSW", "NT", "OT", "QLD", "SA", "TAS", "VIC", "WA"),
columns = NULL,
data_env = getOption("healthyAddress.data_env"),
data_dir = getOption("healthyAddress.data_dir", tempfile()),
rbind = TRUE
)
Arguments
ste |
The abbreviated state name. |
columns |
Character vector of columns to select. If |
data_env |
The environment in which objects are cached. Mainly for internal use. |
data_dir |
The file directory into which the downloaded files should be
stored. Defaults to a temporary directory. It is recommended to set the option
|
rbind |
Whether or not to bind the list result should multiple states be requested. |
Value
A data.table
containing all the addresses in the given states.
Standard address
Description
Standardize an address from a free text expression into its components as used in the PSMA (formerly, "Public Sector for Mapping Agencies") database.
Usage
standardize_address(
Address,
AddressLine2 = NULL,
return.type = c("data.table", "integer"),
integer_StreetType = FALSE,
hash_StreetName = FALSE,
check = 1L,
nThread = getOption("healthyAddress.nThread", 1L)
)
standard_address2(Address, nThread = getOption("healthyAddres.nThread", 1L))
standard_address3(Line1, Line2, Postcode = NULL, KeepStreetName = FALSE)
Arguments
Address |
A character vector, either a full address or (if |
AddressLine2 |
Either |
return.type |
Either |
integer_StreetType |
Should the street type be returned as an integer vector? |
hash_StreetName |
Should |
check |
An integer, whether the inputs should be checked for possibly invalid addresses or addresses that may not be parsed correctly. |
nThread |
Number of threads to use. |
Line1 , Line2 , Postcode |
For addresses split by line. |
KeepStreetName |
Should an additional character vector be included in the result of the street name? |
Details
By convention observed in the PSMA, street names such as 'THE ESPLANADE' have a street name of 'THE ESPLANADE' and an absent street type code.
Non-addresses passed have unspecified behaviour, though usually the numbers of the standard address will be 0 or NA. Postcodes may be negative in some circumstances where a postcode is not detected, though this should not be relied on.
For maximum performance, consider setting integer_StreetType
and
hash_StreetName
to TRUE
. It has been observed that joining
two tables together has been faster when using the hash of the standardized
street name, rather than the street name, even when taking into account
the hashing process.
For performance reasons, addresses with more than 32 words are not supported.
If a postcode-like number exists at the end of a Address
, but is not
in fact a postcode, then NA
will be in each field, except postcode,
which will have the value -1.
Value
A data.table
containing columns indicating the components of the standard address:
FLAT_NUMBER
The flat or unit number. This includes things like SHOP number.
NUMBER_FIRST
As used in the PSMA, this identified the first (or only) number in the address range.
NUMBER_LAST
As used in the PSMA, if an address is marked as having a range of street numbers, the last of the range.
NUMBER_SUFFIX
A
raw
vector. The suffix observed after the numbers. The PSMA technically has multiple suffixes for each number component.H0
If
hash_StreetName = TRUE
, the DJB2 hash (as used inHashStreetName
of the street name.). Observed to have performance benefits.STREET_NAME
The (uppercase) of the street name. Streets such as 'THE ESPLANADE' or 'THE AVENUE' are treated as entirely made up of a street name and have a
STREET_TYPE_CODE
of zero.STREET_TYPE_CODE
An integer, the street type code marking the type of street such as ROAD, STREET, AVENUE, etc. They code corresponds approximately to the rank of their frequency in addresses.
STREET_TYPE
If
integer_StreetType = FALSE
, then the (uppercase) standard name of the street type.POSTCODE
An integer vector, the postcode observed.
Uppercase
Description
Uppercase
Usage
toupper_basic(x)
Arguments
x |
A character vector |
Value
The same as toupper(x)
for ASCII entries. For implementation
reasons, strings wider than 32767 characters (bytes) will be ignored.
Unique postcodes of
Description
Unique postcodes of
Usage
unique_Postcodes(x, strict = TRUE)
uniqueN_Postcodes(x, strict = TRUE)
Arguments
x |
An integer vector of postcodes. |
strict |
(logical, default: |
Value
unique_Postcodes
A (sorted) integer vector of the unique, non-NA values in
x
.uniqueN_Postcodes
The number of unique postcodes.