Machine learning and NLP approaches in address matching
Author
Syne, LamineDate
2022Abstract
The object of this project is to explore machine learning and NLP potenal to the
address matching sub-field of geographic informaon science. To achieve this a deep
study about word and sentence embeddings models was made, how they work and
how they can be used to generate numerical representaons of an address.
For each word or sentence embedding model we generate vector representaon of
addresses in the database and calculate the cosine similarity between them in order
to know which ones represent the same geographic posion or not.
On the other hand we introduce the confusion matrix for evaluang performance of
each model on a dataset of already matched addresses created from ISTAC [1] data
sources and make a comparison study between the models.
Finally, a use case example will be shown by choosing the most performing model
among those one studied above. This last one can be a debut for building a powerful
tool for matching address pairs in all Canary Islands.
Key words : machine learning, NLP, language model, address matching, word
embedding, similarity