One might say, don’t use Plackey if …
Okay now that the catchy headline is out of the way, we can dig into why I would say something like that as a developer on the Placekey team. Let me take you on a journey through the process of being a developer working on a conflation problem.
As the project begins, you sit in a meeting bright eyed and bushy tailed ready to take on the joining of your datasets to create the most rich deception of each entity in the collection; perhaps the ask is a parcel_number column from dataset_a, an appraisal column from dataset_b and maybe even a location_name column from dataset_c. Three values that up to this point exist in isolation among their own datasets, but three values that - if collected - could be meaningful to some hard problem you and your team are going to solve.
So you have the idea: join these datasets to create your beautiful unified view.
As you start evaluating the problem you step into the realm of address matching. Each dataset appears to either have a city or postal_code. You think you will be able to group on those to limit your candidate matches. Well that is if you can find all the corresponding postal_codes for the cities in dataset_a to those in dataset_b.
Let’s say you are at a mature org that along the way has built a beautiful Data Warehouse that you can quickly get this data from. That dataset is nothing to scoff at, so you might need some processing power for that like a big data processing tool like Spark. Fortunately, the org that has the Data Warehouse also has an Apache Spark tool for you to use.
Now, in your spark job, you manage to ingest and clean up nearly all of your cities and postal_codes to be able to narrow your matches. Great, you have limited the number of matches to those only within the same city with respect to the other datasets. There still is a considerable number of candidates for each grouping.
Oh but wait, suddenly, you are reading a Medium article that shares a bit on Apache Sedona, a powerful open source project that allows you to do a spatial join on Spark between centroids. Oh and look at that your org that has a Data Warehouse, Apache Spark also has a resource for Apache Sedona.
You think, perhaps I can leverage our contract with the geocoder to process my datasets to get a centroid to then join; and perfect, you org that has a Data Warehouse, Apache Spark, Apache Sedona also just built the best async Data Pipeline Geocoder that will make this super easy to do.
Great, you now have addresses in proximity to each other and it makes the matching a much more manageable process - maybe like 1 to 100 candidates each. You think this is great and finally get down to the matching logic. So, you pull out your trusty resources and start looking into address matching. It leads you to address parsing; oh and look there are a few off the shelf tools that can make it super easy to do. You parse the address and write logic to match the components, and it works! Well, until one of the datasets you got was handwritten data from doctors who on the weekends moonlight as ghostwriters for country clerks - okay that's made up but you get the point: spelling errors and handwriting translated by OCR. Oh no.
Lying in bed at night, you have your eureka moment: semantic similarity! You rise from the ashes, I mean sleepless night thinking about your neat new implementation. Itching to start applying them, you outfit the last step of your Spark job with the hottest new embedding model; but wait, the job has slowed to a crawl. Inching along you examine the CPU usage and find its nearly at 100% the entire job and craves more resources. No worries though, your org that has a Data Warehouse, Apache Spark, Apache Sedona, Data Pipeline Geocoder also has a large budget for GPUs to support your embeddings.
And there you have it, you have built a beautiful, robust and effective, yet expensive, data pipeline for conflation. You can confidently ingest your datasets and produce a reliable outcome to solve your hard problem.
What Plackey Offers:
Placekey is great at many things and continually provides value for those who use it, but it is not the only solution. I, and I think your project managers and ones who negotiated your cloud provider contracts would argue that if you have the resources to do this type of intricate conflation, you should be doing it.
That said, when you decide to work with Plackey whether it be for a quick analysis or a foundational pipeline, you are getting not only a service but a partnership with an organization that has committed its long history of spatial data processing, conflation and matching to bringing you the highest level of service possible, and of course an organization that has Data Warehouse, Apache Spark, Apache Sedona, Data Pipeline Geocoder and GPUs. If you are interested in getting started you can sign up for an account here.