User:The Anome/Wikidata/OSM reconciliation notes

User:The Anome/Wikidata/OSM reconciliation notes

more

← Previous revision Revision as of 09:09, 19 April 2026
Line 14: Line 14:


OWL Places, a one-at-a-time service: https://osm.wikidata.link/
OWL Places, a one-at-a-time service: https://osm.wikidata.link/

From the Geodata Wikiproject talk page:

I'm planning to do a massive reconciliation of complete dumps of Wikidata and OpenStreetMap, in a way that will be compliant with both Wikidata and OSM's license terms. It is intended to do two things: create a mapping of OSM relations to Wikidata item IDs, and also to extract and compare OSM coordinates to Wikidata coordinates, and flag those pages for which the coordinates differ considerably. What it ''won't'' do is extract coordinates for consumption by Wikidata or Wikipedia, because that would be against OSM's licensing terms; they explicitly note that you can use OSM's data to ''check'' other datasets, but not to republish OSM's data; see [https://osmfoundation.org/wiki/Licence/Licence_and_Legal_FAQ#4._CAN_I_USE_OSM_DATA_AND_OPENSTREETMAP-DERIVED_MAPS_TO_VERIFY_MY_OWN_DATA_WITHOUT_TRIGGERING_SHARE-ALIKE?].

My new faster home internet connection, extra RAM and NVRAM disks on my new small desktop computer will now allow this to be done in days, rather than months; the first step in both processes is to winnow out only the necessary data from the chaff that is irrelevant to this purpose, discarding over 99.9% of each dump. Design work, which will take significantly longer, is still under way, but my current intention is to download the PBF form of the OSM dump and use Osmosis for extraction, and to download the compressed JSON version of the Wikidata dump and use DIY scripts to analyse that. The resulting data will be put into a private PostgreSQL database on the same computer for later analysis.

Matching will then be done on the same basis as the Anomebot scripts on in-RAM copies of both datasets; if there is one, and ony one, entry in both datasets with the same name, enclosure hierarchy, and feature type, they will be considered to be the same. There are also some other wrinkles involving Wikipedia page types, DAG traversal, and a significant number of ad-hoc heuristics developed over the years, and I can also use the Wikipedia category tree as a cross-check on the Wikidata categorization. For the moment, I'll concentrate on data extraction, with correlation coming afterwards. There are also some potential other benefits including better GNS matching that can use the OSM data to double-check GNS data; unlike OSM, GNS data is in the public domain and can be-reused here.

OSM also has its own tag type for links to Wikidata: wikidata=*.

The other big potential win is being able to publish the mapping of OSM relations to Wikidata item IDs as part of Wikidata using https://www.wikidata.org/wiki/Property:P402, but I need more understanding about licensing issues before I consider that. Finally, yes, I know about the lack of stability guarantees regarding relation identifiers and Wikidata, but I think this would still be useful and relevant, and re-reconciliations can be performed from time to time to keep the data fresh.