Updated on January 25, 2016
Replicator: Wikidata import tool
Replicator was created for importing data from Wikidata into the QueryR REST API persistence. It has two big conceptual components: getting entities from a specified source, and then doing something with said entities.
Wikidata Web API
As the above ascii cast shows, you can import entities via the Wikidata web API. You need to be able to connect to the API, and this is by far the slowest way to import, however still much more convenient than getting a dump in case you just want to import a few entities for testing purposes.
The one required argument for the API import command is the list of entities to import. This argument accepts single entity IDs such as Q1 and P42, as well as ranges such as Q1-Q100. You can have as many of these as you want, for instance P42 Q1-100 P20-30 Q1337. The -r or –include-references flag allows you to specify all referenced entities should also be imported. This is particularly useful when you need the labels of these entities when displaying the one you actually imported. Finally there is a verbosity option that allows switching between 3 different levels of output.
The command internally does batching using my Batching Iterator PHP library. You can specify the batch size with the -b or –batchsize option. The command can also be safely aborted via ctrl+c. Rather than immediately dying and leaving your database (or other output) in a potentially inconsistent state, Replicator will finish importing the current entity before exiting. A summary of the import is displayed once it completed or was aborted.
It is possible to import data from both compressed and uncompressed JSON dumps. This functionality is exposed via several commands. import:json for uncompressed dumps, import:bz2 for bzip2 compressed dumps and import:gz for gzip compressed dumps. It is possible to specify a maximum number of entities to import, or to safely and interactively abort the import via ctrl+c. In both cases you will be presented with a continuation token that can be used to continue the import from where it stopped.
The JSON import functionality is build on my Wikidata JSON Dump Reader PHP library. You can even import XML dumps via the import:xml command, though are likely better off sticking with the recommended JSON dumps.
As Replicator was written for the QueryR REST API, it by default imports into the persistence used by this API. This persistence is composed of the QueryR EntityStore, the QueryR TermStore and Wikibase QueryEngine, all open source PHP libraries providing persistence for Wikibase data.
While internally Replicator uses a plugin system, there currently is no way to add additional sources without modifying the PHP code. The needed modifications are very trivial, and it is also relatively simple to make the application as a whole truly extensible. While I’m currently working on other projects, I suspect this capability is useful for various use cases. All it takes is implementing an interface with a method
handleEntity( EntityDocument $entity ), and suddenly the tool becomes capable of importing into your MediaWiki, your Wikibase or your custom persistence.
Let me know if you are interested in creating such a plugin, then I will add the missing parts of the plugin system. I might get to this in some time anyway, and then do another blog post covering the details.
Further points of interest
I should mention that the Replicator application is build on top of the Wikibase DataModel and Wikibase DataModel Serialization PHP libraries, without which creating such a tool would be a lot more work, both initially and maintenance wise. It also uses the Symfony Console component, which I can highly recommend for anyone creating a CLI application in PHP.
If you are running a Wikibase instance on your wiki, take a look at the Wikibase Import MediaWiki extension by Aude. If you want to import things into Wikidata, then have a look at this reference Microdata import script by Addshore. If you are working with Java and want to import dumps, check out Wikidata Toolkit.