Friday, September 7, 2012

Wikipedia and Wikimedia commons html content

Wikipedia and Wikimedia Commons have excellent information. I am building a few applications using the information from these sites. I wanted to get some portions of the html markup, for example the image source, title, thumbnail etc. I tried the Media Wiki API and was able to get the file names by a query like this.

This produced the following output

<?xml version="1.0"?>
-<api>-<query>-<categorymembers><cm title="File:Ethiopian - Coin Depicting an Anonymous King - Walters 59793 - Obverse.jpg" ns="6" pageid="18842729"/><cm title="File:Ethiopian - Coin Depicting an Anonymous King - Walters 59793 - Reverse.jpg" ns="6" pageid="18842732"/><cm title="File:Ethiopian - One of Two Coins Depicting Ousanas and an Anonymous King - Walters 59794.jpg" ns="6" pageid="18809697"/><cm title="File:Greek - Apollo - Walters 59533.jpg" ns="6" pageid="18787772"/><cm title="File:Greek - Athena - Walters 59519 - Obverse.jpg" ns="6" pageid="18801612"/><cm title="File:Greek - Athena - Walters 59519 - Reverse.jpg" ns="6" pageid="18801616"/><cm title="File:Greek - Athena - Walters 59702 - Back.jpg" ns="6" pageid="18787788"/><cm title="File:Greek - Persephone - Walters 59693.jpg" ns="6" pageid="18787786"/><cm title="File:Greek - Tetradrachme with King Nicodemus II - Walters 59723 - Back.jpg" ns="6" pageid="18787795"/><cm title="File:Matthes Gebel - Medal of Arnold and Nicholas Wenck - Walters 59480 - Obverse.jpg" ns="6" pageid="18839416"/></categorymembers></query>-<query-continue><categorymembers cmcontinue="file|7e524f4d414e202d20434f494e2057495448204120484950504f504f54414d555320414e4420504f525452414954204f46204f544143494c494120534556455241202d2057414c54455253203539373531202d204241434b2e4a50470a524f4d414e202d20434f494e2057495448204120484950504f504f54414d555320414e4420504f525452414954204f46204f544143494c494120534556455241202d2057414c54455253203539373531202d204241434b2e4a5047|18787799"/></query-continue></api>

You can use the file name to build a url like the following to get the image info

However I wanted to get the title, thumbnail url etc. After some research I hit on this discussion How to get HTML content text of a Wikipedia Page (via Wikipedia API)? and then this page Manual:Parameters to index.php.

This led to the following query which gives all the relevant html markup:

No comments:

Post a Comment