Thursday, July 22, 2010

Changes to this Blog

From now on I'll try to keep the posts in here a little bit less technical. Everything else can be found on my personal blog. That said I've kicked that blog of with a post about how I'll try to generate new data for this version of OSMdoc.

I'll keep you updated here and on Twitter. I won't promise anything but I hope to have fresh data before the end of the month.

As always just contact me if you have suggestions or questions.

And sponsors are still very welcome!

Wednesday, June 2, 2010

Import of OSM history

I've done some work on OSMdoc in the last few days. That code probably won't be useful for anyone else but I've put in in the public anyway.

This morning I started the process of importing the complete history of OSM data into the database. If that works it should take about eight days at the current speed. Once that's done I'll have to get that data up to date as the dump is from February 2010 and I don't have any code yet to download the diffs but that shouldn't be too hard to do.

I'll try to get back to regular blogging but in the meantime I'll try to post quick updates to Twitter.

The import is currently running at about 1500 elements per second which involves quite a bit and isn't optimized for speed:

  1. Read XML data
  2. Parse an element: Node, Way, Relation or Changeset
  3. Serialize that data into a byte Array by using Avro
  4. Send a message to a AMQP/RabbitMQ exchange containing the serialized data
Another process is bound to the same exchange and listens for those messages:
  1. Reads messages from RabbitMQ
  2. Uses Avro to deserialize those messages into Java objects
  3. Writes them to HBase
This isn't optimal for bulk imports but it was a process I already had and which worked reasonably well. I'm not concerned about speed at the moment. This allows me to pretty easily switch to other backends or to do some more processing of the data.


I'm currently at about 20 million elements.

As always: Feedback is welcome.

Friday, April 23, 2010

Status update March & April

There haven't been any updates here because there aren't any news to report at the moment.

I've been busy with other stuff but I'll start working on OSMdoc again next week and I hope to have something to report again real soon.

Sunday, March 7, 2010

FOSSGIS, OpenSource and Hosting

A lot has happened since my last update. Shortly after my last post in January I began actively searching for a place to host the new OSMdoc version. A thread on the german mailing lists was very constructive and I decided to wait until the FOSSGIS conference in Osnabrück with a decision on how to move on. Just a heads up: The rest of this blog post will be a bit technical and probably uninteresting for most people.

Following the thread I decided to begin open sourcing parts of the stuff I'm doing. Most of it won't be very interesting at all for others but I've been asked for the source code multiple times...so here it is ;-) I'll continue cleaning up my existing code and release it on the Wiki page.

I also decided to (once again) redesign parts of the architecture to be more flexible. I'll probably write something about this up in the future and include a flow chart. For now a short overview has to suffice:

I've designed a very modular architecture (for various reasons which I'll go into later). I've written two simple projects reading input data: One for .osm and another for .osc files. Those projects do nothing more than to parse the XML files and generating domain objects, serialize them and send them off to a RabbitMQ exchange/queue. I use Avro to specify the schema for the domain objects and to serialize them.
I then use another small tool to subscribe to those messages and import them into the HBase database.

I know that this process seems overly complicated but it is pretty easy to use, very easy to extend and very flexible (I've only written about the open-sourced parts, the rest uses this message based design too). I also know that this isn't the fastest way to do this but I don't care about speed that much and it is still very fast - I'd say fast enough.

Which leads me to FOSSGIS: While there were "only" two days for OSM and of those only one hour of discussions about the dev servers it was still very interesting and constructive. We discussed at length the problem of the data basis used by the different tools. At the moment everyone uses their own format or database but a lot of tool need and to the same. We shortly discussed a few common use cases (API DB, Mapnik DB, routing DB, planet files, etc.) and I think we agreed that there is a lot to be gained by a common way to process and save the data.

This is another reason for my aforementioned design. In theory it'd allow other tools on the server (where OSMdoc is hosted) to subscribe to the same messages and thus easily process diff files. By generating their own messages other tools can benefit too from additional generated data. In the discussions at FOSSGIS PubSubHubBub was mentioned as a possible protocol to subscribe to those events. This has two advantages: It is probably even easier to implement for a lot of people because of the familiar protocol and it is easier to access from remote computers. Fortunately it should be very easy to use RabbitMQ internally and expose the feeds for which it makes sense by PubSubHubBub. I haven't done it yet and it is not on my to-do list but it'd be worth investigating.

I'll continue going this way and I hope to start the import of the full history planet very soon now (the latest version was released mid-February) to see how everything performs. I am thinking about providing a data update for the old OSMdoc version in the meantime as the new version takes longer than expected. I'll see what I can do.

Sunday, January 31, 2010

January update

I've kept quite in January but that doesn't mean that there was no progress. A lot of effort and focus has been put towards the mapping in Haiti, but I've gotten a lot done on OSMdoc too.

First of all I've done a lot of work on HBase and some of my patches have been committed already. Sometime in January I was able to first run my whole pipeline of tools. An OSM parser reads the XML file and inserts the data into my HBase database. Another tool generates the tag statistics and yet another tool exports these values into a Solr instance and it seemed to worked flawless.

The new part is that I've been able to get data from Solr and HBase into the web interface so I can now focus on the front end part. I've got a few problems there. OSMdoc has seven columns at the moment but that won't be nearly enough for the new version. I've got over twenty values for each key and ideally I'd like to display and allow sorting and filtering on each and every one of those but it'll be much to wide to display all those on one line. I've got to find a solution - unfortunately I suck at HTML/CSS/Javascript so this might take a while.

I hope to be able to provide screenshots and perhaps a demo sometime very soon.

In the meantime - starting February I'm looking for a new job...so if anyone has a good idea feel free to contact me :)

Update: I decided to start smaller. Once an updated history dump of OSM has been made available (I hope that'll be sometime in the next weeks) I'll load and process that and will only try to replicate the current feature set of OSMdoc for a first new version.

This has various reasons: One being that I can't rely on all the functionality Django and other libraries (JavaScript and Python) the current version is using as the backend doesn't use a conventional database anymore but mainly Solr instead. I'm using ajax-solr but there's a lot of manual work involved. Solr is a lot more powerful than the current PostgreSQL search so I'll have to find good ways to integrate this functionality in an unobtrusive and easy way.

The other reason is that I still haven't figured out the hosting (I haven't spent much time on it either) as I'll need a lot more power for all the data and new servers. Ideally HBase should run on at least three to six servers but that won't be possible so I try to cram as much as possible in Solr and load the rest on demand or in a PostgreSQL instance (which requires yet more work).

In case anyone feels generous or knows a company that might be willing to sponsor something here is a list of things I'm looking for (the absolute minimum): Hosting with 2 GB RAM+, 500 GB HDD space minimum (keep in mind that there is a full OSM history database involved), it has to be cheap as I'm currently between jobs (another thing I need) and I can't afford a big server or multiple EC2 instances. Do contact me if you have any questions.

This weekend I'll be at FOSDEM in Bruxelles, anyone else there from the OSM world?

Saturday, January 2, 2010

Full writeup of how I parse tags

Introduction

In this blog post I try to explain how (and why) I parse tags for the new version of OSMdoc.

First a bit of terminology:
  • tag is a key-value pair with up to 255 Unicode characters in the key and value
  • So I call the whole thing a tag and the parts are the key and the value
  • An example tag: amenity (key) = restaurant (value)
The current versions of OSMdoc, Tagwatch and tagstat provide strictly statistically information without evaluating the tags in any way. I plan to change that in the new version of OSMdoc. So here is a description of the ways I (plan to) parse tags. Any comments are welcome.

Keys
  1. I increment the counters for the original key
  2. I check if the original key is a know misspelling of another key. If that is the case I proceed with that new key
  3. Trim the key and see if there are differences (this means that there was whitespace at the beginning or the end of the key), if there are I increment the counters for the trimmed key too
  4. I split every key on colons except when the value in question is on a blacklist of keys that should not be split
    1. I trim every resulting part[1] deleting empty parts
    2. For every part I add links of the types of parent, child, descendant, ascendant and root (see below for examples)
[1]: I chose to trim here by default because keys should be more or less well defined unlike values and in my opinion (the current data set seems to agree with me) those namespaces should be separated by colons and no additional whitespace. So I treat whitespace in keys as typos and ignore it at the beginning and end of parts. See below for an example what'd happen if I didn't do that.

I thought about adding a link from the last "qualifier" to its "unnamespaced" version but I chose not to do so because it doesn't seem to be a valuable link/idea considering the current data in the data set. For the example below that would mean a link from seamark:light:colour to colour. This can be added easily at a later time if needed.

Examples:
  • seamark:light:colour will resolve to the following entries
    • seamark  seamark:light [child]
    • seamark  seamark:light:colour [descendant]
    • seamark:light  seamark [parent]
    • seamark:light  seamark:light:colour [child]
    • seamark:light:colour  seamark:light [parent]
    • seamark:light:colour  seamark [root]
  • An example of why I chose to trim parts of the key by default: seamark: light:colour. This adds a lot of useless clutter to the correct key pages for just one single whitespace.
    • seamark  seamark:light [child]
    • seamark  seamark: light [child]
    • seamark  seamark:light:colour [descendant]
    • seamark  seamark: light:colour [descendant]
    • seamark:light  seamark [parent]
    • seamark:light  seamark:light:colour [child]
    • seamark:light  seamark: light:colour [child]
    • seamark: light  seamark [parent]
    • seamark: light  seamark:light:colour [child]
    • seamark: light  seamark: light:colour [child]
    • seamark:light:colour  seamark:light [parent]
    • seamark:light:colour  seamark: light [parent]
    • seamark:light:colour  seamark [root]
    • seamark: light:colour  seamark:light [parent]
    • seamark: light:colour  seamark: light [parent]
    • seamark: light:colour  seamark [root]

Values
  1. I split every value on semicolons except when the value in question is on a blacklist of values that should not be split
    1. If this results in more than one value I add the original (unsplit) value to the list
  2. For every resulting value in the list I do the following:
    1. Increment the counters for the current value in question
    2. Trim the value and see if there are differences (this means that there was whitespace at the beginning or the end of the value), if there are and the trimmed value is not empty I increment the counters for the trimmed values too
    3. Check if the value or the trimmed value are known misspellings and if that is the case increment the counters for the correct value
This means that the numbers that OSMdoc presents won't be the same numbers that the other programs present and they won't represent the actual numbers from the OSM database.

If a value is later found to be incorrectly trimmed or split on semicolons it should be easy to correct it since I still record the usage information even for the unsplit and incorrect values and I know how I processed the information and can just subtract the numbers from all those resulting values.

Do you think this is a good move or should I still record the precise numbers for each and every value? I could do that but it adds complexity in the front- and backend and I believe it to be unnecessary. In the long run I hope/think that most programs consuming OSM data will process the data first in a similar manner as simple typos and irrelevant whitespace shouldn't have a negative influence on works produced by using OSM data. They should still be fixed though where possible.

I never automatically mark values as misspellings no matter how likely it seems. There are just to many languages and meanings to consider

A few examples:
  • "restaurant" is just kept the same
  • "restaurant; parking"  "restaurant; parking", "restaurant", " parking" and "parking"
  • "restaurant; ;parking"  "restaurant; ;parking", "restaurant", " ", "parking"
  • "restaurant; praking"  "restaurant; praking", "restaurant", " praking", " praking" and "parking" (assuming that "praking" has been marked as a misspelling of "parking")

Sunday, December 27, 2009

Christmas

I haven't got much done over the holidays but I'll start again today.

There is just one good thing to report. The database talked to the front end for the first time (i.e. the new HBase Thrift interface seems to work). So I'm now able to display data again. I'll see if I can get a beta version up that talks to my local database for limited demonstration purposes.

That also means that I'm now working on two more fronts: The HBase Thrift interface, the program to import the history dump into the database, the program to import diffs into the database, a Python API for the data and the new OSMdoc interface to display all that data. This will probably mean that the overall progress will slow down a bit but I currently plan to release at least the new version of the old functionality (tag statistics) sometime in February.

I'll keep you updated.