<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8417410101561011253</id><updated>2012-02-17T00:27:01.499+01:00</updated><title type='text'>OSMdoc</title><subtitle type='html'>Temporary blog about development news regarding OSMdoc.com. So this might be quite technical and not very interesting for an average OpenStreetMap user at the moment. This might change later when I com closer to the release of OSMdoc2.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>12</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-7943674128921413008</id><published>2010-07-22T17:01:00.000+02:00</published><updated>2010-07-22T17:01:55.388+02:00</updated><title type='text'>Changes to this Blog</title><content type='html'>From now on I'll try to keep the posts in here a little bit less technical. Everything else can be found on my &lt;a href="http://blog.lars-francke.de/"&gt;personal blog&lt;/a&gt;. That said I've kicked that &lt;a href="http://blog.lars-francke.de/2010/07/22/processing-openstreetmap-data-with-hive/"&gt;blog&lt;/a&gt; of with a post about how I'll try to generate new data for this version of OSMdoc.&lt;br /&gt;&lt;br /&gt;I'll keep you updated here and on Twitter. I won't promise anything but I hope to have fresh data before the end of the month.&lt;br /&gt;&lt;br /&gt;As always just contact me if you have suggestions or questions.&lt;br /&gt;&lt;br /&gt;And sponsors are still very welcome!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-7943674128921413008?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/7943674128921413008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/07/changes-to-this-blog.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/7943674128921413008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/7943674128921413008'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/07/changes-to-this-blog.html' title='Changes to this Blog'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-8208386657453513113</id><published>2010-06-02T11:58:00.000+02:00</published><updated>2010-06-02T11:58:41.324+02:00</updated><title type='text'>Import of OSM history</title><content type='html'>I've done &lt;a href="http://bitbucket.org/lfrancke/"&gt;some work&lt;/a&gt; on OSMdoc in the last few days. That code probably won't be useful for anyone else but I've put in in the public anyway.&lt;br /&gt;&lt;br /&gt;This morning I started the process of importing the complete &lt;a href="http://planet.openstreetmap.org/full-experimental/"&gt;history&lt;/a&gt; of OSM data into the database. If that works it should take about eight days at the current speed. Once that's done I'll have to get that data up to date as the dump is from February 2010 and I don't have any code yet to download the diffs but that shouldn't be too hard to do.&lt;br /&gt;&lt;br /&gt;I'll try to get back to regular blogging but in the meantime I'll try to post quick updates to &lt;a href="http://twitter.com/osmdoc"&gt;Twitter&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The import is currently running at about 1500 elements per second which involves quite a bit and isn't optimized for speed:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Read XML data&lt;/li&gt;&lt;li&gt;Parse an element: Node, Way, Relation or Changeset&lt;/li&gt;&lt;li&gt;Serialize that data into a byte Array by using &lt;a href="http://avro.apache.org/"&gt;Avro&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Send a message to a AMQP/RabbitMQ exchange containing the serialized data&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;Another process is bound to the same exchange and listens for those messages:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Reads messages from RabbitMQ&lt;/li&gt;&lt;li&gt;Uses Avro to deserialize those messages into Java objects&lt;/li&gt;&lt;li&gt;Writes them to HBase&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;This isn't optimal for bulk imports but it was a process I already had and which worked reasonably well. I'm not concerned about speed at the moment. This allows me to pretty easily switch to other backends or to do some more processing of the data.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;I'm currently at about 20 million elements.&lt;br /&gt;&lt;br /&gt;As always: Feedback is welcome.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-8208386657453513113?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/8208386657453513113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/06/import-of-osm-history.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8208386657453513113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8208386657453513113'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/06/import-of-osm-history.html' title='Import of OSM history'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-2088848656671493469</id><published>2010-04-23T12:43:00.000+02:00</published><updated>2010-04-23T12:43:14.042+02:00</updated><title type='text'>Status update March &amp; April</title><content type='html'>There haven't been any updates here because there aren't any news to report at the moment.&lt;br /&gt;&lt;br /&gt;I've been busy with other stuff but I'll start working on OSMdoc again next week and I hope to have something to report again real soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-2088848656671493469?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/2088848656671493469/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/04/status-update-march-april.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/2088848656671493469'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/2088848656671493469'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/04/status-update-march-april.html' title='Status update March &amp; April'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-8146734737414558459</id><published>2010-03-07T23:55:00.000+01:00</published><updated>2010-03-07T23:55:10.317+01:00</updated><title type='text'>FOSSGIS, OpenSource and Hosting</title><content type='html'>A lot has happened since my last update. Shortly after my last post in January I began actively searching for a place to host the new OSMdoc version. A &lt;a href="http://lists.openstreetmap.org/pipermail/talk-de/2010-February/063103.html"&gt;thread&lt;/a&gt; on the german mailing lists was very constructive and I decided to wait until the &lt;a href="http://www.fossgis.de/"&gt;FOSSGIS&lt;/a&gt; conference in Osnabrück with a decision on how to move on. Just a heads up: The rest of this blog post will be a bit technical and probably uninteresting for most people.&lt;br /&gt;&lt;br /&gt;Following the thread I decided to begin &lt;a href="http://wiki.openstreetmap.org/wiki/OSMdoc"&gt;open sourcing&lt;/a&gt; parts of the stuff I'm doing. Most of it won't be very interesting at all for others but I've been asked for the source code multiple times...so here it is ;-) I'll continue cleaning up my existing code and release it on the Wiki page.&lt;br /&gt;&lt;br /&gt;I also decided to (once again) redesign parts of the architecture to be more flexible. I'll probably write something about this up in the future and include a flow chart. For now a short overview has to suffice:&lt;br /&gt;&lt;br /&gt;I've designed a very modular architecture (for various reasons which I'll go into later). I've written two simple projects reading input data: One for .osm and another for .osc files. Those projects do nothing more than to parse the XML files and generating &lt;a href="http://bitbucket.org/lfrancke/openstreetmap-domain/src/tip/src/main/java/org/openstreetmap/domain/generated/"&gt;domain objects&lt;/a&gt;, serialize them and send them off to a &lt;a href="http://www.rabbitmq.com/"&gt;RabbitMQ&lt;/a&gt; exchange/queue. I use &lt;a href="http://hadoop.apache.org/avro/"&gt;Avro&lt;/a&gt;&amp;nbsp;to specify the &lt;a href="http://bitbucket.org/lfrancke/openstreetmap-domain/src/d6b7616142bc/src/main/resources/openstreetmap.avsc"&gt;schema&lt;/a&gt;&amp;nbsp;for the domain objects and to serialize them.&lt;br /&gt;I then use another &lt;a href="http://bitbucket.org/lfrancke/osm-hbase-import/"&gt;small tool&lt;/a&gt;&amp;nbsp;to subscribe to those messages and import them into the HBase database.&lt;br /&gt;&lt;br /&gt;I know that this process seems overly complicated but it is pretty easy to use, very easy to extend and very flexible (I've only written about the open-sourced parts, the rest uses this message based design too). I also know that this isn't the fastest way to do this but I don't care about speed that much and it is still very fast - I'd say fast enough.&lt;br /&gt;&lt;br /&gt;Which leads me to FOSSGIS:&amp;nbsp;While there were "only" two days for OSM and of those only one hour of discussions about the dev servers it was still very interesting and constructive. We discussed at length the problem of the data basis used by the different tools. At the moment everyone uses their own format or database but a lot of tool need and to the same. We shortly discussed a few common use cases (API DB, Mapnik DB, routing DB, planet files, etc.) and I think we agreed that there is a lot to be gained by a common way to process and save the data.&lt;br /&gt;&lt;br /&gt;This is another reason for my aforementioned design. In theory it'd allow other tools on the server (where OSMdoc is hosted) to subscribe to the same messages and thus easily process diff files. By generating their own messages other tools can benefit too from additional generated data. In the discussions at FOSSGIS PubSubHubBub was mentioned as a possible protocol to subscribe to those events. This has two advantages: It is probably even easier to implement for a lot of people because of the familiar protocol and it is easier to access from remote computers. Fortunately it should be very easy to use RabbitMQ internally and expose the feeds for which it makes sense by PubSubHubBub. I haven't done it yet and it is not on my to-do list but it'd be worth investigating.&lt;br /&gt;&lt;br /&gt;I'll continue going this way and I hope to start the import of the full history planet very soon now (the latest version was released mid-February) to see how everything performs. I am thinking about providing a data update for the old OSMdoc version in the meantime as the new version takes longer than expected. I'll see what I can do.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-8146734737414558459?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/8146734737414558459/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/03/fossgis-opensource-and-hosting.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8146734737414558459'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8146734737414558459'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/03/fossgis-opensource-and-hosting.html' title='FOSSGIS, OpenSource and Hosting'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-301455365461122962</id><published>2010-01-31T17:49:00.001+01:00</published><updated>2010-02-03T16:41:22.339+01:00</updated><title type='text'>January update</title><content type='html'>I've kept quite in January but that doesn't mean that there was no progress. A lot of effort and focus has been put towards the mapping in&amp;nbsp;&lt;a href="http://wiki.openstreetmap.org/wiki/WikiProject_Haiti"&gt;Haiti&lt;/a&gt;, but I've gotten a lot done on OSMdoc too.&lt;br /&gt;&lt;br /&gt;First of all I've done a lot of work on HBase and some of my patches have been&amp;nbsp;committed&amp;nbsp;already. Sometime in January I was able to first run my whole pipeline of tools. An OSM parser reads the XML file and inserts the data into my HBase database. Another tool generates the tag statistics and yet another tool exports these values into a Solr instance and it seemed to worked flawless.&lt;br /&gt;&lt;br /&gt;The new part is that I've been able to get data from Solr and HBase into the web interface so I can now focus on the front end part. I've got a few problems there. OSMdoc has seven columns at the moment but that won't be nearly enough for the new version. I've got over twenty values for each key and ideally I'd like to display and allow sorting and filtering on each and every one of those but it'll be much to wide to display all those on one line. I've got to find a solution - unfortunately I suck at HTML/CSS/Javascript so this might take a while.&lt;br /&gt;&lt;br /&gt;I hope to be able to provide screenshots and perhaps a demo sometime very soon.&lt;br /&gt;&lt;br /&gt;In the meantime - starting February I'm looking for a new job...so if anyone has a good idea feel free to contact me :)&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Update:&lt;/i&gt;&amp;nbsp;I decided to start smaller. Once an updated history dump of OSM has been made available (I hope that'll be sometime in the next weeks) I'll load and process that and will only try to replicate the current feature set of OSMdoc for a first new version.&lt;br /&gt;&lt;br /&gt;This has various reasons: One being that I can't rely on all the functionality Django and other libraries (JavaScript and Python) the current version is using as the backend doesn't use a conventional database anymore but mainly Solr instead. I'm using&amp;nbsp;&lt;a href="http://github.com/evolvingweb/ajax-solr"&gt;ajax-solr&lt;/a&gt;&amp;nbsp;but there's a lot of manual work involved. Solr is a lot more powerful than the current PostgreSQL search so I'll have to find good ways to integrate this functionality in an&amp;nbsp;unobtrusive&amp;nbsp;and easy way.&lt;br /&gt;&lt;br /&gt;The other reason is that I still haven't figured out the hosting (I haven't spent much time on it either) as I'll need a lot more power for all the data and new servers. Ideally HBase should run on at least three to six servers but that won't be possible so I try to cram as much as possible in Solr and load the rest on demand or in a PostgreSQL instance (which requires yet more work).&lt;br /&gt;&lt;br /&gt;In case anyone feels generous or knows a company that might be willing to sponsor something here is a list of things I'm looking for (the absolute minimum): Hosting with 2 GB RAM+, 500 GB HDD space minimum (keep in mind that there is a full OSM history database involved), it has to be cheap as I'm currently between jobs (another thing I need) and I can't afford a big server or multiple EC2 instances. Do&amp;nbsp;&lt;a href="mailto:lars.francke@gmail.com"&gt;contact&lt;/a&gt;&amp;nbsp;me if you have &lt;i&gt;any&lt;/i&gt;&amp;nbsp;questions.&lt;br /&gt;&lt;br /&gt;This weekend I'll be at FOSDEM in Bruxelles, anyone else there from the OSM world?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-301455365461122962?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/301455365461122962/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/01/january-update.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/301455365461122962'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/301455365461122962'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/01/january-update.html' title='January update'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-7755968599884512245</id><published>2010-01-02T05:05:00.000+01:00</published><updated>2010-01-02T05:05:53.012+01:00</updated><title type='text'>Full writeup of how I parse tags</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Introduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this blog post I try to explain how (and why) I parse &lt;i&gt;tags &lt;/i&gt;for the new version of OSMdoc.&lt;br /&gt;&lt;br /&gt;First a bit of&amp;nbsp;terminology:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;tag&lt;/i&gt;&amp;nbsp;is a &lt;i&gt;key-value&lt;/i&gt;&amp;nbsp;pair with up to 255 Unicode characters in the &lt;i&gt;key&lt;/i&gt; and &lt;i&gt;value&lt;/i&gt;&lt;/li&gt;&lt;li&gt;So I call the whole thing a &lt;i&gt;tag&lt;/i&gt;&amp;nbsp;and the parts are the &lt;i&gt;key&lt;/i&gt;&amp;nbsp;and the &lt;i&gt;value&lt;/i&gt;&lt;/li&gt;&lt;li&gt;An example tag: amenity&lt;i&gt;&amp;nbsp;&lt;/i&gt;(&lt;i&gt;key&lt;/i&gt;) = restaurant (&lt;i&gt;value&lt;/i&gt;)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;The current versions of OSMdoc,&amp;nbsp;&lt;a href="http://tagwatch.stoecker.eu/"&gt;Tagwatch&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="http://tagstat.hypercube.telascience.org/"&gt;tagstat&lt;/a&gt;&amp;nbsp;provide strictly statistically information without evaluating the &lt;i&gt;tags&lt;/i&gt;&amp;nbsp;in any way. I plan to change that in the new version of OSMdoc. So here is a description of the ways I (plan to) parse &lt;i&gt;tags&lt;/i&gt;. Any comments are welcome.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: x-large;"&gt;Keys&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;I increment the counters for the original &lt;i&gt;key&lt;/i&gt;&lt;/li&gt;&lt;li&gt;I check if the original &lt;i&gt;key&lt;/i&gt; is a know misspelling of another &lt;i&gt;key&lt;/i&gt;. If that is the case I proceed with that new &lt;i&gt;key&lt;/i&gt;&lt;/li&gt;&lt;li&gt;Trim the&amp;nbsp;&lt;i&gt;key&lt;/i&gt;&amp;nbsp;and see if there are differences (this means that there was whitespace at the beginning or the end of the &lt;i&gt;key&lt;/i&gt;), if there are I increment the counters for the &lt;i&gt;trimmed&amp;nbsp;key&lt;/i&gt;&amp;nbsp;too&lt;/li&gt;&lt;li&gt;I split every &lt;i&gt;key&lt;/i&gt;&amp;nbsp;on colons except when the value in question is on a blacklist of &lt;i&gt;keys&lt;/i&gt;&amp;nbsp;that should not be split&lt;/li&gt;&lt;ol&gt;&lt;li&gt;I trim every resulting &lt;i&gt;part&lt;/i&gt;[1] deleting empty parts&lt;/li&gt;&lt;li&gt;For every &lt;i&gt;part &lt;/i&gt;I add links of the types of &lt;i&gt;parent&lt;/i&gt;, &lt;i&gt;child&lt;/i&gt;, &lt;i&gt;descendant&lt;/i&gt;, &lt;i&gt;ascendant&lt;/i&gt;&amp;nbsp;and &lt;i&gt;root &lt;/i&gt;(see below for examples)&lt;/li&gt;&lt;/ol&gt;&lt;/ol&gt;&lt;div&gt;[1]: I chose to trim here by default because &lt;i&gt;keys&lt;/i&gt;&amp;nbsp;should be more or less well defined unlike &lt;i&gt;values&lt;/i&gt;&amp;nbsp;and in my opinion (the current data set seems to agree with me) those &lt;i&gt;namespaces&lt;/i&gt;&amp;nbsp;should be separated by colons and no additional whitespace. So I treat whitespace in &lt;i&gt;keys &lt;/i&gt;as typos and ignore it at the beginning and end of &lt;i&gt;parts&lt;/i&gt;. See below for an example what'd happen if I didn't do that.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I thought about adding a link from the last "qualifier" to its "unnamespaced" version but I chose not to do so because it doesn't seem to be a valuable link/idea considering the current data in the data set. For the example below that would mean a link from &lt;i&gt;seamark:light:colour&lt;/i&gt;&amp;nbsp;to &lt;i&gt;colour&lt;/i&gt;. This can be added easily at a later time if needed.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Examples:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&amp;nbsp;will resolve to the following entries&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark:light:colour&lt;/i&gt;&amp;nbsp;&lt;span style="font-style: normal;"&gt;[descendant]&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;seamark:light&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent] &lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;&lt;span style="font-style: normal;"&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[root]&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;An example of why I chose to trim parts of the key by default:&lt;i&gt;&amp;nbsp;seamark: light:colour.&lt;/i&gt;&amp;nbsp;This adds a lot of useless clutter to the correct &lt;i&gt;key&lt;/i&gt;&amp;nbsp;pages for just one single whitespace.&lt;/li&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark: light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark:light:colour&lt;/i&gt;&amp;nbsp;&lt;/i&gt;[descendant]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;&lt;i&gt;seamark&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark: light:colour&lt;/i&gt;&amp;nbsp;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;[descendant]&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark:light&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark:light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark: light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark: light&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&amp;nbsp;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;seamark: light&lt;/i&gt;&amp;nbsp;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&amp;nbsp;&lt;i&gt;seamark:light:colour&lt;/i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark: light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark: light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[child]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&amp;nbsp;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&amp;nbsp;&lt;i&gt;seamark:light&lt;/i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark: light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark:light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[root]&lt;/li&gt;&lt;li&gt;&lt;i&gt;&lt;i&gt;seamark: light:colour&lt;/i&gt;&amp;nbsp;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&amp;nbsp;&lt;i&gt;seamark:light&lt;/i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark: light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark: light&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[parent]&lt;/li&gt;&lt;li&gt;&lt;i&gt;seamark: light:colour&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;&lt;i&gt;seamark&lt;/i&gt;&lt;i&gt;&amp;nbsp;&lt;/i&gt;[root]&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;Values&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;I split every &lt;i&gt;value &lt;/i&gt;on semicolons except when the &lt;i&gt;value &lt;/i&gt;in question is on a blacklist of &lt;i&gt;values &lt;/i&gt;that should not be split&lt;/li&gt;&lt;ol&gt;&lt;li&gt;If this results in more than one&amp;nbsp;&lt;i&gt;value&lt;/i&gt;&amp;nbsp;I add the original (unsplit)&amp;nbsp;&lt;i&gt;value&lt;/i&gt;&amp;nbsp;to the list&lt;/li&gt;&lt;/ol&gt;&lt;li&gt;For every resulting&amp;nbsp;&lt;i&gt;value&lt;/i&gt;&amp;nbsp;in the list I do the following:&lt;/li&gt;&lt;ol&gt;&lt;li&gt;Increment the counters for the current&amp;nbsp;&lt;i&gt;value&lt;/i&gt;&amp;nbsp;in question&lt;/li&gt;&lt;li&gt;Trim the &lt;i&gt;value&lt;/i&gt; and see if there are differences (this means that there was whitespace at the beginning or the end of the &lt;i&gt;value&lt;/i&gt;), if there are and the &lt;i&gt;trimmed value &lt;/i&gt;is not empty I increment the counters for the &lt;i&gt;trimmed values&lt;/i&gt; too&lt;/li&gt;&lt;li&gt;Check if the &lt;i&gt;value&lt;/i&gt;&amp;nbsp;or the &lt;i&gt;trimmed value&lt;/i&gt;&amp;nbsp;are known misspellings and if that is the case increment the counters for the &lt;i&gt;correct value&lt;/i&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/ol&gt;&lt;div&gt;This means that the numbers that OSMdoc presents won't be the same numbers that the other programs present and they won't represent the actual numbers from the OSM database.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If a &lt;i&gt;value &lt;/i&gt;is later found to be incorrectly trimmed or split on semicolons it should be easy to correct it since I still record the usage information even for the unsplit and incorrect values and I know how I processed the information and can just&amp;nbsp;subtract&amp;nbsp;the numbers from all those resulting &lt;i&gt;values&lt;/i&gt;.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Do you think this is a good move or should I still record the precise numbers for each and every &lt;i&gt;value&lt;/i&gt;? I could do that but it adds complexity in the front- and backend and I believe it to be unnecessary. In the long run I hope/think that most programs consuming OSM data will process the data first in a similar manner as simple typos and irrelevant whitespace shouldn't have a negative influence on works produced by using OSM data. They should still be fixed though where possible.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I never automatically mark values as misspellings no matter how likely it seems. There are just to many languages and meanings to consider&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;A few examples:&lt;/b&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;"restaurant" is just kept the same&lt;/li&gt;&lt;li&gt;"restaurant; parking"&amp;nbsp;&lt;i&gt;&lt;i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;/i&gt;&lt;/i&gt;&amp;nbsp;"restaurant; parking", "restaurant", " parking" and "parking"&lt;/li&gt;&lt;li&gt;"restaurant; ;parking"&amp;nbsp;&lt;i&gt;&lt;i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;/i&gt;&lt;/i&gt;&amp;nbsp;"restaurant; ;parking", "restaurant", " ", "parking"&lt;/li&gt;&lt;li&gt;"restaurant; praking"&amp;nbsp;&lt;i&gt;&lt;i&gt;&lt;i&gt;&lt;span style="color: darkblue; font-family: monospace; line-height: 14px; white-space: pre;"&gt;→&lt;/span&gt;&lt;/i&gt;&lt;/i&gt;&lt;/i&gt;&amp;nbsp;"restaurant; praking", "restaurant", " praking", " praking" and "parking" (assuming that "praking" has been marked as a misspelling of "parking")&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-7755968599884512245?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/7755968599884512245/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2010/01/full-writeup-of-how-i-parse-tags.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/7755968599884512245'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/7755968599884512245'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2010/01/full-writeup-of-how-i-parse-tags.html' title='Full writeup of how I parse tags'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-4664800592770355783</id><published>2009-12-27T18:47:00.000+01:00</published><updated>2009-12-27T18:47:08.494+01:00</updated><title type='text'>Christmas</title><content type='html'>I haven't got much done over the holidays but I'll start again today.&lt;br /&gt;&lt;br /&gt;There is just one good thing to report. The database talked to the front end for the first time (i.e. the new HBase Thrift interface seems to work). So I'm now able to display data again. I'll see if I can get a beta version up that talks to my local database for limited demonstration purposes.&lt;br /&gt;&lt;br /&gt;That also means that I'm now working on two more &lt;i&gt;fronts&lt;/i&gt;: The HBase Thrift interface, the program to import the history dump into the database, the program to import diffs into the database, a Python API for the data and the new OSMdoc interface to display all that data. This will probably mean that the overall progress will slow down a bit but I currently plan to release at least the new version of the &lt;i&gt;old &lt;/i&gt;functionality (tag statistics) sometime in February.&lt;br /&gt;&lt;br /&gt;I'll keep you updated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-4664800592770355783?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/4664800592770355783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/12/christmas.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/4664800592770355783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/4664800592770355783'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/12/christmas.html' title='Christmas'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-6629907141172563248</id><published>2009-12-16T12:55:00.001+01:00</published><updated>2009-12-17T14:42:31.773+01:00</updated><title type='text'>Relation roles</title><content type='html'>I'm currently writing parts of an analysis for relation roles. It works just like tags so it counts how often a role has been used and how often it has been used in combination with a specific key or key-value (tag). This is one of the most often requested features.&lt;br /&gt;&lt;br /&gt;But it also isn't as easy.&lt;br /&gt;&lt;br /&gt;Roles are potentially used thousands of times on a single relation. Should I count each one of those or only how many relations a role has been used on? It also isn't as easy to detect when a role has been removed or changed because relations may contain a member multiple times.&lt;br /&gt;&lt;br /&gt;The two types of counts would be something like this:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;role &lt;i&gt;forward&lt;/i&gt;&amp;nbsp;has been used 234.567 times&lt;/li&gt;&lt;li&gt;role &lt;i&gt;forward &lt;/i&gt;is being used on 12.345 relations&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The latter is much easier to do and in my opinion more meaningful. I'm leaning towards it.&lt;br /&gt;&lt;br /&gt;I also don't know if the semicolon or colon are used in roles the same way they are used in keys or values but I think I won't do any special processing for the time being.&lt;br /&gt;&lt;br /&gt;Any input on relation roles would be welcome!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-6629907141172563248?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/6629907141172563248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/12/relation-roles.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/6629907141172563248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/6629907141172563248'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/12/relation-roles.html' title='Relation roles'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-8594005592929146088</id><published>2009-12-14T02:56:00.003+01:00</published><updated>2009-12-14T03:21:16.630+01:00</updated><title type='text'>history planet, planet.gpx, development update</title><content type='html'>Unfortunately I haven't gotten much done the last week but there are a few news nonetheless.&lt;br /&gt;&lt;br /&gt;There is now an&amp;nbsp;&lt;a href="http://lists.openstreetmap.org/pipermail/dev/2009-December/017880.html"&gt;experimental full history&lt;/a&gt; version of the planet after we've fixed a few stupid bugs with the&amp;nbsp;&lt;a href="http://bitbucket.org/lfrancke/historydump/"&gt;program doing the export&lt;/a&gt;. If you are interested in the details of the schema that this program writes you should have a look at the source code. It is basically a normal .osm file with multiple versions of elements.&lt;br /&gt;&lt;br /&gt;Then there are &lt;a href="http://lists.openstreetmap.org/pipermail/dev/2009-December/017859.html"&gt;a few steps&lt;/a&gt; in the direction of a full dump of all the GPX data that was uploaded to OpenStreetMap (the privacy settings will be respected of course) and I'll try to get something done in the next few weeks.&lt;br /&gt;&lt;br /&gt;The license discussions also &lt;a href="http://lists.openstreetmap.org/pipermail/talk/2009-December/045627.html"&gt;touched&lt;/a&gt; OSMdoc but I decided to ignore that for now and just assume that this aggregated data will be fine.&lt;br /&gt;&lt;br /&gt;But there are a few developments and new problems with the processing of the data. Last week I asked about &lt;i&gt;namespaced&lt;/i&gt;&amp;nbsp;tags and &lt;i&gt;multivalued values&lt;/i&gt;&amp;nbsp;and I've implemented parts of if. I split every value on semicolons now and trim each part (e.g. "red; white" will be parsed to two values "red" and "white"). As long as I'm not sure if this works for the majority of tags I still include the original (unsplit) value. I can easily delete all those later. As always there are a few problems. What about duplicate values ("&lt;i&gt;red&lt;/i&gt;; white; &lt;i&gt;red&lt;/i&gt;")? Filter those out or leave them in?&lt;br /&gt;&lt;br /&gt;As for the keys, I've done nothing there so far as I'm not yet satisfied with the information I'm collecting and how to save and display this. One (constructed) example being if there is no entry for the actual &lt;i&gt;parent&lt;/i&gt;&amp;nbsp;tag.&lt;br /&gt;An example: &lt;i&gt;foo:bar&lt;/i&gt;. The &lt;i&gt;foo:bar&lt;/i&gt;&amp;nbsp;page would point to &lt;i&gt;foo&lt;/i&gt;. But what if &lt;i&gt;foo&lt;/i&gt; hasn't even be used?&lt;br /&gt;Another question I'm asking myself is if I should I link from &lt;i&gt;left&lt;/i&gt; to &lt;i&gt;cycleway:left, buoy:left, highway:left&lt;/i&gt;?&lt;br /&gt;&lt;br /&gt;I've also partially implemented the "used with" functionality that was available in an earlier version of OSMdoc. Currently I record the following combinations:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;key&lt;/i&gt;&amp;nbsp;to &lt;i&gt;key &lt;/i&gt;(How often were &lt;i&gt;amenity=* &lt;/i&gt;and &lt;i&gt;cuisine=* &lt;/i&gt;used together on the same element?)&lt;/li&gt;&lt;li&gt;&lt;i&gt;key&lt;/i&gt;&amp;nbsp;to &lt;i&gt;key-value &lt;/i&gt;(How often were &lt;i&gt;amenity=*&lt;/i&gt; and &lt;i&gt;cuisine=german&lt;/i&gt; used together?)&lt;/li&gt;&lt;li&gt;&lt;i&gt;key-value&lt;/i&gt;&amp;nbsp;to &lt;i&gt;key-value &lt;/i&gt;(How often were &lt;i&gt;amenity=restaurant&lt;/i&gt; and &lt;i&gt;cuisine=german&lt;/i&gt; used together?&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;This is a lot more detailed than before and I hope it'll provide useful information (as an alternative to the &lt;i&gt;Useful combination&lt;/i&gt;&amp;nbsp;box on the OSM Wiki).&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I started a few other features but those are the ones I've made real progress this week. So that's it for now but I hope I'll get more done this week.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Edit:&lt;/i&gt;&lt;br /&gt;I've got another question that I'm thinking about. When an element is &lt;a href="http://www.openstreetmap.org/browse/node/366396189/history"&gt;deleted&lt;/a&gt;&amp;nbsp;it is still in the database with all its tags. So the tags weren't really &lt;i&gt;removed&lt;/i&gt;&amp;nbsp;but should they still count? I'm leaning towards &lt;i&gt;no&lt;/i&gt;&amp;nbsp;at the moment. The problem is that elements can be undeleted&amp;nbsp;(simply by setting the &lt;i&gt;visible&lt;/i&gt;&amp;nbsp;attribute to &lt;i&gt;true &lt;/i&gt;again). So I would have to check if an element was previously deleted each time an element is modified. Very complicated.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-8594005592929146088?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/8594005592929146088/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/12/history-planet-planetgpx-development.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8594005592929146088'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8594005592929146088'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/12/history-planet-planetgpx-development.html' title='history planet, planet.gpx, development update'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-4902883976143037701</id><published>2009-12-06T21:47:00.000+01:00</published><updated>2009-12-06T21:47:20.776+01:00</updated><title type='text'>Tags with "namespaces"</title><content type='html'>As far as I can tell there are more and more tags being used in a format that use a single colon to delimit various parts of the key. This is used to implement&lt;i&gt;&amp;nbsp;namespaces&lt;/i&gt;&amp;nbsp;for tags.&lt;br /&gt;&lt;br /&gt;Examples are all the TIGER tags (&lt;i&gt;tiger:county&lt;/i&gt;, &lt;i&gt;tiger:source&lt;/i&gt;, though automatically generated), the&amp;nbsp;&lt;a href="http://wiki.openstreetmap.org/wiki/Key:addr"&gt;Karlsruhe schema for addresses&lt;/a&gt;, &lt;i&gt;name&lt;/i&gt;&amp;nbsp;or the &lt;i&gt;seamark &lt;/i&gt;tags used by OpenSeaMap. I believe as more and more people use OSM more namespaces will emerge for specialized data to avoid nameclashes.&lt;br /&gt;&lt;br /&gt;These schema can be used to do things like:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;&lt;i&gt;maxweight=7&lt;/i&gt;&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;maxweight:agricultural=no&lt;/i&gt;&amp;nbsp;oder &lt;i&gt;maxweight:except=agricultural&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;I like this schema, it's essentially a tree structure with the most general settings at the top of the tree and these defaults can be overridden at lower levels. The nice thing about this format is that it's already a&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Materialized_path"&gt;Materialized path&lt;/a&gt;&amp;nbsp;which is very nice to save it in a database.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm currently thinking of how to implement/support this in OSMdoc. I thought of just listing (for every key) all its children or parents with the optional possibility to mark a tag as "not namespaced" (this would be for keys that have a colon but for different reasons). I'd leave the search function as it is.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But I'd be interested if there are any more evaluations or analysis I might have forgotten for this kind of tag. Any data you like to see....go ahead.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The next post will probably be about multivalued tags separated by semicolons.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-4902883976143037701?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/4902883976143037701/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/12/tags-with-namespaces.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/4902883976143037701'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/4902883976143037701'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/12/tags-with-namespaces.html' title='Tags with &quot;namespaces&quot;'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-9023548299509524478</id><published>2009-12-05T03:08:00.000+01:00</published><updated>2009-12-05T03:08:50.393+01:00</updated><title type='text'>planet.gpx, Tagcounter &amp; Solr</title><content type='html'>It &lt;a href="http://lists.openstreetmap.org/pipermail/dev/2009-December/017859.html"&gt;looks like&lt;/a&gt; we might get a dump of most of the GPX traces that are available in OpenStreetMap as long as ... I (or someone else) writes a program to export the data ;-) In that case I'd include that into OSMdoc but that should be very simple as these tags are no key-value pairs.&lt;br /&gt;&lt;br /&gt;I'm about halfway done with the changes to HBase and first feedback comments are very promising so I hope I'll be able to access the data from Python in a week or two. This would enable me to demo some parts of the new system.&lt;br /&gt;&lt;br /&gt;At the moment OSMdoc only displays &lt;i&gt;current&lt;/i&gt;&amp;nbsp;usage data but as the new version will have access to the full OSM history (the export has started btw.) I'll be able to provide a lot more data. What is done so far are these counters (amazingly fast thanks to HBase's&amp;nbsp;&lt;i&gt;&lt;a href="http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue(byte[], byte[], byte[], long)"&gt;atomic incrementColumnValue&lt;/a&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;:&lt;/span&gt;&lt;/i&gt;&lt;span id="goog_1259977881705"&gt;&lt;/span&gt;&lt;span id="goog_1259977881706"&gt;&lt;/span&gt;&lt;a href="http://www.blogger.com/"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How often is a key or a key-value pair used on a changeset&lt;/li&gt;&lt;li&gt;How often is a key or a key-value pair &lt;i&gt;currently&lt;/i&gt;&amp;nbsp;used on a node, way or relation&lt;/li&gt;&lt;li&gt;New are counters for how often a key or key-value pair has been added, left unmodified or removed from a node, way or relation. For keys I also record how often the value has changed.&lt;/li&gt;&lt;li&gt;How many distinct values does a key currently have and what was the maximum amount of distinct values&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;This is only a small part part of the new features and I don't really know how useful these values are but they were easy to implement.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Keys and values are also indexed in a&amp;nbsp;&lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt;-Server which should make searches blazingly fast and it should be possible to do very &lt;a href="http://lucene.apache.org/java/2_9_1/queryparsersyntax.html"&gt;complex searches&lt;/a&gt;&amp;nbsp;through keys and values. The only thing that I'm not yet sure is if I'll be able to provide substring searches in values as those are very expensive to implement (using n-grams). I'll just test it.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-9023548299509524478?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/9023548299509524478/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/12/planetgpx-tagcounter-solr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/9023548299509524478'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/9023548299509524478'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/12/planetgpx-tagcounter-solr.html' title='planet.gpx, Tagcounter &amp; Solr'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8417410101561011253.post-8197811192962256804</id><published>2009-11-29T21:22:00.002+01:00</published><updated>2009-11-29T22:14:48.009+01:00</updated><title type='text'>Status update</title><content type='html'>I thought it would be about time for a quick status update about the development of OSMdoc and as I'm too lazy to set up something permanent I'll just use this blog until I come around to use set up something more permanent. This will at times be technical and not very interesting for those only interested in the data. After I've got something to show this will probably change.&lt;br /&gt;&lt;br /&gt;OSMdoc gets between 20 and 50 visitors a day. Not too bad considering that it is quite specific and not advertised very well. In the last few months I've received numerous feature requests and bug fixes. One of the most requested features of course was up to date data. The current data is from August 2009.&lt;br /&gt;&lt;br /&gt;So I decided it was time to do it all over and do it right this time. With the &lt;a href="http://www.mail-archive.com/dev@openstreetmap.org/msg09660.html"&gt;help&lt;/a&gt; of Matt Amos I hope that we'll have a complete dump of the OSM history and I plan to use that as the data basis for OSMdoc combined with the new replication diffs available thanks to Brett Henderson. The only drawback is that those diffs are missing changeset information.&lt;br /&gt;&lt;br /&gt;As there are numerous requests for data and analysis of this data that could not easily be done with the current PostgreSQL solution I decided to convert the data and insert it into a &lt;a href="http://hadoop.apache.org/hbase/"&gt;HBase&lt;/a&gt; database. I have most of the basic import code done (for OSM elements: nodes, ways, relations and changesets) that works with the full history dump. Some of the analysis steps are done too but for the moment I'm concentrating on getting the data out and building the rest incrementally. As the front end is written in Python (using Django) I need to access HBase from Python. So at the moment I'm working on the ticket&amp;nbsp;&lt;a href="https://issues.apache.org/jira/browse/HBASE-1744"&gt;HBASE-1744&lt;/a&gt;&amp;nbsp;to update the Thrift API. After that I'm back to work on the data.&lt;br /&gt;&lt;br /&gt;I'm still not sure how I will be able to host this as the current host is quite limited and more powerful servers also are a lot more expensive :)&lt;br /&gt;&lt;br /&gt;I'm always interested in your opinions, criticism, feature requests and general comments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8417410101561011253-8197811192962256804?l=osmdoc.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osmdoc.blogspot.com/feeds/8197811192962256804/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osmdoc.blogspot.com/2009/11/status-update.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8197811192962256804'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8417410101561011253/posts/default/8197811192962256804'/><link rel='alternate' type='text/html' href='http://osmdoc.blogspot.com/2009/11/status-update.html' title='Status update'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry></feed>
