Researching Digital Data: 2014

Friday, December 19, 2014

Creating an a4 with rulers to test the output accuracy of a printer

I recently bought a new printer and wanted to test its borderless printing capabilities. So I wrote a small VBA script that creates an A4 MS Word document with four rulers, each starting at one of the four borders:

In case you're interested in the source code, I put in on github. The PDF can be downloaded from here (http://renenyffenegger.ch/blog/files/a4-ruler.pdf).

Tuesday, December 16, 2014

The missing \b regular expression special character in Oracle.

Oracle's regular expressions don't support the special regular expression character \b, at least not in Oracle 11i.

Consider the following table:

create table tq84_word_boundaries ( txt varchar2(50) ); insert into tq84_word_boundaries values ('AFooABarAndABaz' ); insert into tq84_word_boundaries values ('A FooA BarAndABaz' ); insert into tq84_word_boundaries values ('A Foo, a Bar and a Baz'); insert into tq84_word_boundaries values ('A Foo without a Baz' ); insert into tq84_word_boundaries values ('Foo Bar, Baz' ); insert into tq84_word_boundaries values ('Is it a Foo?' ); insert into tq84_word_boundaries values ('Bar-Foo-Baz' );

Now, I want to find all records that contain the exact word Foo. That is, I want, for example A Foo without a Baz (the fourth record), but I don't want A FooA BarAndABaz (the second record) because FooA is not the exact word Foo

If Oracle supported \b I could use

select * from tq84_word_boundaries where regexp_like(txt, '\bFoo\b');

Yet, Oracle, does not support it and no record is returned.

To improve matters, I could try

select * from tq84_word_boundaries where regexp_like(txt, '\sFoo\s');

This approach returns

TXT -------------------------------------------------- A Foo without a Baz

A bit better, but far from perfect. For example, the fifth record (Foo Bar, Baz) is not returned, because the \s doesn't match start of line. So, I improve the where condition:

select * from tq84_word_boundaries where regexp_like(txt, '(^|\s)Foo($|\s)');

The record is now returned:

TXT -------------------------------------------------- A Foo without a Baz Foo Bar, Baz

Yet again, this is far from perfect. I need also record records where Foo is followed or lead by a non word character (such as ? or -):

select * from tq84_word_boundaries where regexp_like(txt, '(^|\s|\W)Foo($|\s|\W)');

This returns

TXT -------------------------------------------------- A Foo, a Bar and a Baz A Foo without a Baz Foo Bar, Baz Is it a Foo? Bar-Foo-Baz

I think I can live with it.

Links

Source code on github

Monday, December 15, 2014

Little things that make live easier #6: Using clip.exe in cmd.exe to copy somehting into the clipboard

Windows comes with a handy little program, named clip.exe, that can be used to quickly copy something into the clipboard from cmd.exe.

For example, if I wanted to copy the current working directory into the clipboard, I'd go

C:\some\long\and\complicated\path>echo %CD% | clip

Similarly, if I needed to copy all files within the current directory, I'd do

C:\some\path>dir /b | clip

Also, I can copy the content of a file into the clipboard like so:

C:\some\path>clip < someFile.txt

Friday, December 12, 2014

Paginating SQL queries with Oracle 12c

Oracle 12c not only allows to create TOP N queries with an understandable syntax, the row limiting clause also makes it possible to paginate a result set:

create table tq84_table ( id number, txt varchar2(10) ); insert into tq84_table values ( 2, 'two' ); insert into tq84_table values ( 1, 'one' ); insert into tq84_table values ( 6, 'six' ); insert into tq84_table values ( 8, 'eight'); insert into tq84_table values ( 3, 'three'); insert into tq84_table values ( 7, 'seven'); insert into tq84_table values ( 4, 'four' ); insert into tq84_table values ( 9, 'nine' ); insert into tq84_table values (10, 'ten' ); insert into tq84_table values ( 5, 'five' );

select id, txt from tq84_table order by id offset 4 rows -- skip first 4 records fetch next 3 rows only -- fetch next 3 records ;

returns

ID TXT ---------- ---------- 5 five 6 six 7 seven

Links

SQL statement on github

TOP N select queries with Oracle 12c

With Oracle 12c, it's possible, finally, to do TOP n select queries with an easily understandable syntax.

Here's a table to demonstrate it:

create table tq84_table ( id number, txt varchar2(10) ); insert into tq84_table values ( 2, 'two' ); insert into tq84_table values ( 1, 'one' ); insert into tq84_table values ( 6, 'six' ); insert into tq84_table values ( 8, 'eight'); insert into tq84_table values ( 3, 'three'); insert into tq84_table values ( 7, 'seven'); insert into tq84_table values ( 4, 'four' ); insert into tq84_table values ( 0, 'zero' ); insert into tq84_table values ( 9, 'nine' ); insert into tq84_table values (10, 'ten' ); insert into tq84_table values ( 5, 'five' ); commit;

Now, a select statement with fetch first row only:

select id, txt from tq84_table order by id fetch first row only;

selects

ID TXT ---------- ---------- 0 zero

Links

Paginating SQL queries with Oracle 12c

SQL on github

Thursday, December 11, 2014

Little things that make live easier #5: Navigation the Windows Registry with Powershell

It's possible to navigate the Windows Registry with Powershell

PS C:\> cd HKLM:/SOFTWARE/Microsoft/PowerShell PS HKLM:\SOFTWARE\Microsoft\PowerShell> ls Hive: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\PowerShell SKC VC Name Property --- -- ---- -------- 4 2 1 {Install, PID} PS HKLM:\SOFTWARE\Microsoft\PowerShell> cd 1 PS HKLM:\SOFTWARE\Microsoft\PowerShell\1> ls Hive: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\PowerShell\1 SKC VC Name Property --- -- ---- -------- 0 1 0407 {Install} 0 6 PowerShellEngine {ApplicationBase, PSCompatibleVersion, RuntimeVersion, ConsoleHostAssemblyName...} 1 0 PSConfigurationProviders {} 2 0 ShellIds {}

Monday, December 8, 2014

Little things that make live easier #4: writing colored text in cmd.exe

A constant source of frustration when using cmd.exe is that it seems impossible to write colored text easily, unlike other shells (although I don't consider cmd.exe to be a shell). Yet, if powershell is installed, then it's possible to write such colored text. Here's a powershell script (c:\temp\colors.ps1):

write-host "" write-host " " -noNewLine write-host "Warning" -foregroundColor red -backgroundColor yellow write-host " " -noNewLine write-host "-------" -foregroundColor blue write-host " Wearing headphones doesn't" -foregroundColor red write-host " make farts silent." -foregroundColor red

When executed with something like

powershell -executionpolicy bypass c:/temp/colors.ps1

It prints

Wednesday, December 3, 2014

Using Exporter in a Perl module

The Exporter module makes it easy for another module to define which functions are exported into the other module's users' namespace.

Here's a simple module to demonstrate that

package SomeModule; use warnings; use strict; use Exporter; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); @ISA = qw(Exporter); $VERSION = 1; @EXPORT = qw(exp_one exp_two); @EXPORT_OK = qw(ok_one ok_two ok_three ok_four); %EXPORT_TAGS = ('all' => [ ( @EXPORT, @EXPORT_OK) ], 'some' => [ qw(ok_one ok_three exp_two) ] ); sub exp_one { return 'exp_one'; } sub exp_two { return 'exp_two'; } sub ok_one { return 'ok_one'; } sub ok_two { return 'ok_two'; } sub ok_three { return 'ok_three';} sub ok_four { return 'ok_four'; } 1;

And here's a script that uses SomeModule:

use warnings; use strict; use feature 'say'; use SomeModule; say exp_one(); say exp_two(); # say ok_one();

This script can call exp_one and exp_two from SomeModule because these two functions are listed in @EXPORT. Functions listed in @EXPORT are by default exported into the user's namespace.

ok_one, on the other hand, cannot be called directly as it is not listed in @EXPORT. It can, however, be called so SomeModule::ok_one().

Here's another script:

use warnings; use strict; use feature 'say'; use SomeModule qw(ok_one ok_two); # say exp_one(); say ok_one(); say ok_two();

Now, I specify the identifiers I want to have imported into my namespace (qw(ok_one ok_two)). Accordingly, I can call these.

I cannot call exp_one (which I could call in the first script), because as soon as I begin specifying identifiers to be exported I need to specify all of them.

Of course, it will soon be tiresome to always indicate which functions I want to have imported. Therefore, I can define tags with the %EXPORT_TAGS hash that groups multiple identifiers under a tag. For example, the tag all imports all functions. In my case, I just used the combined values of @EXPORT and @EXPORT_OK. This is demonstrated with the third script:

use warnings; use strict; use feature 'say'; use SomeModule (':all'); say exp_one(); say ok_one(); say ok_two();

Lastly, I can import some functions, too:

use warnings; use strict; use feature 'say'; use SomeModule (':some'); # say exp_one(); say ok_one(); say exp_two();

Links

Source code on github: SomeModule.pm, script.pl, script_2.pl, script_3.pl and script_4.pl

Wednesday, November 19, 2014

Bypassing ORA-00942: table or view does not exist

The infamous ORA-00942: table or view does not exist error is a constant source of frustration with Oracle.

Ususally, the cause for this error is that someone is granted access to a table via a role rather than directly. So, the user can select from that_table but as soon as he uses the statement in a compiled PL/SQL source, it won't work anymore, erroring out with this ORA-00942.

I can demonstrate this easily: First, a fresh user is created and given a few privileges:

connect / as sysdba create user ipc_user identified by ipc_user; grant create procedure, create session, select_catalog_role to ipc_user;

The user has been granted select_catalog_role so he can do

select count(*) from v$process;

and

select count(*) from v$session;

Now, the user wants to create a stored procedure that shows him the amount of memory used for his process:

create or replace function tq84_proc_memory return varchar2 as v_result varchar2(200); begin select 'Used: ' || round(pga_used_mem /1024/1024)||', '|| 'Alloc: ' || round(pga_alloc_mem /1024/1024)||', '|| 'Freeable: ' || round(pga_freeable_mem/1024/1024)||', '|| 'PGA Max: ' || round(pga_max_mem /1024/1024) into v_result from v$process where addr = (select paddr from v$session where sid = sys_context('USERENV','SID')); return v_result; end tq84_proc_memory; /

This procedure won't compile for the reasons outlined in the beginning of this post:

-- 5/5 PL/SQL: SQL Statement ignored -- 15/33 PL/SQL: ORA-00942: table or view does not exist

So, what to do? For such reasons, I have created the IPC PL/SQL package. The package's function exec_plsql_in_other_session uses dbms_job to create another session that can make full use of the granted privileges and also uses dbms_pipe to return a value to the caller.

Of course, I need to grant the required privileges also:

grant execute on dbms_job to ipc_user; grant execute on dbms_pipe to ipc_user; grant create job to ipc_user;

After installing the packages, I can rewrite my function like so:

Now, I can use the function to report some memory figures

select tq84_proc_memory from dual;

Links

Github:

SQL demonstration script

Friday, November 14, 2014

A PL/SQL bug

Today, I discovered what seems to be a PL/SQL bug. Here's the relevant part:

if guid_ is null then dbms_output.put_line('guid_ is null: ' || guid_); end if;

Under some special circumstances, it prints

guid_ is null: 07D242FCC55000FCE0530A30D4928A21

Of course, this is impossible if PL/SQL were executing correctly. Either guid_ is null or it has a value. Since the if statement executes the dbms_output line, I should assume that guid_ is indeed null. Yet, a value for guid_ is printed.

This behaviour can be reproduced, at least on Oracle 11R2, with the following code:

create type tq84_t as table of varchar2(32); / create type tq84_o as object ( dummy number(1), not final member procedure clear ) not final; / show errors create type tq84_d under tq84_o ( g varchar2(32), constructor function tq84_d return self as result, overriding member procedure clear ); / show errors create package tq84_h as t tq84_t; end tq84_h; / show errors create package body tq84_h as begin t := tq84_t(); end; / show errors create type body tq84_o as member procedure clear is begin null; end clear; end; / create type body tq84_d as constructor function tq84_d return self as result is begin g := sys_guid; return; end tq84_d; overriding member procedure clear is begin tq84_h.t.extend; tq84_h.t(tq84_h.t.count) := g; g := null; end clear; end; / show errors declare b tq84_o; -- Change to tq84_d ... guid_ varchar2(32); begin b := new tq84_d; guid_ := treat(b as tq84_d).g; b.clear; if guid_ is null then dbms_output.put_line('guid_ is null: ' || guid_); end if; end; / drop type tq84_t; drop type tq84_d; drop type tq84_o; drop package tq84_h;

I have also asked a question on stackoverflow.

Wednesday, November 12, 2014

The power of k-nearest-neighbor searches

I came across the k-nearest-neighbor (k-NN) algorithm recently. Altough it's a relatively simple algorithm, its power still amazed me.

K-NN can be used to classify sets of data when the algorithm is fed with some examples of classifications.

In order to demonstrate that, I have written a perl script. The script creates two csv files (known.csv and unknown) and a third file: correct.txt. The k-NN algorithm will use known.csv to train its understanding of a classification. Then, it tries to guess a classification for each record in unknown.csv. For comparing purposes, correct.txt contains the classification for each record in unknown.csv.

Structure of `known.csv`

known.csv is a csv file in which each record consists of 11 numbers. The first number is the classifaction for the record. It is a integer between 1 and 4 inclusively. The remaining 10 numbers are floats between 0 and 1.

Structure of `unknown.csv`

In unknown.csv, each record consists of 10 floats between 0 and 1. They correspond to the remaining 10 numbers in known.csv. The classification for the records in unkown.csv is missing in the file - it is the task of the k-NN algorithm to determine this classification. However, for each record in unknown.csv, the correct classification is found in correct.txt

Values for the floats

A record's classification determines the value-ranges for the floats in the record, according to the following graphic:

The four classifications are represented by the four colors red, green, blue and orange. The Perl script generated eight floats. For range for the first float for the red classification is [0.15,0.45], the range for the first float for the green classification is [0.35,0.65] etc. Similarly, the range for the second value of the red classification is [0.55,0.85] and so on.

To make things a bit more complicated, two random values in the range [0,1] are added to the eight values resulting in 10 values. These two values can either be at the beginning, at the end or one at the beginning and the other at the end.

Result

When I let the Perl script create 1000 unknown and known records, the following r script guesses more than 985 classifications right, most of the time.

library(FNN) known <- read.csv("known.csv" , header=FALSE) unknown <- read.csv("unknown.csv", header=FALSE) labels <- known[, 1] known <- known[,-1] results <- (0:4)[knn(known, unknown, labels, k = 10, algorithm="cover_tree")] write(results, file="guessed.txt", ncolumns=1)

Links

Source code on github

I find that rather impressive

Tuesday, November 11, 2014

Introductory Scripts for Active Session History and Automatic Workload Repository

A client of mine asked me to provide a short introductory talk on Oracle's Active Session History (ASH) and Automatic Workload Repository (AWR). I now have put the demonstration scripts on github at https://github.com/ReneNyffenegger/oracle_scriptlets/tree/master/ash-awr/Introduction-Nov-3rd-14. Maybe, someone can profit from the scripts?

Tuesday, November 4, 2014

Little things that make live easier #3: javascripts' Number.toFixed() for a fixed number of digits after the decimal point.

Javascript's Number object has the toFixed method that allows display a number with a fixed amount of digits after the decimal point. Unfortunately, it doesn't round the number.

function tf(n) { out.innerHTML += n.toFixed(3) + '<br>'; } tf(4 ); // 4.000 tf(4.2 ); // 4.200 tf(4.1004); // 4.100 tf(4.1009); // 4.100

Links

Source code on github

Friday, October 31, 2014

Little things that make live easier #2: sys.ora_mining_number_nt

With Oracle, sys.ora_mining_number can be used to turn a set of numbers to a result set:

SQL> select * from table(sys.ora_mining_number_nt(4, 20, 15)); COLUMN_VALUE ------------ 4 20 15

Thursday, October 30, 2014

Is this a bug in the PL/SQL compiler

I had some leisure time and sifted through some of my old stackoverflow questions and answers. On December 22nd, 2010, I asked the question Is this a bug in the PL/SQL compiler. It turned out, that the following PL/SQL compiles (at least on Oracle 11):

create or replace package return as subtype return is varchar2(10); end return; / create or replace package tq84 as constant constant return . return := 'return'; function function return return . return; end tq84; /

I still find this quite funny, so I had to post it on this blog!

Thursday, October 23, 2014

Creating psychedelic images with word and VBA

The Shape object of MS-Word can be used to insert pictures into word documents with VBA. If I insert a small picture and blow it up to use the entire page size, it creates a cool, almost psychedelic effect.

Here's the image: Its size is 20x30 pixels, approximately the aspect ratio of an A4 document.

The function, named main needs a parameter, path, that points to the location of the image:

sub main(path)

First, we declare a variable for the image (background_image), which is a shape), then load an image and assign it to the variable:

dim background_image as shape set background_image = activeDocument.shapes.addPicture( _ fileName := path & "\background.png", _ linkToFile := false)

Then, we need to place the image's top left corner on the page's top left corner:

background_image.relativeVerticalPosition = _ wdRelativeVerticalPositionPage background_image.top = 0 background_image.relativeHorizontalPosition = _ wdRelativeHorizontalPositionPage background_image.left = 0

Finally, we want the image to be behind the text in case we're going to write on it:

background_image.zOrder msoSendBehindText

This VBA program should be saved into a file, for example backgroundImage.bas. It can then be executed with a

runVBAFilesInOffice.vbs -word backgroundImage -c main %CD%

The resulting word (or image) then looks like

Links

See this link for runVBAFilesInOffice.vbs.

Source code on github

Friday, October 17, 2014

Cloning a user in Oracle

dbms_metadata makes it simple to clone a user. Here's a demonstration.

First, I create a small schema with some simple grants: The first user (a_user) is needed to contain a few tables to test if grants to these tables will be cloned:

create user a_user identified by a_password;

The second user (user_to_be_cloned) is the user that I will actually duplicate:

create user user_to_be_cloned identified by "Secret*49" quota 10M on users;

I also need a role through which I will grant an object privilege:

create role a_role;

Creating the mentioned tables:

create table a_user.table_01 (id number); create table a_user.table_02 (id number); create table a_user.table_03 (id number);

Granting some privileges:

grant create session, create table to user_to_be_cloned; grant select, insert on a_user.table_01 to user_to_be_cloned; grant all on a_user.table_02 to a_role; grant a_role to user_to_be_cloned;

Now, I want an sql script that contains the necessary statements to clone the user. The following script uses the SQL*Plus spool command to create the file (so it is designed to be run in SQL*Plus).
It also uses replace a lot to change USER_TO_BE_CLONED to CLONED_USER.

set heading off set pages 0 set long 1000000 set termout off exec dbms_metadata.set_transform_param( dbms_metadata.session_transform, 'SQLTERMINATOR', true); spool create_cloned_user.sql select replace( dbms_metadata.get_ddl( 'USER', 'USER_TO_BE_CLONED' ), '"USER_TO_BE_CLONED"', 'CLONED_USER' ) from dual; select replace( dbms_metadata.get_granted_ddl( 'SYSTEM_GRANT', 'USER_TO_BE_CLONED' ), '"USER_TO_BE_CLONED"', 'CLONED_USER' ) from dual; select replace( dbms_metadata.get_granted_ddl( 'OBJECT_GRANT', 'USER_TO_BE_CLONED' ), '"USER_TO_BE_CLONED"', 'CLONED_USER' ) from dual; select replace( dbms_metadata.get_granted_ddl( 'ROLE_GRANT', 'USER_TO_BE_CLONED' ), '"USER_TO_BE_CLONED"', 'CLONED_USER' ) from dual; select replace( dbms_metadata.get_granted_ddl( 'TABLESPACE_QUOTA', 'USER_TO_BE_CLONED' ), '"USER_TO_BE_CLONED"', 'CLONED_USER' ) from dual; spool off set termout on

The created script can now be used to create the cloned user. After logging on with CLONED_USER you can verify that the grants are cloned, too, by trying to select from a_user.table_01, a_user.table_02 and a_user.table_03.

Links

Source code on github

Thursday, October 9, 2014

How UBS cuts costs for finding solutions to their Big Data problems

This is interesting: the UBS Innovation Challenge.

Here's what it boils down to: UBS has lots of data and everybody talks about turning Big Data into Big Money. This is exactly what UBS want to to. The problem: UBS lacks the expertise and know how to tackle this kind of problem. After all, they're a bank. So, what can they do?

They could go to one of the many new Big Data consultancies that spring up like mushrooms these days and ask them for a solution. Yet, that is risky. How can a bank determine if such a consultancy is worth the money, especially in a field where those companies are relatively young. UBS could contact many consultancies. Of course, that will drive prices even higher. There must be a better way.

Indeed, UBS has found a better way. They do the UBS Innovation Challenge. Everybody can participate with ideas tailored to the UBS problem domain. UBS screens those ideas, selects a "best idea" and gives the winner SGD 40'000. There are additional costs, such as flights to Singapore and accomodations for the final participiants, but the costs for UBS are manageable.

So, UBS benefits in two ways: 1) They are presented with not only one or two ideas but with many ideas. 2) The costs for these ideas is known in advance.

Saturday, October 4, 2014

An open street map node density map for Switzerland

I'd like to know how open street map nodes are distributed in Switzerland, that is, I want to map their density on an image. This should be a fairly easy task with Python since I have already loaded Switzerland into an sqlite database (See OpenStreetMap: convert an pbf to an sqlite database with Python and download-switzerland-pbf.py [github]).

First, I create a table that contains the count of nodes per pixel. I fill this this table with a create table... as ...count(*).... group by. The values for the variables a through f determine the aspect ration and the size of the resulting picture:

create table count_per_pixel_ch as select count(*) cnt, cast ( ( lon - a ) / c * e as int) x, cast ( ( lat - b ) / d * f as int) y from node group by x, y;

x and y are unique. This pair represents a pixel on the final image. cnt is the number of nodes that fall into the area covered by a x/y pair.

After filling this table, I can iterate over each pixel and draw a color that is lighter for large pixel counts and darker for small pixel counts. I use the Python Image Library (import Image) to draw the image.

import Image # # Code filling the table # for r in cur.execute(""" select x, y, cnt from count_per_pixel_ch where x >= 0 and x < {image_width_px} and y >= 0 and y < {image_height_px} """.format( image_width_px = image_width_px , image_height_px = image_height_px)): x = r[0] y = image_height_px - r[1] - 1 cnt = float(r[2]) blue = int(float(cnt)/float(avg_count_per_pixel) * 255.0) green = 0 if blue > 255: green = blue - 255 blue = 255 pixels[x, y] = (blue/2, green, blue)

Here's the result:

The same picture with a better resolution.

Links

Source code on github

Wednesday, October 1, 2014

A Google Earth hiking map for the Säntis region with Open Street Map data

After I have loaded a Switzerland pbf file (see download-switzerland-pbf.py or download-switzerland-pbf.bat) into an sqlite database (see OpenStreetMap: convert an pbf to an sqlite database with Python ), I can use this data to create a Google Earth kml file that highlights objects found in Open Street Map.

I am particularly intersted in creating an SAC hiking map for the Säntis region. For that end, I chose osm ways that have a tag named sac_scale.
Additionally, I restricted the respective nodes to the lattitude longitude for the Säntis region.

These ways are kept in a specific table:

create table sac_ways_around_saentis as select distinct wy.id way_id from tag tg join way wy on wy.id = tg.way_id join node_in_way nw on wy.id = nw.way_id join node nd on nd.id = nw.node_id where nd.lat > 47.2210118322 and nd.lat < 47.2604651483 and nd.lon > 9.3149215728 and nd.lon < 9.3959004678 and tg.k = 'sac_scale'

With this table, I can now extract the lattitude/longitude pairs for each node in the relevant ways and write them into a kml file.

The algorithm basically boils down to:

select way_id from sac_ways_around_saentis -- with each way_id: select node_id from node_in_way where way_id = ? order by order_ -- with node_id select lat, lon from node where id = ? -- Emit lat, lon into kml file

Of course, I would use a Python script for this (sac_ways_saentis.py on github).

Here's a screen shot of the result:

Tuesday, September 30, 2014

OpenStreetMap: convert an pbf to an sqlite database with Python

Recently, I have demonstrated how the open street map pbf parser (OSMpbfParser.py) can be used to create xml files. With little effort, a script can be written that creates and fills an sqlite database. Of course, I first have to define a schema into which I load the data. I think this is also a good exercise to understand the internals of osm data. The tables are adapted from the original open street data database schema.

Tables

Table node

The node table is the fundamental table for open street map. Each node consists of a longitude (lon) / latitude (lat) pair and thus defines a geographical point on the surface of the earth. The original schema has a few more attributes, such as changeset_id or version. I am not interested in these, so I don't load them:

create table node( id integer primary key, lat real not null, lon real not null );

Tables way and node_in_way

Each "line" (such as a street or border etc) on open street map is an ordered list of nodes. Open Street Map calles them ways. Ways are also used to define areas in which case the last node equals the first node in the list.

First, I need a table just to store each way. Again, I am not interested in attributes such as user or changeset, so I just load the way's id:

create table way( id integer primary key );

The table node_in_way relates nodes and ways:

create table node_in_way( way_id integer not null references way deferrable initially deferred, node_id integer not null references node deferrable initially deferred, order_ integer not null )

Tables relation and member_in_relation

Open Street Map allows to relate multipe nodes and ways (and even relations) in a relation:

create table relation( id integer primary key );

Each member (that is either node, way or relation) is listed in member_in_relation. Thus, exactly one of the attributes node_id, way_id or relation_id is not null:

create table member_in_relation( id_of_relation integer not null references relation deferrable initially deferred, node_id integer null references node deferrable initially deferred, way_id integer null references way deferrable initially deferred, relation_id integer null references relation deferrable initially deferred, role text )

Table tag

Finally, there are tags. A tag is a key (k) value (v) pair that can be assigned to nodes, ways or relations. Often, the values for specific keys define what an object is. As in member_in_relation exactly on of node_id, way_id or relation_id is not null.

create table tag( node_id integer null references node deferrable initially deferred, way_id integer null references way deferrable initially deferred, relation_id integer null references relation deferrable initially deferred, k text not null, v text not null )

ERD

Here's the ERD for these tables:

The ERD was created with dia from pbf2sqlite-erd.dia.

Loading the pbf to a sqlite db

In order to run the script, a pbf must be obtained, for example with download-switzerland-pbf.py.

Then, this pbf can be loaded into a sqlite db on the command line with

pbf2sqlite.py xyz.pbf xyz.db

The script is on github: pbf2sqlite.py.

Links

Parsing an Open Street Map pbf file with Python

Open Street Map: convert pbf to xml

A Google Earth hiking map for the Säntis region with Open Street Map data

An open street map node density map for Switzerland

Wednesday, September 24, 2014

Why is there no == operator in SQL?

Why is there no == operator in SQL that yields true if both operands are either null or have the same value?

Here's the truth table for the = operator:

=	42	13	null
42	true	false	null
13	false	true	null
null	null	null	null

This has some implications. The statement

select * from t where col1 = col2

won't return a record where both col1 and col2 are null.

I suspect that in most cases this is not what the author of such a statement wants. Therefore, they will rewrite the query so:

select * from t where ( a is null and b is null) or ( a = b)

Now, if there were a == operator with this truth table:

==	42	13	null
42	true	false	false
13	false	true	false
null	false	false	true

it would definitely make my life easier (and would not cost the database companies too much money to implement).

Maybe I am all wrong and there is such a thing somewhere. If you know of such an operater in any database product, please let me know!

Tuesday, September 23, 2014

Oh that pesky Oracle outer join symbol (+)

Here are three tables, named A, A2Z and Z that I want to outer join from left to right:

The first table, A, contains the (primary keys) 1 through 7. I want my select to return each of these, that's why I use an outer join. A's column i is outer joined to A2Z's column ia:
A.i = A2Z.ia (+).
The (+) symbol indicates that a record should be returned even if there is no matching record. Note, the (+) is on the side of the = where "missing" records are expected.

Now, the records in A2Z should be joined to Z if A2Z's column flg is equal to y (colored green in the graphic above). For example, for the 1 in A, I expected the query to return the a in Z, for the 2 in A I expect no matching record in Z since the corresponding flg is either null or not equal to y.
This requirement can be implemented with a
A2Z.flg (+) = 'y'
Note, the (+) is on the side where missing records (or null values) are expected. Since y is neither missing nor null, it goes to the other side.

Finally, A2Z needs to be joined to Z:
A2Z.iz = Z.i (+)

Complete SQL Script

create table A (i number(1) primary key); create table Z (i char (1) primary key); create table A2Z ( ia not null references A, flg char(1) not null, iz char(1) not null ); insert into A values (1); insert into A values (2); insert into A values (3); insert into A values (4); insert into A values (5); insert into A values (6); insert into A values (7); insert into Z values ('a'); insert into Z values ('b'); insert into Z values ('c'); insert into Z values ('d'); insert into A2Z values (1, 'y', 'a' ); insert into A2Z values (1, 'n', 'b' ); insert into A2Z values (2,null, 'c' ); insert into A2Z values (2, 'q', 'd' ); insert into A2Z values (3, 'y', 'e' ); insert into A2Z values (4, , 'f' ); insert into A2Z values (5, 'y', null); insert into A2Z values (6, 'v', null); select A.i a_i, Z.i z_i from A, A2Z, Z where A.i = A2Z.ia (+) and A2Z.flg (+) = 'y' and A2Z.iz = Z.i (+) order by A.i; drop table A2Z purge; drop table Z purge; drop table A purge;

When run, the select returns the following records:

       A_I Z
---------- -
         1 a
         2
         3
         4
         5
         6
         7

Monday, September 22, 2014

Open Street Map: convert pbf to xml

Here's an example on how the open street map parser can be used to create xml files. Please excuse the wide source...

import sys import time import OSMpbfParser def xml_escape(s_): s_=s_.replace ("&", "&" ) s_=s_.replace ("<", "<" ) s_=s_.replace (">", ">" ) s_=s_.replace ('"', '"') return s_ def callback_node(node): stamp=time.strftime("%Y-%m-%dT%H:%M:%SZ",time.gmtime(node.time)) if len(node.Tags)>0: fh.write(' \n' % (node.NodeID, node.version, stamp, node.uid, xml_escape(node.user), node.changeset, node.Lat, node.Lon)) for t in node.Tags.keys(): fh.write(' \n' % (t, xml_escape(node.Tags[t]))) fh.write(' \n') else: fh.write(' \n' % (node.NodeID, node.version, stamp, node.uid, xml_escape(node.user), node.changeset, node.Lat, node.Lon)) def callback_way(way): stamp=time.strftime("%Y-%m-%dT%H:%M:%SZ",time.gmtime(way.time)) fh.write(' \n' % (way.WayID, way.version, stamp, way.uid, xml_escape(way.user), way.changeset)) for n in way.Nds: fh.write(' \n' % (n)) for t in way.Tags.keys(): fh.write(' \n' % (t,xml_escape(way.Tags[t]))) fh.write(' \n') def callback_relation(relation): stamp=time.strftime("%Y-%m-%dT%H:%M:%SZ",time.gmtime(relation.time)) fh.write(' \n' % (relation.RelID, relation.version, stamp, relation.uid, xml_escape(relation.user), relation.changeset)) for m in relation.Members: fh.write(' \n' % (m.type, m.ref, xml_escape(m.role))) for t in relation.Tags.keys(): fh.write(' \n' % (t,xml_escape(relation.Tags[t]))) fh.write(' \n') # First argument is *.pbf file name pbf_filename = sys.argv[1] # Second (optional) argument is output file if len(sys.argv) > 2: fh = open(sys.argv[2], 'w') else: fh = sys.stdout fh.write('\n') fh.write('\n') OSMpbfParser.go(pbf_filename, callback_node, callback_way, callback_relation) fh.write('\n') fh.close()

This script takes one mandatory and an optional argument. The mandatory argument specifies the pbf file. If the optional parameter is given, it specifies the name of the xml file into which the output is written, otherwise, the output goes to stdout:

c:\> pbf2xml.py liechtenstein-latest.osm.pbf liechtenstein.xml

Links

Source code on github

open street map parser

OpenStreetMap: convert an pbf to an sqlite database with Python

Parsing an Open Street Map pbf file with Python

Chris Hill at http://pbf.raggedred.net/ has written a parser in Python for open street map pbf files. His parser is free software, so I used this liberty to adapt it for my needs. In particular, his script either collects osm node, way and relation data either in memory (which is a problem for big pbf files) or it emits xml files with the osm data. I have changed his script so that I can pass three callback functions that are invoked as soon as the parser finished with one of the three fundamental osm type node, way or relation.

Here's a template that can be used to write a script that uses my adaption of the parser:

import OSMpbfParser def callback_node(node): do_something_with(node) def callback_way(way): do_something_with(way) def callback_relation(relation): do_something_with(relation) OSMpbfParser.go( 'the-file-to-be-parsed.pbf', callback_node, callback_way, callback_relation)

Each of the callback functions has exactly one parameter that corresponds to the classes OSMNode, OSMWay and OSMRelation (see the source at github).

Installing google's protocol buffers

To parse pbf files, google's protocol buffers are needed. The can be optained from code.google.com/p/protobuf/downloads/list.
For Windows, I downloaded protoc-2.5.0-win32.zip which contains one file: protoc.exe. After extracting this file, the environment variable PATH should be changed so that it points to the directory with protoc.exe.

For the python installation, the full source protobuf-2.5.0.tar.bz2 is also needed. After extracting them, cd into the python directory and execute:

cd protobuf-2.5.0\protobuf-2.5.0\python python setup.py build python setup.py test python setup.py install

Links

Source code on github

convert pbf files to xml

OpenStreetMap: convert an pbf to an sqlite database with Python

Wednesday, September 17, 2014

Creating a simple TopoJSON map

I want to create a simple map for a country (perhaps an island) that consists of three regions. Here's how this country looks like:

The three regions are confined by what TopoJSON refers to as arcs: A red arc to the West, an orange arc to the North East, a blue arc to the South east, and a light blue, a purple and a green arc going to the center of the country. To draw the map with these regions, I need to create a TopoJSON topology object. Usually, this object is specified in JSON, but for simplicity's sake, I just create an ordinary JavaScript object. This JavaScript object has a member arcs that is an array.

arcs: [ ..... ]

For each arc in the map, another array is inserted into this arcs: array:

arcs: [ [ ... ], // Red arc [ ... ], // light blue arc [ ... ], // orange arc /* etc */ ]

The elements in these nested arrays deteremine the coordinates of the respective arc. These coordinates are specified in a special format: the first element is the absolute coordinates and the following coordinates are relative to their previous coordinates. I hope the following picture makes this a bit clearer:

The red arc starts at -1,-3 and goes to -3,-1 which is -2,2 relative to the first coordinate. The third (and last) coordinate is relative 1,3 to the second coordinate. Hence, the entries in the arcs array become:

arcs: [ [ // Red arc [-1, -3], [-2, 2], [ 1, 3] ], [ // light blue arc [ .. ], [ .. ] ] /* etc */ ]

We can now refer to one of these arcs with an integer. The first arc is 0, the second 1 etc. If we want to refer to one of these arcs against its direction, the integer for the first arc is -1, for the second arc, its -2 etc. This allows us to define the regions with the integers for the respective confining arc. For example, the first region looks like

{ id: 'foo', type: 'Polygon', arcs: [ [0, -2, -5] ] },

This indicates that the region with id=foo is confined by the first arc (0) in direction of the arc, the second arc (-2, note the negative sign) against the direction of the arc (negative sign!) and the fifth arc (again against its direction, negative sign). Finally, the country (or island) needs to be placed somewhere on the earth. The latitute-spread of the Northernmost and Southernmost point of the country is 20 degrees, therefore, the scale is

scale: [ 10/3, 10/3 ]

The middle of the country is 45 degrees north and 0 degrees west, so the translation becomes

translate: [0, 45]

So, the complete TopoJSON object for the country looks like this:

topology = { type: 'Topology', objects: { regions: { type: 'GeometryCollection', geometries: [ {id: 'foo', type: 'Polygon', arcs: [ [0, -2, -5] ] }, {id: 'bar', type: 'Polygon', arcs: [ [3, 1, -3 ] ] }, {id: 'baz', type: 'Polygon', arcs: [ [ -6, 4, -4 ] ] } ] } }, arcs: [ [ // Red arc # 0 / -1 { [-1, -3], [-2, 2], [ 1, 3] ], // } [ // light blue arc # 1 / -2 { [ 0, 1], [-1, -1], [-1, 2] ], // } [ // orange arc # 2 / -3 { [ 3, 0], [-1, 1], [ 1, 2], [-5,-1] ], // } [ // green arc # 3 / -4 { [ 3, 0], [-3, 1] ], // } [ // purple arc # 4 / -5 { [ -1, -3], [ 1, 4] ], // } [ // blue arc # 5 / -6 { [ -1, -3], [ 4, 3] ] // } ], transform: { scale: [ 10/3, 10/3 ], translate: [0, 45] } }

To show this topology with d3.js, the folllowing code should do:

var width = 1000; var height = 500; var projection = d3.geo.albers() .center([0, 45]) .rotate([0,0]) .parallels([ 5,9]) .scale(1000) .translate([width / 2, height / 2]); // Create «SVG window» var svg = d3.select("body").append("svg").attr("width", width ).attr("height", height); // Create path generator var path = d3.geo.path().projection(projection).pointRadius(2); var regions = topojson.feature(topology, topology.objects.regions); svg.selectAll(".region") .data(regions.features) .enter().append("path") .attr("class", function(d) { return "region-" + d.id; }) .attr("d" , path);

Some CSS to change the colors of the regions:

svg { background-color: #eee; } path { stroke: #444; stroke-width:2px; } .region-foo { fill: #7ad; } .region-bar { fill: #d77; } .region-baz { fill: #da7; }

The complete html file is on github, as well as the inkscape/svg file for the graphics.

Friday, September 12, 2014

A simple bar chart with d3.js

With d3.js it is surprisingly simple to create a bar chart:

d3.select('.bar-chart').
  selectAll('div').
  data([194, 52, 228, 268, 163, 138, 92]).
  enter().
    append('div').
    style ('width', function(d) {return d + "px"}).
    text  (         function(d) {return d       });

For each of the data-elements (194, 52, 268 etc), a new div is appended with its css width style set to the respective px width. Since the divs are transparent per default, a bit of css is needed to make them visible:

.bar-chart div {
  background-color:#fa5;
  margin: 2px;
  color: #713;
  text-align: right}

Result:

github link

Wednesday, September 10, 2014

python -m SimpleHTTPServer

I wish I had known this earlier. With python installed, a simple webserver can be started on the command line with just a
python -m SimpleHTTPServer

This command creates a webserver that listens on port 8000, so that it can be accessed with a browser on localhost:8000
The command also accepts an alternative port instead the default 8000:

python -m SimpleHTTPServer 7777

github link

Monday, September 8, 2014

Sihleggstrasse 23, 8832 Wollerau

Switzerlands Zefix, the central index of companies (zentraler Firmenindex), is downloadable with FTP. This is of course an invitation to create a script that determines the address at which most Swiss companies are registered. Not surprisingly, this is in Wollerau, more precisly at

Sihleggstrasse 23
8832 Wollerau

At this address, 328 companies are registered!
For the reference, here's the list of the top twenty company addresses in Switzerland:

Registered companies	Street	Location
328	Sihleggstrasse 23	8832 Wollerau
304	Murbacherstrasse 37	6003 Luzern
222	Technoparkstrasse 1	8005 Zürich
203	Gewerbestrasse 5	6330 Cham
201	Baarerstrasse 75	6300 Zug
201	Chamerstrasse 172	6300 Zug
188	Neuhofstrasse 5A	6340 Baar
185	Industriestrasse 47	6300 Zug
182	Chemin du Château 26 A	2805 Soyhières
163	Riva Albertolli 1	6900 Lugano
162	Haldenstrasse 5	6340 Baar
157	Churerstrasse 35	9470 Buchs
157	Dammstrasse 19	6300 Zug
156	Rue du Rhône 100	1204 Genève
156	Weissbadstrasse 14	9050 Appenzell
149	Poststrasse 6	6300 Zug
148	Baarerstrasse 78	6300 Zug
145	Industriestrasse 21	6055 Alpnach Dorf
144	Oberneuhofstrasse 5	6340 Baar
140	Baarerstrasse 2	6300 Zug

Note how many companies are registered in Zug.

Sorting à la sqlite

Here's a sqllite table

create table dattyp (
  without_dt,
  dt_int integer,
  dt_text text)

The first column (without_dt) does not have an associated datatype with it, the second and third columns do: integer and text, respectively.
Let's fill the table with some values:

insert into dattyp values ( 2,  2,  2)
insert into dattyp values ( 9,  9,  9)
insert into dattyp values (19, 19, 19)

Check the sorting behavior: is it dependent on the datatype?

select without_dt from dattyp order by without_dt

2
9
19

select dt_int from dattyp order by dt_int

2
9
19

select dt_text from dattyp order by dt_text

19
2
9

Columns without explicit datatypes and integer columns are sorted numerically, while the text column is sorted alphanumerically.
Strings that can be converted to integers are inserted:

insert into dattyp values ('28', '28', '28')

Same sort check as above:

select without_dt from dattyp order by without_dt

select dt_int from dattyp order by dt_int

select dt_text from dattyp order by dt_text

The sorting behavior didn't change. Inserting strings that cannot be converted to an integer. Note that the strings can be inserted to the integer column as well:

insert into dattyp values ('foo', 'bar', 'baz')

Again the same selects:

select without_dt from dattyp order by without_dt

2
9
19
28
foo

select dt_int from dattyp order by dt_int

2
9
19
28
bar

select dt_text from dattyp order by dt_text

19
2
28
9
baz

A complete python script is in this github file

Sunday, September 7, 2014

Codesnippet for using sqlite with Python

This is a code snippet intented to demonstrate the use use of the sqlite3 module in Python:

import sqlite3
import os.path

if os.path.isfile('foo.db'):
   os.remove('foo.db')

db = sqlite3.connect('foo.db')

cur = db.cursor()

cur.execute('create table bar (a number, b varchar)')

cur.execute("insert into bar values (2, 'two')")

cur.execute('insert into bar values (?, ?)', (42, 'forty-two'))

cur.executemany('insert into bar values (?, ?)', [
  (4, 'four'),
  (5, 'five'),
  (7, 'seven'),
  (9, 'nine')
])

for row in cur.execute('select * from bar order by a'):  
    print "%2d: %s" % (row[0], row[1])

Github link to script

Saturday, September 6, 2014

Python: reading a csv file

Here's a csv file:

col 1,col 2,col 3
foo,bar,baz
one,two,three
42,,0

This file can be read in python with this script:

import csv

csv_file   = open('data.csv', 'r')
csv_reader = csv.reader(csv_file)

header = csv_reader.next()

for record in csv_reader:
    
    print 'Record:'
    
    i = 0
    for rec_value in record:
        print '  ' + header[i] + ': ' + rec_value
        i += 1

Github links: data.csv and script.py.

Python: How to download a gz file and decompress it in one go

Here's a python snippet I recently used that downloads .gz files from a ftp server and decompresses it in one go.

import zlib from ftplib import FTP def get_gz(ftp, ftp_filename, local_filename): decomp = zlib.decompressobj(16+zlib.MAX_WBITS) unzip = open (local_filename, 'wb') def next_packet(data): unzip.write(decomp.decompress(data)) ftp.retrbinary('RETR ' + ftp_filename, next_packet) decompressed = decomp.flush() unzip.write(decompressed) unzip.close() ftp_ = FTP('ftp.host.xyz') ftp_.login() ftp_.cwd('/foo/bar') get_gz(ftp_, 'remote-file.gz', 'local-file')

Link to github gist

Friday, December 19, 2014

Tuesday, December 16, 2014

Monday, December 15, 2014

Friday, December 12, 2014

Thursday, December 11, 2014

Monday, December 8, 2014

Wednesday, December 3, 2014

Wednesday, November 19, 2014

Friday, November 14, 2014

Wednesday, November 12, 2014

Structure of known.csv

Structure of unknown.csv

Values for the floats

Result

Tuesday, November 11, 2014

Tuesday, November 4, 2014

Friday, October 31, 2014

Thursday, October 30, 2014

Thursday, October 23, 2014

Friday, October 17, 2014

Thursday, October 9, 2014

Saturday, October 4, 2014

Wednesday, October 1, 2014

Tuesday, September 30, 2014

Tables

Table node

Tables way and node_in_way

Tables relation and member_in_relation

Table tag

ERD

Loading the pbf to a sqlite db

Wednesday, September 24, 2014

Tuesday, September 23, 2014

Complete SQL Script

Links

Monday, September 22, 2014

Links

Installing google's protocol buffers

Wednesday, September 17, 2014

Friday, September 12, 2014

Wednesday, September 10, 2014

Monday, September 8, 2014

Sunday, September 7, 2014

Saturday, September 6, 2014

Structure of `known.csv`

Structure of `unknown.csv`