Tuesday, December 16, 2014

The missing \b regular expression special character in Oracle.

Oracle's regular expressions don't support the special regular expression character \b, at least not in Oracle 11i.

Consider the following table:

create table tq84_word_boundaries ( txt varchar2(50) ); insert into tq84_word_boundaries values ('AFooABarAndABaz' ); insert into tq84_word_boundaries values ('A FooA BarAndABaz' ); insert into tq84_word_boundaries values ('A Foo, a Bar and a Baz'); insert into tq84_word_boundaries values ('A Foo without a Baz' ); insert into tq84_word_boundaries values ('Foo Bar, Baz' ); insert into tq84_word_boundaries values ('Is it a Foo?' ); insert into tq84_word_boundaries values ('Bar-Foo-Baz' );

Now, I want to find all records that contain the exact word Foo. That is, I want, for example A Foo without a Baz (the fourth record), but I don't want A FooA BarAndABaz (the second record) because FooA is not the exact word Foo

If Oracle supported \b I could use

select * from tq84_word_boundaries where regexp_like(txt, '\bFoo\b');
Yet, Oracle, does not support it and no record is returned.

To improve matters, I could try

select * from tq84_word_boundaries where regexp_like(txt, '\sFoo\s');
This approach returns
TXT -------------------------------------------------- A Foo without a Baz

A bit better, but far from perfect. For example, the fifth record (Foo Bar, Baz) is not returned, because the \s doesn't match start of line. So, I improve the where condition:

select * from tq84_word_boundaries where regexp_like(txt, '(^|\s)Foo($|\s)');
The record is now returned:
TXT -------------------------------------------------- A Foo without a Baz Foo Bar, Baz

Yet again, this is far from perfect. I need also record records where Foo is followed or lead by a non word character (such as ? or -):

select * from tq84_word_boundaries where regexp_like(txt, '(^|\s|\W)Foo($|\s|\W)');
This returns
TXT -------------------------------------------------- A Foo, a Bar and a Baz A Foo without a Baz Foo Bar, Baz Is it a Foo? Bar-Foo-Baz
I think I can live with it.
Source code on github

No comments:

Post a Comment