Matching class numbers

I have two columns in OpenRefine - column OCLC-DDC contains the most frequent DDC number obtained from OCLC through classify API (through web scraping by ISBN number of books) e.g.181.045, and another column Local-DDC (DDC number of the same book in a local library). e.g 181.4.
There are many such rows with these two columns. How can I match these two columns in a way that it will give score 1 if values of these two column matches exactly, score 0.50 if first three digits matches (before '.'), score 0.75 if first three digits as well as three digits after '.' matches, and score 0 if first three digits do not match? I was trying something like this:

if(cells["DDC-MF"].value.trim().substring(0,3) == cells["NLI-ddc"].value.trim().substring(0,3),0.5,if(cells["DDC-MF"].value.trim().substring(0,6) == cells["NLI-ddc"].value.trim().substring(0,6), 0.75,if(cells["DDC-MF"].value.trim() == cells["NLI-ddc"].value.trim(),1,0)))

But the issue here that -1) If two class number matches exactly they also match up to the three digits after '.', and 2) it seems not an elegant solution.

Any suggestions?

Thanks and regards

This one is giving me better results -

if(cells["DDC-MF"].value.trim() == cells["NLI-ddc"].value.trim(), 1,
   if(cells["DDC-MF"].value.trim().substring(0, 7) == cells["NLI-ddc"].value.trim().substring(0, 7), 0.8,
      if(cells["DDC-MF"].value.trim().substring(0, 6) == cells["NLI-ddc"].value.trim().substring(0, 6), 0.7,
         if(cells["DDC-MF"].value.trim().substring(0, 5) == cells["NLI-ddc"].value.trim().substring(0, 5), 0.6,
            if(cells["DDC-MF"].value.trim().substring(0, 3) == cells["NLI-ddc"].value.trim().substring(0, 3), 0.5, 0)
DDC-MF NLI-ddc Score
181.045 181.4 0.5
891.451109 891.451504 0.8
658.00954 658.00954 1
342.54087 342.54087 1
491.4309 891.4309 0
338.13095455 338.10954 0.6
539.72 539.72 1
378.54162 891.4513 0
363.7 333.7 0
954.2 928.91439 0
616.99401 616.99401 1
639.2 639.2 1
639.209165 639.2 0.6
320.12 327.58 0
338.954009033 330.954 0
956.04 294.5 0
891.43171 891.434 0.7

But still I think a better and more elegant solution is there for sure.

Elegant is nice, but excessively clever isn't always a good thing, so I wouldn't try to make it too obscure.

Here's a less verbose version with the same structure:

with( cells["DDC-MF"].value.trim(), mf,
with(cells["NLI-ddc"].value.trim(), nli,
if(mf == nli, 1,
if(mf[0, 7] == nli[0, 7], 0.8,
if(mf[0, 6] == nli[0, 6], 0.7,
if(mf[0, 5] == nli[0, 5], 0.6,
if(mf[0, 3] == nli[0, 3], 0.5, 0)

and here's a version which uses forEach and array processing to compact it more (but returns slightly different scores on a 10 point scale to make the math easier)

with( cells["DDC-MF"].value.trim(), mf,
with(cells["NLI-ddc"].value.trim(), nli,
if(mf == nli, [10,10],
if(mf[0, i] == nli[0, i], floor((0.1 + i*0.1)*10), 0)

Hope these examples give you some ideas of other things to try to fine tune it.


Picking up on the comment from @tfmorris that

Elegant is nice, but excessively clever isn't always a good thing, so I wouldn't try to make it too obscure.

An alternative approach I'd suggest is to create three custom facets with the following expressions:


These correspond in order to the conditions:

  1. Exact match
  2. First three digits before and first three digits after the decimal point match
  3. The first three digits match

Filter rows choosing true in the first facet. Do "add new column" for your Score with the expression containing just the digit 1. This will give you a new column with the score 1 for all exact matches

Then set facet 1 to false and facet 2 to true (not exact match but first 7 chars - 3 before and 3 after decimal point - do match). Update your score column for the resulting rows with the expression 0.75

Then set facet 1 to false and facet 2 to false and facet 3 to true (not exact match, not 3/3 match, but first three digits match). Update your score column for the resulting rows with the expression 0.5

Finally set all facets to false and update the score column for the resulting rows with expression 0

While this takes a few more steps, the expressions are easier to write and read, and you can get the exact outcome you originally describe in your original post.

Another approach would be to treat the class numbers as decimal numbers and calculate the different - then assign a score based on the range the difference falls into. I'm pretty sure with DDC this is a valid approach because of the way the numbers work, although I don't think it would work with similar class schemes like UDC where the rules for the order of the digits are more subtle and there are additional, non-numeric, characters involved

E.g. add a "difference" column using:


Then add your score with an expression like based on the difference column:


I think that gets your outcome - and of course you can play with the ranges specified and add more ranges if you want to get more fine-grained scoring. The raw difference is the ultimate fine grained scoring here

Thanks @tfmorris and @ostephens for all these nice solutions. We have applied all these solutions in a test dateset and getting results as expected.

These will help us immensely in identifying books with different/wrong class numbers (DDC-based) in a large collection with 2.5 million book records. We found that for almost 60% of this collection we can fetch DDC number (Most Frequent in OCLC collection) and FAST subject headings.

Heartfelt thanks and best regards

1 Like