Matching class numbers

psm · August 12, 2023, 8:57pm

I have two columns in OpenRefine - column OCLC-DDC contains the most frequent DDC number obtained from OCLC through classify API (through web scraping by ISBN number of books) e.g.181.045, and another column Local-DDC (DDC number of the same book in a local library). e.g 181.4.
There are many such rows with these two columns. How can I match these two columns in a way that it will give score 1 if values of these two column matches exactly, score 0.50 if first three digits matches (before '.'), score 0.75 if first three digits as well as three digits after '.' matches, and score 0 if first three digits do not match? I was trying something like this:

if(cells["DDC-MF"].value.trim().substring(0,3) == cells["NLI-ddc"].value.trim().substring(0,3),0.5,if(cells["DDC-MF"].value.trim().substring(0,6) == cells["NLI-ddc"].value.trim().substring(0,6), 0.75,if(cells["DDC-MF"].value.trim() == cells["NLI-ddc"].value.trim(),1,0)))

But the issue here that -1) If two class number matches exactly they also match up to the three digits after '.', and 2) it seems not an elegant solution.

Any suggestions?

Thanks and regards

psm · August 13, 2023, 11:59am

This one is giving me better results -


if(cells["DDC-MF"].value.trim() == cells["NLI-ddc"].value.trim(), 1,
   if(cells["DDC-MF"].value.trim().substring(0, 7) == cells["NLI-ddc"].value.trim().substring(0, 7), 0.8,
      if(cells["DDC-MF"].value.trim().substring(0, 6) == cells["NLI-ddc"].value.trim().substring(0, 6), 0.7,
         if(cells["DDC-MF"].value.trim().substring(0, 5) == cells["NLI-ddc"].value.trim().substring(0, 5), 0.6,
            if(cells["DDC-MF"].value.trim().substring(0, 3) == cells["NLI-ddc"].value.trim().substring(0, 3), 0.5, 0)
         )
      )
   )
)

DDC-MF	NLI-ddc	Score
181.045	181.4	0.5
891.451109	891.451504	0.8
658.00954	658.00954	1
342.54087	342.54087	1
491.4309	891.4309	0
338.13095455	338.10954	0.6
539.72	539.72	1
378.54162	891.4513	0
363.7	333.7	0
954.2	928.91439	0
616.99401	616.99401	1
639.2	639.2	1
639.209165	639.2	0.6
320.12	327.58	0
338.954009033	330.954	0
956.04	294.5	0
891.43171	891.434	0.7

But still I think a better and more elegant solution is there for sure.

tfmorris · August 13, 2023, 9:26pm

Elegant is nice, but excessively clever isn't always a good thing, so I wouldn't try to make it too obscure.

Here's a less verbose version with the same structure:

with( cells["DDC-MF"].value.trim(), mf,
with(cells["NLI-ddc"].value.trim(), nli,
if(mf == nli, 1,
if(mf[0, 7] == nli[0, 7], 0.8,
if(mf[0, 6] == nli[0, 6], 0.7,
if(mf[0, 5] == nli[0, 5], 0.6,
if(mf[0, 3] == nli[0, 3], 0.5, 0)
)
)
)
)
))

and here's a version which uses forEach and array processing to compact it more (but returns slightly different scores on a 10 point scale to make the math easier)

with( cells["DDC-MF"].value.trim(), mf,
with(cells["NLI-ddc"].value.trim(), nli,
if(mf == nli, [10,10],
forEach([7,6,5,3],i,
if(mf[0, i] == nli[0, i], floor((0.1 + i*0.1)*10), 0)
)
)
)
).sort()[-1]

Hope these examples give you some ideas of other things to try to fine tune it.

Tom

ostephens · August 14, 2023, 9:39am

Picking up on the comment from @tfmorris that

Elegant is nice, but excessively clever isn't always a good thing, so I wouldn't try to make it too obscure.

An alternative approach I'd suggest is to create three custom facets with the following expressions:

cells["DDC-MF"].value==cells["NLI-ddc"].value

cells["DDC-MF"].value.substring(0,7)==cells["NLI-ddc"].value.substring(0,7)

cells["DDC-MF"].value.substring(0,3)==cells["NLI-ddc"].value.substring(0,3)

These correspond in order to the conditions:

Exact match
First three digits before and first three digits after the decimal point match
The first three digits match

Filter rows choosing true in the first facet. Do "add new column" for your Score with the expression containing just the digit 1. This will give you a new column with the score 1 for all exact matches

Then set facet 1 to false and facet 2 to true (not exact match but first 7 chars - 3 before and 3 after decimal point - do match). Update your score column for the resulting rows with the expression 0.75

Then set facet 1 to false and facet 2 to false and facet 3 to true (not exact match, not 3/3 match, but first three digits match). Update your score column for the resulting rows with the expression 0.5

Finally set all facets to false and update the score column for the resulting rows with expression 0

While this takes a few more steps, the expressions are easier to write and read, and you can get the exact outcome you originally describe in your original post.

ostephens · August 14, 2023, 9:58am

Another approach would be to treat the class numbers as decimal numbers and calculate the different - then assign a score based on the range the difference falls into. I'm pretty sure with DDC this is a valid approach because of the way the numbers work, although I don't think it would work with similar class schemes like UDC where the rules for the order of the digits are more subtle and there are additional, non-numeric, characters involved

E.g. add a "difference" column using:

abs(value.toNumber()-cells["NLI-ddc"].value.toNumber())

Then add your score with an expression like based on the difference column:

if(value==0,1,
  if(and(value>0,value<0.001),0.75,
    if(and(value>0.001,value<1),0.5,0)
  )
)

I think that gets your outcome - and of course you can play with the ranges specified and add more ranges if you want to get more fine-grained scoring. The raw difference is the ultimate fine grained scoring here

psm · August 15, 2023, 6:30am

Thanks @tfmorris and @ostephens for all these nice solutions. We have applied all these solutions in a test dateset and getting results as expected.

These will help us immensely in identifying books with different/wrong class numbers (DDC-based) in a large collection with 2.5 million book records. We found that for almost 60% of this collection we can fetch DDC number (Most Frequent in OCLC collection) and FAST subject headings.

Heartfelt thanks and best regards

Topic		Replies	Views
Number of matches between the columns Support and Helpdesk	6	422	November 19, 2022
Exact matching on a property during reconcile Support and Helpdesk wikidata , reconciliation	2	87	June 7, 2024
Configuring auto-match with openrefine wikibase reconciliation Support and Helpdesk wikidata , wikibase , reconciliation	3	210	March 24, 2024
Extract unmatched values between 2 cells/columns with space or column separated data Support and Helpdesk	13	74	October 29, 2024
Can't reconcile with an URL (https://www.wikidata.org/wiki/Property:P973 ) Support and Helpdesk	3	59	April 18, 2025

Matching class numbers

Related topics