Word Facet design improvement

When we first designed our Word Facet it was for a very basic use case of Latin (English) word splitting by the common whitespace separator.

But I think our Word Facet could be improved for our users with some small tweaks to the underlying split() function which is only splitting by a single whitespace character. Also, a more useful split might be against any word and also ignoring punctuation chars such as _ underscores, etc. as options. We might even make it more useful by having those options as 2 or 3 checkboxes in the facet-controls div which could alter the underlying regex in the split() function.

As an example, notice how much more useful it can become in the top word facet versus original bottom word facet:

I also would likely want to further click on an Edit link next to each extracted word to massively replace a few words with more recent accepted terminology. Also having that Edit link would make Word Facet part of bulk edit operations as well as its only current use in limited text analysis.

I meant to reply to this for a while but am only getting round to it now. I think it is a really interesting problem you are bringing up here.
It feels like it should be possible to use better tokenizers for such a facet. In a sense, just by exposing a given tokenizer in GREL or Jython, you could already improve the expression that defines the word facet.

Tokenization depends on the language so ideally we should be able to provide choice to the user. I wonder if any of the libraries we already depend on offer any sensible tokenizer in any language.
I know Solr offers tokenizers in various languages, but I haven’t checked if we incidentally depend on the artifact where they are deployed. If not it would be nice to expose them as GREL functions in some extension. And there might be more complete sets of tokenizers in other Java libraries.

I completely agree about tokenization - this would be a good addition to GREL especially for the type of use case described by @thadguidry above. Supporting a GREL function like:

<string>.tokenize("<tokenizer name string>",<optional tokenizer params>)

in a similar way to how we support different algorithms for phonetic - including an extension point to allow for additional tokenizers to be added would be good.

The optional tokenizer parameters may be needed - for example the Solr Whitespace tokenizer takes a param to determine whether the Java or Unicode definitions of whitespace should be used … but of course the problem here is that each tokenizer could have a very different set of parameters which might be a challenge for validation.

But overall I feel like this would be a good addition.

However this doesn’t quite resolve the issue that @thadguidry raises - because whether we do it via a tokenizer or by a modified version of the current split() command used in the default word facet - we need to make sure that we choose a sensible ‘default’ for the word facet that works for a wide range of common use cases

Are you thinking this would trigger a find and replace for the ‘edited’ word?

Yes @ostephens, that’s the idea. First, I need a nicer facet to see the list of words (what is a word and what is not is up to language and tokenizer expression), not see a list of super long strings (so tokenization helps here), and then replace some words (by clicking the mass edit link next to each word in the facet list). Give me a better view first…then I can take additional steps once I see the problem through that better view.

When David and Stefano designed and implemented this more than 12 years ago, Java had much more limited capabilities, but improving it is straightforward now.

This sounds like the same enhancement request that I created in 2012, which Antonin expanded upon in 2020. Although I closed my enhancement request when I found the duplicate, I’ve decided to reopen it because it’s trivial to implement and captures 90% of the value while a full blown tokenizer implementation is discussed.

If folks want to experiment with this, I’d recommend using the following expression, which I suggest as the new default:

value.split(/(?U)[\W]/)

For a stronger normalization, folks can use

value.fingerprint().split(’ ')

but I don’t think it’s a good idea for a default because it’s too different from the current behavior. It adds case folding and diacritic folding, but the deduplication that it includes will change the word counts.

I’ve put a pull request with my proposed new default up for review.

Tom