Normalize longer text data

grafisco · August 7, 2024, 5:29pm

Hey there,

I have a dataset of developer jobs data. In this dataset, I have a "skills" column. A few example values:

-HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)
-Proficiency in one or more backend programming languages (e.g., Java, Python, Node.js, Ruby) API development Database management (SQL or NoSQL) Server management and deployment Security best practices Knowledge of web frameworks (e.g., Express, Django)
-Mobile app development languages (e.g., Java, Swift, Kotlin) Cross-platform development (e.g., React Native, Flutter) Mobile app design principles APIs and web services integration Debugging and troubleshooting
-SQL (Structured Query Language) Database design Query optimization Data modeling Database maintenance Problem-solving skills

I need to normalize these to technology names, such as:

-HTML, CSS, React, Angular, UX
-Java, Python, Node.js, Ruby
-Swift, Kotlin, React, Flutter
etc..

The reason why I need to normalize this is because I need to answer this question: What are the most popular technologies?

Any suggestions on what could be the best approach to this?

thadguidry · August 7, 2024, 11:29pm

If there's a fairly consistent pattern of finding the technology names within paratheses, then you might extract them by Adding a new column based on using some GREL

value.find(/\(([^)]+)\)/).join("|||")

The alternative is to use a Word Facet as a hint on what the soup of words looks like. And stealing it's regex used by clicking on its "change" and then using that in a Add new column based on using its GREL and adding the join.

value.split(/(?U)[\W]+/).join("|||")

Finally, you can split the new column on the separator characters |||.
Then either run a Reconcilation(but that's harder) or perhaps easier to use an existing technology terms database or "job skills" database that you create a new OpenRefine project against and then in your existing project use cross() on the new column of words with those found in the database of technology terms. Dice.com used to maintain a really good list long ago, but that's long ago and technology rapidly changes. So it's best to use something like Wikidata and query using SPARQL to extract and download the CSV listing of instances in certain categories, like "web frameworks" as an example.

Record linkage is a science on its own. Best of luck!

thadguidry · August 7, 2024, 11:35pm

Just thought of another GREL pattern. Only first letter capital words?

value.find(/\b[A-Z].*?\b/).join("|||")

Topic		Replies	Views
Reconcile data from GlobalNames API into separate columns Data cleaning and transformations reconciliation	2	248	November 3, 2023
GREL replace() using regular expression Support and Helpdesk	10	358	July 23, 2024
Clean data as a base for dirty data Data cleaning and transformations	3	328	April 30, 2023
Dario Schiraldi : How do I clean or format messy data? Community	1	28	July 9, 2025
Reconcile many values into single entity Support and Helpdesk wikidata , reconciliation	1	153	December 23, 2023

Normalize longer text data

Related topics