Hey there,
I have a dataset of developer jobs data. In this dataset, I have a "skills" column. A few example values:
-HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)
-Proficiency in one or more backend programming languages (e.g., Java, Python, Node.js, Ruby) API development Database management (SQL or NoSQL) Server management and deployment Security best practices Knowledge of web frameworks (e.g., Express, Django)
-Mobile app development languages (e.g., Java, Swift, Kotlin) Cross-platform development (e.g., React Native, Flutter) Mobile app design principles APIs and web services integration Debugging and troubleshooting
-SQL (Structured Query Language) Database design Query optimization Data modeling Database maintenance Problem-solving skills
I need to normalize these to technology names, such as:
-HTML, CSS, React, Angular, UX
-Java, Python, Node.js, Ruby
-Swift, Kotlin, React, Flutter
etc..
The reason why I need to normalize this is because I need to answer this question: What are the most popular technologies?
Any suggestions on what could be the best approach to this?
If there's a fairly consistent pattern of finding the technology names within paratheses, then you might extract them by Adding a new column based on using some GREL
value.find(/\(([^)]+)\)/).join("|||")
The alternative is to use a Word Facet as a hint on what the soup of words looks like. And stealing it's regex used by clicking on its "change" and then using that in a Add new column based on using its GREL and adding the join.
value.split(/(?U)[\W]+/).join("|||")
Finally, you can split the new column on the separator characters |||
.
Then either run a Reconcilation(but that's harder) or perhaps easier to use an existing technology terms database or "job skills" database that you create a new OpenRefine project against and then in your existing project use cross()
on the new column of words with those found in the database of technology terms. Dice.com used to maintain a really good list long ago, but that's long ago and technology rapidly changes. So it's best to use something like Wikidata and query using SPARQL to extract and download the CSV listing of instances in certain categories, like "web frameworks" as an example.
Record linkage is a science on its own. Best of luck!
Just thought of another GREL pattern. Only first letter capital words?
value.find(/\b[A-Z].*?\b/).join("|||")