Hi all,
Generalist analyst here, trying to understand the extent to which data cleansing solutions require human intervention - and at which points.
For example, if we have a workflow which includes the following stages:
· Removal of data which is irrelevant to the analysis;
· Deduplication of data
· Fixing structural errors
· Deal with missing data
· Address data outliers.
· Data validation
To what extent does a human need to be involved in each? There is clearly a need around outliers (to determine whether it was an error or just a rare event) but how much human engagement is needed in the other stages? Also, if we look at the various open source solutions currently available (OpenRefine, Datacleaner etc) is there any significant difference in the degree of human interaction required in each?
Not a specialist in this area so please reply as if responding to a small child.
Many thanks in advance,
Windsor Holden