Data Cleansing is the process of correcting, editing and deleting data that is incorrect, incomplete or duplicated. The cleansing can be done at once or can be carried out continuously, and either manually or using various software tools. We strive to clean the data in the closest place possible to where the data is originated - place close to the source - both from a technical point of view (ideally right in the original system/database/software/app) and from the point of view of time. The later the data is cleaned, the more the pollution of the rest of the information system will spread. The goal of data cleansing is to improve the overall consistency of data generated in different ways.
How is it done?
Data cleansing is always done in two basic steps:
- Error detection, i.e. identifying the places containing errors - misspellings (typographical errors), incompleteness, incorrect and inaccurate values, duplicity, and the like
- Correction of errors, i.e. fixing of damaged or incorrect records - repair, replacement, modification and deletion of data
Data cleansing in enterprise applications is performed at the database level using specialized tools. Once this is done, incorrect records are replaced by the clean data.
Why is data cleansing necessary?
Not many information systems (if some at all), applications, and other tools storing business records contain only clean data. This might be due to the failure to coordinate processes and activities within the organization or due to the isolation of individual parts of the information system which means that they do not communicate between each other or the connection established is not working. As a result, the same information is entered to several places within the system. This is frequently the case of customer database that is duplicated in several places because it is involved in many processes and systems that are isolated (uncoordinated) from one another. As a result, customer details are kept in several places, e.g. in ERP softwares, CRM softwares and others. The problem is that when a customer, for example, changes the address, this update only appears in one of these places. Another common origin of this sort of trouble are typographical errors, e.g. an error in the business name causing someone else to enter the same record (item) once again.
The problem of dirty data (i.e. duplicated, inaccurate, incomplete or inconsistent data) naturally increases as the amount of input data and the complexity of information systems grow. If the input data is of a poor quality, the output cannot be better. In other words, poor input values will always generate a poor output.
If the input data is not clean when it’s entered into a complex integrated system, the pollution is then spread to other places and the cost of data cleansing exponentially grows. Data hygiene, as it is sometimes referred to, is becoming more and more important as the number of systems in use grows and the amount of data and information sources rises.
What are the most common origins of dirty data (also called rogue data)?
There are many ways the dirty data can be created. This is the list of the most common ones:
- Entry errors, including misspellings, typographical errors, and various transcripts of the same words.
- Missing data, which may be due to incomplete questionnaires or a negligent transcription into the database.
- Missing enterprise or industry standards for data encryption, which is a major problem, for example, in healthcare.
- Multiple databases scattered across different parts of an organization, each of which is structured according to the current departmental needs that are incompatible with structure of the other existing databases.
- Poorly documented or obsolete data.
- Errors caused by transmission (transmission noise), migration, or merging of data from different sources
- Different definitions of terms, inconsistent vocabulary, inconsistent method or methodology of data input
- Non-uniform formats of data in source systems
What are the methods and technologies used for data cleansing?
Data cleansing can be done at once (for example, before migrating data from one application to another), but it is mostly a continuous and never-ending process. It always depends on the extent of data pollution, and on the means the source data is saved (where - in what systems, how it is accessible, etc.). In practice, it is the following methods of data cleansing that are used:
- Manual data cleansing - it can be convenient and effective for smaller volumes of data, this method is time consuming and financially demanding
- Automatic data cleansing - this method makes use of specialized tools that use sophisticated algorithms to standardize, harmonize, repair, compare, cross check, divide and consolidate data
The features of the data cleansing tools can vary substantially ranging from a simple data cleansing tool to the tool able to compare, correct and consolidate database items from different databases. The tool can either be built-in a particular software or can stand alone as a specialized data cleansing tool available, for example, in the cloud. However, this kind of tools should only be used with caution so to avoid any breach or leakage of data.