Member-only story
Interview Question: How does data cleaning play a vital role in analysis?
A large part of the data scientist’s or related role is cleaning data. It is therefore essential to know about this part of the job and be able to answer any questions during an interview for a data science or similar position.
Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database. Data cleaning refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying or deleting those pieces of coarse data.
The process of data cleaning may involve removing typographical errors or validating or correcting values against a known list of entities. The validation may be strict, such as rejecting any data that is incomplete, or fuzzy, such as correcting data that are partially matched. Some data cleansing solutions will clean data by cross checking it with a validated dataset.
The steps involved in data cleaning are:-
- Remove duplicates
- Remove irrelevant data
- Standardise capitalisation
- Convert data type
- Clear formatting
- Correct errors
- Language translation