Data Prep, Data Cleaning, & Data Wrangling..

Why are we so concerned about our data and its format before we process it? Surely it’s all bits and bytes and bobs that can be shoved into an online database and then let someone else’s good ol’ Object Oriented Stats program do with it as it pleases…

c/o geralt from Pixabay: https://pixabay.com/users/geralt-9301/

According to interviews and expert estimates, data scientists spend approximately 70 percent to 80 percent of their time embroiled in the mundane labor of collecting and preparing unruly digital data before it can be explored for useful nuggets. (this quote c/- Steve Lohr of The New York Times)

What is Data Preprocessing?

Data preprocessing or wrangling is a process of transforming and mapping data from various data sources. It involves converting raw data into an appropriate format suitable for various data analytical purposes and processing.

Why is Data Wrangling Important?

Data wrangling may significantly influence the formation of any statistical conclusions based on the data. As the ancient saying used to go when I was studying Pascal programming back in the old days — “Garbage in will give you Garbage out.” GIGO is a famous expression that is used to emphasise that “the quality of the statistical analyses (or output) will always depend upon the quality of the data (or input).” By preprocessing your input data, you can minimize the garbage contaminating analysis and reduce the risk of low-quality output later in your data analysis process.

The Five V’s of Data

Velocity is the speed at which data accumulates. Data is constantly being generated and extremely fast — it never stops. Near or real-time streaming, local, and cloud-based technologies can quickly process information.

Volume is the scale of the data or the increase in the amount of data stored. Volume drivers are increased data sources, higher resolution sensors, and scalable infrastructure.

Variety is the diversity of the data. For example, structured data fits neatly into rows and columns in relational databases, while unstructured data is not organized pre-defined, like Tweets, blog posts, pictures, numbers, and video. Variety also reflects that data comes from different sources, machines, people, and processes, both internal and external to organizations. Drivers are mobile technologies, social media, wearable technologies, geo technologies, video, and much more.

Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Drivers include cost and the need for traceability. With a large amount of data available, the debate rages concerning data accuracy in the digital age.

Value is our ability and need to turn data into value (usually business value). Value isn’t just profit. It may have medical or social benefits such as customer, employee, or personal satisfaction. The main reason that people invest time to understand data is to derive meaningful value from it.

So What Are the Major Tasks of Data Processing?

The five key steps required to ensure that your data is up to scratch are as follows:

  1. Obtain the data
  2. Understand how it is structured (or unstructured)
  3. Clean, tidy and manipulate the data
  4. Scan data to ensure no nasty surprises occur down the road, and
  5. Transform the data into a format that can be readily imported into the statistical analysis software.

So there you have it — the chief reasons why Data Scientists spend so much of their working time playing with digits, alphas, blobs, and bytes so that they are all looking the same, standing at attention and ready to march onwards to be converted into business value.

--

--

Paul Chambiras https://freelance-writer.store

I am a freelance writer on all things Business, DIY, Sport, Technology, IT and Management.