Julia for Data Science
上QQ阅读APP看书,第一时间看更新

What is data munging?

Munging comes from the term "munge," which was coined by some students of Massachusetts Institute of Technology, USA. It is considered one of the most essential parts of the data science process; it involves collecting, aggregating, cleaning, and organizing the data to be consumed by the algorithms designed to make discoveries or to create models. This involves numerous steps, including extracting data from the data source and then parsing or transforming the data into a predefined data structure. Data munging is also referred to as data wrangling.

The data munging process

So what's the data munging process? As mentioned, data can be in any format and the data science process may require data from multiple sources. This data aggregation phase includes scraping it from websites, downloading thousands of .txt or .log files, or gathering the data from RDBMS or NoSQL data stores.

It is very rare to find data in a format that can be used directly by the data science process. The data received is generally in a format unsuitable for modeling and analysis. Generally, algorithms require data to be stored in a tabular format or in matrices. This phase of converting the gathered raw data into the required format can get very complex and time consuming. But this phase creates the foundation of the sophisticated data analysis that can now be done.

It is good to define the structure of the data that you will be feeding the algorithms in advance. This data structure is defined according to the nature of the problem. The algorithms that you have designed or will be designing should not just be able to accept this format of data, but they should also be able to easily identify the patterns, find the outliers, make discoveries, or meet whatever the desired outcomes are.

After defining how the data will be structured, you define the process to achieve that. This is like a pipeline that will accept some forms of data and will give out meaningful data in a predefined format. This phase consists of various steps. These steps include converting data from one form to another, which may or may not require string operations or regular expressions, and finding the missing values and outliers.

Generally, data science problems revolve around two kinds of data. These two kinds of data will be either categorical or numerical. Categorical data comes with labels. These labels are formed by some group of values. For example, we can treat weather with categorical features. Weather can be sunny, cloudy, rainy, foggy, or snowy. These labels are formed when the underlying values are associated with one of the groups of the data (which comes under a label). These labels have some unique characteristics and we may not be able to apply arithmetic operations on them.

Numerical data is much more common, for example, temperature. Temperature will be in floating-point numbers and we can certainly apply mathematical operations on it. Every value is comparable with other values in the dataset, so we can say that they have a direct relation with each other.