Guide to Data Wrangling: Tools and Techniques for Data Scientists

View all blog posts under Articles | View all blog posts under Online Master of Science in Business Analytics

A data wrangler works on his computer.

Consider this hypothetical scenario: A professional well-versed in computers and technology is tasked with organizing a small company’s data and files into a more cohesive, holistic and easy-to-understand format. Throughout the company’s history, there has never been a standard process for how to catalog or store data. Some workers have been organizing company financial info in Microsoft Excel spreadsheets in a more structured manner, but other employees have not applied these standards to their own sheets. Even the company’s top executives have not been following a set protocol for storing their data, with senior officials using a range of tools, such as cloud platforms, external hard drives and even personal computer hard drives.

Data wrangling, the process of converting one form of data into a more organized and easy-to-interpret form, would be beneficial to this company and its staff, as well as other organizations. Wrangling can help professionals and companies to analyze and interpret their data more effectively, as long as wranglers are aware of the best tools and practices to use.

Wranglers can be dedicated data professionals who hold advanced degrees in computer science-related fields and are fluent in current analytics methodologies and practices. They may also be professionals who don’t necessarily have an academic background in analytics but who aim to organize and present data in an easier-to-understand manner. As long as professionals understand wrangling tools and best practices, they can collect and organize data in a more beneficial manner.

What Are Data Wrangling and Data Munging?

Data wrangling and munging are tools and processes that data analysts and other professionals can use to organize data. For example, a large investment firm could use data wrangling to organize complex information on certain stocks or investments. Data wrangling could also help a media company to better present how many views or impressions a collection of content generated over time.

Defining Data Wrangling and Data Munging

According to Principles of Data Wrangling, “The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data.” This can include determining what data is available or what types of data exist, which types are beneficial and which are not necessary to include, and how to best present insights to others.

“Data munging,” often a synonym for “data wrangling,” refers to the “data preparation process of manually transforming and cleansing large data sets,” according to the software organization Import.io. “This process is typically performed manually using spreadsheets or scripts to filter out unwanted data and create a more relevant, digestible output.” “Munge” is an IT phrase referring to when a piece of data has been altered or changed, sometimes destructively or irreversibly, Techopedia notes.

Types of Data Wrangling and Munging

According to Openbridge, data wrangling includes cleaning data, converting one form of data into another, and mapping and storing data.

Cleaning data entails modifying or removing items that are not cohesive in a data set. For example, if a company is creating a list of mailing addresses based on customer survey responses, cleaning data could mean adjusting individual or incomplete zip codes so each one is in a nine-digit format.

Converting data from one form to another can include extracting raw data on transactions for a certain time frame from a company’s credit card processors and transferring it to a more cohesive platform, such as Excel.

Mapping and storing data can also be components of the data wrangling or munging process. For example, a media production company wants to organize hundreds of hours of footage from an event. That company could catalog footage captured by certain camera operators onto individual hard drives, or it could design a map of where each individual segment or hour should be stored in database server.

Industries Ideal for Data Wrangling and Munging

Any type of company or industry that gathers and organizes large amounts of data can benefit from data wrangling or munging. A community bakery that serves thousands of customers each year generates a lot of data from functions or events, including customer transactions, employee wages, and forecasts for future prices and costs. Data wrangling is also beneficial in helping a major accounting and consulting firm that generates hundreds of millions in annual revenue to organize account information, client payments and information regarding employee benefits.

Data wrangling may be particularly beneficial in the following industries:

  • Healthcare. On a given day at a hospital, a health practitioner provides a patient with a prescription for a specific dosage of a medication, an administrator completes a bill listing service that a patient received, and individuals donate different amounts of money to help the organization with a new initiative. All of these activities generate data that the hospital needs to organize and understand.
  • Supply chain. A beverage company manager may try to determine how much production costs will change after using new materials, a new manufacturing device may help to produce the product at a 20% faster rate, and individuals may deliver and receive shipments and payments across the world in different time zones. These activities generate data in a variety of formats, all of which need to be organized and wrangled to be effectively interpreted and used.
  • Higher education. Thousands of students submit applications online to attend a university, staff members award need-based aid to recently admitted students, and the alumni organization sorts through emails regarding recent graduates’ job titles and the amount of money they are earning in their new roles. These activities generate data that can be wrangled.

Data Wrangling Tools and Techniques

Many tools and techniques can help professionals in their efforts to wrangle data so that others may use it to uncover insights. Some of these tools can facilitate data processing, and others can help make data more organized and easier to understand, but each one is useful to professionals as they wrangle data to benefit their organizations.

Processing and Organizing Data

The specific tool a professional uses to process and organize data can depend on the data type and the purpose or goal for the data. For example, a spreadsheet software or platform, such as Microsoft Excel or Google Sheets, may be suitable for certain data wrangling and organizing projects.

Solutions Review notes that big data processing and storage tools, such as Google BigQuery and Amazon Web Services, help to sort and store data. For example, Microsoft Excel can be used to catalog data, such as the number of transactions a business recorded over a given week. Google BigQuery, however, can help to store the data (the transactions) and can be used to analyze the data to determine how many transactions were above a certain amount, periods with a certain frequency of transactions, etc.

Supervised and unsupervised machine learning algorithms can help to process and analyze the stored and organized data. “In a supervised learning model, the algorithm learns on a labeled data set, providing an answer key that the algorithm can use to evaluate its accuracy on training data,” writes Isha Salian for the Nvidia blog. “An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.”

For example, a supervised learning algorithm that has been designed to understand the difference between data sets of pictures of either donuts or pizza could ideally sort through a large data set of pictures of both. An unsupervised learning algorithm could be given 10,000 pictures of pizza, varying slightly in size, toppings, crust and other factors, and try to make sense of those images without any preexisting qualifiers or labels. Both learning algorithms would allow for the data to be better organized than what was included in the original set.

Cleaning and Consolidating Data

Excel allows individuals to store data. The organization Digital Vidya provides tips for cleaning data in Excel, such as deleting extra spaces, converting numbers from text into numerals and removing formatting. For example, after data has been moved into an Excel spreadsheet, removing extra spaces in individual cells can help provide more accurate analytics services later on. Allowing text-written numbers to exist (e.g., nine instead of 9) may hamper other analytical procedures.

As noted earlier, data wrangling best practices may differ depending on the organization or individual who will be accessing the data later, as well as the goal or purpose for the data’s use. The small bakery may not need to purchase a large database server, but it may need a digital tool or service that is more intuitive and comprehensive than a folder on a computer’s desktop. Specific types of database systems and tools include those offered by MySQL and Oracle.

Extracting Insights From Data

Professionals leverage different tools to extract data insights, which occurs after the wrangling process.

Descriptive, diagnostic, predictive and prescriptive analytics can be applied to a data set that has been munged or wrangled to uncover insights. For example, descriptive analytics could show the small bakery how much profit it generated in a year. Descriptive analytics could illustrate why it generated that amount of profit. Predictive analytics could show that the bakery may see a 10% drop in profit within the next year. Prescriptive analytics could highlight potential solutions that can help the bakery mitigate the potential drop.

Datamation also notes different types of data tools that can be beneficial to organizations. Tableau, for example, enables users to access visualizations of their data, and IBM Cognos Analytics provides services that can help in various stages of an analytics process.

Additional Tips for Data Wrangling and Munging

Data wrangling and munging can be performed in many ways, and several tools can facilitate the process. Depending on the organization or individual for which the data is presented, a specific wrangler’s approach can vary. With the bakery example described earlier, the bakery owner might just want to see the data organized and wrangled in the most easy-to-understand form, but professionals in a large-scale accounting or consulting firm may want to see that data wrangled and presented more comprehensively. Individual needs can vary for data wrangling and munging. As such, wranglers should keep the following tips in mind.

Understand Your Audience

Specific data-wrangling needs or goals can vary by organization. For example, someone who is wrangling a firm’s financial data at the end of the fiscal year may be able to break it down into hyperspecific segments, such as individual purchases made by employees or the amount spent on contributing 401(k) matches for a certain department.

The data wrangling process may also depend on who is going to interpret it. If the data is being used to showcase to potential clients or investors the firm’s overall revenue-generating capabilities, the additional segmenting of 401(k) info or individual expenditures may be unnecessary. But if the firm’s executives are going to review the data to find potential areas to reduce costs and inefficiencies, additional segmenting may be helpful.

Knowing who is going to be accessing and using that data, as well as what those individuals want to achieve with the insights, is important when wrangling.

Reevaluate Your Work

Businesses may instruct professionals on how to wrangle a client’s data, such as by specific invoice, payment received or account expenditure. The methods will vary. For example, they may organize that information into an Excel spreadsheet, clean up data so it matches a certain format, and segment the data so it can be easily interpreted across invoices, payments received and expenditures.

Once professionals have completed the wrangling process, they may notice room for improvement. For example, more efficient wrangling could entail having unpaid invoices correspond to anticipated future payments or grouping account expenditures together rather than listing them individually. Additionally, the wrangler may notice an operations error, such as sending multiple invoices to clients for a single service.

Data wrangling can be a time-consuming process, and after completing a project, a professional may want to move on to the next one. However, taking the extra time to reevaluate wrangled data to ensure that it’s of the highest quality and is organized as efficiently as possible can help to reduce errors and inefficiencies in the future.

Learn More About Data

Successful data wrangling and munging takes place when professionals understand the full scope of tools and services at their disposal, as well as the audience for whom they will be wrangling data. The audience may grow, and the types of tools and resources may expand. Data professionals need to adapt to these changes and be well-versed in new technologies and breakthroughs in analytics to ensure they are prepared to execute effective data wrangling and munging services.