Data Cleaning Challenge
DATA CLEANING CHALLENGE WITH PANDAS
Anastasia Migunova is a data scientist and she is currently working in a big 4 firm. Based in Germany, she holds a Ph.D. in Applied Math and M.A. in Computer Science.
You are given two files with sales information of an industrial company that produces, distributes and sells electronics worldwide.
The goal of the project is to clean sales data stored in the two datasets, merge them and get a single aggregated dataframe in long and wide format with the 2020 revenue broken down by product, branch, sector and quarter. The final dataframe must be ready to be loaded in well-known visualization tools like Qlik or Tableu so that senior managers can quickly check company data and create reports and charts.
In addition to the two files with retail (129000 rows) and wholesale (9400 rows) data, you are provided with four extra files to complete some of the exercises required to create the final dataset. Your task consists of cleaning both tables, combine them with the other files and calculate the revenue of 2020 according to several conditions.
Cleaning involves dealing with null values, unwanted characters, duplicates, column types, string manipulation, etc. Moreover, you will have to create new columns, change date formats, standardize and replace values through dictionaries, apply list comprehension and many more with the aim of creating a final table with a required data model.
Anastasia has divided the project in more than 30 assignments so you can complete the challenge step by step.
It is a great project to practice the Pandas library and to get confident with data manipulation in Python. You will practice the majority of tasks data scientists and data analysts from a wide range of industries must perform to prepare and exploit data.
DOWNLOAD / CONTENT
You will receive an email with a ZIP file. In addition, the download is always available on your account.
1) One PDF with the instructions and guidelines, including the project broken down into 34 exercises that you may follow in case you need guidance .
2) 6 files with data. 4 spreadsheets and 2 .csv
3) A Notebook file with the solutions. It contains not only the source code but also detailed explanations and comments about how the code works. The code has been written by a senior developer so it is clean and easy to understand.
IMPORTANT: to see the solutions (Notebook) you need to have jupyter or ANACONDA package installed on your machine. If you do not have it, you may download it here. It is free.
WHAT YOU WILL PRACTICE
– Libraries: Pandas, Numpy, datetime.
– Import and read .csv and Excels files.
– Remove, select, rename, filter columns and rows.
– Data types.
– Conditional slicing.
– Convert to long and wide format.
– Merge and joins.
– Loops (for).
– Lists and dictionaries.
– apply + lambda
– List comprehension
If you need additional information, do not hesitate to contact us.
Specification: Data Cleaning Challenge
1 review for Data Cleaning Challenge
Only logged in customers who have purchased this product may leave a review.
Pandas is the most important Python Library if you want to jump in the data analysis domain. The workload is fair and it covers all the necessary topics of Pandas like Data Wrangling, Aggregation, merges, strings, etc.
The solutions to the exercises come in a Jupyter Notebook and they are concise, well structured and properly explained.