Data Wrangling with Python – Intermediate

Python is known for its incredible versatility and simplicity in handling data, making it an excellent tool for data wrangling. This article will delve into the intermediate aspects of Python, such as file manipulation and reading CSV files. This guide assumes a basic knowledge of Python and Python syntax. If you need a refresher on Python basics, check out HODP’s Python for beginners guide.

File Input/Output (I/O)

File I/O operations are crucial in Python, especially when dealing with large amounts of data stored outside of Python, such as Excel spreadsheets. Python makes it easy to read and write files in different modes, making data manipulation efficient and straightforward.

Opening Files

To open a file in Python, we use the built-in open() function. open() takes two arguments, the file’s name (or path) and the mode in which we want to open the file.

Python provides different modes for opening a file. The common modes are:

  • 'r' for read-only.
  • 'w' for write-only.
  • 'a' for append.
  • 'x' for exclusive creation.

In addition to these modes, you can specify text mode ('t') or binary mode ('b'). By default, files are opened in text mode. After processing a file, we should always close it using the close() function to free up any resources associated with the file.

File Operations

Python provides several methods to read and write files. The read() function is used to read an entire file, and the write() function is used to write to a file. However, using the write mode will completely overwrite all existing data, so it’s often preferable to use the append mode when you want to add data to an existing file.

The tell() function returns the current position of the file pointer, and the seek() function changes the pointer position. This allows us to have control over where in the file we are reading or writing.

Python also provides the readlines() function, which returns a list of all lines in the file. But if you want to loop through every line in the file, you can loop through the file object directly. This makes handling large files more manageable.

Reading CSVs

Python provides inbuilt support to read and write CSV files through the csv module. This module has reader() and DictReader() functions that enable us to read CSV files conveniently.

The reader() function is best suited for small CSV files. Each row in the CSV is returned as a list of strings, so we can access data using indices. On the other hand, DictReader() is useful for reading in large CSV files. It works similarly to reader(), but it stores data in dictionaries rather than lists.

import csv
with open("students.csv") as f:
   reader = csv.DictReader(f, delimiter=",")
   for row in reader:
       print(row)Code language: JavaScript (javascript)

This code reads a CSV file and prints each row, which is represented as a dictionary where the keys correspond to the column names and the values to the data in the respective cells.

In conclusion, Python’s intermediate capabilities, such as file handling and CSV parsing, make it a powerful tool for data wrangling. With these skills under your belt, you can work more efficiently and flexibly with your data, enhancing your data analysis and manipulation capabilities.