When importing and exporting CSV files, a common challenge is handling newline characters within cells.
Let’s explore simple solutions to help manage such scenarios effectively with Python.
🤔 Understanding the issue
CSV files use specific characters (or delimiters), like commas (’,’) or pipes (’|’) to separate values and newlines to indicate new rows. But what if the data itself contains new lines (within a cell)?
Let’s imagine we extract the following data from a worksheet:
Name | Description
-----------|------------
Pump 1 | Not running
Pump 2 | Turned off
at panel
If saved as a CSV file, it may look like this:
Name|Description
Pump 1|Not running
Pump 2|"Turned off
at panel"
Notice the quotes used around the “Turned off at panel” cell. These exist because the CSV writer used quotes to handle cells with multiple cells i.e. multiline cells.
✍️ Writing data to CSV
Before writing data to CSV, you need to handle newlines within cells to avoid any cell content breaking the structure of the CSV file.
Let’s look at a few ways you could write your data to a CSV while cleaning up newlines in each cell content:
Note⚠️: The following steps are shown for a Windows machine using ‘\r\n’ for newlines. Replace with ‘\n’ if you’re on a Unix-based machine instead.
🔄 Approach 1: Replace newlines
You can swap newlines with a space or another placeholder of your choice:
df["Description"] = df["Description"].str.replace('\\r\\n', ' ', regex=True)
df.to_csv("new_file.csv", sep='|', index=False, quoting=csv.QUOTE_NONE, escapechar='\\\\')
Output:
Name|Description
Pump 1|Not running
Pump 2|Turned off at panel
Let’s break down the code used:
.str.replace('\\r\\n', ' ', regex=True)
- searches for newline characters and replaces them with spacesdf.to
_csv
- writes the dataframe to a CSV filesep='|'
- sets the pipe character as the column delimiterindex=False
- excludes row indices from the CSV filequoting=csv.QUOTE_NONE
- Stops pandas from adding quotes automaticallyescapechar='\\\\'
- Escapes delimiters that show in the data
Pros✅
Easy to implement
Preserves the row integrity in the CSV file
Cons❌
Loses the original data format
Not suitable if a newline has semantic meaning
📦Approach 2: Wrap everything with quotes
Configure the CSV writer to place quotes around every field and record, which preserves the newlines as they are:
df.to_csv("new_file.csv", sep='|', index=False, quoting=csv.QUOTE_ALL)
Output:
"Name"|"Description"
"Pump 1"|"Not running"
"Pump 2"|"Turned off
at panel"
This code is similar to the previous approach, apart from:
quoting=csv.QUOTE_ALL
- encloses all fields in quotes
Pros✅
Preserves the data in its original form, including the newlines too
Doesn’t require any data manipulation before saving it
Cons❌
CSV files become harder to read in simple text editors
More complex parsing logic may be required to read the data back in
🗃️ Approach 3: Use custom encoding for newlines
Encode the newlines in the cells with a unique placeholder before writing to CSV and decode them when reading:
df["Description"] = df["Description"].str.replace("\\r\\n", "<br>")
df.to_csv("new_file.csv", sep='|', index=False)
Output:
Name|Description
Pump 1|Not running
Pump 2|Turned off<br>at panel
The .str.replace("\\r\\n", "<br>")
line replaces the newlines with an HTML break indicator, which represents the custom encoding solution we’re writing into our logic.
Pros✅
Keeps the structure of the CSV clean and intact
Allows you to restore the original formatting after decoding
Cons❌
Requires constant encoding and decoding steps
The placeholder may collide with your actual data if it appears anywhere else coincidentally (choose one that doesn’t appear anywhere in your data)
📚 Approach 4: Use libraries for multi-line CSV formats
Some modules support the handling of CSV files with multiline formats using quotes:
import csv
df.to_csv("new_file.csv", sep='|', lineterminator='\\r\\n', quoting=csv.QUOTE_NONNUMERIC)
Output:
""|"Name"|"Description"
0|"Pump 1"|"Not running"
1|"Pump 2"|"Turned off
at panel"
Let’s break down some lesser-known areas of our code:
lineterminator='\\r\\n'
- defines the newline character for the end of each recordquoting=csv.QUOTE_NONNUMERIC
- wraps quotes around every field that isn’t numeric
Pros✅
Retains the data formatting
Matches the CSV standards for handling multi-line fields
Cons❌
Not every CSV parser supports multi-line records
Can still be confusing if the file is opened and viewed in simple text editors
🏁 Conclusion
When deciding on what method to advance with, think about what is important for your data and who will be using it.
Each method has its benefits and trade-offs, so select the one that fits your situation best, and most importantly, always test to make sure your data looks right once everything is processed!