How to Open Big CSV Files

profile picture of article author Jason by Jason Strauss Data Engineer at CSV Explorer

CSV files are chunks of text used to move data between spreadsheets, databases, and programming languages. Spreadsheet software, like Excel, can have a difficult time opening very large CSVs. I’ll explain why large CSVs are difficult to work with and outline some tools to open big CSV files.

The Difficulty with Opening Big CSVs in Excel

Screenshot from Oregon Trail computer game. From the animals you shot, you got 2949 pounds of meat but were only able to carry 100 pounds back to the wagon.

Spreadsheet software, like Excel and Google Sheets, work by loading entire files into a computer's high speed memory (RAM). Excel is limited to opening CSVs that fit within your computer’s RAM. For most modern computers, that means a limit of about 60,000 to 200,000 rows. Google Sheets runs on Google computers with more RAM and can often open slightly bigger CSV files.

The Tools

Databases and programming languages are able to open enormous data sets by bringing small chunks of a file into RAM as they're needed. Below, I'll outline some popular databases and programming languages to open large CSVs such as shells, SQL databases, & Python.

Shells

Shells are computer applications for running software commands. They are good for getting samples of the data and basic searching. They aren't good for numerical analysis.

Mac, Windows, and Linux computers all come with a shell installed. Mac computers come with the Terminal application. Windows comes with PowerShell. Here are some popular commands for working with big CSVs in the shell.

Windows PowerShell

# Print the first 100 lines of big_file.csv:
Get-Content -First 100 'C:\Users\jason\Desktop\big_file.csv'

# Print the last 100 lines of big_file.csv
Get-Content -Last 100 'C:\Users\jason\Desktop\big_file.csv'

# Print lines that contain the word "Utah"
Get-Content 'C:\Users\jason\Desktop\big_file.csv' | Select-String "Utah"

# Get the first 100 lines of big_file.csv and write them to a new file, sample.csv
Get-Content -First 100 'C:\Users\jason\Desktop\big_file.csv' | Out-File 'C:\Users\jason\Desktop\sample.csv'

# Count the number of lines in big_file.csv (3 separate commands)
# Note that we tell the PowerShell to count in chunks of 2,000 rows
$count = 0
Get-Content '.\big_file.csv' -ReadCount 2000 | foreach { $Count += $_.count }
$count

Mac Terminal

# Count the number of lines in big_file.csv
wc -l /Users/jason/Desktop/big_file.csv

# Print the first 100 lines of big_file.csv
head -100 /Users/jason/Desktop/big_file.csv

# Print the last 100 lines of big_file.csv
tail -100 /Users/jason/Desktop/big_file.csv

# Print lines that contain the word "Utah"
grep Utah /Users/jason/Desktop/big_file.csv

# Get the first 100 lines of big_file.csv and write them to a new file, sample.csv
head -100 /Users/jason/Desktop/big_file.csv > /Users/jason/Desktop/sample.csv

SQL Databases

SQL Databases are applications for querying, aggregating, and updating rows of data. They can be difficult to set up but are remarkably powerful. If you’re planning on using your CSV in a software application or connecting your CSV to a Business Intelligence tool, you should load your CSV into a SQL database. Many BI tools can import CSVs but the tools often have size limits.

SQL stands for Structured Query Language and is a standard for querying data across many databases. The most Popular open-source SQL databases are MySQL and PostgreSQL. Loading a CSV into PostgreSQL or MySQL requires 3 (not-so-easy) steps.

  1. Create and connect to a database: Install a database locally (not recommended) or use a cloud database like Amazon’s RDS or Compose.com
  2. Create the table and schema: Databases are strict about datatypes and require you to define which columns contain text, numbers, and dates.
    CREATE TABLE Flights(Airline TEXT, Flight_Time INT, Departure_Date TIMESTAMP);
  3. Load the CSV: Use the database's commands to import the CSV to your new table. COPY in PostgreSQL and LOAD in MySQL.
    COPY Flights from 'FlightData.csv' CSV HEADER;

Be sure to use the COPY and LOAD commands and not INSERT, which will be much slower on large CSVs.

Microsoft Access

Microsoft Access is Microsoft's desktop database tool. It comes with a user interface for importing CSVs and querying them. You can also write SQL in Access and connect Access to Excel. Access is included in the Microsoft Office Professional Suite or can be downloaded here. See Microsoft's documentation on importing CSVs into Access.

Python

Python is a general-purpose programming language and contains many of the same functions as SQL. It's easier to load a CSV into Python than into a database. Python is the tool-of-choice for many data scientists and statisticians. There are several Python tools for working with large CSVs: the native CSV module, Pandas, and csvkit. CSV Explorer uses Python to parse large CSVs before loading them into a database.

Python comes pre-installed on Mac computers, and can be opened by opening the Terminal and typing python. For Windows, you can download Python here. To read large files in either the native CSV module or Pandas, use chunksize to read small parts of the file at time.

Other programming languages like R, SAS, and Matlab have similar functions for opening and analyzing CSVs.

CSV Explorer

CSV Explorer is a tool for basic searching, aggregating, and plotting of big CSV files. Behind the scenes, it uses a combination of Python and SQL to open big CSVs.

The documentation will show you examples to search, aggregate, plot, and export to Excel. You can try CSV Explorer on your data here.

Conclusion

The tools for working with big CSVs are powerful, but not friendly to beginners. Using a Shell is a great way to get a sample of a large CSV and get started. For more complicated analysis, Python or a SQL database will give you much more power. CSV Explorer lies somewhere in the middle and is often a good first step for looking at big CSVs.