A CSV (comma-separated values) file is a text file that has a specific format which allows data to be saved in a table structured format. It uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The term CSV also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons. A delimiter guaranteed not to be part of the data greatly simplifies parsing.

This article discusses about how to combine multiple CSV (or .xlsx) file into a single .CSV (or .xlsx) files.

List Files

Let assume there are five mock sales files in CSV format that I am going to use in the next step of the tutorial. Each one of the files, includes million rows and multiple number of columns.

Following code snippet show how to prepare list of the files to merge based on given pattern.

fileNamePrefix = 'AAFull_Table'
path = "C:\\Users\\anandabh\\Downloads\\Report\\"

# Create list with files to merge based on name convention
def findFilesToMerge(filePrefix):
    file_list = [path + f for f in os.listdir(path) if f.startswith(fileNamePrefix)]
    return file_list

In the above example listdir() list the content (files and directory) of the directory path. Next we filter the files which name starts with fileNamePrefix using f.startswith(). This function finally returns the list of the files which we intent to merge.

Create Dataframe

Next we will create dataframe using Pandas. File list created above is unordered, sorted() function is used to sort the files based on the its name.

Then iteratively parse each CSV file in the file_list with pandas read_csv() method and append these datasets to the df_list. read_csv() is used to parse a CSV, the dataset are automatically converted to pandas DFs. Together with parsing the files, I am adding a new column in each DF that includes the name of the original CSV file.

def createDF(file_list):
    # creates empty list to include the content of each file converted to pandas DF
    df_list = []

    # reads each (sorted) file in file_list, converts it to pandas DF and appends it to the df_List
    # Also add new column having the file name at the end
    for file in sorted(file_list):
        df_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))

    return df_list

To read from .xlsx file instead of .csv file, use read_excel() instead of read_csv().

Below function finally merge the pandas DF by applying the concat() method on df_list. ignore_index = True means that a brand new ordered index will be generated for the merged dataframe.

# Merges single pandas DFs into a single DF, index is refreshed 
def mergeDataFrame(df_list):
    return pd.concat(df_list, ignore_index=True)

Merge Dataframe

Finally we write the merged dataframe to a file. First create a xlsxwriter engine, which we use to write the result to a file. to_excel function take xlsx write engine, and saves the result to file in the sheet name Sheet1.

def writeToExcel(dataFrame, path, outfileName):
    # Create a Pandas Excel writer using XlsxWriter as the engine.
    writer = pd.ExcelWriter(path + outfileName + '_merged.xlsx', engine='xlsxwriter')

    # Convert the dataframe to an XlsxWriter Excel object.
    dataFrame.to_excel(writer, sheet_name='Sheet1', index=False)

    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

If we want to save the result to .csv file instead of .xlsx file, use to_csv() instead of to_excel(), as shown below

dataFrame.to_csv(path + 'sales_records_full.csv', index=False)

Example

Following is full code, which includes code snippet which we discussed above. It reads multiple .csv file and save the result to a .xlsx file.

## Merge Multiple 1M Rows CSV files
import os
import pandas as pd
import xlsxwriter

# CSV Directory
path = "C:\\Users\\anandabh\\Downloads\\Report\\"

fileNamePrefix = 'AAFull_Table'
outFilName = 'merged'

# Create list with files to merge based on name convention
def findFilesToMerge(filePrefix):
    file_list = [path + f for f in os.listdir(path) if f.startswith(fileNamePrefix)]
    return file_list

def createDF(file_list):
    # creates empty list to include the content of each file converted to pandas DF
    df_list = []

    # reads each (sorted) file in file_list, converts it to pandas DF and appends it to the df_List
    for file in sorted(file_list):
        df_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))

    return df_list

# Merges single pandas DFs into a single DF, index is refreshed 
def mergeDataFrame(df_list):
    return pd.concat(df_list, ignore_index=True)

def getFileName(filePath):
    return os.path.basename(filePath)

def writeToExcel(dataFrame, path, outfileName):
    # Create a Pandas Excel writer using XlsxWriter as the engine.
    writer = pd.ExcelWriter(path + outfileName + '_merged.xlsx', engine='xlsxwriter')

    # Convert the dataframe to an XlsxWriter Excel object.
    dataFrame.to_excel(writer, sheet_name='Sheet1', index=False)

    # Close the Pandas Excel writer and output the Excel file.
    writer.save()


file_list = findFilesToMerge(fileNamePrefix)

if len(file_list) > 0:
    outfileName =  getFileName(file_list[0])

    df_merged = mergeDataFrame(createDF(file_list))

    writeToExcel(df_merged, path, outfileName)