1. YouTube Summaries
  2. Excel Sheet Comparison with Python: Generating Automated Reports

Excel Sheet Comparison with Python: Generating Automated Reports

By scribe 7 minute read

Create articles from any YouTube video or use our API to get YouTube transcriptions

Start for free
or, create a free article to see how easy it is.

Introduction to Excel Sheet Comparison

In today's data-driven world, comparing Excel sheets efficiently is a crucial skill for professionals across various industries. Whether you're a data analyst, financial expert, or business manager, the ability to quickly identify differences between datasets can save time and provide valuable insights. This comprehensive guide will walk you through the process of comparing Excel sheets and generating automated comparison reports using Python and the pandas library.

Setting Up the Environment

Before we dive into the code, let's set up our working environment. For this tutorial, we'll be using Google Colab as our Integrated Development Environment (IDE). Google Colab is a cloud-based platform that allows you to write and execute Python code in your browser, making it an excellent choice for data analysis tasks.

Why Google Colab?

  • Free to use
  • Provides 1TB of cloud storage
  • No need to worry about local hardware limitations
  • Easy to share and collaborate on notebooks

If you're new to Google Colab, you can find detailed tutorials and documentation online to help you get started.

Preparing the Data

For this example, we'll be working with two sample Excel sheets: "Day 1 Report" and "Day 2 Report". These sheets contain simplified data to make it easier to understand the comparison process. In a real-world scenario, you might be dealing with much larger datasets.

Sample Data Structure

Both Excel sheets contain the following columns:

  • Date
  • Items
  • Amount

The "Items" column includes categories such as:

  • Furniture
  • Office Supplies
  • Technology
  • Peripherals
  • Laptops

We've intentionally introduced some differences between the two sheets to demonstrate how our comparison script will identify and report these variations.

Writing the Python Code

Now, let's break down the Python code that will perform the Excel sheet comparison and generate our report.

Importing Required Libraries

import pandas as pd
from google.colab import drive

We start by importing the pandas library, which is essential for data manipulation and analysis in Python. We also import the Google Colab drive module to access files stored in our Google Drive.

Mounting Google Drive

drive.mount('/content/drive')

This line mounts your Google Drive to the Colab environment, allowing you to access files stored there.

Reading Excel Files

df1 = pd.read_excel('/content/drive/MyDrive/day1_report.xlsx')
df2 = pd.read_excel('/content/drive/MyDrive/day2_report.xlsx')

Here, we use pandas to read both Excel files into DataFrame objects. Make sure to replace the file paths with the correct locations of your Excel files in Google Drive.

Displaying Data Preview

print("Day 1 Report:")
print(df1.head())
print("\nDay 2 Report:")
print(df2.head())

This code snippet displays the first few rows of each DataFrame, allowing us to quickly verify that the data has been loaded correctly.

Comparing DataFrame Structures

columns_match = df1.columns.equals(df2.columns)
print(f"Columns match: {columns_match}")

Before proceeding with the comparison, we check if the columns in both DataFrames match. This step is crucial to ensure that we're comparing compatible datasets.

Performing Row-wise Comparison

merged = pd.merge(df1, df2, how='outer', indicator=True)
differences = merged[merged['_merge'] != 'both']

day1_only = differences[differences['_merge'] == 'left_only'].drop('_merge', axis=1)
day2_only = differences[differences['_merge'] == 'right_only'].drop('_merge', axis=1)

This section of code performs the actual comparison:

  1. We merge the two DataFrames using an outer join.
  2. We identify rows that are not present in both DataFrames.
  3. We separate the differences into two categories: rows exclusive to Day 1 and rows exclusive to Day 2.

Generating the Comparison Report

with pd.ExcelWriter('/content/drive/MyDrive/comparison_report.xlsx') as writer:
    day1_only.to_excel(writer, sheet_name='Exclusive to Day 1', index=False)
    day2_only.to_excel(writer, sheet_name='Exclusive to Day 2', index=False)
    differences.to_excel(writer, sheet_name='All Differences', index=False)

print("Comparison report generated successfully.")

Finally, we create an Excel file containing three sheets:

  1. Rows exclusive to Day 1
  2. Rows exclusive to Day 2
  3. All differences combined

This report is saved in your Google Drive, making it easy to access and share.

Analyzing the Comparison Report

Once the comparison report is generated, you can open it to analyze the differences between the two Excel sheets. Let's examine what kind of insights you might gain from this report:

Exclusive to Day 1

In our example, you might find:

  • A "Computers" entry with an amount of 7,834
  • A "Technology" entry with an amount of 7,800

Exclusive to Day 2

You might see:

  • A "Mouse" entry with an amount of 30
  • A "Technology" entry with an amount of 9,800

Interpreting the Results

From these differences, we can deduce:

  1. The "Computers" category was present in Day 1 but removed in Day 2.
  2. A new "Mouse" category was added in Day 2.
  3. The amount for the "Technology" category increased from 7,800 to 9,800.

These insights can be crucial for various business scenarios, such as:

  • Identifying data entry errors
  • Tracking inventory changes
  • Monitoring budget allocations
  • Detecting unauthorized modifications

Benefits of Automated Excel Comparison

Implementing this automated comparison process offers several advantages:

  1. Time Efficiency: What might take hours to do manually can be accomplished in seconds.
  2. Accuracy: Eliminates human error in the comparison process.
  3. Scalability: Can handle large datasets with ease.
  4. Consistency: Provides a standardized format for comparison reports.
  5. Quick Decision Making: Allows for rapid identification of discrepancies, enabling faster response times.

Expanding the Functionality

While our current script provides a solid foundation for Excel sheet comparison, there are several ways you could enhance its functionality:

1. Handling Multiple Sheets

Expand the code to compare multiple sheets within each Excel file:

def compare_multiple_sheets(file1, file2):
    xls1 = pd.ExcelFile(file1)
    xls2 = pd.ExcelFile(file2)
    
    for sheet in xls1.sheet_names:
        if sheet in xls2.sheet_names:
            df1 = pd.read_excel(file1, sheet_name=sheet)
            df2 = pd.read_excel(file2, sheet_name=sheet)
            # Perform comparison and generate report
        else:
            print(f"Sheet '{sheet}' not found in second file")

2. Customizable Comparison Criteria

Allow users to specify which columns to compare:

def compare_sheets(df1, df2, compare_columns):
    merged = pd.merge(df1[compare_columns], df2[compare_columns], how='outer', indicator=True)
    # Rest of the comparison logic

3. Highlighting Changes

Implement a feature to highlight cells that have changed between the two sheets:

def highlight_changes(df1, df2):
    changes = df1.ne(df2)
    styled = df1.style.apply(lambda x: ['background-color: yellow' if v else '' for v in changes], axis=None)
    return styled

4. Statistical Analysis

Add statistical comparisons for numerical columns:

def statistical_comparison(df1, df2, numeric_columns):
    stats1 = df1[numeric_columns].describe()
    stats2 = df2[numeric_columns].describe()
    diff = stats2 - stats1
    return diff

5. Visualization of Differences

Create charts or graphs to visualize the differences:

import matplotlib.pyplot as plt

def plot_differences(df1, df2, column):
    plt.figure(figsize=(10, 6))
    plt.bar(['Day 1', 'Day 2'], [df1[column].sum(), df2[column].sum()])
    plt.title(f'Comparison of {column} Totals')
    plt.ylabel('Total')
    plt.savefig('comparison_plot.png')

Best Practices for Excel Comparison

To ensure the most effective use of this Excel comparison tool, consider the following best practices:

  1. Data Cleaning: Before comparison, ensure your data is clean and consistently formatted.
  2. Version Control: Keep track of different versions of your Excel files to maintain a history of changes.
  3. Regular Comparisons: Set up a schedule for regular comparisons to catch discrepancies early.
  4. Documentation: Maintain clear documentation of your comparison process and any custom modifications to the script.
  5. Security: Be mindful of data sensitivity and ensure proper access controls are in place.

Troubleshooting Common Issues

When working with Excel comparisons, you might encounter some common issues. Here's how to address them:

  1. Mismatched Column Names: Ensure column names are identical in both sheets. Use string manipulation functions to standardize names if necessary.

  2. Data Type Inconsistencies: Convert columns to appropriate data types before comparison:

    df1['Amount'] = pd.to_numeric(df1['Amount'], errors='coerce')
    df2['Amount'] = pd.to_numeric(df2['Amount'], errors='coerce')
    
  3. Missing Values: Handle missing values explicitly:

    df1 = df1.fillna('N/A')
    df2 = df2.fillna('N/A')
    
  4. Large File Sizes: For very large files, consider using chunking:

    chunk_size = 10000
    for chunk1, chunk2 in zip(pd.read_excel(file1, chunksize=chunk_size),
                              pd.read_excel(file2, chunksize=chunk_size)):
        # Perform comparison on chunks
    
  5. Encoding Issues: Specify the correct encoding when reading files:

    df = pd.read_excel(file_path, encoding='utf-8')
    

Conclusion

Mastering Excel sheet comparison with Python opens up a world of possibilities for data analysis and quality control. By automating this process, you can save countless hours and gain insights that might otherwise be overlooked. The script we've developed provides a solid foundation, but remember that the real power lies in customizing and expanding it to meet your specific needs.

As you continue to work with this tool, consider the following next steps:

  1. Experiment with different datasets to understand the tool's capabilities and limitations.
  2. Explore additional pandas functions to enhance the comparison process.
  3. Integrate this comparison tool into your regular workflow for continuous data quality monitoring.
  4. Share your insights and improvements with colleagues to foster a culture of data-driven decision making.

Remember, the key to successful data analysis is not just in the tools we use, but in our ability to interpret and act on the insights they provide. Happy comparing!

Article created from: https://youtu.be/vmQjWkOigfY

Ready to automate your
LinkedIn, Twitter and blog posts with AI?

Start for free