PowerShell Record Extractor

Introduction

When you work on Big Data projects, you tend to work with very big files. In my day job, I frequently deal with CSV files that are larger that 500 MB. Even on my Dell workstation, it takes a long time to open such a file in Excel or any text editor, such as Notepad++. During development you typically don’t want to work with files of this size, indeed in a lot of scenarios it is sufficient to work with just the N top records.

I found myself frequently trying to use the PowerShell Get-Content or ImportCsv cmdlets, but even those don’t handle large files well, especially since they still try to read the entire file.  In addition, cmdlets such as Import-Csv assume a given format (in this case comma-separated).

Getting a bit frustrated with what was out there, I decided to quickly write my own little PowerShell script, which is described below.

The ExtractRecords Script

Requirements

The requirements for my custom script were very simple and straight-forward:

  1. The script should work with text files and read and write a line at the time.
  2. The code should not care about any text formats (CSV, tab-delimited etc..).
  3. The parameters to the script should be the following:
    • The first parameter should allow me to specify the input file name.
    • The second parameter specifies the output file name.
    • The next parameter should specify the number of records to extract. This parameter should default to 1000.
    • The final parameter should allow me to optionally skip the first (header record). In some cases, my files do have header records, while in other scenarios there is no header record, so the script should be able to handle both.
  4. Most of all, the script should be fast, very fast! Did I mention that it should be fast? In our projects we already have a enough work to do without having to be slowed down by mundane tasks like extracting a test data set.

The script file is named ExtractRecords.ps1 and it can be found at my GitHub repository:

Usage

In these examples I will be working with a CSV file named BigFile.csv, which is about 458 MB in size.  In this first example, let’s assume that we would like to extract the first 5000 records into  a file named Extract.csv:

A sample run in PowerShell is shown below:

By default, the script will NOT include the first (header) record. If you do want this first header row included, you can run the script as follows:

Performance

The script is indeed very fast. Below is a sample run where I write 100,000 records. The script takes just a bit over half a second to read and write 100,000 lines. To measure performance I used the Measure-Command cmdlet, which is really handy for these type of tasks:

 

Script Analysis

In this section we take a quick look at how the script is structured. I am by no means a PowerShell specialist, no any feedback is always welcome.

I use script parameters is a param block. I specify the parameters, and write out their values as information to the user:

Next, I make sure that the input and output file names were specified:


I convert the specified paths into “fully qualified paths”. The PSCmdLet.GetUnresolvedProviderPathFromPSPath is a great method to get the fully-qualified path:

Next, I need to make sure that the file specified by the full input path does exist:

We use the StreamReader and StreamWriter in the .NET System.IO namespace. This is really only of the fastest way to read and write files. The remainder of the script is shown below: