hashtable - Powershell 2 and .NET: Optimize for extremely large hash tables?


Question: 

I am dabbling in Powershell and completely new to .NET.

I am running a PS script that starts with an empty hash table. The hash table will grow to at least 15,000 to 20,000 entries. Keys of the hash table will be email addresses in string form, and values will be booleans. (I simply need to track whether or not I've seen an email address.)

So far, I've been growing the hash table one entry at a time. I check to make sure the key-value pair doesn't already exist (PS will error on this condition), then I add the pair.

Here's the portion of my code we're talking about:

...
    if ($ALL_AD_CONTACTS[$emailString] -ne $true) {
      $ALL_AD_CONTACTS += @{$emailString = $true}
    }
...

I am wondering if there is anything one can do from a PowerShell or .NET standpoint that will optimize the performance of this hash table if you KNOW it's going to be huge ahead of time, like 15,000 to 20,000 entries or beyond.

Thanks!




3 Answers: 

I performed some basic tests using Measure-Command, using a set of 20 000 random words.

The individual results are shown below, but in summary it appears that adding to one hashtable by first allocating a new hashtable with a single entry is incredibly inefficient :) Although there were some minor efficiency gains among options 2 through 5, in general they all performed about the same.

If I were to choose, I might lean toward option 5 for its simplicity (just a single Add call per string), but all the alternatives I tested seem viable.

$chars = [char[]]('a'[0]..'z'[0])
$words = 1..20KB | foreach {
  $count = Get-Random -Minimum 15 -Maximum 35
  -join (Get-Random $chars -Count $count)
}

# 1) Original, adding to hashtable with "+=".
#     TotalSeconds: ~800
Measure-Command {
  $h = @{}
  $words | foreach { if( $h[$_] -ne $true ) { $h += @{ $_ = $true } } }
}

# 2) Using sharding among sixteen hashtables.
#     TotalSeconds: ~3
Measure-Command {
  [hashtable[]]$hs = 1..16 | foreach { @{} }
  $words | foreach {
    $h = $hs[$_.GetHashCode() % 16]
    if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) }
  }
}

# 3) Using ContainsKey and Add on a single hashtable.
#     TotalSeconds: ~3
Measure-Command {
  $h = @{}
  $words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}

# 4) Using ContainsKey and Add on a hashtable constructed with capacity.
#     TotalSeconds: ~3
Measure-Command {
  $h = New-Object Collections.Hashtable( 21KB )
  $words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}

# 5) Using HashSet<string> and Add.
#     TotalSeconds: ~3
Measure-Command {
  $h = New-Object Collections.Generic.HashSet[string]
  $words | foreach { $null = $h.Add( $_ ) }
}
 

So it's a few weeks later, and I wasn't able to come up with the perfect solution. A friend at Google suggested splitting the hash into several smaller hashes. He suggested that each time I went to look up a key, I'd have several misses until I found the right "bucket", but he said the read penalty wouldn't be nearly as bad as the write penalty when the collision algorithm ran to insert entries into the (already giant) hash table.

I took this idea and took it one step further. I split the hash into 16 smaller buckets. When inserting an email address as a key into the data structures, I actually first compute a hash on the email address itself, and do a mod 16 operation to get a consistent value between 0 and 15. I then use that calculated value as the "bucket" number.

So instead of using one giant hash, I actually have a 16-element array, whose elements are hash tables of email addresses.

The total speed it takes to build the in-memory representation of my "master list" of 20,000+ email addresses, using split-up hash table buckets, is now roughly 1,000% faster. (10 times faster).

Accessing all of the data in the hashes has no noticeable speed delays. This is the best solution I've been able to come up with so far. It's slightly ugly, but the performance improvement speaks for itself.

 

You're going to spend a lot of the CPU time re-allocating the internal 'arrays' in the Hashtable. Have you tried the .NET constructor for Hashtable that takes a capacity?

$t = New-Object Hashtable 20000
...
if (!($t.ContainsKey($emailString))) { 
    $t.Add($emailString, $emailString) 
}

My version uses the same $emailString for the key & value, no .NET boxing of $true to an [object] just as a placeholder. The non-null string will evaluate to $true in PowerShell 'if' conditionals, so other code where you check shouldn't change. Your use of '+= @{...}' would be a big no-no in performance sensitive .NET code. You might be allocating a new Hashtable per email just by using the '@{}' syntax, which could be wasting a lot of time.

Your approach of breaking up the very large collection into a (relatively small) number of smaller collections is called 'sharding'. You should use the Hashtable constructor that takes a capacity even if you're sharding by 16.

Also, @Larold is right, if you're not looking up the email addresses, then use 'New-Object ArrayList 20000' to create a pre-allocated list.

Also, the collections grow expenentially (factor of 1.5 or 2 on each 'growth'). The effect of this is that you should be able to reduce how much you pre-allocate by an order of manitude, and if the collections resize once or twice per 'data load' you probably won't notice. I would bet it is the first 10-20 generations of 'growth' that is taking time.

 

More Articles


php - Key for HMAC Algorithm

How to generate the secret key for the HMAC algorithm as I have to use it for data verification at the other clients end?Thanks in advance.

How do I cross-compile a Linux kernel to a MIPS little endian host?

The kernel in question is 2.6.18. If I callmake ARCH=mips CROSS_COMPILE=mipsel-linux- menuconfigthere will be only the option to build a big endian kernel in the menu. If I use ARCH=mipsel, it will complain about not having an arch/mipsel dir.How's this done?

android - Can't find class [org/drinkless/td/libcore/telegram/TdApi$Object]

I am getting this error "Can't find class [org/drinkless/td/libcore/telegram/TdApi$Object]" but I haven't used this class anywhere in my project.I haven't used that class anywhere.But i am getting the above errorCan anyone please let me know how to solve the above error.


java - Location mocking - google map detects the movement but my own application not triggering location change

I am facing a strange issue. I modified official mock provider source code provided by google to mock some route for my application.Using this code. mockLocation.setElapsedRealtimeNanos(elapsedTimeNanos); mockLocation.setTime(currentTime); // Set the loc

python - Heap that supports modification of its elements?

Here is my scenario. I want to implement A* (in Python) without having to resort to linear-time min or in operations. I need a heap to be able to efficiently get the lowest weight item. My immediate response was 'Easy! I'll use heapq!' Then I discovered that life is rarely as simple as we would like

Missing plugin for GStreamer for Android SDK

I changed the stream url in the included Tutorial 5 (a basic media player) to a h.264/mp3 media stream (from its original ogv stream) and it started complaining about some missing plugins.After doing some googling I found Prajnashi's gst-ffmpeg plugin for Android https://github.com/prajnashi/gst-ffm


powershell - How to add key/value pair at the end of hash table?

I am trying to calculate code count using PowerShell script.I found a script on the Internet and am trying to add the total line at the end.I have added the column$CountHash.Add("Total", $Total)at the end.Param( [string]$path, [string]$outputFile, [string]$include = "*.*", [string]

.net - HMACSHA256.ComputeHash - Unexpected Result

I'm trying to generate a signature in VB.NET using the following vendor documentation as a reference guide:https://shuttle.support.signiant.com/customer/en/portal/articles/2807676-media-shuttle-metadata-developer-s-guide#AppendixAThey also provide this sample JS code:https://github.com/Signiant/medi

r - adding two column of a data where col1 contains date and col2 contains days

I have a data frame in which i have two columns date and days and i want to add date column with days and show the result in other column data frame-1col date is in format of mm/dd/yyyy formatdate days3/2/2019 83/5/2019 43/6/2019 43/21/2019 33/25/2019 7and i want my output like t

parsing - How to parse an array of json object using jq

I need to parse a Json file which have a lot of arrays. This is the json source:{"iabVersion": "IAB_V2","categories": [{ "categories": [{ "categories": [{ "id": "1.1.1", "name": "Commercial Trucks" }, { "id": "1.1.2", "name": "Conve