How Hashing works

June 13, 2019 · 7 min read

author

Hashing is a data structure which allows to add and read data with $O(1)$ . But the search is done with $O(n)$ .

Hashing basically does nothing else than mapping data as keys. This means that arbitrarily large data sets are converted to a data set of fixed length, and then reference the input data set. Mostly, hashing is used to convert passwords into hashes, which are then stored in a database. Or to check data for their integrity (checksum).

Hashing is a very easy data structure to implement. The only thing you would need is an array that stores the data.

Mathematically, hashing works on the principle that you have a table that is m large. Ideally, m should be a prime number. We have records $k_0$ , $k_1$ , ... $k_n$ and determine our key $h$ which should be inside the table $m$ .

The mathematical formula would be:

h = k \bmod m

This data structure can be used to represent anything, e.g. numbers, strings (where each ASCII value is summed up), objects, images, etc ...

Depending on how the key was calculated, the case can arise that a mapping refences on more than one data set. This is then the collision, what Google has done with the cryptographic hash procedure SHA-1, but that is another story.

I have added a sample code here:

#include <iostream>
#include <stdlib.h>
#include <string.h>
#include <math.h>

using namespace std;

const int m = 7;
int array[m] = {0};

int h(int k)
 {
     int y = 0;
     y = k % m;
     if(array[y] != 0)
     {
         if(debug)
             cout << "Kollision --> duplicate hash: " << y << endl;

     y = double_hashing(k); 
     } 
     return y;
 }

int hashtb_insert(int k)
{
     int hash = h(k);
     array[hash] = k;
     return hash;
}

int hashtb_get(int hash)
{
     return array[hash];
}

void hashtb_show()
{
     cout << "HASH | VALUE" << endl;
     for(int i = 0; i < m; i++)
     {
         cout << i << " | " << array[i] << endl;
     }
}

int main(int argc, char **argv)
{
     int datas[m] = {1,43,12,76,88,99,12};
     for(int i = 0; i < m; i++)
     {
         cout << "hashing " << datas[i] << endl;
         cout << "Hashvalue of " << datas[i] << " is " << hashtb_insert(datas[i]) << endl << endl; 
     }
     hashtb_show();
     cout << "testing get vals from hash" << endl;
     int hash = 0;
     for(int i = 0; i < m; i++)
     {
         cout << "enter " << i + 1 << " hash :";
         cin >> hash;
         cout << "the hash is from value --> " << hashtb_get(hash) << endl << endl;
     }
}

To handle collisions, there are different strategies: Open addressing or concatenation.

Open addressing:

Linear Special
Linear scaled probing
Quadratic probing
Double Hashing

n linear probing, nothing is done other than to look for a free space in the table.

The mathematical expression for this is:

h = (k + i) \bmod m

Here, the $k$ , which represents our data set as a number, is extended by $i$ . This is done until a place is found. This is done until a free position is found.

int linear_sond(int k)
{
	int y = 0;
	for(int i = 1; i < m; i++)
	{
		y = (k + i) % m;
		if(array[y] == 0)
		{
			return y;
		}
	}
}

The scaled linear probing is like the linear probing, which is extended by another variable c, however.

The formula looks like this:

h = (k + ci) \bmod m

int linear_sond_scaled(int k, int s)
{
	int y = 0;
	for(int i = 1; i < m; i++)
	{
		y = (k + (s * i)) % m;
		if(array[y] == 0)
		{
			return y;
		}
	}
}

Quadratic probing is also a way to find free spaces in the table. To do this, add to the $k$ , $i^2$ .

This then looks like this in mathematics:

h = k + i^2 \bmod m

int quad_sond(int k)
{
	int y = 0;
	for(int i = 0; i < m; i++)
	{
		y = (k + (i * i) % m);
		if(array[y] == 0)
		{
			return y;
		}		
	}	
}

Doublehashing is a bit different than the methods just shown. Here the hash function is extended by $h'$ . The function $h'$ must not be a divisor of $m$ . That means, $k$ is increased by 1 in $h'$ and $m$ is decreased by 2.

The formula is then for $h'$ :

h'(k) = (1 + k) \bmod (m - 2)

The formula for $h$ :

h = (k +i ( h'(k) \bmod m))

int h_strich(int k)
{
	return 1 + (k % (m - 2));
}

int double_hashing(int k)
{
	int y = 0;
	for(int i = 0; i < m; i++)
	{
		y = (k + i * (h_strich(k) % m);
		if(array[y] == 0)
		{
			return y;
		}
	}
}

The second strategy, concatenation, is also a method of handling in case of a collision. By concatenation we mean that records with the same key are concatenated by means of a simple concatenated list. A small schematic example:

In the event of a collision, the current record is saved in place of the previous record. All previous ones, hung behind the current one, that in descending order. At each place in the table is an anchor node, where the separate lists are bound.

The Loadfactor. What is the load factor? The loadfactor is the specification of the number of elements to the table size $m$ . This then indicates how full the table is.

If you want to extend the code with the loadfactor, you would have to create a new integer variable $n_E$ , which is incremented by 1 on an insert operation.

To specify the loadfactor, the number of existing elements in the table is divided by the table size m. The loadfactor is the number of elements in the table. Since both variables are of type integer, the result is taken times 1.0 to get a decimal number. The load factor is also used for rehashing, as a threshold value.

int nE = 0;

double loadFactor()
{
	return (1.0 * nE) / m;
}

Rehashing is used when the table is almost full or the load factor reaches or exceeds a certain value. Rehashing creates a new array or vector, with twice the table size $m$ .

m = 2m

The old array is then temporarily stored. After the enlargement, all elements from the old array are inserted again. At the end, the old array is deleted.

info

Important, I changed the array data type to a dynamic array to simplify the code.

void rehash()
{
	int *table = new int[m * 2]; //No standard array but arraypointer to extend it dynamically
	int *temp = array; //No standard array but arraypointer to extend it dynamically 
	array = table;

	for(int i = 0; i < m; i++)
	{
		if(temp[i] != 0)
		{
			hashtb_insert(temp[i]); //If concatenation is used, the lists from the old array must still be removed here.
		}
	}

	delete[] temp;
}

The hashing data structure should normally be realized with a std::vector, in order to have less effort with changes.

I hope I could give a little insight into the topic of hashing 😄