Non Machine Learning Algorithm

Hi Guys
While not in the spirit of machine learning, I found a way to study this problem which was on top of the leaderboard for quite a while with a 97.5.

There is a theorem from linear algebra which states that the dot product of two vectors, a and b is qual to the length of a times the length of b times the cosine of the angle between them.

import numpy as np
def Cosine (a,b):
cos = np.vdot (a,b) / ( np.linalg.norm (a) * np.linalg.norm(b) )
return cos

We can think of a chromatograph as a vector in 16 dimensional space. The idea is to dot entries from thei input data with each of the target classifications, ordovician … The largest cosine wins. A cosine of 1.0 is a perfect fit.

This algorithm has some nice features.

Conceptually simple
Computationally simple
No scaling is needed. Only the direction of the vector matters.
The standard deviation is not used.

Here’s a screen shot showing the match of row 47 with each of the targets:

1 Like

Hi guys
We are almost at the end of this contest and I am posting the code to my cosine algorithm. I am having trouble getting the indents in the code to appear.

#load modules
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import matplotlib.image as mpimg

Introduction

This method of identifying the rows in the sample, while not following the suggested machine learning procedure, is very simple and gives very good results (leaderboard scores). It works much better than the machine learning exercise, at least for me.

Purging High Values and Zeros

  • We first separate the data by lab. This is necessary as we are going to replace the zeros with random data of the same mean and standard deviation. Not the same for all labs.
  • First we replace all the anomalously high values in variable 21 with zeros.
  • We now can treat all zeros the same (per lab).
  • In each column we count the nonzeros and get the sum of column and sum of the squares.
  • We compute the mean (sum / nonzeros) and variance (mean of squares - square of mean) for that column.
  • We then replace the zeros with random normal data using the mean and stdev. This is the only difference between runs.
  • We then reassemble the dataset by concatenating the three lab dataframes.
  • Finally we sort by sample number to get them back in their original order. The scoring system is expecting this order.
  • To get a different random sequence, just change the Seed.

Seed = 13 #value for my best run
np.random.seed(Seed) # so we can recreate a particular run.
df_lab1 = df.query(‘Lab == “Lab_1”’, inplace=False)
df_lab2 = df.query(‘Lab == “Lab_2”’, inplace=False)
df_lab3 = df.query(‘Lab == “Lab_3”’, inplace=False)
frames = [df_lab1, df_lab2, df_lab3 ]
for frame in (frames):

Set the anomalously high values to zero.

frame.loc[frame['21'] > 100, '21'] = 0

#   We now find all the zeros and replace them    
nrows = frame.shape[0] 
ncols = frame.shape[1]  # first 3 are lab, x, y

# loop thru each column get number nonzero rows, sum and sum squares
for j in range (3, ncols):   #j is column index
    nonzeroes = 0
    sum       = 0
    sumsq     = 0
    for i in range (0, nrows):  #i is row index
        x = frame.iloc [i, j]  
        if x > 0: 
            nonzeroes += 1
            sum       += x
            sumsq     += x * x
    # get mean and standard deviation
    mean = sum / nonzeroes
    var  =  ( sumsq / nonzeroes ) - ( mean * mean )
    stdev = np.sqrt (var)
    #print ("mean:  ",  mean  )
    #print ("stdev: ",  stdev  )
    for i in range (0, nrows):
        if frame.iloc [i, j] == 0: 
            r = np.random.normal (mean, stdev)
            frame.iloc [i, j] = r

‘’’
There are now no zeros. They have been replaced by random normal
data with mean and stdev calculated from the non zero
column members.
‘’’

We must now reassemble the frames separated by lab.

xxx = pd.concat(frames)

Sort by sample. Submission is expecting this order.

df_nozeros = xxx.sort_values([‘Sample’])
#Add new columns, ‘Family’ and Cosine. I used these in studying the results.
df_nozeros.insert(1, ‘Family’, ‘Nan’)
df_nozeros.insert(2, ‘Cosine’, 0.0)
#df_nozeros

Test by Cosine of Geochem Data and Sample Data

There is a theorem from linear algebra describing the dot product of two vectors, a and b:

For any two vectors, a and b, the dot product is equal to the length of a, times the length of b, times the cosine of the angle between them. We take each row in sample data as a vector in 16 dimensional space. We then dot it with each of the four vectors in the geochem database. We are looking for the largest cosine. A perfect fit will give a cosine of 1.0.

The numpy function ‘linalg.norm’ gives the length of a vector. The function ‘vdot’ gives the dot product of two vectors. A simple rearrangement of the above equation gives the cosine.

This algorithm does not require any scaling of the data. It only depends on the shape. Also, it does not use the standard deviation of the geochem data. So far I have not found a way to employ this, yet it works extremely well without it.

At the end of the notbook there is a geometric illustration of the algorithm. We show a sample chromatagraph superimposed on each of the geochem chromatagraphs.

This equation may look familiar to those with a background in statistics. If we square both sides, we get the equation for R squared, the coefficient of multiple correlation.

I added the family and cosine to the nozero dataframe.

Cosine of angle between vectors a & b

def Cosine (a,b):
cos = np.vdot (a,b) / ( np.linalg.norm (a) * np.linalg.norm(b) )
return cos

geochem_db = pd.read_csv(‘global_ranges_oil_families.csv’)
Family = geochem_db.family.unique ()

The spectra we will test against

Ordovician = geochem_db.iloc [ 0 : 16, 1 ]
Deltaic = geochem_db.iloc [ 16 : 32, 1 ]
Lacustrine = geochem_db.iloc [ 32 : 48, 1 ]
Marine = geochem_db.iloc [ 48 : 64, 1 ]
Spectra = [Ordovician, Deltaic, Lacustrine, Marine]

#Initialize submit file
submitfile = “seed” + str (Seed) + “.csv”
f = open(submitfile, “w”)
f.write (“family\n”)

vector = np.zeros (16)
nrows = df_nozeros.shape[0]
ncols = df_nozeros.shape[1] #first 5 are Lab, Family, Cosine, x, y
for i in range (0, nrows):
vector = df_nozeros.iloc[ i, 5:ncols]

large = 0.0
family = ''
j = 0
# Find best match (largest cosine)
for spectrum in Spectra:
    cosTheta = Cosine (vector, spectrum)
    if cosTheta > large:
        large  = cosTheta
        family = Family[j]
    j += 1
#print ( family, large)
df_nozeros.at[i, 'Family'] = family
df_nozeros.at[i, 'Cosine'] = large
f.write (family + "\n")

f.close()

Geometric Illustration of the Cosine Algorithm

We display the chromatgraphs of each of the four oil types in the geochem database. Then, we superimpose, in black, a chromatagraph from the samples. In some instances, the fit is astonishing. To use this, just type in the row number (sample), from the most recent run. We scale the database numbers to match the scale of the sample. The cosine for each pair is also displayed.

def Cosine (a,b):
cos = np.vdot (a,b) / ( np.linalg.norm (a) * np.linalg.norm(b) )
return cos

#Let’s look at database chromatagraphs.
geochem_db = pd.read_csv(‘global_ranges_oil_families.csv’)
Family = geochem_db.family.unique ()
xLabels = geochem_db.iloc[ 0:16, 0 ]
colors = [‘r’, ‘b’, ‘g’, ‘y’]

Data for our sample

sample = 47 # <================= Enter sample (row) here.
Y = df_nozeros.iloc[sample, 5:]
X = xLabels

fig = plt.figure(figsize=(10, 6), num = 2)
fig.suptitle("Oil Chromatograph Database vs sample " + str(sample))

for i in range (4):
plt.subplot (221 + i)
y = geochem_db.iloc[ 16 * i : 16 * (i + 1): , 1 ]

#scale up database values to match our sample
mY = max (Y)  # sample
my = max (y)  # database
factor = mY / my
y *= factor

#get cosine
cosine = Cosine (y,Y)

#  plot database
x = xLabels
plt.xlabel(Family[i] + ',  cosine:  ' + str (cosine))
plt.plot( x, y, colors[i])

#plot our sample
plt.plot(X, Y, 'black')

plt.show()