Hi guys
We are almost at the end of this contest and I am posting the code to my cosine algorithm. I am having trouble getting the indents in the code to appear.
#load modules
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
Introduction
This method of identifying the rows in the sample, while not following the suggested machine learning procedure, is very simple and gives very good results (leaderboard scores). It works much better than the machine learning exercise, at least for me.
Purging High Values and Zeros
- We first separate the data by lab. This is necessary as we are going to replace the zeros with random data of the same mean and standard deviation. Not the same for all labs.
- First we replace all the anomalously high values in variable 21 with zeros.
- We now can treat all zeros the same (per lab).
- In each column we count the nonzeros and get the sum of column and sum of the squares.
- We compute the mean (sum / nonzeros) and variance (mean of squares - square of mean) for that column.
- We then replace the zeros with random normal data using the mean and stdev. This is the only difference between runs.
- We then reassemble the dataset by concatenating the three lab dataframes.
- Finally we sort by sample number to get them back in their original order. The scoring system is expecting this order.
- To get a different random sequence, just change the Seed.
Seed = 13 #value for my best run
np.random.seed(Seed) # so we can recreate a particular run.
df_lab1 = df.query(‘Lab == “Lab_1”’, inplace=False)
df_lab2 = df.query(‘Lab == “Lab_2”’, inplace=False)
df_lab3 = df.query(‘Lab == “Lab_3”’, inplace=False)
frames = [df_lab1, df_lab2, df_lab3 ]
for frame in (frames):
Set the anomalously high values to zero.
frame.loc[frame['21'] > 100, '21'] = 0
# We now find all the zeros and replace them
nrows = frame.shape[0]
ncols = frame.shape[1] # first 3 are lab, x, y
# loop thru each column get number nonzero rows, sum and sum squares
for j in range (3, ncols): #j is column index
nonzeroes = 0
sum = 0
sumsq = 0
for i in range (0, nrows): #i is row index
x = frame.iloc [i, j]
if x > 0:
nonzeroes += 1
sum += x
sumsq += x * x
# get mean and standard deviation
mean = sum / nonzeroes
var = ( sumsq / nonzeroes ) - ( mean * mean )
stdev = np.sqrt (var)
#print ("mean: ", mean )
#print ("stdev: ", stdev )
for i in range (0, nrows):
if frame.iloc [i, j] == 0:
r = np.random.normal (mean, stdev)
frame.iloc [i, j] = r
‘’’
There are now no zeros. They have been replaced by random normal
data with mean and stdev calculated from the non zero
column members.
‘’’
We must now reassemble the frames separated by lab.
xxx = pd.concat(frames)
Sort by sample. Submission is expecting this order.
df_nozeros = xxx.sort_values([‘Sample’])
#Add new columns, ‘Family’ and Cosine. I used these in studying the results.
df_nozeros.insert(1, ‘Family’, ‘Nan’)
df_nozeros.insert(2, ‘Cosine’, 0.0)
#df_nozeros
Test by Cosine of Geochem Data and Sample Data
There is a theorem from linear algebra describing the dot product of two vectors, a and b:
For any two vectors, a and b, the dot product is equal to the length of a, times the length of b, times the cosine of the angle between them. We take each row in sample data as a vector in 16 dimensional space. We then dot it with each of the four vectors in the geochem database. We are looking for the largest cosine. A perfect fit will give a cosine of 1.0.
The numpy function ‘linalg.norm’ gives the length of a vector. The function ‘vdot’ gives the dot product of two vectors. A simple rearrangement of the above equation gives the cosine.
This algorithm does not require any scaling of the data. It only depends on the shape. Also, it does not use the standard deviation of the geochem data. So far I have not found a way to employ this, yet it works extremely well without it.
At the end of the notbook there is a geometric illustration of the algorithm. We show a sample chromatagraph superimposed on each of the geochem chromatagraphs.
This equation may look familiar to those with a background in statistics. If we square both sides, we get the equation for R squared, the coefficient of multiple correlation.
I added the family and cosine to the nozero dataframe.
Cosine of angle between vectors a & b
def Cosine (a,b):
cos = np.vdot (a,b) / ( np.linalg.norm (a) * np.linalg.norm(b) )
return cos
geochem_db = pd.read_csv(‘global_ranges_oil_families.csv’)
Family = geochem_db.family.unique ()
The spectra we will test against
Ordovician = geochem_db.iloc [ 0 : 16, 1 ]
Deltaic = geochem_db.iloc [ 16 : 32, 1 ]
Lacustrine = geochem_db.iloc [ 32 : 48, 1 ]
Marine = geochem_db.iloc [ 48 : 64, 1 ]
Spectra = [Ordovician, Deltaic, Lacustrine, Marine]
#Initialize submit file
submitfile = “seed” + str (Seed) + “.csv”
f = open(submitfile, “w”)
f.write (“family\n”)
vector = np.zeros (16)
nrows = df_nozeros.shape[0]
ncols = df_nozeros.shape[1] #first 5 are Lab, Family, Cosine, x, y
for i in range (0, nrows):
vector = df_nozeros.iloc[ i, 5:ncols]
large = 0.0
family = ''
j = 0
# Find best match (largest cosine)
for spectrum in Spectra:
cosTheta = Cosine (vector, spectrum)
if cosTheta > large:
large = cosTheta
family = Family[j]
j += 1
#print ( family, large)
df_nozeros.at[i, 'Family'] = family
df_nozeros.at[i, 'Cosine'] = large
f.write (family + "\n")
f.close()
Geometric Illustration of the Cosine Algorithm
We display the chromatgraphs of each of the four oil types in the geochem database. Then, we superimpose, in black, a chromatagraph from the samples. In some instances, the fit is astonishing. To use this, just type in the row number (sample), from the most recent run. We scale the database numbers to match the scale of the sample. The cosine for each pair is also displayed.
def Cosine (a,b):
cos = np.vdot (a,b) / ( np.linalg.norm (a) * np.linalg.norm(b) )
return cos
#Let’s look at database chromatagraphs.
geochem_db = pd.read_csv(‘global_ranges_oil_families.csv’)
Family = geochem_db.family.unique ()
xLabels = geochem_db.iloc[ 0:16, 0 ]
colors = [‘r’, ‘b’, ‘g’, ‘y’]
Data for our sample
sample = 47 # <================= Enter sample (row) here.
Y = df_nozeros.iloc[sample, 5:]
X = xLabels
fig = plt.figure(figsize=(10, 6), num = 2)
fig.suptitle("Oil Chromatograph Database vs sample " + str(sample))
for i in range (4):
plt.subplot (221 + i)
y = geochem_db.iloc[ 16 * i : 16 * (i + 1): , 1 ]
#scale up database values to match our sample
mY = max (Y) # sample
my = max (y) # database
factor = mY / my
y *= factor
#get cosine
cosine = Cosine (y,Y)
# plot database
x = xLabels
plt.xlabel(Family[i] + ', cosine: ' + str (cosine))
plt.plot( x, y, colors[i])
#plot our sample
plt.plot(X, Y, 'black')
plt.show()