What's a Weka going to do?

Here I want to examine how Machine Learning might compare with a our conventional gene expression analysis. The data set includes both male and female oysters exposed to OA conditions (and controls). Gonad tissue. Sam ran data through bowtie/stringtie for comparison. Complete sample details are below. PDF of post

Sample.ID	OldSample.ID	Treatment	Sex	TreatmentN	Parent.ID
12M	S12M	Exposed	M	3	EM05
13M	S13M	Control	M	1	CM04
16F	S16F	Control	F	2	CF05
19F	S19F	Control	F	2	CF08
22F	S22F	Exposed	F	4	EF02
23M	S23M	Exposed	M	3	EM04
29F	S29F	Exposed	F	4	EF07
31M	S31M	Exposed	M	3	EM06
35F	S35F	Exposed	F	4	EF08
36F	S36F	Exposed	F	4	EF05
39F	S39F	Control	F	2	CF06
3F	S3F	Exposed	F	4	EF06
41F	S41F	Exposed	F	4	EF03
44F	S44F	Control	F	2	CF03
48M	S48M	Exposed	M	3	EM03
50F	S50F	Exposed	F	4	EF01
52F	S52F	Control	F	2	CF07
53F	S53F	Control	F	2	CF02
54F	S54F	Control	F	2	CF01
59M	S59M	Exposed	M	3	EM01
64M	S64M	Control	M	1	CM05
6M	S6M	Control	M	1	CM02
76F	S76F	Control	F	2	CF04
77F	S77F	Exposed	F	4	EF04
7M	S7M	Control	M	1	CM01
9M	S9M	Exposed	M	3	EM02

Sam provided me with a table of FPKM values pulled from Ballgown. There is information for about 39k genes. In short I believe Sam got no DEGs with record to treatment, but as might be expect many differences based on sex.

table-pic

I opened the csv file up in the Weka Explorer.

exp

Then ran Classifier using SMO, on Treatment (thus ignoring sex). Correctly Classified Instances = 59.09%

class

I went on to Select Attributes

Based on this I found 24 genes contributing ranging in Ranked attributes values from 0.5 to 0.7743.

My goal now is to rerun this pipeline in Weka just using data from these 24 genes to see if the predictability improves. Will be doing some inelegant data munging with bbedit, excel, and Rstudio.

Step 1) simply read in Sam’s fpkm table with 39k genes.

fpkm <- read.csv(file = "../analyses/gene_fpkm.csv")

Step 2) generate list of genes I can insert into a dplyr::select function. (skip this part if adverse to cringing)

Paste into excel, copy gene column, paste back into bbedit:
Text::remove line breaks
Find::replace gene- with gene. (due to Weka formatting?)
Find::replace ` ` with ,
Paste in dplyr::select code chunk

select05.treat <- select(fpkm, Treatment, TreatmentN, Sex, gene.LOC111128233, gene.LOC111105393, gene.LOC111116464, gene.LOC111135088, gene.LOC111130500, gene.LOC111127278, gene.LOC111113969, gene.LOC111132481, gene.LOC111105938, gene.LOC111138283, gene.LOC111131835, gene.LOC111129832, gene.LOC111130487, gene.LOC111127862, gene.LOC111110254, gene.LOC111124688, gene.LOC111125297, gene.LOC111116779, gene.LOC111119607, gene.LOC111123744, gene.LOC111124204, gene.LOC111113829, gene.LOC111110051, gene.LOC111104660)

Then writing it out as

write.table(select05.treat, file = "../analyses/select05.treat.csv", sep = ",", row.names = FALSE, quote = FALSE)

Lets bring that into Weka Explorer

Classify gives a much higher ROC etc

classify

Summary

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          21               95.4545 %
Incorrectly Classified Instances         1                4.5455 %
Kappa statistic                          0.9076
Mean absolute error                      0.0455
Root mean squared error                  0.2132
Relative absolute error                  9.1304 %
Root relative squared error             42.6835 %
Total Number of Instances               22     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.100    0.923      1.000    0.960      0.911    0.950     0.923     Exposed
                 0.900    0.000    1.000      0.900    0.947      0.911    0.950     0.945     Control
Weighted Avg.    0.955    0.055    0.958      0.955    0.954      0.911    0.950     0.933     

=== Confusion Matrix ===

  a  b   <-- classified as
 12  0 |  a = Exposed
  1  9 |  b = Control

This seems to suggest to me that these 24 gene expression patterns can confidently predict exposure, thus must be impacted by OA exposure, exclusive of sex.

But the next task is to identify what these genes are.

Written on January 22, 2022