What's a Weka going to do?

Here I want to examine how Machine Learning might compare with a our conventional gene expression analysis. The data set includes both male and female oysters exposed to OA conditions (and controls). Gonad tissue. Sam ran data through bowtie/stringtie for comparison. Complete sample details are below. PDF of post

Sample.ID OldSample.ID Treatment Sex TreatmentN Parent.ID
12M S12M Exposed M 3 EM05
13M S13M Control M 1 CM04
16F S16F Control F 2 CF05
19F S19F Control F 2 CF08
22F S22F Exposed F 4 EF02
23M S23M Exposed M 3 EM04
29F S29F Exposed F 4 EF07
31M S31M Exposed M 3 EM06
35F S35F Exposed F 4 EF08
36F S36F Exposed F 4 EF05
39F S39F Control F 2 CF06
3F S3F Exposed F 4 EF06
41F S41F Exposed F 4 EF03
44F S44F Control F 2 CF03
48M S48M Exposed M 3 EM03
50F S50F Exposed F 4 EF01
52F S52F Control F 2 CF07
53F S53F Control F 2 CF02
54F S54F Control F 2 CF01
59M S59M Exposed M 3 EM01
64M S64M Control M 1 CM05
6M S6M Control M 1 CM02
76F S76F Control F 2 CF04
77F S77F Exposed F 4 EF04
7M S7M Control M 1 CM01
9M S9M Exposed M 3 EM02

Sam provided me with a table of FPKM values pulled from Ballgown. There is information for about 39k genes. In short I believe Sam got no DEGs with record to treatment, but as might be expect many differences based on sex.

table-pic

I opened the csv file up in the Weka Explorer.

exp

Then ran Classifier using SMO, on Treatment (thus ignoring sex). Correctly Classified Instances = 59.09%

class

I went on to Select Attributes

sa

Based on this I found 24 genes contributing ranging in Ranked attributes values from 0.5 to 0.7743.

My goal now is to rerun this pipeline in Weka just using data from these 24 genes to see if the predictability improves. Will be doing some inelegant data munging with bbedit, excel, and Rstudio.

Step 1) simply read in Sam’s fpkm table with 39k genes.

fpkm <- read.csv(file = "../analyses/gene_fpkm.csv")

Step 2) generate list of genes I can insert into a dplyr::select function. (skip this part if adverse to cringing)

Paste into excel, copy gene column, paste back into bbedit:
Text::remove line breaks
Find::replace gene- with gene. (due to Weka formatting?)
Find::replace ` ` with ,
Paste in dplyr::select code chunk

select05.treat <- select(fpkm, Treatment, TreatmentN, Sex, gene.LOC111128233, gene.LOC111105393, gene.LOC111116464, gene.LOC111135088, gene.LOC111130500, gene.LOC111127278, gene.LOC111113969, gene.LOC111132481, gene.LOC111105938, gene.LOC111138283, gene.LOC111131835, gene.LOC111129832, gene.LOC111130487, gene.LOC111127862, gene.LOC111110254, gene.LOC111124688, gene.LOC111125297, gene.LOC111116779, gene.LOC111119607, gene.LOC111123744, gene.LOC111124204, gene.LOC111113829, gene.LOC111110051, gene.LOC111104660)

Then writing it out as

write.table(select05.treat, file = "../analyses/select05.treat.csv", sep = ",", row.names = FALSE, quote = FALSE)

Lets bring that into Weka Explorer

in

Classify gives a much higher ROC etc

classify

Summary


=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          21               95.4545 %
Incorrectly Classified Instances         1                4.5455 %
Kappa statistic                          0.9076
Mean absolute error                      0.0455
Root mean squared error                  0.2132
Relative absolute error                  9.1304 %
Root relative squared error             42.6835 %
Total Number of Instances               22     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.100    0.923      1.000    0.960      0.911    0.950     0.923     Exposed
                 0.900    0.000    1.000      0.900    0.947      0.911    0.950     0.945     Control
Weighted Avg.    0.955    0.055    0.958      0.955    0.954      0.911    0.950     0.933     

=== Confusion Matrix ===

  a  b   <-- classified as
 12  0 |  a = Exposed
  1  9 |  b = Control

This seems to suggest to me that these 24 gene expression patterns can confidently predict exposure, thus must be impacted by OA exposure, exclusive of sex.

But the next task is to identify what these genes are.

Written on January 22, 2022