MineGene - A Machine Learning Platform for Microarray Data Mining Operations

This method contains two approaches of a Greedy Gene Selection method. The [[Gene Addition]] and the [[Gene Elimination]].\n[img[Gene Selection Add Del|SelectionAddDel.jpg]]\n

Billy Ranking is a gene [[Ranking]] method proposed by Basilis Moustakis. The procedure is depicted in the following image:\n[img[Billy Ranking|BillyRanking.jpg]]

This is a [[Gene Selection]] method proposed by Basilis Moustakis. This method evaluates each [[group|Grouping]], starting from the most discriminant. The evaluation is done through the [[Prediction]] method that the user selects. First we evaluate the first group and estimate the [[Accuracy]] over the [[train data|Input]]. Then we evaluate the accuracy of the first and the second group together. If the accuracy improves then we preserve the second group to our selection otherwise we neglect it. Then we continue by evaluating the previous selected groups plus the third group. This procedure continuous until all groups are examined. At the end we select the minimum set of groups that produced the maximum accuracy.

The Data File contains the primary data with gene expressions. It should be tab delimited file with {{{k}}} rows and {{{l}}} columns. In the {{{i}}}-th row and {{{j}}}-th column should be the expression of {{{i}}}-th gene of the {{{j}}}-th patient/sample. The filename could be anything, say “train.txt”. \n\nFor example the following data file contains the expression values of 4 genes in 10 samples:\n{{{\n3.2 5.3 1.2 1.1 5.6 6.8 1.9 6.5 5.2 7.0\n6.5 5.5 8.9 4.1 7.0 6.2 1.5 6.2 8.1 3.8\n0.5 6.8 8.4 2.7 7.5 3.9 6.9 3.9 3.3 5.1\n1.8 4.9 9.9 6.3 4.8 6.2 5.9 5.2 4.8 8.3\n}}}\n\nThe file can have any name in the form: {{{Filename.ext}}} but the [[Options File]] and the [[Names File]] should be in the form {{{Filename.opt}}} and {{{Filename.names}}} respectively. For example a domain can have the names: {{{Leukemia.data}}}, {{{Leukemia.opt}}}, {{{Leukemia.names}}}.

As it is depicted in the Introduction, ~MineGene offers both [[Supervised]] (class prediction) and [[Unsupervised]] (clustering) learning methods. Moreover it includes some [[Filtering]] methods for data screening and a set of [[Validation]] methods.

[[Introduction]]

Assume that the sample is presented as a vector of gene-expression values for the genes that are present in the diagnostic/prognostic kit. The Discretization method is a novel [[Prediction]] procedure, and a respective metric, that predicts the class of a sample. It has been proposed by:\n\n//Potamias, G., Koumakis, L., Moustakis, L. Gene Selection via Discretized ~Gene-Expression Profiles and Greedy ~Feature-Elimination. 2004. Hellenic Conference on Artificial Intelligence.//\n\nWe assign the integer values ‘1’ and ‘-1’ to the respective discretised genes’ expression-levels of the new unknown sample. In order to do this, we discretized the expression values of the unseen samples according to the mid points values computed via the [[Entropy Ranking]] method. The integer values ‘1’ and ‘-1’ stands for the //‘h’// and //‘l’// assignments, respectively, denoted with //sign(s~~g~~)//. The matching formula, below, is used to predict the class of a sample //s//.\n\n[img[Discretization Prediction Formula|DiscrPredict.jpg]]\n\nIn this formula, with //g ∈ P // we denote all selected and positive ranked genes and respectively with //g ∈ N// we denote all selected and negatively ranked genes. With //|P|// and //|N|// we denote the number of “Positive” and “Negative” train samples respectively. As with the [[gene-ranking formula|Entropy Ranking]], this formula also encompasses a polarity characteristic. If the outcome of the formula is positive then the new sample is assigned to class //P//, and if it is negative then it assigned to class //N//. In addition, the strength with which the sample is predicted to belong to one of the two classes is also provided so that, strong (or, weak) predictions could be made. Take as an example the extreme case were //L~~g;P~~ = H~~g;N~~ = 0// for all selected genes (i.e., all the genes have ‘high’ values for all class //P// samples, and ‘low’ values for all class //N// samples; in other words all selected genes are ideally associated with the respective classes). Then, in the above formula the bracketed factor receives its maximum positive value which equals the total number of total selected genes, say //T//. Now, if the incoming unseen sample have ‘high’ values (i.e., //sign(s~~g~~) = 1//) for all genes associated with class //P//, and ‘low’ values (i.e., //sign(s~~g~~) = -1//) for all genes associated with class //N// (i.e., an ideal class //P// sample) then, this formula receives its maximum positive value which equals to //-T//. So, the sample is strongly predicted to belong to class //P//. All the above holds for the inverse case where, the incoming sample is an ideal class //N// sample- the outcome of this formula will be //-2S//, and the sample will be strongly predicted to belong to class //N//. Under suitable assumptions (based on an analysis of all prediction figures) a ‘weak’ prediction could leave the sample unclassified.

Entropy is an gene ranking method proposed by:\n\n//Potamias, G., Koumakis, L., Moustakis, L. Gene Selection via Discretized ~Gene-Expression Profiles and Greedy ~Feature-Elimination. 2004. Hellenic Conference on Artificial Intelligence.//\n\nan explanation of the method can be found also [[here|http://www.csd.uoc.gr/~kantale/Kanterakis_THESIS.pdf]] (page 51).\n\nA general statement of the two-interval discretization problem followed by a two-step process to solve is:\nGiven: A vector of numbers // V = <n~~1~~, n~~2~~, ..., n~~v~~>//, sorted so that //n~~i~~ > n~~i+1~~// where each number //n~~i~~// in //V// V is assigned to one of two classes. \nFind: A number //μ//, // n~~1~~ < μ < n~~v~~ // that splits the number in //V// into two intervals: //[n~~1~~, μ)// and //[μ, n~~v~~]//, and best //discriminates// between the two classes. Best discrimination is decided according to a specified criterion. This [[Ranking]] method contains two steps:\n# For all consecutive pair of numbers //n~~i~~//, //n~~i+1~~// in //V// their midpoint, //μ~~i~~ = (n~~i~~, n~~i+1~~)/2// is computed, and the corresponding ordered vector of midpoint numbers is formed, //M = < μ~~1~~ , μ~~2~~ … μ~~v-1~~>//. \n# For each //μ ∈ M// the well-known information gain metric is computed:\n[img[Entropy Ranking|EntropyRanking.jpg]]\nwhere sets //V~~l~~//, and //V~~η~~// include numbers from //V// which are less than //μ// and higher (or equal) to //μ//, correspondingly. That is, //V~~l~~ = {n~~i~~ ∈ V / n~~i~~ in [n~~1~~,μ)}// and //V~~h~~ = {n~~i~~∈V / n~~i~~ in [μ,n~~v~~]}//. \n\nNote that the first term in equation is just the entropy of the original set of numbers in //V// according to their class assignment, i.e., the distribution of class-values assigned to the numbers in //V//. The second term is the expected entropy after //V// is split using //μ// as the split point. That is, taking into account the distribution of class-values assigned to the numbers in //V~~l~~// and //V~~h~~//. The midpoint that exhibits the maximum information gain is considered as the gene’s expression value which, when considered as a split point, exhibits the best discrimination between the classes. Then, this point is selected to assign the gene’s expression values to the nominal //‘l’//ow or, //‘h’//igh values, respectively (i.e., less than //μ// and higher that //μ//). A ‘natural’ (even extreme and controversial in a molecular setting!) interpretation of low and high expression values for a gene is that the state of the gene is ‘on’ or ‘off’ in a particular sample (e.g., disease type or state).\n\nFor each discretized gene we count the number of //‘h’//s and //‘l’//s that occur in the respective samples. Assume that each sample is assigned to one of two classes, i.e., //P//, and //N//. The following quantities are computed: //H~~g,P~~// = number of //‘h’// values for gene //g// assigned to class //P//; //L~~g,P~~// = number of //‘l’// values for gene //g// assigned to class //P//; //H~~g,N~~// = number of //‘h’// values for gene //g// assigned to class //N//; and //L~~g,N~~// = number of //‘h’// values for gene //g// assigned to class //N//. The formula below, computes a rank for each gene that measures the power of the gene to distinguish between the two classes:\n[img[Entropy Ranking 2|EntropyRanking2.jpg]]\n\n

We can select to perform our study restricted to genes whose names exist in an external file. The file should have the following format:\n{{{\n0 AFFX-BioB-5_at CL1\n3 AFFX-BioB-M_at CL2\n1 AFFX-BioB-3_at CL2\n}}}\nThe first and third columns are ignored, although required for consistency reasons. The second column contains the names of the genes that will be furthermore processed. With this simple way we can restrict a study to be performed only to certain favourable genes.\n

Filtering is the first task of all the learning procedures and the only that accesses the primary [[datasets|Input]]. Filtering can be considered as a preprocessing of primary data. With filtering we eliminate the number of further studied gene expression profiles according to a preferred criterion. The main reason to do this is to simplify the following tasks by providing them less data and to reduce the dimensionality of the problem. Usually the data filtered does not contain any significant information for gene expression regulations, namely filtered data do not significantly regulate among different sample classes. The provided filtering methods are:\n* [[NaN Removal]]\n* [[Wilcoxon Rank-Sum Test Filtering]]\n* [[Fold Difference Filtering]]\n* [[File Filtering]]\nMore than one filtering method can be applied to a dataset, as we can see in the filtering dialog box of ~MineGene by checking on the corresponding check boxes:\n[img[Filtering Dialog Box|Filtering.jpg]]

With fold difference we test the overall mean value of a gene's expression values over 2 different kinds of samples. If //Samples%// value is zero then we test Fold Difference, or else if the mean value of the samples that belong to the first class is at least {{{x}}} times greater than the mean value of the samples that belong to the second class. {{{x}}} is the value that we enter in the //Fold// text box. \n\nIf //Samples%// is not zero then we calculate the number (let {{{k}}}) of the samples of all the classes that are greater at least {{{x}}} times from the mean values of all samples. If this value is greater or equal to {{{k}}} then the gene is not screened. \n[img[Fold Difference|Fold.jpg]]\n

This is the GUI tiddler

In this [[Gene Selection]] method we are presented with the two vectors of groups of genes, //O~~P~~ = <O~~p;f~~, O~~p;f-1~~ … O~~p;1~~>// and //O~~N~~ = <O~~n;f~~, O~~n;f-1~~ … O~~n;1~~>//. Note that the beginning elements in the two vectors contain groups of genes that are less distinguishing between the two classes. In contrast, the ending elements contain genes that are most discriminant. So, it is rational to consider a procedure that adds groups from the beginning of the two vectors. We consider three situations: \n# Adding a group from //O~~P~~//\n# Adding a group from //O~~N~~//\n# Adding a group from both //O~~P~~// and //O~~N~~//. \nIn all cases, the accuracy is assessed on the training samples. The accuracy is computed via the seelcted [[Prediction]] method. The accuracy figure and the respective list of remaining genes are recorded. The addition that exhibits the highest accuracy is performed. The group-addition process continues till all the groups in the two lists are considered. The list of remaining genes with the highest accuracy is selected as the final set of most discriminant genes. \n\n\n

In this [[Gene Selection]] method we are presented with the two vectors of groups of genes, //O~~P~~ = <O~~p;f~~, O~~p;f-1~~ … O~~p;1~~>// and //ON = <O~~n;f~~, O~~n;f-1~~ … O~~n;1~~//>. Note that the beginning elements in the two vectors contain groups of genes that are less distinguishing between the two classes. In contrast, the ending elements contain genes that are most discriminant. So, it is rational to consider a procedure that eliminates groups from the beginning of the two vectors. We consider three situations: \n# Deleting a group from //O~~P~~// \n# Deleting a group from //O~~N~~//\n# Deleting a group from both //O~~P~~// and //O~~N~~//. \nIn all cases, the accuracy of the remaining genes on the training samples is assessed. The accuracy is computed based on the selected [[Prediction]] method. The accuracy figure and the respective list of remaining genes are recorded. The deletion that exhibits the highest accuracy is performed. The group-elimination process continues till all the groups in the two lists are considered. The list of remaining genes with the highest accuracy is selected as the final set of most discriminant genes.

The next step is to select the most suitable genes that according to our methods can discriminate the samples among two or more categories. These genes must have been regulated differently in the two classes of samples. Most of the studies select an ‘ad hoc’ number of best ranked genes, but some algorithmic approaches exist as well. ~MineGene provides the following methods for Gene Selection:\n* [[ADD/DEL Selection Method]]\n* [[Select Groups]]\n* [[Select Genes]]\n* [[Billy Selection]]\n* [[Read From File Selection]]

Usually it is undesirable to manage each gene as a unique feature. The main reason for this is that according to previous step, some genes may exhibit a similar ranking, thus they should be treated as a group of genes. Moreover, treating each gene as a unique feature is sometimes an expensive computational task. Grouping allows as to reduce complexity and to emerge some physical correlation of the genes.\n\nAfter ranking, the genes are sorted according to their ranking in descending order. At the top of the ordering we have genes with maximum descriptive ability. We provide the following grouping method:\n* [[MAXMIN Grouping]]\n* No Grouping. No grouping is performed at all. With this option, every gene is considered to belong to a unique group.

This general purpose machine learning tool comprises with a lot requirements. One of them is that it acts as a plug-in in a gene expression database, thus it is implemented in a general purpose, flexible computer language. Another concern is that it is composed by several [[Mining Operations]] with certain correlations between them. All the tasks that are crried by the plathform are families of certain algorithms (i.e., we have the family of gene ranking algorithms). Algorithms belonging to the same family share common attributes, methods and architecture. Thus, ~MineGene has been implemented under object oriented programmaing principles to ensure the component-like structure of the tool. Finally the tool utilizes a Graphical User Interface in order, for a user to have a visual contact with various possible algorithms and parameters of them. \n\nThe programming language that fulfills the above requirements is the C++ and the programming environment selected is Microsoft Visual Studio v. 6.0. The component based schema of the tool, is reflected in the hierarchy of the classes as we can see in figure below. It is crucial to note that ~MineGene’s architecture allows a component / plug-in approach. Thus if a new specific (i.e., ranking) algorithm appears it is very easy and straight-forward to be embodied in the tool and enrich its architecture.\n\n[img[alternate text|Architecture2.jpg]]

The basic input of ~MineGene is a set of files (or else //domain//) that describe a Microarray experiment. This domain contains three files:\n* [[Data File]]\n* [[Options File]]\n* [[Names File]]\nA domain can be treated either as a train dataset or as a test dataset. To open a domain we should press on of the "Select File #" buttons. \n[img[Input|Input.jpg]]\nWe then select the [[Data File]] of the domain and we select from the relevant combo box how this file should be treated. We have 4 choices:\n* Train. This dataset will be treated as train\n* Test. This dataset will be treated as test\n* [[Study]]\n* [[C45 Names]]\nIf we select more than one train domains then these domains will be merged. The same happens if we select more than on test domains.

~MineGene is a general-purpose [[machine learning|http://en.wikipedia.org/wiki/Machine_learning]] tool to serve as an application platform for various [[Mining Operations]] including [[gene selection|Gene Selection]], [[classification|Supervised]] and [[clustering|Unsupervised]] algorithms. It is a collection of Machine Learning algorithms and heuristics for intelligent processing of gene expression data produced by [[DNA Microarray|http://en.wikipedia.org/wiki/DNA_microarray]] experiments. Its main purpose is to mine into vast and redundant documents for information regarding the ability of certain genes to discriminate between different sample states. ~MineGene, is designed and [[implemented|Implementation]] to be suited as a plug-in in a gene expression database. With ~MineGene we give the ability to a gene expression database apart from storing, retrieving, sharing and querying of the data, to infer fundamental conclusions about the inner regularities, descriptive ability and possible relations of the data stored.\n\nThe majority of the studies performed on gene expression data analysis follow a ‘one-way’ approach, thus they apply only one or a very limited set of algorithms. Even when a study is composed by many parts, each responsible for a specific aspect of the process, it is not possible to apply and test various algorithms for this aspect and infer invaluable conclusions not only for the data, but for the application spectrum of an algorithm as well. Moreover, even when we want to test a single algorithm it is desirable to have an environment capable to perform multiple runs with different inputs and parameters. ~MineGene is an effort to implement this environment though a Graphical User Interface. The picture below shows the initial, starting GUI of ~MineGene.\n\n[img[MineGene|MineGene.jpg]]\n\nJudging from studies recently published, there is not yet any standard method for microarray gene expression data analysis but some general guidelines that recently have started to be formed. These guidelines represent a sequencing procedure that starts after data acquisition and ends to the construction of a predictor or a clustering mechanism depending if we are performing [[Supervised]] or [[Unsupervised]] [[Mining Operations]]. This sequencing procedures is depicted on the image below.\n[img[MineGene|Architecture1.jpg]]\n

In K-fold cross-validation, the original train sample is partitioned randomly into a training set and into a testing set. The number of samples that will belong in the train and test datasets is defined by the user and it is a percentage value. For example if the user selects 40% K-fold validation in a domain that has 150 samples. 60 random samples will be used for training and the rest 110 will be used for test. This procedure is done several times (user defined) so that an adequate number of experiments will be performed. On each experiment we estimate the [[accuracy]] of the model.\n\nIn the Validation dialog box, in the ~K-Fold Validation section:\n[img[K Fold Validation|KFoldValidation.jpg]]\nIn this segment of the dialog there are 2 text boxes. In the first we enter the number of experiments that we want to perform for each percentage value entered in the second text box. In the second text box we enter percentage values separated with commas (,). For example if we enter the value {{{10}}} in the first dialog box and the value {{{30, 50, 80}}} in the second, then 30 experiments will be performed. 10 experiments where 30% of the train samples will be treated as train and the rest 70% will be treated as test samples, another 10 experiments where 50% of the train samples will be treated as train and the rest 50 will be treated as test and another 10 experiments where 80% of the train samples will be treate d as train samples and the rest 20% as test.\n\nBy entering consequtive numbers of percentage values for train dataset partitioning, e.g. {{{20, 40, 50, 60, 80, 90}}} we can acquire the [[Learning Curve|http://en.wikipedia.org/wiki/Learning_curve]] of our model.

With this method we take one sample from the train dataset. Then we perform all the [[algorithms|Supervised]] selected, and then use the taken sample, as a test dataset. This process is done iteratively for all train samples. The ratio of the samples predicted successful by the predictor reflects the predictive capacity of the train data. In the case that we have absence of test data we may consider as best discrimant the genes that participated more times in the selected genes set during a LOOCV procedure. LOOCV is an essential validation method that can estimate the value of our learning method and/or the predictive ability of our data.

First we estimate the value:\n[img[maxmim grouping 1|MaxminGroup1.jpg]]\n{{{MaxRank}}} and {{{MinRank}}} are the maximum and minimum ranking of the genes respectively as they were computed from the [[Ranking]] step. As we have positive and negative ranking we have to estimate two {{{g}}} values: one for positive and one for negative ranking. Gene {{{i}}} is assigned to group O~~i~~ according the formula:\n[img[maxmim grouping 1|MaxminGroup2.jpg]]\nIn this formula, R~~i~~ is the ranking of gene {{{i}}}, and {{{k}}} is an integer variable.

[[Introduction]]\n[[Implementation]]\n[[Mining Operations]]\n[[Input]]\n

This [[Ranking]] method is available only for discrete expression data. The range of the discretized values should be [0, 1]. The ranking procedure used is depicted in the following figure:\n[img[Ranking Nominal 0 1|NOMRanking01.jpg]]\n

This [[Prediction]] method is related with the [[NOM {0, 1}|NOM {0, 1} Ranking]] [[Ranking]] method. In this approach for each [[selected|Gene Selection]] gene we check the corresponding expression value of a test sample. ''If this value is 1'' then we check the ranking value of this gene and we add it to one of two variables depending whether this ranking value is positive or negative. When we have checked all selected genes, we measure the absolute value of these variables. If the positive sum is greater that the negative sum, then this sample is assigned to the possible class.

This [[Ranking]] method is available only for discrete expression data. The range of the discretized values should be [1, N]. The ranking procedure is depicted in the following figure:\n[img[Nomina Ranking 1 N|NOMRanking1N.jpg]]\nIn the later formula //R~~g~~// is the ranking value for the gene //g//. #Values is the N value, that is 3 for this example. //D(i,1)// is the distribution value for value //i// in class 1.

This [[Prediction]] method is related with the [[NOM {1, N}|NOM {1, N} Ranking]] [[Ranking]] method. First ly we

Microarray expression data are sometimes erratic or non available. In these cases the gene expression matrix contains the value: ~NaN (Not a Number) in the corresponding position instead of a certain continuous value. Gene expression profiles containing too many ~NaN values do not exhibit any particular information, so they can be safely removed. With this [[Filtering]] method we can remove gene expression profiles containing ~NaN values over a certain percentage. \n[img[NaN Removal|NaN.jpg]]\nBy entering a percentage value in the //Features// text box we eliminate all genes that have ~NaN values over the inserted percentage. By entering a percentage value in the //Samples// text box we eliminate all samples that have more ~NaN values than the inserted percentage.

This file contains the name of all genes that are participating in an experiment. The name of the file should be in the form {{{Filename.names}}}. Where {{{Filename}}} is the filename of the [[Data File]]. The format of the file is:\n{{{\n#gene <tab> GeneName <tab> Value\n}}}\nWhere {{{#gene}}} is the number of the gene that this line contains its name. ''Gene numbering is starting from 0, not from 1''. The {{{GeneName}}} is the name of the gene and {{{Value}}} is an [[External Gene Name]]. For example in a domain with 4 genes, a possible Names File could be the following:\n{{{\n0 BRCA1 Relevant\n1 COG4 Irrelevant\n2 ABCA1 Relevant\n3 SNX3 Irrelevant\n}}}

The Options File contains information about the samples. The file name should always be in the form {{{Filename.opt}}}. Where {{{Filename}}} is the filename of the data file. The information that contains is in the form:\n{{{Attribute = Value}}}. The attributes that this file should contain are:\n* {{{classes}}}. It contains the class assignment of each sample. Each sample is a value from {{{1}}} to {{{n}}} where {{{n}}} is the number of classes. For example if we have a domain with 3 classes and 10 samples a possible class assignment would be:\n{{{\nclasses = 1 2 1 2 3 3 2 1 1\n}}}\nThe numbers can be delimited with any white space character (tab or/and enter). If for any reason we don't want a sample to participate in our study then we put a zero value to its class assignment. Moreover (and only for datasets acting as train datasets) if we want some samples to be treated as test samples then we put negative values to their class assignments for example the following line:\n{{{\nclass = 1 2 0 2 -3 3 -3 -2 -1 1\n}}}\nIndicates that the third sample is not participating at all in the study. Samples {1, 2, 4, 6, 10} are train samples and {5, 7, 8} are test samples.\n* {{{names}}}. It contains the name of each class. For example a names declaration in the .opt file can be:\n{{{\nnames = Benign Malignant Normal\n}}}\nIn this example all samples having the value {{{1}}} in the {{{class}}} declaration are considered to belong to the {{{Benign}}} class.\n* {{{samples}}}. It contains the name of each sample. This declaration in optional but is useful specially in the [Results File] where the [Association Value] for each sample is printed. An example of a {{{samples}}} declaration is:\n{{{\nsamples = Sam1a Sam1b Sam2 Sam3 Sam4a Sam4b Sam4c Sam5 Sam6a Sam6b\n}}}\n

The final step is to select the predictor. Here the genes selected from the previous task are chosen to act as attributes with continuous attribute values. Then each sample in the testing dataset is processed by the learning method selected here and assigned to a class. This is the only task where the test dataset is needed. The provided prediction methods are:\n* [[Discretization]]\n* [[NOM {0,1} Prediction]]\n* [[NOM {1,N} Prediction]]\n* [[Billy]]\n* [[Support Vector Machines]] (SVM)\n* [[K-Nearest Neighbor]] (KNN)\n* [[KMeans]]\n* [[Naive Bayes]]

In [[ROC|http://en.wikipedia.org/wiki/Receiver_operating_characteristic]] (Receiver Operating Characteristic) Validation we enter values exacly as in [[K-Fold Validation]] and the sample partitioning is the same. The difference is that instead of the [[accuracy]] we estimate the AUC (Area Under Curve) of the ROC plot.

With ranking we tag each gene with a value indicative of its descriptive ability. The higher the ranking the better the ability of the gene to discriminate between different sample classes. The availabe ranking methods are:\n* [[Entropy Ranking]]\n* [[Signal-to-Noise Ranking]]\n* [[Wilcoxon Rank-Sum Ranking]]\n* [[Billy Ranking]]\n* [[NOM {0, 1} Ranking]]\n* [[NOM {1, N} Ranking]]\n* [[Read From File Ranking]]

Here we have the ability to assign a value to each gene that comes from an external source. The file should have the following format:\n{{{\n0 AFFX-BioB-5_at 13.5\n3 AFFX-BioB-M_at 17.1\n1 AFFX-BioB-3_at 21.2\n}}}\nThe first column is ignored. The second and third column contains the name of the file and its value respectively. If the file does not contain values for all available genes (all, except of these neglected from filtering) then an error message is appeared.\n

Instead of applying a specific algorithm to find the most informative genes we can select genes from an external file. The file should have the following format:\n{{{\n0 AFFX-BioB-5_at CL1\n3 AFFX-BioB-M_at CL2\n1 AFFX-BioB-3_at CL2\n}}}\nThe first and third columns are ignored. The second column contains the names of the selected genes. This method is useful in order to estimate the descriptive ability of genes published in a foreign study. \n

Apart from the heuristic/algorithmic methods for gene selection, a user can manually set the number of positive or negative ranked genes to be selected as markers. Let P and N denote the number of positive and negative ranked genes respectively. Via a special dialog box (figure below) a user can set either the absolute number of desired positive and negative genes, as well as the percentage of them. We can also set to “lock” the ratio of selected positive and negative genes to be P/N regardless our selection of genes. For example, let C be the number of absolute genes that we select to be our final markers. If we choose to “lock” the ratio then we will use (C*P)/(P+N)) positive genes and (C*N)/(P+N) negative genes. Similarly if we select a percentage C instead of an absolute value, then we will use C% of P positive genes and C% of N negative genes. The same selection can be done to [[groups|Select Groups]] rather than genes.\n[img[Select Genes|SelectGenes.jpg]]

This method is exactly the same if the [[Select Genes|Gene Selection]] method. Instead of manually selection of genes, we have the ability to select an arbitrary number of [[groups|Grouping]] via the same dialog box and the same options.

Each gene, that has samples in class a and in class b are ranked according to the formula:\n[img[Signal To Noise Ratio|SignalToNoise.jpg]]\nWhere μ~~a~~,μ~~b~~ are the mean values of the expression values of class σ~~a~~ and class σ~~b~~ respectively. And are the standard deviation of expression values of class a and class b respectively. Intuitively, this formula calculates how ‘concentrated’ are the expression values among the two classes.

A Machine Learning Tool for Mining Gene Expression profiles

~MineGene

By [["Supervised" learning|http://en.wikipedia.org/wiki/Supervised_learning]] we mean that we use a train sample in order to build a model (through learning) of our dataset and then test our model to an indipedent test dataset. A supervised learning procedure through ~MineGene, contains 5 discrete steps where a user should select and customize the algorithm that each step should follow. This steps are:\n# [[Filtering]]\n# [[Ranking]]\n# [[Grouping]]\n# [[Gene Selection]]\n# [[Prediction]]\n[img[Methods for Supervised Learning|SupervisedMethods.jpg]]\nA user can select the desired combinations of algorithms from the combo boxes showed above. After that the user, has to press the button "Gene Selection".\n

MineGene

By [["Unsupervised" learning|http://en.wikipedia.org/wiki/Unsupervised_learning]] or clustering we mean the procedure to partition the genes into groups (clusters) that share common expression values. The clustering procedure includes only two steps. The [[Filtering]] of the initial data and the choice of the clustering algorithm. We provide the following clustering algorithms:\n* [[GTC-MSP Clustering]] or ~GraphoTheoretical Clustering based on Minimum Spanning Trees\n* [[KMeans Clustering]]\n* [[Discrete KMeans Clustering]]\n\nThe selection of the clustering method is made through the relevant combo box:\n[img[Clsutering Combo|Clustering.jpg]]\nAfter the selection of the clustering algorithm, the user has to press the "Clustering" button.

Through validation, we estimate the learning ability of our method. This esimation is made on train data. The available validation methods are:\n* [[Leave One Out Cross Validation (LOOCV)]] \n* [[K-Fold Validation]]\n* [[ROC Curve Validation]]\nIn order to validate our data we have to press the button "Validation" where the following dialog box is appeared:\n[img[Validation Dialog Box|Validation.jpg]]\nWe can select only one validation method by checking one of the radio buttons.

As [[Wilcoxon Rank-Sum Test Filtering]], we apply the [[Wilcoxon rank-sum test|http://en.wikipedia.org/wiki/Mann-Whitney_U]] and estimate the inversed probability that genes are significantly regulated between the two sample classes. Namely, the ranking value is 1-p. Where p is the value estimated by the Wilcoxon ~Rank-Sum Test.

We can perform a [[hypothesis testing|http://en.wikipedia.org/wiki/Hypothesis_testing]] to test the [[null hypothesis|http://en.wikipedia.org/wiki/Null_hypothesis]] that genes are significantly regulated between the two sample classes. This can be done via the [[Wilcoxon rank-sum test|http://en.wikipedia.org/wiki/Mann-Whitney_U]]. The user has to specify the maximum p value. Usually a value close to 0.05 or less, yields satisfactory results.\n[img[Wilcoxon|Wilcoxon.jpg]]\n

[[kantale|http://www.ics.forth.gr/bmi/kanterakis.html]] is the author of MineGene.