Expert Discovery

ExpertDiscovery system applies an original knowledge discovery approach (Relational Data Mining) [Scientific Discovery Web Site; Vityaev, 2006; Vityaev, Kovalerchuk, 2008; Vityaev, Kovalerchuk, 2004; Kovalerchuk, Vityaev, 2000]. The approach was used in Discovery system which has been successfully applied for solution some particular problems in the fields of psychophysics, cancer diagnostics and securities rates prediction. The heart of the system is semantic probabilistic inference. [Vityaev, 2006].

The idea of new knowledge discovery is to sequentially increase accuracy of hypotheses so that on each step the hypotheses have the higher probability and definition level. Also the level of significance of the results is tested by statistical criterions.

Discovery system implements semantic probabilistic inference with knowledge discovery as a set of probability laws, the strongest probability laws and maximally specific laws.

ExpertDiscovery is an adaptation of the Discovery system which is configured to knowledge discovery in sets of nucleotide sequences, according to semantic probabilistic inference, as complex signals with specified parameters.

ExpertDiscovery plugin in UGENE has the following advantages:

Crossplatforming
The unite system
1. Many algorithms within the bounds of one project, apparently, give more possibilities than many different individual narrow applications. Such an approach simplifies user’s work: that is needed is to launch UGENE which gives the access to the wide range of the algorithms instead of launching different unrelated programs.
2. UGENE plugins have unified interface and work logic. Also, user who is already familiar with UGENE could cope with a new module faster. Thus, ExpertDiscovery uses reliable interface and visualization solutions (sequence view, annotation view, task manager, etc.) of UGENE.
3. Extension and combination of results possibilities appear. For example, ExpertDiscovery markups can be UGENE algorithms’ results (SITECON, Weight Matrix, Query Designer, etc.)
4. Data formats. ExpertDiscovery can read sequences in any format which is supported by UGENE (FASTA, FASTAQ, Genbank, GFF, EMBL, etc.).

More detailed information about ExpertDiscovery you can find below:

UGENE extends ExpertDiscovery CS library. In the integrated system the elementary signals can be any signals from the table 1.

Table 1. The elementray signals of ExpertDiscovery

Figure 2. UGENE Query Designer algorithms

With UGENE Query Designer we can create markups containing results of differrent algorithms (SITECON, PWM, Repeat Finder) (Fig. 2), full description of the algorithms and its launching in Query Desginer you can find in the UGENE documentation [Unipro UGENE].

As for regulatory regions hierarchical analysis, using UGENE we can create sequences markups with elementary signals which are loaded to ExpertDiscovery for the further analysis and building more complex models of the regulatory regions. For example, it is easy to create markups with UGENE Workflow Designer building up a corresponding scheme.

In the Fig. 3 there is a scheme which gives a DNA sequence markup with TFBS recognized by weight matrix method. To solve the problem the scheme contains the read sequence element, the read weight matrix element, the element for recognition and the write a result to a file element. The resulting file is the file with annotations of the sequences of IRF binding site. Then the file can be loaded to ExpertDiscovery system as a markup with IRF site as the elementary signal.

Figure 3. UGENE Workflow Designer scheme for creating markups with weight matrix method.

It is worth to say that as a markup for ExpertDiscovery can be loaded any file with annotations in genbank format. For instance, a markup can be generated with a similar scheme but using SITECON method instead of weight matrix method or launching elements search with UGENE Query Designer or any other UGENE tools.

The main cycle of work of ExpertDiscovery program is the following (Fig. 4):

Figure 4.The main iteration of work of ExpertDiscovery user.

Logically, ExpertDiscovery consists of two parts: the part that extracts CS (ED Signal Extractor) and the part that recognizes CS on sequences (ED Recognition).

Expert loads a positive set of sequences (Positive) containing a regulatory object he is interested in, and a negative set (Negative) which doesn’t contain the object. Learning of the system will be based on these two sets. Also it is necessary to set the parameters (Parameters) and load markups of the sequences with elementary signals which will be used to extract complex signals. The output data of the algorithm are CSs (Complex Signals). Then, user can recognize any CS on sequences of the control set (Control). User sets the recognition bound and, as the result, obtains recognition data (Recognition Data) as an HTML report or a recognition profile.

ExpertDiscovery is integrated into UGENE as a plugin and launching from the Tools menu in the main UGENE window.

Figure 5. The main ExpertDiscovery window.

The Fig. 5 represents the main ExpertDiscovery window in UGENE. Functional of document management, loading markups, extracting signals, etc., can be found as buttons on the toolbar. The window is divided into three areas. The upper left area contains the hierarchical list of the elements of the system: sequences (positive, negative and control), markups, signals. The lower left area shows properties of a chosen element. On the right – the area of sequence view, a chosen signal is shown as an annotation of sequences.

Loading data. To build a model of a regulatory region the system requires the training set: positive and negative set of nucleotide sequences. Sequences of the positive set contain a region expert is interested in. It can be a set of sequences which contains binding sites of a specific type or set of a specific group of genes.

Usually, as the negative set, sequences which don’t contain the investigated signal (or a set of signals), are used. However, sometimes it is difficult to provide such a set before the stage of computer analysis. That’s why the negative set can include so-called “random sequences” generated automatically saving the frequency of occurrence of symbols relatively to the positive set.

Figure 6. Loading of a set of DNA sequences dialog.

Loading of a set of nucleotide sequences dialog (Fig. 6) is launched by the “New ExpertDiscovery Document” button on the toolbar. Any sequence files in a format supported by UGENE may be chosen. Then it is needed to load a markup of the sequences with elementary signals which will be the basis for CS. User can chose nucleotides markup or load any markups file generated before. Usually, markups are characterized by locations on a sequence which have the signal and names of a family they are included in.

Figure 7.Loading of markups of sequences dialog.

By the ”Load Markup” button on the toolbar the loading of markups of sequences dialog is launched (Fig.7). If the “Append to Current Markup” flag is not checked then the old markup will be deleted.

CS editing. To manually create CS one can use the popup menu of the “Complex signals” item in the “Items” project window. Also, grouping folders are provided for convenience.

Under definition of CS, it is represented as a hierarchical tree in which the operations are nodes and markups items or words are leafs.

When CS is created and selected, its structure can be changed and parameters can be viewed in the parameters area. The available types of nodes are the “distance” operation (binary), the “repetition” operation, the “interval” operation, the markup items and words. CS is full determined when all its leafs have terminal symbols – words or markup items.

Creating of CS automatically. Using the training set (positive and negative set, markups) the system can construct a structure of a regulatory region as CS. The extracting wizard is launched by the “Extract signals” button on the toolbar.

Figure 8. CS extraction parameters dialog.

In the first dialog window (Fig. 8) extraction parameters (see below) are set. Next windows are for setting operations which will be nodes of CS and choosing a folder for CS storing.

To see CS location in a sequence it is needed to pick sequences for representation with the popup menu of the sequence. Then, one can choose any CS and it will be shown as autoannotations on each represented sequence. Moreover, it is possible to observe few signals at once on the sequence, for this, user checks signals for group representation with the popup menu. The same operation is used to choose signals for recognition.

CS recognition on a sequence. After the CSs are automatically extracted they can be recognized on any sequence. Such a set of sequences can be loaded as the control set.

For recognition some set of CSs is chosen, each of the signals is applied to a sequence. Then, to a symbol of the sequence, where CS is occurred, –log(1-P) score is added, where P is a value of conditional probability of the signal. Score of the sequence is a total score of all its symbols. The sequence is considered to be recognized when it has the selected CS, and its total score is higher than the recognition bound. Expert can choose the recognition bound using the training set. Choosing of the recognition bound is performed in the corresponding dialog by clicking the button “Set recognition bound” on the toolbar. In the dialog errors of the first and the second type are shown for choosing the value.

Also, for convenience, an HTML recognition report can be generated. The report includes statistical parameters and a recognition result for each sequence.

To CS the following operations can be applied:

Distance between signals. The input CSs are s1 and s2. It is specified that the distance between them varies from min to max and the order is taken into account. The result CS is found in a position if in the position s1 is found, and on a distance from min to max from it s2 is found. In the case if order is unimportant, s2 may be found before s1. Min and max parameters are specified by expert.

Repetition of a signal. A result case is a repetition of an input signal s from N_min to N_max times and the distance between neighbor repetitions varies from min and max. N_min, N_max, min and max are specified by expert.

Belonging of a signal to an interval. The input CS must be in the interval from min to max, where min and max are absolute distance values from the first symbol of a sequence. The operation makes sense only for an aligned set. Min and max parameters are specified by expert. Also the distance between a pair of CSs can be determined by any of the following ways:

from the end of the first signal to the beginning of the second.
from the beginning of the first signal to the beginning of the second
from the middle of the first signal to the beginning of the second

A way of the distance determination is a parameter of the corresponding operation and specified by experts.

By specifying the parameters of the operations expert specifies a set of operations SetO which can be used to create CS as hypotheses and also a set SetCS of all the CS which are needed to be tested by expert or to be extracted automatically.

User specifies the set of operations SetO which will be applied to CS thus defining the expert hypotheses as CS and sequentially increasing their accuracy. Also it is needed to specify the parameters of CS selection.

On the first step of the algorithm as an initial population of signals the elementary signals are taken. Next steps are for increasing the accuracy of the signals in the population. The following procedure is performed to improve the quality of the CS:

Choose one of the elementary signals T of current CS;
Choose an operation O from the set of the operations SetO and T is substituted by O which is applied for some other elementary signals;
The resulting CS is tested on the selection criterion (see further):
1. If it is satisfied then the CS is written to the resulting set ResCS.
2. Otherwise the CS is tested on the branching criterion (see further). If it is satisfied the signal is transferred to the next population.
3. If all of the previous criterions are failed then the CS is discarded.

Then the next signal in the population is taken. When there are no more signals in current population the algorithm goes to the next population. The cycle continues while the population is not empty. The result is the set of CS ResCS. Note, that each resulting CS is more significant and probable than each of its sub-signal.

To test the CS, two sets are required – positive (let it be YES set) and negative (let it be NO set). YES set contains sequences that have some signals in advance. NO set sequences do not have the signals in advance or the sequences can be generated randomly and they are needed to test the statistical parameters of the CS.

ExpertDiscovery uses the following selection criterions of the CS:

A condition probability threshold - the minimal value of the condition probability that a signal must have. Also it is checked that the signal is more probable than the previous sub-signal;
A statistical significance by Fisher criterion threshold – it is needed to check statements 3 and 4 of semantic probabilistic inference;
If minimization significance level by Fisher criterion is set then it is checked that the signal is more significant than the previous sub-signal;
A statistical significance by Ul criterion threshold [Yule, 1900];
A positive set coverage threshold;
Uniqueness check. On different steps the signals with a similar structure can be found. It is possible to choose between saving all the signals or unique signals only.

For branching criterions testing:

A condition probability threshold - also it is checked that a resulting signal after branching is more probable than the initial;
A statistical significance by Fisher criterion threshold;
If minimization significance level by Fisher criterion is set, then it is checked that a resulting signal after branching is more probable than the initial;
Minimal complexity (amount of contained operations) of the CS;
Maximal complexity of the CS;
Correlation of the “distance” operation arguments condition of the CS.

Selection and branching criterions use the following terms:

Condition probability P of belonging of the CS to YES set.

P = a₁₁/(a₁₀ + a₁₁),

where:

a₁₁ - full amount of inclusions of the CS to YES set,

a₁₀ - full amount of inclusions of the CS to NO set.

Statistical significance by Fisher criterion (Fisher exact test of contingency tables [Kendel, 1973]). For calculation of the significance level (f) 4 values are used:

t₀₀-amount of negative sequences having the signal inclusions;

t₀₁- full amount of inclusions of the CS to YES set

t₁₀- full amount of inclusions of the CS to NO set

t₁₁- amount of positive sequences not having the signal inclusions

f = (t₀₀+t₀₁)! (t₁₀+t₁₁)! (t₀₀+t₁₀)! (t₀₁+t₁₀)! / ( (t₀₀+t₀₁+t₁₀+t₁₁)! t₀₀! t₀₁! t₁₀! t₁₁! )

Statistical significance by Ul criterion [Yule, 1900]
Positive set coverage in percent (for positive set sequences having the signal);
Negative set coverage in percent (for negative set sequences having the signal);

For the “distance” operation correlation level between arguments is evaluated.

UGENE and EXPERT DISCVOERY INTEGRATED SYSTEM

The integrated system is quite a powerful tool for hierarchical regulatory regions analysis. Generating markups with different methods, we allow the system to perform recognition on the high levels of the hierarchy which is impossible with other programs. So, we can extract and investigate a model of a complex regulatory region. All the functionality is accessible in the context of one program.

UGENE uses the system of plugins: each independent module is a plugin which can be switched on or off by user, the plugins can interact with each other.

ExpertDiscovery system algorithms fit well the UGENE concept that’s why it was decided to integrate them into UGENE as a plugin which would repeat and extend possibilities of ExpertDiscovery.

Child pages

Expert Discovery