## Cas.uah.edu

SummarySeparationsDiscretizationImportant FactorsExplanationsExamplesReferences
Numerical records containing values of attributes.

Records of usage of equipment and wear results.

Records of characteristics of mortgage borrowers and results ofdefaults.

Records of financial data and stock price performance.

Records of amniotic fluid analysis for proteins during 14th weekof pregnancy and preterm/normal delivery results.

Patient records covering anamnesis, lab data, histology,treatment, results.

Records of neurological tests and time to onset of Alzheimer’sdisease.

The desired explanations are found and validated by a newmethod called EXARP in four steps.

The process typically works with as few as 40 to 60 records,even when the records contain a large number of attributes.

Use one half of the numerical data sets to derive equivalentlogic data sets. The method involves pattern analysis.

Compute importance values for the attributes. Select the mostimportant attributes and restrict the logic data to these at-tributes.

Compute explanations for the restricted logic data.

Validate the explanations using the second half of the numericaldata.

In each step of EXARP, an ❆❧t❡r♥❛t❡ ❘❛♥❞♦♠ Pr♦❝❡ss (ARP) ispostulated whose goal is to disrupt or mislead the computingprocedure by random actions.

The steps are so designed that the negative effects of the ARPsare mitigated or even eliminated.

This is quite different from direct use of probabilistic or uncer-tainty models, say in Bayesian networks or naive Bayes clas-sifiers. There, the model is directly employed for the desiredresults.

Second Idea: Early Termination (Giving Up)
If we discover that an alternate random process has distortedthe data so much that explanations very likely cannot be de-rived, then we stop with that conclusion.

Training sets A and B consist of records of length n. The kthentry is the value for attribute k. Entries may be ❚r✉❡, ❋❛❧s❡,or ❯♥❛✈❛✐❧❛❜❧❡.

The value ❯♥❛✈❛✐❧❛❜❧❡ means that one cannot get a ❚r✉❡✴❋❛❧s❡value.

Note: Generally, another value is possible: ❆❜s❡♥t. In thatcase, one does not know value, but can obtain it. In this talk,we will not make use of that option.

Find a logic formula that is ❚r✉❡ on A and ❋❛❧s❡ on B, or showthat this cannot be done. The formula s❡♣❛r❛t❡s A from B.

The sets A and B usually are taken from two populations Aand B. A separating formula for A and B may then be usedto predict whether a record is in A or B.

There are effective methods to find separating formulas. In ourapproach, the formulas are in disjunctive normal form (DNF),and we construct them by a recursive process.

Example: (x ∧ ¬y ∧ z) ∨ (¬x ∧ v) ∨ (y ∧ w)
Goal: Logic data sets A and B representing A and B.

Most popular method: Entropy plus Minimum DescriptionLength Principle (MDLP). The principle generally selects thehypothesis that minimizes the description length of the hypoth-esis plus the description length of the data given the hypothesis.

Here, we apply a new method of pattern analysis that supportsseveral essential enhancements.

1. Sort the numerical entries of the attribute. Label each entry
of the sorted list by “A” (resp. “B”) if coming from a recordof A (resp. B). The result is a ❧❛❜❡❧ s❡q✉❡♥❝❡.

10.8 3.7 2.9 1.7 0.5 −1.0 −3.5 −11.9
3. Find an interval where the sequence switches from mostly
important the interval. Put a cutpoint c into the middle ofthe interval. Details are given in a moment.

Logic attribute yc: if x ≤ c, then yc = ❋❛❧s❡
1. In the label sequence, replace each A by 1 and each B by 0.

2. Smooth the sequence of 0s and 1s by Gaussian convolution.

The variance of Gaussian convolution is determined by eval-uation of an ARP that randomly constructs consecutive Asor Bs to mimic an important switch. Goal is smoothing sothat the effect of the ARP is mostly eliminated.

3. Select cutpoint where the smoothed data change by maxi-
10.8 3.7 2.9 1.7 0.5 −1.0 −3.5 −11.9
4. If attribute values may change randomly: Evaluate a second
ARP to estimate the size of such random variations andto define an ✐♥t❡r✈❛❧ ♦❢ ✉♥❝❡rt❛✐♥t② around the cutpoint toeliminate or at least reduce such random variations.

10.8 3.7 2.9 1.7 0.5 −1.0 −3.5 −11.9
Extension: Discretization via Uncertain Regions
Recursive process derives discretization of regions inn-dimensional space, for any n ≥ 2.

Discussion via example of cancer data.

Cervical Cancer: FIGO I-III versus Recurrence
Derive factors explaining difference between FIGO I-III andRecurrence, using lab data.

1. Partition the data into FIGO I-III cases and Recurrence
2. Discretize the two data sets, getting sets A for FIGO I-III
3. Compute 20 separating logic formula with short clauses.

One half of the 20 formulas evaluates to ❚r✉❡ on A and to
❋❛❧s❡ on B. The other half evaluates to the opposite values.

4. Use an ARP to find a s✐❣♥✐☞❝❛♥❝❡ ✈❛❧✉❡ for each literal of the
formulas. The value can be viewed as a probability estimatethat the literal is important for explaining the differencebetween the records of A versus those of B.

5. Declare each attribute for which least one of its literals has
significance value ≥ 0.6 to be important. That way, we getthe ✐♠♣♦rt❛♥t ❢❛❝t♦rs. If just one or two important factorsare selected, lower the threshold to 0.55, and repeat theselection.

Suppose we have a continuous goal such as time to progression(TTP). We want to find out explanations for short versus longTTP.

Carry out the above process parametrically, by varying a TTPthreshold interval. Obtain significance values for each TTPthreshold interval and for each attribute. These values definefor each attribute an ✐♠♣♦rt❛♥❝❡ ❢✉♥❝t✐♦♥ that maps the thresh-old intervals to the significance values.

Pick threshold intervals for which the sum of the importancefunctions is large. Find explanations for these threshold inter-vals.

Discussion uses example of FIGO I-III versus Recurrence.

1. Delete from the data all attributes except for the important
factors determined in the previous step.

2. Discretize the sets of FIGO I-III and Recurrence records
3. Compute a logic formula that is ❚r✉❡ on A and ❋❛❧s❡ on B,
and a second formula that is ❋❛❧s❡ on A and ❚r✉❡ on B.

The logic formulas, combined with the discretization infor-mation, constitute the desired explanations.

1. Two explanations are consistent if they cannot simultane-
2. Two explanations are stable if small random changes of
attributes that are subject to such changes, do not turn oneconclusion into its opposite.

Note: Under stability, small random changes are allowed toturn ❚r✉❡ or ❋❛❧s❡ into ❯♥❞❡❝✐❞❡❞ or ✈✐❝❡ ✈❡rs❛.

We define a statistic with 0/1 value via the explanations ob-tained in the previous step.

Hypothesis H0: The explanations produce accurate predic-tions.

Hypothesis H1: The explanations do not produce accurate pre-dictions. Indeed, with some luck, the same accuracy can beachieved by flipping an unbiased coin, which statistically is aBernoulli trial with α = 0.5.

Find Explanations and Establish Significance
1. Split the given data into a training set and a testing set.

2. Obtain explanations from the training data using the earlier
3. Apply the explanations to the testing data and determine
how often the explanations are correct/incorrect.

4. Compare the outcome of Step 3 with results of Bernoulli
5. Use direct computation or approximation by normal distri-
bution to obtain probability p that Bernoulli trials obtainthe same results or better. If p is very small, then acceptH0. Otherwise, accept H1.

Cervical Cancer: Difference Between FIGO I-III andRecurrence
The data set was supplied by the Frauenklinik, Charit´
Note: At present, treatments cannot utilize this information.

We include it here to demonstrate validation.

57 patients (31 for training, 26 for testing)
31 training cases: 19 FIGO I-III, 12 Recurrence
26 testing cases: 14 FIGO I-III, 12 Recurrence
Note: FIGO IV excluded since too few cases.

Explanations by C4.5 Decision Tree Method
[124 < ENDOSTATIN ≤ 156 and VEGFD ≤ 254],
[ENDOSTATIN > 124 and VEGFD > 254],
The two explanations are consistent but not stable.

19 of the 26 test cases are predicted correctly.

Accuracy = 73%.

Significance of the explanations is p < 0.18, which causes re-jection of the explanations.

If [ENDOSTATIN ≥ 129.0], then Recurrence case.

The two explanations are consistent but not stable.

21 of the 26 test cases are predicted correctly.

Accuracy = 81%.

Significance of the explanations is p < 0.002.

If ENDOSTATIN < 123.0 or M2PK PLASMA < 18.8,then FIGO I-III case.

If ENDOSTATIN > 136.0 and M2PK PLASMA > 21.8,then Recurrence case.

The two explanations are consistent and stable.

22 of the 26 test cases are predicted correctly.

Accuracy = 85%.

Significance of conclusion is p < 0.0002, which is almost thegold standard p < 0.0001.

The data set was supplied by the Frauenklinik, Universit¨
SURVZ (living at present)HER2 (value of second test)Thymidine Phosphorylase:
VEGF (Vascular Endothelial Growth Factor)COX2 (Cyclooxygenase 2)K18 (Keratin 18)
HAT AD MED (adjuvant hormone therapy: medication
HER2 STAT (HER2 status (2, 3, ?))FISH STAT (FISH Status (0,1))HISTO TYP (histological tumor type (1,2,3))PT (tumor size (1,2,3,4,?))PN (lymphnode status (1,2,3,?))M (metastasis (0,1,?))G (grading (1,2,3,?))REZ ER (estrogen receptor expression (0,1,?))REZ PR (progesterone receptor expression (0,1,?))Local recurrence and distant metastasis status:
AT LOK (local)AT ABD (abdominal)AT HEP (liver)AT PUL (lungs)AT ZNS (central nervous system)
AT PERI (heart)AT PLEU (pleura)AT ASCI (ascites)AT LYM (lymphangiosis)AT KNO (bone)
HT AD (hormone therapy)HT PA (palliative hormone therapy)CT AD (chemotherapy)CT PA (palliative chemotherapy)ST (radiation)BT (bisphosphonates)Age (years)BEST RES (best response
(1 = complete response,2 = partial response,
TTP (time to progression (weeks))SURV (survival time (weeks))
1. Which factors influence time to progression (TTP)?
2. Why do 3 patients have amazingly high TTP values?
Explanations by C4.5 Decision Tree Method
If [AT HEP > 0 and TP TISSUE > 4], then high TTP case.

These explanations are probably wrong since absence ofmetastases in the liver is claimed to result in not-high TTP.

The explanations are 92.9% correct on the training data. Thisis disturbing since there are just 14 cases in total, yet C4.5 doesnot explain all of them.

RIPPER declares each case to be not-high TTP. Needless tosay, this is not useful.

If TP TISSUE ≥ 6, and if COX2 ≤ 2 or K18 ≥ 9, then highTTP case.

If TP TISSUE ≤ 4], or if COX2 ≥ 4 and K18 ≤ 8, then not-high TTP case.

The two explanations are consistent and only partially stablesince K18 ≥ 9 and K18 ≤ 8 cause instability.

1. The explanations are in full agreement with current un-
derstanding of the actions/interactions of Xeloda and withother results of molecular biology.

2. The explanations are 100% correct on the training data.

In two situations, high TTP values are predicted:
If TP TISSUE ≥ 6 and [K18 ≥ 9 or COX2 ≤ 2]:
If TP TISSUE ≥ 6 and above condition does not apply:
Xeloda,Vioxx (to lower COX2), andAspirin (to prevent thrombosis).

A clinical study has been started in Germany to test validity.

If correct, a substantial number of patients can be saved.

The data set was supplied by the Technical University, Munich.

457 records (230 for training, 227 for testing)
230 training cases: 79 early dementia, 151 late/no dementia
227 testing cases: 65 early dementia, 162 late/no dementia
Explanations by C4.5 Decision Tree Method
Two explanations with 6 and 9 rules each. The largest rule has7 terms. The explanations seem contrived.

The two explanations are consistent but not stable.

191 of the 227 test cases are predicted correctly.

Accuracy = 84%.

Significance of the explanations is better than the gold standardp < 0.0001.

RIPPER produces two explanation sets.

If SKTSCORE ≥ 8, or if CDRWERT ≥ 1 and SKT8RW ≥ 8,or if SKT4RW ≥ 26 and SKT3RW ≤ 8 and SKT5RW ≥ 23,then near-term dementia.

If CDRWERT ≥ 1 and SKT9RW ≥ 2 and MMS ≤ 26, or ifCDRWERT ≥ 1 and SKT9RW ≥ 1 and SKT3RW ≤ 8, or ifMMS ≤ 22, or if HBEFIND2 = 2 and SKT2RW ≥ 7, or ifSKT1RW ≥ 19 and GDSSCORE ≥ 4,then late/no dementia.

The first explanation set seems interesting, while the secondone seems contrived. Hence, select the first explanation set.

This turns out to be a good choice (83% accuracy versus 75%).

The explanations of set 1 are consistent but not stable.

Explanation set 1 makes sense except for the term SKT3RW≤ 8. It indicates, in part, that a good performance of readingnumbers implies early dementia.

188 of the 227 test cases are predicted correctly.

Accuracy = 83%.

Significance of the explanations is better than the gold standardp < 0.0001.

If CDRWERT ≥ 1 and [SKT4RW ≥ 28 or PERSEV = 0],then near-term dementia.

If CDRWERT = 0 or [SKT4RW ≤ 25 and PERSEV ≥ 1],then late/no dementia.

The two explanations are consistent and stable.

An expert has commented that the explanations make senseexcept that the PERSEV terms are questionable.

180 of the 227 test cases are predicted correctly.

Accuracy = 79%.

Significance of conclusion is better than the gold standard p <0.0001.

Riehl, K., and Truemper, K., “Construction of Explanationsfrom Numerical Data,” in preparation.

Book “Data Mining and Knowledge Discovery ApproachesBased on Rule Induction Techniques”, E. Triantaphyllou andG. Felici, eds., Springer Verlag, Berlin, 2006:
1. Bartnikowski, S., Granberry, M., Mugan, J., and Truem-
per, K., “Transformation of Rational and Set Data to LogicData.”
2. Felici G., Sun, F., and Truemper, K., “Learning Logic For-
mulas and Related Error Distributions.”
Truemper, K., “Design of Logic-based Intelligent Systems,”Wiley, 2004.

Source: http://cas.uah.edu/tsengf/workshop_lecture_1.pdf

Entrerríos, agosto 28 DE 2009 Oficio: Nº 0530 Señores Honorable Concejo Municipal de Entrerríos Entrerríos (Antioquia) Asunto: Remisión Informe de Gestión 2009 Respetados Honorables Concejales: Reciban un cordial saludo, me permito remitir el Informe de Gestión de la Secretaría de Planeación acogiendo la invitación de acudir a la respetable Corporación, para presentarles las acciones d

Metabolic Synergy™ Nutritional Support for Optimal Glucose, Insulin & Leptin LevelsTHIS INFORMATION IS PROVIDED FOR THE USE OF PHYSICIANS AND OTHER LICENSED HEALTH CARE PRACTITIONERS ONLY. THIS INFORMATION IS INTENDED FOR PHYSICIANS AND OTHER LICENSED HEALTH CARE PROVIDERS TO USE AS A BASIS FOR DETERMINING WHETHER OR NOT TO RECOMMEND THESE PRODUCTS TO THEIR PATIENTS. THIS MEDICAL AND SC