Guide to the Collagen Diseases Data Set
Domain
The database was collected at Chiba University hospital. Each
patient came to the outpatient clinic of the hospital on collagen
diseases, as recommended by a home doctor or a general physician
in the local hospital.
Collagen diseases are auto-immune diseases. Patients generate
antibodies attacking their own bodies. For example, if a patient
generates antibodies in lungs, he/she will chronically lose the
respiratory function and finally lose life. The disease mechanisms
are only partially known and their classification is still fuzzy.
Some patients may generate many kinds of antibodies and their
manifestations may include all the characteristics of collagen
diseases.
In collagen diseases, thrombosis is one of the most important
and severe complications, one of the major causes of death. Thrombosis
is an increased coagulation of blood, that cloggs blood vessels.
Usually it will last several hours and can repeat over time. Thrombosis
can arise from different collagen diseases. It has been found
that this complication is closely related to anti-cardiolipin
antibodies. This was discovered by physicians, one of whom donated
the datasets for discovery challenge.
Thrombosis must be treated as an emergency. It is important
to detect and predict the possibilities of its occurrence. However,
such database analysis has not been made by any experts on immunology.
Domain experts are very much interested in discovering regularities
behind patients' observations.
Goals
- Search for patterns which detect and predict thrombosis.
- Search for temporal patterns specific/sensitive to thrombosis.
(Examination date is very close to the date on thrombosis. If
we can find specific/sensitive patterns before/after the thrombosis,
they are very useful.)
- Search for features which classifies collagen diseases correctly.
- Search for temporal patterns specific/sensitive to each collagen
disease.
Domain experts told us that if useful patterns are discovered
then they are acceptable in major journals on rheumatology (collagen
diseases.)
Evaluation Scheme
One of the domain experts, who is well known for rheumatology,
will attend PKDD'99 conference and evaluate all the results. The
results will be also evaluated in the clinical environment in
the future.
Database
Database consists of three tables. (TSUM_A.CSV, TSUM_B.CSV,
TSUM_C.CSV). The patients in these three tables are connected
by ID number.
Basic information about patients (input by doctors). This dataset
includes all patients (about 1000 records).
item |
meaning |
remark |
ID |
identification of the patient |
|
Sex |
|
|
Birthday |
|
YYYY/M/D |
Description date |
the first date when a patient data was recorded |
YY.MM.DD |
First date |
the date when a patient came to the hospital |
YY.MM.DD |
Admission |
patient was admitted to the hospital (+) or followed at the outpatient
clinic (-) |
|
Diagnosis |
disease names |
multivalued attribute |
Special laboratory examinations (input by doctors) (measured
by the Laboratory on Collagen Diseases). This dataset does not
include all the patients, but includes the patients with these
special tests.
item |
meaning |
remark |
ID |
identification of the patient |
|
Examination Date |
date of the test |
YYYY/MM/DD |
aCL IgG |
anti-Cardiolipin antibody (IgG) concentration |
|
aCL IgM |
anti-Cardiolipin antibody (IgM) concentration |
|
ANA |
anti-nucleus antibody concentration |
|
ANA Pattern |
pattern observed in the sheet of ANA examination |
|
aCL IgA |
anti-Cardiolipin antibody (IgA) concentration |
|
Diagnosis |
disease names |
multivalued attribute |
KCT |
measure of degree of coagulation |
|
RVVT |
measure of degree of coagulation |
|
LAC |
measure of degree of coagulation |
|
Symptoms |
other symptoms observed |
multivalued attribute |
Thrombosis |
degree of thrombosis |
0: negative (no thrombosis)
1: positive (the most severe one)
2: positive (severe)
3: positive (mild) |
Examination date is very close to the date on thrombosis. In negative
examples, these tests are examined when thrombosis is suspected.
Laboratory Examinations stored in Hospital Information Systems
(Stored from 1980 to March 1999) All the data include ordinary
laboratory examinations and have temporal stamps. The tests are
not necessarily connected to thrombosis.
item |
meaning |
normal range |
ID |
identification of the patient |
|
Date |
Date of the laboratory tests (YYMMDD) |
|
GOT |
AST glutamic oxaloacetic transaminase |
N < 60 |
GPT |
ALT glutamic pylvic transaminase |
N < 60 |
LDH |
lactate dehydrogenase |
N < 500 |
ALP |
alkaliphophatase |
N < 300 |
TP |
total protein |
6.0 < N < 8.5 |
ALB |
albumin |
3.5 < N < 5.5 |
UA |
uric acid |
N > 8.0 (Male)
N > 6.5 (Female) |
UN |
urea nitrogen |
N < 30 |
CRE |
creatinine |
N < 1.5 |
T-BIL |
total bilirubin |
N < 2.0 |
T-CHO |
total cholesterol |
N < 250 |
TG |
triglyceride |
N < 200 |
CPK |
creatinine phosphokinase |
N < 250 |
GLU |
blood glucose |
N < 180 |
WBC |
White blood cell |
3.5 < N < 9.0 |
RBC |
Red blood cell |
3.5 < N < 6.0 |
HGB |
Hemoglobin |
10 < N < 17 |
HCT |
Hematoclit |
29 < N < 52 |
PLT |
platelet |
100 < N < 400 |
PT |
prothrombin time |
N < 14 |
Note |
comment for the test PT |
|
APTT |
activated partial prothrombin time |
N < 45 |
FG |
fibrinogen |
150 < N < 450 |
AT3 |
marker of DIC, one of the most important complications of collagen
diseases |
70 < N < 130 |
A2PI |
marker of DIC |
70 < N < 130 |
U-PRO |
proteinuria |
0 < N < 30 |
IGG |
Ig G |
900 < N < 2000 |
IGA |
Ig A |
80 < N < 500 |
IGM |
Ig M |
40 < N < 400 |
CRP |
C-reactive protein |
N= -, +-, or N < 1.0 |
RA |
Rhuematoid Factor |
N= -, +- |
RF |
RAHA |
N < 20 |
C3 |
complement 3 |
N > 35 |
C4 |
complement 4 |
N > 10 |
RNP |
anti-ribonuclear protein |
N= -, +- |
SM |
anti-SM |
N= -, +- |
SCl70 |
anti-scl70 |
N= -, +- |
SSA |
anti-SSA |
N= -, +- |
SSB |
anti-SSB |
N= -, +- |
CENTROMEA |
anti-centromere |
N= -, +- |
DNA |
anti-DNA |
N < 8 |
DNA-II |
anti-DNA |
N < 8 |
This database was donated by dr. Katsuhiko
Takabayashi and prepared by prof. Shusaku
Tsumoto
For possible questions on the data and task description contact
Shusaku Tsumoto or
Dr. Takabayashi.
All questions and answers will be published as appendixes to this
document.
- The description of attributes (and their normal ranges)
doesn't correspond to the data. (21.7.1999) Following
several questions, I checked the original database in hospital,
we found several errors about attribute information. We found
errors from PT to TAT2. Please replace the first line with the
second one.
Old: PT APTT FG PIC TAT TAT2 U-PRO
New: PT Note APTT FG AT3 A2PI U-PRO
The Normal Range is:
AT3 70 < N < 130
A2PI 70 < N < 130
- While most values fall into the normal range (as given
by the Guide to the Medical Data Set) this is not the case for
item UN and CRE. (16.7.1999)
These are experts' mistakes. UN: N<30 and CRE: N<1.5 are
normal values. Sorry.
- For the following items the normal range does not fit
to the data: WBC: values 0.1<=N<=119.5, normal range 3500<
N <9000,
RBC: values 0.01<=N<=6.57, normal range 350< N <600,
(16.7.1999)
Yes, they are also expert's mistakes. Please change the normal
range to: WBC: normal range 3.5 < N < 9.0 RBC: normal range:
3.5 < N < 6.0
- Are the values of diagnosis ordered lists or just sets?
ie, is 'RA, SJS' equal to 'SJS, RA'? (14.7.1999)
The values are just sets, so RA,SJS is equal to SJS,RA.
- What means the word 'susp' that comes after many of diagnosis,
like 'SLE susp'? And what about the words that comes between
parenthesis, like 'BEHCET (entero)', 'EN (r/o BEHCET)' and 'RA
(seronegative)'? All diagnosis like 'SLE susp' can be grouped
into a higher level like 'SLE'? (14.7.1999)
Susp stands for "suspected". So, their diagnosis have
not been confirmed. 'BEHCET (entero)': entero stands for enterocolitis
type of Bechet disease. It is one type of Bechet diseases in
which colon is the main target of autoimmune process. Bechet
have several types. In case of BECHET (neuro), the main target
will be neuron.
'EN (r/o BEHCET)' :This means that this entercolitis case in
which Bechet is strongly suspected.
'RA (seronegative)': From the observations (symptoms), this case
can be diagnosed as RA. But, serum tests (laboratory examinations)
are negative. We have had such strange cases in real clinical
practice. So, this case is clinically RA, but negative from the
labo tests. (So, from the viewpoint of labo tests, they are "true-negative"
cases.)
- We came across some attribute values in table TSUM_C which
puzzled us a bit. (30.6.1999)
I found one error in attribute information in Tsumoto_c.csv.
There is one laboratory examination between TAT and U-PRO. All
the questions about CRP,IGM, RF, IGG are coming from this error.
So, please replace the attribute-list:
ID Date GOT GPT LDH ALP TP ALB UA UN CRE T-BIL T-CHO TG CPK GLU
WBC RBC HGB HCT PLT PT APTT FG PIC TAT U-PRO IGG IGA IGM CRP
RA RF C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II
with:
ID Date GOT GPT LDH ALP TP ALB UA UN CRE T-BIL T-CHO TG CPK GLU
WBC RBC HGB HCT PLT PT APTT FG PIC TAT TAT2 U-PRO IGG
IGA IGM CRP RA RF C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II
- Another little puzzle has been attribute ANA Pattern in
table TSUM_B. First of all, are its values ordered lists or just
sets, if any? ie, "P,S" = "S,P"? Are there
any other consideration you think important as to this attribute?
(30.6.1999)
This values are just sets, so {P,S} = {S,P}.
- Attributes RNP, SM, SC170, SSA, SSB, CENTROMEA, which
are expected to assume [-, +-], are often seen to have numbers
as values. How are they supposed to be interpreted? (30.6.1999)
Usually, these test have two kinds of measurements: qualitative
and quantitative. We thought that they are measured by qualitative
methods. I will check the normal range.
- We found values such as "<30" and ">=1000"
for some numerical attributes. How should they be interpreted?
Could they be replaced with some number? (30.6.1999)
It means that these values are too small or too large. For example,
you can set some values to each case. Say, "<30"
can be transformed to "10" and ">=1000"
to 1500.
- New attribute U-PRO (after TAT-2) sometimes shows to assume
value TR. What does it mean?
TR means that due to some problems with blood serum, the laboratory
cannot measure. So, it means that "error in measurement
due to the problems with submitted blood serum".
- We have downloaded and started to analyze the medical
data set. It seems that 350 of the patient IDs in TSUM_B.csv
have no correspondent entry in TSUM_A.csv. Therefore, data of
BOTH examinations exist only for about 400 of the 1200 patients
in TSUM_A.csv ? (30.6.1999)
Tsumoto_a.csv includes all the data of patients who are followed
by doctors at outpatient clinic in University Hospital at least
several months. On the other hand, Tsumoto_b.csv includes the
data of two types of patients. The first one is a patient followed
at University Hospital. The second one is a patient who is not
followed at University Hospital, but specific laboratory examinations
are made (even in this case, we will register that patient and
provide ID number of university hospital.) So, tsumoto_a.csv
and tsumoto_b.csv includes three types of patients:
- First type: a patient followed at outpatient clinic in University
hospital, but no special examinations are made for this patient.
(Patients in the first type do not suffer from thrombosis: that
it, they are negative with respect to throbmosis.)
- Second type: a patient followed at University hospital and
special examinations are made for this patient.
- Third type: a patient who is not followed at University hospital,
but special examinations are made for this patient.
Thus, about 400 patients in Tsumoto_b.csv are belonging to
third type. But, they are not followed at University hospital,
they do not have temporal data. So, please use first type and
second type patients for the analysis of thrombosis.
- Are for a given patient the values of attribute "Diagnosis"
in table TSUM_A the same as the values of attribute "Diagnosis"
in table TSUM_B ?
Yes. If not, please use the diagnosis in TSUM_A. That is the
most recent updated file about diagnosis.
- Are "Diagnosis" (table TSUM_A, TSUM_B) concepts,
the contributors to the discovery challenge should consider or,
is the only target attribute the "Thrombosis" ?
Diagnosis is also the target attribute. My colleagues are not
only interested in "Thrombosis", but also in "Diagnosis".
- In table TSUM_C, you gave the normal range of values.
What is the possible range of all values (e.g. I have found value
"+" for attributes with normal range "-",
"+-", but I have found value "-" also in
attribute with normal range N<8)
Okay, first, {-,+-,+} is a usual notation in medical "qualitative"
tests. "-" is negative (in normal range), "+-"
is not negative but at the border of normal range, "+"
means positive, or abnormal.
{-,+-,+} can be observed in a simple test: usually, each symbol
corresponds to a range of "quantitative" values. In the
case of "-" in Normal range (N<8), "-"
means that the value of this test is in the normal range (N<8).
This page is originally written by Peter Berka, PKDD99 Discovery
Chair. The donators would like to thank him for his effort and
his kindness. Without his efforts, the workshop would neither
have had such an impact nor have made us continue using this data
for future discovery challenge workshops.
Shusaku Tsumoto and Katsuhiko Takabayashi
For More Information on this Data :
- Shusaku
Tsumoto
- Department of Medical Informatics, Shimane Medical University.
- E-mail: tsumoto@computer.org
For more details, refer to PKDD99 Discovery Challenge Home Page.
Asked Questions
- question. (21.7.1999)
answer.
Last modified: Fri Feb 4 11:06:53 JST 2000