KDD Challenge 2000

Guide to the Collagen Diseases Data Set

Domain

The database was collected at Chiba University hospital. Each patient came to the outpatient clinic of the hospital on collagen diseases, as recommended by a home doctor or a general physician in the local hospital.

Collagen diseases are auto-immune diseases. Patients generate antibodies attacking their own bodies. For example, if a patient generates antibodies in lungs, he/she will chronically lose the respiratory function and finally lose life. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases.

In collagen diseases, thrombosis is one of the most important and severe complications, one of the major causes of death. Thrombosis is an increased coagulation of blood, that cloggs blood vessels. Usually it will last several hours and can repeat over time. Thrombosis can arise from different collagen diseases. It has been found that this complication is closely related to anti-cardiolipin antibodies. This was discovered by physicians, one of whom donated the datasets for discovery challenge.

Thrombosis must be treated as an emergency. It is important to detect and predict the possibilities of its occurrence. However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering regularities behind patients' observations.

 

Goals

  1. Search for patterns which detect and predict thrombosis.
  2. Search for temporal patterns specific/sensitive to thrombosis. (Examination date is very close to the date on thrombosis. If we can find specific/sensitive patterns before/after the thrombosis, they are very useful.)
  3. Search for features which classifies collagen diseases correctly.
  4. Search for temporal patterns specific/sensitive to each collagen disease.

Domain experts told us that if useful patterns are discovered then they are acceptable in major journals on rheumatology (collagen diseases.)

 

Evaluation Scheme

One of the domain experts, who is well known for rheumatology, will attend PKDD'99 conference and evaluate all the results. The results will be also evaluated in the clinical environment in the future.

 

Database

Database consists of three tables. (TSUM_A.CSV, TSUM_B.CSV, TSUM_C.CSV). The patients in these three tables are connected by ID number.

 

TSUM_A.CSV

Basic information about patients (input by doctors). This dataset includes all patients (about 1000 records).

item meaning remark
ID identification of the patient
Sex
Birthday YYYY/M/D
Description date the first date when a patient data was recorded YY.MM.DD
First date the date when a patient came to the hospital YY.MM.DD
Admission patient was admitted to the hospital (+) or followed at the outpatient clinic (-)
Diagnosis disease names multivalued attribute

 

TSUM_B.CSV

Special laboratory examinations (input by doctors) (measured by the Laboratory on Collagen Diseases). This dataset does not include all the patients, but includes the patients with these special tests.

item meaning remark
ID identification of the patient
Examination Date date of the test YYYY/MM/DD
aCL IgG anti-Cardiolipin antibody (IgG) concentration
aCL IgM anti-Cardiolipin antibody (IgM) concentration
ANA anti-nucleus antibody concentration
ANA Pattern pattern observed in the sheet of ANA examination
aCL IgA anti-Cardiolipin antibody (IgA) concentration
Diagnosis disease names multivalued attribute
KCT measure of degree of coagulation
RVVT measure of degree of coagulation
LAC measure of degree of coagulation
Symptoms other symptoms observed multivalued attribute
Thrombosis degree of thrombosis 0: negative (no thrombosis)
1: positive (the most severe one)
2: positive (severe)
3: positive (mild)

Examination date is very close to the date on thrombosis. In negative examples, these tests are examined when thrombosis is suspected.

 

TSUM_C.CSV

Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to March 1999) All the data include ordinary laboratory examinations and have temporal stamps. The tests are not necessarily connected to thrombosis.

item meaning normal range
ID identification of the patient
Date Date of the laboratory tests (YYMMDD)
GOT AST glutamic oxaloacetic transaminase N < 60
GPT ALT glutamic pylvic transaminase N < 60
LDH lactate dehydrogenase N < 500
ALP alkaliphophatase N < 300
TP total protein 6.0 < N < 8.5
ALB albumin 3.5 < N < 5.5
UA uric acid N > 8.0 (Male)
N > 6.5 (Female)
UN urea nitrogen N < 30
CRE creatinine N < 1.5
T-BIL total bilirubin N < 2.0
T-CHO total cholesterol N < 250
TG triglyceride N < 200
CPK creatinine phosphokinase N < 250
GLU blood glucose N < 180
WBC White blood cell 3.5 < N < 9.0
RBC Red blood cell 3.5 < N < 6.0
HGB Hemoglobin 10 < N < 17
HCT Hematoclit 29 < N < 52
PLT platelet 100 < N < 400
PT prothrombin time N < 14
Note comment for the test PT
APTT activated partial prothrombin time N < 45
FG fibrinogen 150 < N < 450
AT3 marker of DIC, one of the most important complications of collagen diseases 70 < N < 130
A2PI marker of DIC 70 < N < 130
U-PRO proteinuria 0 < N < 30
IGG Ig G 900 < N < 2000
IGA Ig A 80 < N < 500
IGM Ig M 40 < N < 400
CRP C-reactive protein N= -, +-, or N < 1.0
RA Rhuematoid Factor N= -, +-
RF RAHA N < 20
C3 complement 3 N > 35
C4 complement 4 N > 10
RNP anti-ribonuclear protein N= -, +-
SM anti-SM N= -, +-
SCl70 anti-scl70 N= -, +-
SSA anti-SSA N= -, +-
SSB anti-SSB N= -, +-
CENTROMEA anti-centromere N= -, +-
DNA anti-DNA N < 8
DNA-II anti-DNA N < 8

 

This database was donated by dr. Katsuhiko Takabayashi and prepared by prof. Shusaku Tsumoto
For possible questions on the data and task description contact Shusaku Tsumoto or Dr. Takabayashi. All questions and answers will be published as appendixes to this document.

 


Asked Questions in PKDD'99 Discovery Challenge

 


This page is originally written by Peter Berka, PKDD99 Discovery Chair. The donators would like to thank him for his effort and his kindness. Without his efforts, the workshop would neither have had such an impact nor have made us continue using this data for future discovery challenge workshops.

Shusaku Tsumoto and Katsuhiko Takabayashi


For More Information on this Data :

Shusaku Tsumoto
Department of Medical Informatics, Shimane Medical University.
E-mail: tsumoto@computer.org

For more details, refer to PKDD99 Discovery Challenge Home Page.


Asked Questions


Last modified: Fri Feb 4 11:06:53 JST 2000