Statistical relational learning for document mining

Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, David M. Pennock

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Scopus citations


A major obstacle to fully integrated deployment of many data mining algorithms is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. We propose an integrated approach to statistical modeling from relational databases. We structure the search space based on "refinement graphs", which are widely used in inductive logic programming for learning logic descriptions. The use of statistics allows us to extend the search space to include richer set of features, including many which are not boolean. Search and model selection are integrated into a single process, allowing information criteria native to the statistical model, for example logistic regression, to make feature selection decisions in a step-wise manner. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. The resulting classifier can be used to recommend where to publish articles.

Original languageEnglish (US)
Title of host publicationProceedings - 3rd IEEE International Conference on Data Mining, ICDM 2003
Number of pages8
StatePublished - 2003
Externally publishedYes
Event3rd IEEE International Conference on Data Mining, ICDM '03 - Melbourne, FL, United States
Duration: Nov 19 2003Nov 22 2003

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786


Conference3rd IEEE International Conference on Data Mining, ICDM '03
Country/TerritoryUnited States
CityMelbourne, FL

All Science Journal Classification (ASJC) codes

  • Engineering(all)


Dive into the research topics of 'Statistical relational learning for document mining'. Together they form a unique fingerprint.

Cite this