Головна Logical and relational learning with 10 tables

Logical and relational learning with 10 tables

0 / 0
Наскільки Вам сподобалась ця книга?
Яка якість завантаженого файлу?
Скачайте книгу, щоб оцінити її якість
Яка якість скачаних файлів?

This textbook covers logical and relational learning in depth, and hence provides an introduction to inductive logic programming (ILP), multirelational data mining (MRDM) and (statistical) relational learning (SRL). These subfields of data mining and machine learning are concerned with the analysis of complex and structured data sets that arise in numerous applications, such as bio- and chemoinformatics, network analysis, Web mining, and natural language processing, within the rich representations offered by relational databases and computational logic.

The author introduces the machine learning and representational foundations of the field and explains some important techniques in detail by using some of the classic case studies centered around well-known logical and relational systems.

The book is suitable for use in graduate courses and should be of interest to graduate students and researchers in computer science, databases and artificial intelligence, as well as practitioners of data mining and machine learning. It contains numerous figures and exercises, and slides are available for many chapters.

Категорії:
Рік:
2008
Видання:
1
Видавництво:
Springer
Мова:
english
Сторінки:
395
ISBN 10:
3540200401
ISBN 13:
9783540200406
Серії:
Cognitive Technologies
Файл:
PDF, 3,81 MB
Завантажити (pdf, 3,81 MB)

Можливо вас зацікавить Powered by Rec2Me

 

Ключові фрази

 
0 comments
 

To post a review, please sign in or sign up
Ви можете залишити відгук про книгу и поділитись своїм досвідом. Іншим читачам буде цікаво дізнатись вашу думку про прочитані книги. Незалежно чи вам сподобалась книга чи ні, якщо ви відверто і детально розповісте про це, люди зможуть знайти для себя нові книги, які їх зацікавлять.
1

I hate conflict!: seven steps to resolving differences with anyone in your life

Рік:
2008
Мова:
english
Файл:
PDF, 4,31 MB
0 / 0
2

Technische Mechanik für Wirtschaftsingenieure

Рік:
2007
Мова:
english
Файл:
PDF, 41,11 MB
0 / 0
Cognitive Technologies
H. Prendinger, M. Ishizuka (Eds.)
Life-Like Characters
Tools, Affective Functions, and Applications
IX, 477 pages. 2004
H. Helbig
Knowledge Representation and the Semantics of Natural Language
XVIII, 646 pages. 2006
P.M. Nugues
An Introduction to Language Processing with Perl and Prolog
An Outline of Theories, Implementation, and Application with Special Consideration
of English, French, and German
XX, 513 pages. 2006
W. Wahlster (Ed.)
SmartKom: Foundations of Multimodal Dialogue Systems
XVIII, 644 pages. 2006
B. Goertzel, C. Pennachin (Eds.)
Artificial General Intelligence
XVI, 509 pages. 2007
O. Stock, M. Zancanaro (Eds.)
PEACH — Intelligent Interfaces for Museum Visits
XVIII, 316 pages. 2007
V. Torra, Y. Narukawa
Modeling Decisions: Information Fusion and Aggregation Operators
XIV, 284 pages. 2007
P. Manoonpong
Neural Preprocessing and Control of Reactive Walking Machines
Towards Versatile Artificial Perception–Action Systems
XVI, 185 pages. 2007
S. Patnaik
Robot Cognition and Navigation
An Experiment with Mobile Robots
XVI, 290 pages. 2007
M. Cord, P. Cunningham (Eds.)
Machine Learning Techniques for Multimedia
Case Studies on Organization and Retrieval
XVI, 290 pages. 2008
L. De Raedt
Logical and Relational Learning
XVI, 388 pages. 2008

Cognitive Technologies
Managing Editors: D. M. Gabbay J. Siekmann
Editorial Board: A. Bundy J. G. Carbonell
M. Pinkal H. Uszkoreit M. Veloso W. Wahlster
M. J. Wooldridge

Advisory Board:
Luigia Carlucci Aiello
Franz Baader
Wolfgang Bibel
Leonard Bolc
Craig Boutilier
Ron Brachman
Bruce G. Buchanan
Anthony Cohn
Artur d’Avila Garcez
Luis Fariñas del Cerro
Koichi Furukawa
Georg Gottlob
Patrick J. Hayes
James A. Hendler
Anthony Jameson
Nick Jennings
Aravind K. Joshi
Hans Kamp
Martin Kay
Hiroaki Kitano
Robert Kowalski
Sarit Kraus
Maurizio Lenzerini
Hector Levesque
John Lloyd

Alan Mackworth
Mark Maybury
Tom Mitchell
Johanna D. Moore
Stephen H. Muggleton
Bernhard Nebel
Sharon Oviatt
Luis Pereira
Lu Ruqian
Stuart Russell
Erik Sandewall
Luc Steels
Olivi; ero Stock
Peter Stone
Gerhard Strube
Katia Sycara
Milind Tambe
Hidehiko Tanaka
Sebastian Thrun
Junichi Tsujii
Kurt VanLehn
Andrei Voronkov
Toby Walsh
Bonnie Webber

Luc De Raedt

Logical and
Relational Learning
With 77 Figures and 10 Tables

Author:

Managing Editors:

Prof. Dr. Luc De Raedt
Dept. of Computer Science
Katholieke Universiteit Leuven
Celestijnenlaan 200A
3001 Heverlee
Belgium
luc.deraedt@cs.kuleuven.be

Prof. Dov M. Gabbay
Augustus De Morgan Professor of Logic
Department of Computer Science
King’s College London
Strand, London WC2R 2LS, UK

Prof. Dr. Jörg Siekmann
Forschungsbereich Deduktions- und
Multiagentensysteme, DFKI
Stuhlsatzenweg 3, Geb. 43
66123 Saarbrücken, Germany

ISBN: 978-3-540-20040-6

e-ISBN: 978-3-540-68856-3

Cognitive Technologies ISSN: 1611-2482
Library of Congress Control Number: 2008933706
ACM Computing Classification (1998): I.2.6, I.2.4, H.2.8, I.5.1
c 2008 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Cover design: KuenkelLopka GmbH
Printed on acid-free paper
9 8 7 6 5 4 3 2 1
springer.com

To Lieve, Maarten and Soetkin

Preface

I use the term logical and relational learning to refer to the subfield of artificial
intelligence, machine learning and data mining that is concerned with learning
in expressive logical or relational representations. It is the union of inductive
logic programming, (statistical) relational learning and multi-relational data
mining, which all have contributed techniques for learning from data in relational form. Even though some early contributions to logical and relational
learning are about forty years old now, it was only with the advent of inductive logic programming in the early 1990s that the field became popular.
Whereas initial work was often concerned with logical (or logic programming)
issues, the focus has rapidly changed to the discovery of new and interpretable
knowledge from structured data, often in the form of rules, and soon important successes in applications in domains such as bio- and chemo-informatics
and computational linguistics were realized. Today, the challenges and opportunities of dealing with structured data and knowledge have been taken up by
the artificial intelligence community at large and form the motivation for a lot
of ongoing research. Indeed, graph, network and multi-relational data mining
are now popular themes in data mining, and statistical relational learning is
receiving a lot of attention in the machine learning and uncertainty in artificial intelligence communities. In addition, the range of tasks for which logical
and relational techniques have been developed now covers almost all machine
learning and data mining tasks. On the one hand these developments have resulted in a new role and novel views on logical and relational learning, but on
the other hand have also made it increasingly difficult to obtain an overview
of the field as a whole.
This book wants to address these needs by providing a new synthesis of
logical and relational learning. It constitutes an attempt to summarize some
of the key results about logical and relational learning, it covers a wide range
of topics and techniques, and it describes them in a uniform and accessible
manner. While the author has tried to select a representative set of topics
and techniques from the field of logical and relational learning, he also realizes that he is probably biased by his own research interests and views on the

VIII

Preface

field. Furthermore, rather than providing detailed accounts of the many specific systems and techniques, the book focuses on the underlying principles,
which should enable the reader to easily get access to and understand the
relevant literature on logical and relational learning. Actually, at the end of
each chapter, suggestions for further reading are provided.
The book is intended for graduate students and researchers in artificial
intelligence and computer science, especially those in machine learning, data
mining, uncertainty in artificial intelligence, and computational logic, with an
interest in learning from relational data. The book is the first textbook on
logical and relational learning and is suitable for use in graduate courses,
though it can also be used for self-study and as a reference. It contains
many different examples and exercises. Teaching material will become available from the author’s website. The author would also appreciate receiving
feedback, suggestions for improvement and needed corrections by email to
luc.deraedt@cs.kuleuven.be.
The book starts with an introductory chapter clarifying the nature, motivations and history of logical and relational learning. Chapter 2 provides
a gentle introduction to logic and logic programming, which will be used
throughout the book as the representation language. Chapter 3 introduces
the idea of learning as search and provides a detailed account of some fundamental machine learning algorithms that will play an important role in later
chapters. In Chapter 4, a detailed study of a hierarchy of different representations that are used in machine learning and data mining is given, and two
techniques (propositionalization and aggregation) for transforming expressive
representations into simpler ones are introduced. Chapter 5 is concerned with
the theoretical basis of the field. It studies the generality relation in logic, the
relation between induction and deduction, and introduces the most important
framework and operators for generality. In Chapter 6, a methodology for developing logical and relational learning systems is presented and illustrated
using a number of well-known case studies that learn relational rules, decision
trees and frequent queries. The methodology starts from existing learning approaches and upgrades them towards the use of rich representations. Whereas
the first six chapters are concerned with the foundations of logical and relational learning, the chapters that follow introduce more advanced techniques.
Chapter 7 focuses on learning the definition of multiple relations, that is,
on learning theories. This chapter covers abductive reasoning, using integrity
constraints, program synthesis, and the use of an oracle. Chapter 8 covers
statistical relational learning, which combines probabilistic models with logical and relational learning. The chapter starts with a gentle introduction to
graphical models before turning towards probabilistic logics. The use of kernels and distances for logical and relational learning is addressed in Chapter 9,
and in Chapter 10 computational issues such as efficiency considerations and
learnability results are discussed. Finally, Chapter 11 summarizes the most
important lessons learned about logical and relational learning. The author
suggests to read it early on, possibly even directly after Chapter 1.

Preface

IX

An introductory course to logical and relational learning covers most of the
materials in Chapters 1 to 4, Sects. 5.1 – 5.4, 5.9, and Chapters 6 and 11. The
other chapters do not depend on one another, and, hence, further chapters can
be selected according to the interests and the preferences of the reader. Given
the interests in statistical relational learning, the author certainly recommends
Chapter 8. Advanced sections and exercises are marked with a * or even with
**. They are more challenging, but can be skipped without loss of continuity.
This book could not have been written without the help and encouragement of many persons. The author is indebted to a number of co-workers who
contributed ideas, techniques, surveys and views that have found their way
into this book, including: Maurice Bruynooghe for influencing the use of logic
in this book and numerous suggestions for improvement; Hendrik Blockeel,
Luc Dehaspe and Wim Van Laer for contributions to the upgrading methodology described in Chapter 6, Kristian Kersting for joint work on statistical
relational learning presented in Chapter 8, and Jan Ramon for his work on
distances in Chapter 9. This book has also taken inspiration from a number
of joint overview papers and tutorials that the author delivered in collaboration with Hendrik Blockeel, Sašo Džeroski, Kristian Kersting, Nada Lavrač
and Stephen Muggleton. The author would also like to thank the editor at
Springer, Ronan Nugent, for his patience, help, and support during all phases
of this book-writing project.
The author is grateful for the feedback and encouragement on the many
earlier versions of this book he received. He would like to thank especially:
the reading clubs at the University of Bristol (headed by Peter Flach, and involving Kerstin Eder, Robert Egginton, Steve Gregory, Susanne Hoche, Simon
Price, Simon Rawles and Ksenia Shalonova) and at the Katholieke Universiteit Leuven (Hendrik Blockeel, Björn Bringmann, Maurice Bruynooghe, Fabrizio Costa, Tom Croonenborghs, Anton Dries, Kurt Driessens, Daan Fierens,
Christophe Costa Florencio, Elisa Fromont, Robby Goetschalkx, Bernd Gutmann, Angelika Kimmig, Niels Landwehr, Wannes Meert, Siegfried Nijssen,
Stefan Raeymaekers, Leander Schietgat, Jan Struyf, Ingo Thon, Anneleen
van Assche, Joaquin Vanschoren, Celine Vens, and Albrecht Zimmermann),
several colleagues who gave feedback or discussed ideas in this book (James
Cussens, Paolo Frasconi, Thomas Gaertner, Tamas Horvath, Manfred Jaeger,
Martijn van Otterlo), and the students in Freiburg who used several previous
versions of this book.
Last but not least I would like to thank my wife, Lieve, and my children,
Soetkin and Maarten, for their patience and love during the many years it
took to write this book. I dedicate this book to them.

Destelbergen, May 2008

Luc De Raedt

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Is Logical and Relational Learning? . . . . . . . . . . . . . . . . . . 1
1.2 Why Is Logical and Relational Learning Important? . . . . . . . . . 2
1.2.1 Structure Activity Relationship Prediction . . . . . . . . . . . . 3
1.2.2 A Web Mining Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 A Language Learning Example . . . . . . . . . . . . . . . . . . . . . . 7
1.3 How Does Relational and Logical Learning Work? . . . . . . . . . . . 8
1.4 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2

An
2.1
2.2
2.3
2.4
2.5
2.6

Introduction to Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Relational Database Example . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Syntax of Clausal Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Semantics of Clausal Logic — Model Theory . . . . . . . . . . . .
Inference with Clausal Logic — Proof Theory . . . . . . . . . . . . . . .
Prolog and SLD-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Historical and Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . .

17
17
20
22
28
35
39

3

An Introduction to Learning and Search . . . . . . . . . . . . . . . . . . .
3.1 Representing Hypotheses and Instances . . . . . . . . . . . . . . . . . . . .
3.2 Boolean Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 A Generate-and-Test Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Structuring the Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9 Refinement Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 A Generic Algorithm for Mining and Learning . . . . . . . . . . . . . .
3.11 A Complete General-to-Specific Algorithm . . . . . . . . . . . . . . . . . .
3.12 A Heuristic General-to-Specific Algorithm . . . . . . . . . . . . . . . . . .
3.13 A Branch-and-Bound Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

41
41
43
44
45
47
48
50
53
56
58
59
60
62

XII

Contents

3.14 A Specific-to-General Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.15 Working with Borders* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.15.1 Computing a Single Border . . . . . . . . . . . . . . . . . . . . . . . . .
3.15.2 Computing Two Borders . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.15.3 Computing Two Borders Incrementally . . . . . . . . . . . . . . .
3.15.4 Operations on Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.16 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.17 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63
64
64
65
66
68
69
69

4

Representations for Mining and Learning . . . . . . . . . . . . . . . . . . 71
4.1 Representing Data and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Attribute-Value Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Multiple-Instance Learning: Dealing With Sets . . . . . . . . . . . . . . 76
4.4 Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6 Sequences, Lists, and Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Trees and Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.8 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.9 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Designing It Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.11 A Hierarchy of Representations* . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.1 From AV to BL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.11.2 From M I to AV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.11.3 From RL to M I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.11.4 From LP to RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.12 Propositionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.12.1 A Table-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.12.2 A Query-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.13 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.15 Historical and Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . 113

5

Generality and Logical Entailment . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1 Generality and Logical Entailment Coincide . . . . . . . . . . . . . . . . 115
5.2 Propositional Subsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 Subsumption in Logical Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Specialization Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2 Generalization Operators* . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.3 Computing the lgg and the glb . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Θ-Subsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.1 Soundness and Completeness . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.2 Deciding Θ-Subsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.3 Equivalence Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5 Variants of Θ-Subsumption* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5.1 Object Identity* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Contents

XIII

5.5.2 Inverse Implication* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6 Using Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.6.1 Saturation and Bottom Clauses . . . . . . . . . . . . . . . . . . . . . 139
5.6.2 Relative Least General Generalization* . . . . . . . . . . . . . . . 141
5.6.3 Semantic Refinement* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.7 Aggregation* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.8 Inverse Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.9 A Note on Graphs, Trees, and Sequences . . . . . . . . . . . . . . . . . . . 152
5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6

The Upgrading Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.1 Motivation for a Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2 Methodological Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2.1 Representing the Examples . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2.2 Representing the Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.3 Adapting the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.4 Adding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3 Case Study 1: Rule Learning and Foil . . . . . . . . . . . . . . . . . . . . . 161
6.3.1 Foil’s Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.2 Foil’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4 Case Study 2: Decision Tree Learning and Tilde . . . . . . . . . . . . 168
6.4.1 The Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.4.2 Inducing Logical Decision Trees . . . . . . . . . . . . . . . . . . . . . 172
6.5 Case Study 3: Frequent Item-Set Mining and Warmr . . . . . . . . 174
6.5.1 Relational Association Rules and Local Patterns . . . . . . 174
6.5.2 Computing Frequent Queries . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 Language Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6.1 Syntactic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6.2 Semantic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7

Inducing Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.1 Introduction to Theory Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.1.1 Theories and Model Inference . . . . . . . . . . . . . . . . . . . . . . . 188
7.1.2 Theory Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.1.3 Overview of the Rest of This Chapter . . . . . . . . . . . . . . . . 192
7.2 Towards Abductive Logic Programming . . . . . . . . . . . . . . . . . . . . 193
7.2.1 Abduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.2.2 Integrity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2.3 Abductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . 196
7.3 Shapiro’s Theory Revision System . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.3.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.3.2 The Model Inference System . . . . . . . . . . . . . . . . . . . . . . . . 203

XIV

Contents

7.4 Two Propositional Theory Revision Systems* . . . . . . . . . . . . . . . 208
7.4.1 Learning a Propositional Horn Theory Efficiently . . . . . . 208
7.4.2 Heuristic Search in Theory Revision . . . . . . . . . . . . . . . . . 212
7.5 Inducing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.5.1 Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.5.2 An Algorithm for Inducing Integrity Constraints . . . . . . 215
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8

Probabilistic Logic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8.1 Probability Theory Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.2 Probabilistic Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.2.1 Probabilities on Interpretations . . . . . . . . . . . . . . . . . . . . . 226
8.2.2 Probabilities on Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.3 Probabilistic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.3.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.3.2 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.4 First-Order Probabilistic Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.4.1 Probabilistic Interpretations . . . . . . . . . . . . . . . . . . . . . . . 248
8.4.2 Probabilistic Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.5 Probabilistic Logic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.5.1 Learning from Interpretations . . . . . . . . . . . . . . . . . . . . . . . 267
8.5.2 Learning from Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . 270
8.5.3 Learning from Proof Trees and Traces . . . . . . . . . . . . . . . . 271
8.6 Relational Reinforcement Learning* . . . . . . . . . . . . . . . . . . . . . . . 274
8.6.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . 274
8.6.2 Solving Markov Decision Processes . . . . . . . . . . . . . . . . . . 277
8.6.3 Relational Markov Decision Processes . . . . . . . . . . . . . . . . 280
8.6.4 Solving Relational Markov Decision Processes . . . . . . . . . 282
8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

9

Kernels and Distances for Structured Data . . . . . . . . . . . . . . . . 289
9.1 A Simple Kernel and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.2.1 The Max Margin Approach . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
9.2.3 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.3 Distance-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.3.1 Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.3.2 The k-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . 297
9.3.3 The k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.4 Kernels for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
9.4.1 Convolution and Decomposition . . . . . . . . . . . . . . . . . . . . . 299
9.4.2 Vectors and Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

Contents

9.5

9.6
9.7
9.8

XV

9.4.3 Sets and Multi-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
9.4.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
9.4.5 Trees and Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
9.4.6 Graph Kernels* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Distances and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.5.1 Generalization and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 308
9.5.2 Vectors and Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
9.5.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
9.5.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
9.5.5 Atoms and Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9.5.6 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Relational Kernels and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . . . . 323

10 Computational Aspects of Logical and Relational Learning 325
10.1 Efficiency of Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.1.1 Coverage as θ-Subsumption . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.1.2 θ-Subsumption Empirically . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.1.3 Optimizing the Learner for θ-subsumption . . . . . . . . . . . . 328
10.2 Computational Learning Theory* . . . . . . . . . . . . . . . . . . . . . . . . . . 333
10.2.1 Notions of Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
10.2.2 Positive Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
10.2.3 Negative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
10.4 Historical and Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 342
11 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
11.1 A Hierarchy of Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
11.2 From Upgrading to Downgrading . . . . . . . . . . . . . . . . . . . . . . . . . . 346
11.3 Propositionalization and Aggregation . . . . . . . . . . . . . . . . . . . . . . 346
11.4 Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
11.5 Operators and Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
11.6 Unification and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
11.7 Three Learning Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
11.8 Knowledge and Background Knowledge . . . . . . . . . . . . . . . . . . . . 350
11.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

1
Introduction

The field of logical and relational learning, which is introduced in this chapter, is motivated by the limitations of traditional symbolic machine learning
and data mining systems that largely work with propositional representations.
These limitations are clarified using three case studies: predicting the activity
of chemical compounds from their structure; link mining, where properties of
websites are discovered; and learning a simple natural language interface for a
database. After sketching how logical and relational learning works, the chapter ends with a short sketch of the history of this subfield of machine learning
and data mining as well as a brief overview of the rest of this book.

1.1 What Is Logical and Relational Learning?
Artificial intelligence has many different subfields. Logical and relational learning combines principles and ideas of two of the most important subfields of
artificial intelligence: machine learning and knowledge representation. Machine learning is the study of systems that improve their behavior over time
with experience. In many cases, especially in a data mining context, the experience consists of a set of observations or examples in which one searches
for patterns, regularities or classification rules that provide valuable new insights into the data and that should ideally be readily interpretable by the
user. The learning process then typically involves a search through various
generalizations of the examples.
In the past fifty years, a wide variety of machine learning techniques have
been developed (see Mitchell [1997] or Langley [1996] for an overview). However, most of the early techniques were severely limited from a knowledge representation perspective. Indeed, most of the early techniques (such as decision
trees [Quinlan, 1986], Bayesian networks [Pearl, 1988], perceptrons [Nilsson,
1990], or association rules [Agrawal et al., 1993]) could only handle data and
generalizations in a limited representation language, which was essentially

2

1 Introduction

propositional. Propositional representations (based on boolean or propositional logic) cannot elegantly represent domains involving multiple entities
as well as the relationships amongst them. One such domain is that of social networks where the persons are the entities and their social interactions
are characterized by their relationships. The representational limitations of
the early machine learning systems carried over to the resulting learning and
data mining techniques, which were in turn severely limited in their application domain. Various machine learning researchers, such as Ryszard Michalski
[1983] and Gordon Plotkin [1970], soon realized these limitations and started
to employ more expressive knowledge representation frameworks for learning.
Research focused on using frameworks that were able to represent a variable
number of entities as well as the relationships that hold amongst them. Such
representations are called relational. When they are grounded in or derived
from first-order logic they are called logical representations. The interest in
learning using these expressive representation formalisms soon resulted in the
emergence of a new subfield of artificial intelligence that I now describe as
logical and relational learning.
Logical and relational learning is thus viewed in this book as the study of
machine learning and data mining within expressive knowledge representation
formalisms encompassing relational or first-order logic. It specifically targets
learning problems involving multiple entities and the relationships amongst
them. Throughout the book we shall mostly be using logic as a representation
language for describing data and generalizations, because logic is inherently
relational, it is expressive, understandable, and interpretable, and it is well
understood. It provides solid theoretical foundations for many developments
within artificial intelligence and knowledge representation. At the same time,
it enables one to specify and employ background knowledge about the domain,
which is often also a key factor determining success in many applications of
artificial intelligence.

1.2 Why Is Logical and Relational Learning Important?
To answer this question, let us look at three important applications of logical
and relational learning. The first is concerned with learning to classify a set
of compounds as active or inactive [Srinivasan et al., 1996], the second with
analyzing a website [Craven and Slattery, 2001], and the third with learning
a simple natural language interface to a database system [Mooney, 2000]. In
these applications there are typically a variable number of entities as well as
relationships amongst them. This makes it very hard, if not impossible, to
use more traditional machine learning methods that work with fixed feature
vectors or attribute-value representations. Using relational or logical representations, these problems can be alleviated, as we will show throughout the
rest of this book.

1.2 Why Is Logical and Relational Learning Important?

3

1.2.1 Structure Activity Relationship Prediction
Consider the compounds shown in Fig. 1.1. Two of the molecules are active
and two are inactive. The learning task now is to find a pattern that discriminates the actives from the inactives. This type of task is an important
task in computational chemistry. It is often called structure activity relationship prediction (SAR), and it forms an essential step in understanding various
processes related to drug design and discovery [Srinivasan and King, 1999b],
toxicology [Helma, 2005], and so on. The figure also shows a so-called structural alert, which allows one to distinguish the actives from the inactives because the structural alert is a substructure (subgraph) that matches both of
the actives but none of the inactives. At the same time, the structural alert is
readily interpretable and provides useful insights into the factors determining
the activity.

Active

O=N

O

CH=N-NH-C-NH

O-

2

+
N O

O

O
nitrofurazone

-

4-nitropenta[cd]pyrene

Structural alert:
Inactive

Y=Z
O

O
N+

NH
N
O-

O

6-nitro-7,8,9,10-tetrahydrobenzo[a]pyrene

4-nitroindole

Fig. 1.1. Predicting mutagenicity. Reprinted from [Srinivasan et al., 1996], page
c
288, 1996,
with permission from Elsevier

Traditional machine learning methods employ the single-tuple single-table
assumption, which assumes that the data can be represented using attributevalue pairs. Within this representation, each example (or compound) corresponds to a single row or tuple in a table, and each feature or attribute to a
single column; cf. Table 1.1. For each example and attribute, the cells of the

4

1 Introduction

table specify the value of the attribute for the specific example, for instance,
whether or not a benzene ring is present. We shall call representations that
can easily be mapped into the single-tuple single-table format propositional.
Table 1.1. A table-based representation
Compound Attribute1 Attribute2 Attribute3 Class
1
true
false
true
active
true
true
true
active
2
false
false
true
inactive
3
true
false
true
inactive
4

From a user perspective, the difficulty with this type of representation is
the mismatch between the graphical and structured two-dimensional representation in the molecules and the flat representation in the table. In order
to use the flat representation, the user must first determine the features or
attributes of interest. In structure activity relationship prediction, these are
sometimes called fingerprints. Finding these features is in itself a non-trivial
task as there exist a vast number of potentially interesting features. Furthermore, the result of the learning process will critically depend on the quality of
the employed features. Even though there exist a number of specialized tools
to tackle this kind of task (involving the use of libraries of fingerprints), the
question arises as to whether there exist general-purpose machine learning
systems able to cope with such structured representations directly. The answer to this question is affirmative, as logical and relational learners directly
deal with structured data.
Example 1.1. The graphical structure of one of the compounds can be represented by means of the following tuples, which we call facts:
active(f1) ←
logmutag(f1, 0.64) ←
lumo(f1, −1.785) ←
logp(f1, 1.01) ←
atom(f1, f11 , c, 21, 0.187) ←
atom(f1, f12 , c, 21, −0.143) ←
atom(f1, f13 , c, 21, −0.143) ←
atom(f1, f14 , c, 21, −0.013) ←
atom(f1, f15 , o, 52, −0.043) ←
...

bond(f1, f11 , f12 , 7) ←
bond(f1, f12 , f13 , 7) ←
bond(f1, f13 , f14 , 7) ←
bond(f1, f14 , f15 , 7) ←
bond(f1, f18 , f19 , 2) ←
bond(f1, f18 , f110 , 2) ←
bond(f1, f11 , f111 , 1) ←
bond(f1, f111 , f112 , 2) ←
bond(f1, f111 , f113 , 1) ←

In this encoding, each entity is given a name and the relationships among
the entities are captured. For instance, in the above example, the compound
is named f1 and its atoms f11 , f12 , .... Furthermore, the relation atom/5 of
arity 5 states properties of the atoms: the molecule they occur in (e.g., f1),

1.2 Why Is Logical and Relational Learning Important?

5

the element (e.g., c denoting a carbon) and the type (e.g., 21) as well as the
charge (e.g., 0.187). The relationships amongst the atoms are then captured
by the relation bond/3, which represents the bindings amongst the atoms.
Finally, there are also overall properties or attributes of the molecule, such
as their logp and lumo values. Further properties of the compounds could be
mentioned, such as the functional groups or ring structures they contain:
ring size 5(f1, [f15 , f11 , f12 , f13 , f14 ]) ←
hetero aromatic 5 ring(f1, [f15 , f11 , f12 , f13 , f14 ]) ←
...
The first tuple states that there is a ring of size 5 in the compound f1 that
involves the atoms f15 , f11 , f12 , f13 and f14 in molecule f1; the second one states
that this is a heteroaromatic ring.
Using this representation it is possible to describe the structural alert in
the form of a rule
active(M) ← ring size 5(M, R), element(A1, R), bond(M, A1, A2, 2)
which actually reads as1 :
Molecule M is active IF it contains a ring of size 5 called R and atoms
A1 and A2 that are connected by a double (2) bond such that A1 also
belongs to the ring R.
The previous example illustrates the use of logical representations for data
mining. It is actually based on the well-known mutagenicity application of relational learning due to Srinivasan et al. [1996], where the structural alert
was discovered using the inductive logic programming system Progol [Muggleton, 1995] and the representation employed above. The importance of this
type of application is clear when considering that the results were published
in the scientific literature in the application domain [King and Srinivasan,
1996], that they were obtained using a general-purpose inductive logic programming algorithm and were transparent to the experts in the domain. The
combination of these factors has seldom been achieved in artificial intelligence.
1.2.2 A Web Mining Example
While structure activity relationship prediction involves the mining of a (potentially large) set of small graphs, link mining and discovery is concerned
with the analysis of a single large graph or network [Getoor, 2003, Getoor and
Dielh, 2005]. To illustrate link mining, we consider the best known example
of a network, that is, the Internet, even though link mining is applicable in
1

This form of description is sometimes called ‘Sternberg’ English in inductive logic
programming, after the computational biologist Michael Sternberg, who has been
involved in several pioneering scientific applications of logical learning.

6

1 Introduction

other contexts as well, for instance, in social networks, protein networks, and
bibliographic databases. The following example is inspired by the influential
WebKB example of Craven and Slattery [2001].
Example 1.2. Consider the website of a typical university. It contains several
web pages that each describe a wide variety of entities. These entities belong
to a wide variety of classes, such as student, faculty, staff, department, course,
project, and publication. Let us assume that for each such web page, there is
a corresponding tuple in our relational database. Consider, for instance, the
following facts (where the urls denote particular URLs):
faculty(url1, stephen) ←
course(url3, logic for learning) ←
department(url5, computer science) ←
...

faculty(url2, john) ←
project(url4, april2) ←
student(url6, hiroaki) ←

In addition, there are relationships among these entities. For instance, various
pages that refer to one another, such as the link from url6 to url1, denote a particular relationship, in this case the relationship between the student hiroaki
and his adviser stephen. This can again be modeled as facts in a relational
database:
adviser(stephen, hiroaki) ←
teaches(john, logic for learning) ←
belongsTo(stephen, computer science) ←
follows(hiroaki, logic for learning) ←
...
Again, the structure of the problem can elegantly be represented using a
relational database. This representation can easily be extended with additional
background knowledge in the form of rules:
studentOf(Lect, Stud) ← teaches(Lect, Course), follows(Stud, Course)
which expresses that
Stud is a studentOf Lect IF Lect teaches a Course and Stud follows the
Course.
Further information could be contained in the data set, such as extracts of the
text appearing on the different links or pages. There are several interesting
link mining tasks in this domain. It is for instance possible to learn to predict
the classes of web pages or the nature of the relationships encoded by links
between web pages; cf. [Getoor, 2003, Craven and Slattery, 2001, Chakrabarti,
2002, Baldi et al., 2003]. To address such tasks using logical or relational
learning, one has to start from a set of examples of relations that are known to
hold. For instance, the fact that stephen is the adviser of hiroaki is a (positive)
example stating that the link from hiroaki to stephen belongs to the relation
adviser. If for a given university website, say the University of Freiburg, all

1.2 Why Is Logical and Relational Learning Important?

7

hyperlinks would be labeled (by hand) with the corresponding relationships,
one could then learn general rules using logical and relational learning that
would allow one to predict the labels of unseen hyperlinks. These rules could
then be applied to determine the labels of the hyperlinks at another university
website, say that of the University of Leuven. An example of a rule that might
be discovered in this context is
adviser(Prof, Stud) ←
webpage(Stud, Url), student(Url),
contains(Url, advisor), contains(Url, Prof)
which expresses that
Prof is an adviser of Stud IF Stud has a webpage with Url of type
student that contains the words adviser and Prof.
To tackle such problems, one often combines logical and relational learning
with probabilistic models. This topic will be introduced in Chapter 8.
Link mining problems in general, and the above example in particular,
cannot easily be represented using the single-tuple single-table assumption
(as in Table 1.1) without losing information.
Exercise 1.3. Try to represent the link mining example within the singletuple single-table assumption and identify the problems with this approach.
Exercise 1.4. Sketch other application domains that cannot be modeled under this assumption.
1.2.3 A Language Learning Example
A third illustration of an application domain that requires dealing with knowledge as well as structured data is natural language processing. Empirical
natural language processing is now a major trend within the computational
linguistics community [Manning and Schütze, 1999], and several logical and
relational learning scientists have contributed interesting techniques and applications; cf. [Cussens and Džeroski, 2000, Mooney, 2000]. As an illustration
of this line of work, let us look at one of the applications of Raymond Mooney’s
group, which pioneered the use of logical and relational learning techniques for
language learning. More specifically, we sketch how to learn to parse database
queries in natural language, closely following Zelle and Mooney [1996].2 The
induced semantic parser is the central component of a question-answering
system.
Example 1.5. Assume you are given a relational database containing information about geography. The database contains information about various basic
2

For ease of exposition, we use a slightly simplified notation.

8

1 Introduction

entities such as countries, cities, states, rivers and places. In addition, it contains facts about the relationships among them, for instance, capital(C, Y),
which specifies that C is the capital of Y, loc(X, Y), which states that X is located in Y, nextTo(X, Y), which states that X is located next to Y, and many
other relationships.
The task could then be to translate queries formulated in natural language
to database queries that can be executed by the underlying database system.
For instance, the query in natural language
What are the major cities in Kansas?
could be translated to the database query
answer(C, (major(C), city(C), loc(C, S), equal(S, kansas)))
This last query can then be passed on to the database system and executed.
The database system then generates all entities C that are major, a city, and
located in kansas.
Zelle and Mooney’s learning system [1996] starts from examples, which
consist of queries in natural language and in database format, from an elementary shift-reduce parser, and from some background knowledge about the
domain. The task is then to learn control knowledge for the parser. Essentially,
the parser has to learn the conditions under which to apply the different operators in the shift-reduce parser. The control knowledge is represented using
a set of clauses (IF rules) as in the previous two case studies. The control
rules need to take into account knowledge about the stack used by the parser,
the structure of the database, and the semantics of the language. This is
again hard (if not impossible) to represent under the single-tuple single-table
assumption.

1.3 How Does Relational and Logical Learning Work?
Symbolic machine learning and data mining techniques essentially search a
space of possible patterns, models or regularities. Depending on the task,
different search algorithms and principles apply. For instance, consider the
structure activity relationship prediction task and assume that one is searching
for all structural alerts that occur in at least 20% of the actives and at most
2% of the inactives. In this case, a complete search strategy is applicable.
On the other hand, if one is looking for a structural alert that separates the
actives from the inactives and could be used for classification, a heuristic
search method such as hill climbing is more appropriate.
Data mining is often viewed as the process of computing the set of patterns
T h(Q, D, L) [Mannila and Toivonen, 1997], which can be defined as follows (cf.
also Chapter 3). The search space consists of all patterns expressible within a
language of patterns L. For logical and relational learning this will typically

1.3 How Does Relational and Logical Learning Work?

9

be a set of rules or clauses of the type we encountered in the case studies of
the previous section; the data set D consists of the examples that need to
be generalized; and, finally, the constraint Q specifies which patterns are of
interest. The constraint typically depends on the data mining task tackled, for
instance, finding a single structural alert in a classification setting, or finding
all alerts satisfying particular frequency thresholds. So, the set T h(Q, D, L)
can be defined as the set of all patterns h ∈ L that satisfy the constraint
Q(h, D) with respect to the data set D.
A slightly different perspective is given by the machine learning view,
which is often formulated as that of finding a particular function h (again
belonging to a language of possible functions L) that minimizes a loss function
l(h, D) on the data. Using this view, the natural language application of the
previous subsection can be modeled more easily, as the goal is to learn a
function mapping statements in natural language to database queries. An
adequate loss function is the accuracy of the function, that is, the fraction of
database queries that is correctly predicted. The machine learning and data
mining views can be reconciled, for instance, by requiring that the constraint
Q(h, D) succeeds only when l(h, D) is minimal; cf. Chapter 3.
Central in the definition of the constraint Q(h, D) or the loss function
l(h, D) is the covers relation between the data and the rules. It specifies
when a rule covers an example, or, equivalently, when an example satisfies
a particular rule. There are various possible ways to represent examples and
rules and these result in different possible choices for the covers relation (cf.
Chapter 4). The most popular choice is that of learning from entailment. It
is also the setting employed in the case studies above.
Example 1.6. To illustrate the notion of coverage, let us reconsider Ex. 1.1
and let us also simplify it a bit. An example could now be represented by the
rule:
active(m1) ←
atom(m1, m11, c), . . . , atom(m1, m1n, c),
bond(m1, m11, m12, 2), . . . , bond(m1, m11, m13, 1),
ring size 5(m1, [m15, m11, m12, m13, m14]), . . .
Consider now the rule
active(M) ← ring size 5(M, R), atom(M, M1, c)
which actually states that
Molecule M is active IF it contains a ring of size 5 called R and an
atom M1 that is a carbon (c).
The rule covers the example because the conditions in the rule are satisfied by the example when setting M = m1, R = [m15, m11, m12, m13, m14] and
M1 = m11. The reader familiar with logic (see also Chapter 2) will recognize
that the example e is a logical consequence of the rule r, which is sometimes
written as r |= e.

10

1 Introduction

Now that we know what to compute, we can look at how this can be
realized. The computation of the solutions proceeds typically by searching
the space of possible patterns or hypotheses L. One way of realizing this is to
employ a generate-and-test algorithm, though this is too naive to be efficient.
Therefore symbolic machine learning and data mining techniques typically
structure the space L according to generality. One pattern or hypothesis
is more general than another if all examples that are covered by the latter
pattern are also covered by the former.
Example 1.7. For instance, the rule
active(M) ← ring size 5(M, R), element(A1, R), bond(M, A1, A2, 2)
is more general than the rule
active(M) ←
ring size 5(M, R), element(A1, R),
bond(M, A1, A2, 2), atom(M, A2, o, 52, C)
which reads as
Molecule M is active IF it contains a ring of size 5 called R and atoms
A1 and A2 that are connected by a double (2) bond such that A1 also
belongs to the ring R and atom A2 is an oxygen of type 52.
The former rule is more general (or, equivalently, the latter one is more
specific) because the latter one requires also that the atom connected to the
ring of size 5 be an oxygen of atom-type 52. Therefore, all molecules satisfying
the latter rule will also satisfy the former one.
The generality relation is quite central during the search for solutions. The
reason is that the generality relation can often be used 1) to prune the search
space, and 2) to guide the search towards the more promising parts of the
space. The generality relation is employed by the large majority of logical
and relational learning systems, which often search the space in a general-tospecific fashion. This type of system starts from the most general rule (the
unconditional rule, which states that all molecules are active in our running
example), and then repeatedly specializes it using a so-called refinement operator. Refinement operators map rules onto a set of specializations; cf. Chapters
3 and 5.
Example 1.8. Consider the rule
active(Mol) ← atom(Mol, Atom, c, Type, Charge),
which states that a molecule is active if it contains a carbon atom. Refinements
of this rule include:

1.4 A Brief History

11

active(Mol) ← atom(Mol, Atom, c, 21, Charge)
active(Mol) ← atom(Mol, Atom, c, T, Charge), atom(Mol, Atom2, h, T2, Charge2)
active(Mol) ← atom(Mol, Atom, c, T, Charge), ring size 5(Mol, Ring)
...
The first refinement states that the carbon atom must be of type 21, the second
one requires that there be carbon as well as hydrogen atoms, and the third one
that there be a carbon atom and a ring of size 5. Many more specializations
are possible, and, in general, the operator depends on the description language
and generality relation used.
The generality relation can be used to prune the search. Indeed, assume that
we are looking for rules that cover at least 20% of the active molecules and at
most 1% of the inactive ones. If our current rule (say the second one in Ex.
1.7) only covers 18% of the actives, then we can prune away all specializations
of that rule because specialization can only decrease the number of covered
examples. Conversely, if our current rule covers 2% of the inactives, then all
generalizations of the rule cover at least as many inactives (as generalization can only increase the number of covered examples), and therefore these
generalizations can safely be pruned away; cf. Chapter 3 for more details.
Using logical description languages for learning provides us not only with a
very expressive representation, but also with an excellent theoretical foundation for the field. This becomes clear when looking at the generality relation.
It turns out that the generality relation coincides with logical entailment. Indeed, the above examples of the generality relation clearly show that the more
general rule logically entails the more specific one.3 So, the more specific rule
is a logical consequence of the more general one, or, formulated differently, the
more general rule logically entails the more specific one. Consider the simpler
example: flies(X) ← bird(X) (if X is a bird, then X flies), which logically entails
and which is clearly more general than the rule flies(X) ← bird(X), normal(X)
(only normal birds fly). This property of the generalization relation provides
us with an excellent formal basis for studying inference operators for learning.
Indeed, because one rule is more general than another if the former entails the
latter, deduction is closely related to specialization as deductive operators can
be used as specialization operators. At the same time, as we will see in detail
in Chapter 5, one can obtain generalization (or inductive inference) operators
by inverting deductive inference operators.

1.4 A Brief History
Logical and relational learning typically employ a form of reasoning known as
inductive inference. This form of reasoning generalizes specific facts into gen3

This property holds when learning from entailment. In other settings, such as
learning from interpretations, this property is reversed; cf. Chapter 5. The more
specific hypothesis then entails the more general one.

12

1 Introduction

eral laws. It is commonly applied within the natural sciences, and therefore
has been studied in the philosophy of science by several philosophers since
Aristotle. For instance, Francis Bacon investigated an inductive methodology
for scientific inquiry. The idea is that knowledge can be obtained by careful experimenting, observing, generalizing and testing of hypotheses. This is
also known as empiricism, and various aspects have been studied by many
other philosophers including Hume, Mill, Peirce, Popper and Carnap. Inductive reasoning is fundamentally different from deductive reasoning in that the
conclusions of inductive reasoning do not follow logically from their premises
(the observations) but are always cogent; that is, they can only be true with
a certain probability. The reader may notice that this method is actually very
close in spirit to that of logical and relational learning today. The key difference seems to be that logical and relational learning investigates computational
approaches to inductive reasoning.
Computational models of inductive reasoning and scientific discovery have
been investigated since the very beginning of artificial intelligence. Several
cognitive scientists, such as a team involving the Nobel prize winner Herbert A. Simon (see [Langley et al., 1987] for an overview), developed several
models that explain how specific scientific theories could be obtained. Around
the same time, other scientists (including Bruce Buchanan, Nobel prize winner Joshua Lederberg, Ed Feigenbaum, and Tom Mitchell [Buchanan and
Mitchell, 1978]) started to develop learning systems that could assist scientists in discovering new scientific laws. Their system Meta-Dendral produced some new results in chemistry and were amongst the first scientific
discoveries made by an artificial intelligence system that were published in
the scientific literature of the application domain. These two lines of research
have actually motivated many developments in logical and relational learning
albeit there is also a crucial difference between them. Whereas the mentioned
approaches were domain-specific, the goal of logical and relational learning is
to develop general-purpose inductive reasoning systems that can be applied
across different application domains. The example concerning structure activity relationship is a perfect illustration of the results of these developments
in logical and relational learning. An interesting philosophical account of the
relationship between these developments is given by Gillies [1996].
Supporting the scientific discovery process across different domains requires a solution to two important computational problems. First, as scientific theories are complex by their very nature, an expressive formalism is
needed to represent them. Second, the inductive reasoning process should be
able to employ the available background knowledge to obtain meaningful hypotheses. These two problems can to a large extent be solved by using logical
representations for learning.
The insight that various types of logical and relational representations can
be useful for inductive reasoning and machine learning can be considered as
an outgrowth of two parallel developments in computer science and artificial intelligence. First, and most importantly, since the mid-1960s a number

1.4 A Brief History

13

of researchers proposed to use (variants of) predicate logic as a formalism
for studying machine learning problems. This was motivated by severe limitations of the early machine learning systems that essentially worked with
propositional representations. Ranan Banerji [1964] was amongst the earliest
advocates of the use of logic for machine learning. The logic he proposed was
motivated by a pattern recognition task. Banerji’s work on logical descriptions
provided inspiration for developing logical learning systems such as Confucius [Cohen and Sammut, 1982] and Marvin [Sammut and Banerji, 1986],
which already incorporated the first inverse resolution operators; cf. Chapter
5. These systems learned incrementally and were able to employ the already
learned concepts during further learning tasks.
Around the same time, Ryszard Michalski [1983] developed his influential
AQ and Induce systems that address the traditional classification task that
made machine learning so successful. Michalski’s work stressed the importance
of both learning readable descriptions and using background knowledge. He
developed his own variant of a logical description language, the Variable Valued Logic, which is able to deal with structured data and relations. At the
same time, within VVL, he suggested that induction be viewed as the inverse of deduction and proposed several inference rules for realizing this. This
view can be traced back in the philosophy of science [Jevons, 1874] and still
forms the basis for much of the theory of generalization, which is extensively
discussed in Chapter 5.
Theoretical properties of generalization and specialization were also studied by researchers such as Plotkin [1970], Reynolds [1970], Vere [1975] and
Buntine [1988]. Especially, Plotkin’s Ph.D. work on θ-subsumption and relative subsumption, two generalization relations for clausal logic, has been very
influential and still constitutes the main framework for generalization in logical learning. It will be extensively studied in Chapter 5. Second, there is the
work on automatic programming [Biermann et al., 1984] that was concerned
with synthesizing programs from examples of their input-output behavior,
where researchers such as Biermann and Feldman [1972], Summers [1977] and
Shapiro [1983] contributed very influential systems and approaches. Whereas
Alan Biermann’s work was concerned with synthesizing Turing machines, Phil
Summers studied functional programs (LISP) and Ehud Shapiro studied the
induction of logic programs and hence contributed an inductive logic programming system avant la lettre. Shapiro’s Model Inference System is still one
of the most powerful program synthesis and inductive inference systems today.
It will be studied in Chapter 7.
In the mid-1980s, various researchers including Bergadano et al. [1988],
Emde et al. [1983], Morik et al. [1993], Buntine [1987] and Ganascia and
Kodratoff [1986] contributed inductive learning systems that used relational
or logical description languages. Claude Sammut [1993] gives an interesting
account of this period in logical and relational learning.
A breakthrough occurred when researchers started to realize that both
problems (in automatic programming and machine learning) could be stud-

14

1 Introduction

ied simultaneously within the framework of computational logic. It was the
contribution of Stephen Muggleton [1991] to define the new research field of
inductive logic programming (ILP) [Muggleton and De Raedt, 1994, Lavrač
and Džeroski, 1994] as the intersection of inductive concept learning and logic
programming and to bring together researchers in these areas in the inductive
logic programming workshops [Muggleton, 1992b] that have been organized
annually since 1991. A new subfield of machine learning was born and attracted many scientists, especially in Japan, Australia and Europe (where
two European projects on inductive logic programming were quite influential [De Raedt, 1996]). Characteristic for the early 1990s was that inductive
logic programming was developing firm theoretical foundations, built on logic
programming concepts, for logical learning. In parallel, various well-known
inductive logic programming systems were developed, including Foil [Quinlan, 1990], Golem [Muggleton and Feng, 1992], Progol [Muggleton, 1995],
Claudien [De Raedt and Dehaspe, 1997], Mobal [Morik et al., 1993], Linus
[Lavrač and Džeroski, 1994]. Also, the first successes in real-life applications of
inductive logic programming were realized by Ross King, Stephen Muggleton,
Ashwin Srinivasan, and Michael Sternberg [King et al., 1992, King and Srinivasan, 1996, King et al., 1995, Muggleton et al., 1992]; see [Džeroski, 2001] for
an overview. Due to the success of these applications and the difficulties in
true progress in program synthesis, the field soon focused on machine learning and data mining rather than on automatic programming. A wide variety
of systems and techniques were being developed that upgraded traditional
machine learning systems towards the use of logic; cf. Chapter 6.
During the mid-1990s, both the data mining and the uncertainty in artificial intelligence communities started to realize the limitations of the key representation formalism they were using. Within the data mining community, the
item-sets representation employed in association rules [Agrawal et al., 1993]
corresponds essentially to a boolean or propositional logic, and the Bayesian
network formalism [Pearl, 1988] defines a probability distribution over propositional worlds. These limitations motivated researchers to look again at more
expressive representations derived from relational or first-order logic. Indeed,
within the data mining community, the work on Warmr [Dehaspe et al.,
1998], which discovers frequent queries and relational association rules from
a relational database, was quite influential. It was successfully applied on
a structure activity relationship prediction task, and motivated several researchers to look into graph mining [Washio et al., 2005, Inokuchi et al.,
2003, Washio and Motoda, 2003]. Researchers in data mining soon started to
talk about (multi-)relational data mining (MRDM) (cf. [Džeroski and Lavrač,
2001]), and an annual series of workshops on multi-relational data mining was
initiated [Džeroski et al., 2002, 2003, Džeroski and Blockeel, 2004, 2005].
A similar development took place in the uncertainty in artificial intelligence community. Researchers started to develop expressive probabilistic logics [Poole, 1993a, Breese et al., 1994, Haddawy, 1994, Muggleton, 1996], and
started to study learning [Sato, 1995, Friedman et al., 1999] in these frame-

1.4 A Brief History

15

works soon afterward. In the past few years, these researchers have gathered in
the statistical relational learning (SRL) workshops [Getoor and Jensen, 2003,
2000, De Raedt et al., 2005, 2007a]. A detailed overview and introduction to
this area is contained in Chapter 8 and two recent volumes in this area include [Getoor and Taskar, 2007, De Raedt et al., 2008]. Statistical relational
learning is one of the most exciting and promising areas for relational and
logical learning today.
To conclude, the field of logical and relational learning has a long history
and is now being studied under different names: inductive logic programming, multi-relational data mining and (statistical) relational learning. The
approach taken in this book is to stress the similarities between these trends
rather than the differences because the problems studied are essentially the
same even though the formalisms employed may be different. The author
hopes that this may contribute to a better understanding of this exciting
field.

2
An Introduction to Logic

The data mining and machine learning approaches discussed in this book employ relational or logical representations. This chapter introduces relational
database representations and their formalization in logic. Logic is employed
because it is an expressive, well-understood and elegant formalism for representing knowledge. We focus on definite clause logic, which forms the basis of
computational logic and logic programming. The chapter introduces the syntax
as well as the model-theoretic and proof-theoretic semantics of definite clause
logic. Throughout this book an attempt is made to introduce all the necessary
concepts and terminology in an intuitive and yet precise manner, without resorting to unnecessary formal proof or theory. The reader interested in a more
detailed theoretical account of clausal logic may want to consult [Lloyd, 1987,
Genesereth and Nilsson, 1987, Flach, 1994, Hogger, 1990, Kowalski, 1979],
and for a more detailed introduction to the programming language Prolog we
refer him to [Flach, 1994, Bratko, 1990, Sterling and Shapiro, 1986].

2.1 A Relational Database Example
This book is concerned with mining relational data. The data to be mined
will typically reside in a relational database. An example relational database,
written in the logical notation employed throughout this book, is given in the
following example.
Example 2.1. The database contains information about publications, authors
and citations. The entry authorOf(lloyd, logic for learning) ← for the relation
authorOf/2 represents the fact that lloyd is the author of the publication
logic for learning. Similarly, reference(logic for learning, foundations of lp) ← is
a fact stating that foundations of lp is in the bibliography of logic for learning.

18

2 An Introduction to Logic

reference(logic for learning, foundations of lp) ←
reference(logic for learning, learning logical definitions from relations) ←
reference(logic for learning, ai a modern approach) ←
reference(ai a modern approach, foundations of lp) ←
reference(ai a modern approach, ilp theory and methods) ←
reference(ilp theory and methods, learning logical definitions from relations) ←
reference(ilp theory and methods, foundations of lp) ←
authorOf(quinlan, learning logical definitions from relations) ←
authorOf(lloyd, foundations of lp) ←
authorOf(lloyd, logic for learning) ←
authorOf(russell, ai a modern approach) ←
authorOf(norvig, ai a modern approach) ←
authorOf(muggleton, ilp theory and methods) ←
authorOf(deraedt, ilp theory and methods) ←
Even though the database contains only a small extract of a real database,
it will be used to introduce some of the key concepts of the knowledge representation formalism employed in this book. The formalism employed is that
of clausal logic because it is an expressive representation formalism, it is wellunderstood and it provides us with the necessary theory and tools for relational learning. At the same time, it is closely related to relational database
formalisms. This is illustrated by the bibliographic database just introduced
and will become more apparent throughout this section (and Sect. 4.4).
In logic, a relation is called a predicate p/n (e.g., authorOf/2) where p denotes the name of the predicate (authorOf) and n the arity (2), which indicates
the number of arguments the predicate takes. The arguments of predicates are
terms. Constants are a particular type of term that start with a lowercase character, for instance, quinlan and logic for learning. They refer to a particular object in the domain of discourse, such as the book “Logic for Learning” [Lloyd,
2003] or the author “J.R. Quinlan”. All entries in Ex. 2.1 are facts, which are
expressions of the form p(t1 , · · · , tn ) ←, where p/n is a predicate and the ti are
terms. Facts correspond to tuples in a relational database. They express unconditional truths. For instance, the fact authorOf(lloyd, logic for learning) ←
expresses that lloyd is the author of the publication logic for learning. As in
relational databases it is possible to query for information that resides in the
database.
Example 2.2. The query ← authorOf(lloyd, logic for learning) asks whether
lloyd is the author of logic for learning. A theorem prover (such as Prolog)
would return the answer yes, implying that the fact is true, that is, that it
logically follows from the specified database.
The query ← authorOf(lloyd, Pub) asks whether there is a publication
authored by lloyd. The term Pub is called a variable, and by convention,
variables start with an uppercase character. The theorem prover would answer yes and would also return the substitutions {Pub/logic for learning} and

2.1 A Relational Database Example

19

{Pub/foundations of lp}, which state the conditions under which the original query is true. For those readers familiar with SQL, a popular database
language, we also specify the corresponding query in SQL1 :
SELECT Pub FROM authorOf WHERE Author = lloyd
The query
← authorOf(Author, logic for learning), authorOf(Author, foundations of lp)
asks whether there is an author who wrote logic for learning as well as
foundations of lp. The single substitution that would be returned for this query
is {Author/lloyd}. In SQL, this query is written as:
SELECT Author
FROM authorOf t1, authorOf t2
WHERE t1.Author = t2.Author AND t1.Pub = logic for learning
and t2.Pub = foundations of lp
So, the “,” between the two atoms in the query corresponds to and, and two
occurrences of the same variable must be instantiated to the same constant
in the substitution, which corresponds to a “join” in relational databases.
Suppose now that instead of being interested in knowing which publications refer to which other publications, we are interested in knowing the
authors who cite one another. They can be listed using the following query:
← authorOf(A, P), reference(P, Q), authorOf(B, Q)
Many answers would be generated for this query, including:
{A/lloyd, P/logic for learning, Q/foundations of lp, B/lloyd}
{A/lloyd, P/logic for learning, Q/ai a modern appraoch, B/russell}
...
If the query is posed a number of times, it is useful to define a new predicate
cites/2 using the clause:
cites(A, B) ← authorOf(A, P), reference(P, Q), authorOf(B, Q)
This clause states that A cites B if A is the authorOf P, P references Q, and B is
the authorOf Q. This clause corresponds to the definition of a view or an intensional relation in relational database terminology, which can be implemented
by the following statement in SQL.
CREATE VIEW cites
AS SELECT a1.Author, a2.Author
FROM authorOf a1, authorOf a2, reference r
WHERE a1.Pub = r.Pub1 AND a2.Pub = r.Pub2
1

The reader unfamiliar with SQL can safely skip the SQL queries.

20

2 An Introduction to Logic

The effect is that the predicate cites can now be queried as if it contained the
following facts:
cites(lloyd, lloyd) ←
cites(lloyd, russell) ←
cites(russell, lloyd) ←
cites(russell, muggleton) ←
cites(russell, deraedt) ←
cites(muggleton, quinlan) ←
cites(muggleton, lloyd) ←

cites(lloyd, quinlan) ←
cites(lloyd, norvig) ←
cites(norvig, lloyd) ←
cites(norvig, muggleton) ←
cites(norvig, deraedt) ←
cites(deraedt, quinlan) ←
cites(deraedt, lloyd) ←

Exercise 2.3. Pose a query to identify authors who cite themselves. Use
both the original database (consisting of reference/2 and authorOf/2) and the
one where the cites predicate has been defined. Define also a new predicate
self citation/1 that succeeds for those authors who cite themselves.

2.2 The Syntax of Clausal Logic
Whereas the previous section introduced some logical concepts in a rather
informal manner, the present section will introduce them in a more formal
and precise manner.
A term t is a constant, a variable or a compound term f (t1 , ..., tn ) composed of a function symbol f /n (where f is the name of the function and n is
its arity) and n terms ti . Simple terms are, for instance, logic for learning (a
constant) and Pub (a variable). We will use the convention that constants and
function symbols start with a lowercase character and variables start with an
uppercase character.
Compound terms were so far not introduced as they do not belong to the
relational subset of clausal logic. They are used to represent structured objects. Examples of compound terms include card(j, hearts), where card/2 is a
function symbol, and j and hearts are constants. It can be used to denote the
Jack of Hearts card. Compound terms can also be nested. For instance, one
can denote the father of John using the term fatherOf(john). The grandfather of John would then be denoted as fatherOf(fatherOf(john)), that is, as
the father of the father of John. This notation is also used to represent the
natural numbers using logic: the number 0 would be represented by 0, the
number 1 by succ(0), the successor of 0, the number 2 by succ(succ(0)), etc.
Lists are represented using the functor cons/2 (in Prolog the functor is often
represented as ·/2). The first argument of cons/2 denotes the first element of
the list, the second argument denotes the rest of the list, and the constant nil
is used to denote the empty list. For instance, the list [1, 2] is represented as
cons(1, cons(2, nil)).2
2

The list functor cons/2 is usually written in Prolog in infix notation. Using this
notation cons(A, B) is written as [A | B], and cons(A, cons(B, C)) as [A | [B | C]] or
[A, B | C] for short.

2.2 The Syntax of Clausal Logic

21

An atom is a formula of the form p(t1 , .., tn ), where p/n is a predicate
symbol and the ti are terms. For instance, authorOf(lloyd, logic for learning),
largerRank(ace, X), faceCard(j, hearts), and pair(card(j, hearts), card(j, diamonds))
are atoms. Although terms and atoms possess a similar syntax, their meaning
is quite different. These differences are clearest when looking at ground terms
and atoms: ground terms represent objects in the domain of discourse, whereas
ground atoms represent a particular relationship among the objects denoted
by the terms appearing in the atom. Therefore, ground atoms possess a truthvalue, that is, they are either true or false, whereas ground terms (and objects)
do not posses truth-values. Indeed, it does not make sense to talk about the
truth-value of the constant green, the persons “Mary Ann’ or fatherOf(john),
or the list con(a, cons(b, nil)). However, the atom parent(fatherOf(john), john)
(for the predicate parent/2) would typically be true.
By now, we are able to define clauses, which are the key constructs in
clausal logic. Clausal logic is a subset of first-order logic that forms the basis
of logic programming and the programming language Prolog. It is frequently
employed within artificial intelligence due to its uniform and simple representation, and the existence of efficient inference engines or theorem provers for
clausal logic.
Definition 2.4. A clause is an expression of the form h1 ; · · · ; hn ← b1 , · · · , bm
where the hi and bj are logical atoms.
The symbol “,” stands for conjunction (and), the symbol “;” for disjunction
(or), and “←” for implication (if). Furthermore, all variables are universally
quantified (though this is not explicitly written). So, the clause
female(X); male(X) ← human(X)
specifies that for all X, when X is human, X is also male or female. h1 ; · · · ; hn
is referred to as the conclusion part of the clause, or the head of the clause,
b1 , · · · , bm as the condition part or the body of the clause. It will often be
convenient to employ the notation body(c) for {b1 , · · · , bm } and head(c) for
{h1 , · · · , hn }, where c is the clause h1 ; · · · ; hn ← b1 , · · · , bm . This set notation
can be extended to the overall clause. For instance, the above clause c is
represented by the set {h1 , · · · , hn , ¬b1 , · · · , ¬bm } of the literals it contains,
where a literal is either a logical atom a or a negated atom ¬a. The set notation
for clauses reflects also that a clause is a disjunction h1 ∨· · ·∨hn ∨¬b1 ∨· · ·∨¬bm
of literals.
There exist several special types of clauses:
•
•
•
•

facts, where n = 1 and m = 0,
definite clauses, where n = 1,
Horn clauses, where n = 1 or n = 0, and
denials, where n = 0.

22

2 An Introduction to Logic

The clauses that are most often employed in logical and relational learning are
facts and definite clauses, which we already encountered in the previous section. Denials represent negative information, for instance, ← human(dracula)
specifies that dracula is not human, that is, that human(dracula) is false. Denials are used as queries or goals. The reasons for this will become clear soon.
Typically, the database consists of a set of clauses. A set of clauses
{c1 , · · · , cn } specifies that all clauses in the set are true, that is, the set denotes a conjunction c1 , · · · , cn of the clauses ci . We will sometimes refer to
such sets of clauses as (clausal) theories, databases, or knowledge bases.
A substitution θ = {V1 /t1 , · · · , Vn /tn } is an assignment of terms t1 , ..., tn to
variables V1 , ..., Vn , for instance, {A/lloyd, Pub/logic for learning}. The instantiated formula F θ ,where F is a term, atom, or clause and θ = {V1 /t1 , · · · , Vn /tn }
a substitution, is the formula obtained by simultaneously replacing all variables V1 , ..., Vn in F by the terms t1 , ..., tn . For instance, the formula card(X, Y)θ,
where θ = { X/j, Y/diamonds}, denotes card(j, diamonds).

2.3 The Semantics of Clausal Logic — Model Theory
Now that the syntax of clausal logic has been defined, its semantics can be
introduced.
The semantics of logic are based on the concept of an interpretation. Interpretations are defined using the notion of a domain, which contains the set of
objects that exist (in the interpretation). Roughly speaking, an interpretation
of a set of clauses is an assignment that maps
•
•
•

constants onto objects in the domain,
function symbols f /n onto n-ary functions defined on the domain; these
functions map an n-tuple of objects onto an object in the domain,
predicates p/n onto n-ary relations defined on the domain; these relations
represent a set of n-tuples of objects.

For instance, consider the constant john. It could represent a particular person
named John Doe Jr. The function symbol fatherOf/1 could then represent the
function that maps every person to his or her father. It could map John Doe
Jr. onto John Doe Sr. The predicate parent/2 could then map to the parent
relationship. These mappings then define the truth-values of atoms in the
interpretation. For instance, the atom parent(fatherOf(john), john) would map
to true, because John Doe Sr. (fatherOf(john)) is a parent of John Doe Jr.
(john) in our interpretation, assuming that the tuple (John Doe Sr., John Doe
Jr.) is in the relation denoted by parent.
Because it is inconvenient and complicated to work with general interpretations, and because when working with clausal logic, this is not really needed,
we will restrict our attention to a special class of interpretations named after
the French logician Jacques Herbrand. A Herbrand interpretation uses the

2.3 The Semantics of Clausal Logic — Model Theory

23

Herbrand domain as its domain. The Herbrand domain is defined as the set
of all ground terms that can be constructed using the constants and function
symbols that occur in the set of clauses considered.
Example 2.5. The Herbrand universe in the bibliographic database of Ex. 2.1
consists of all constants appearing in the facts, i.e.,
{quinlan, lloyd, muggleton, . . . , logic for learning, . . . , ilp theory and methods}.
Example 2.6. Now consider the clauses defining the natural numbers:
nat(0) ←
nat(succ(X)) ← nat(X)
The first clause states that 0 is a natural number; the second one that succ(X)
is a natural number if X is one. The Herbrand universe in this case is the
infinite set {0, succ(0), succ(succ(0)), . . .}.
A Herbrand interpretation now maps each ground term to itself. So each
ground term refers to itself; for instance, john now refers to the object john
instead of to some person such as John Doe. In a similar manner, predicates
are mapped onto relations over the Herbrand universe. The result is that a
Herbrand interpretation can be viewed as the set of ground atoms that are true
in the interpretation. This is the view that we will be employing throughout
this book when talking about interpretations.
Definition 2.7. A Herbrand interpretation of a set of clauses is a set of
ground atoms (over the constant, function and predicate symbols occurring
in the set of clauses).
All ground atoms in the interpretation are assumed to be true, and all others
are assumed to be false.
Example 2.8. One possible Herbrand interpretation I1 over the bibliographic
database of Ex. 2.1 is:
{authorOf(russell, logic for learning), authorOf(russell, quinlan)}
The interpretation I2 consists of all the ground atoms occurring in the bibliographic database.
Example 2.9. Similarly, for the above specified natural numbers, I3 , I4 , and
I5 defined below are Herbrand interpretations.
I3 = {}
I4 = {nat(0), nat(succ(0))}
I5 = {nat(0), nat(succ(0)), nat(succ(succ(0))), . . .}.

24

2 An Introduction to Logic

Some interpretations do not really reflect the properties of the clauses that
we have written down. For instance, the first interpretations in both the bibliographic database and the definition of the natural numbers are intuitively
impossible given the clauses that we have written down. This motivates the
following definition.
Definition 2.10. A Herbrand interpretation I is a Herbrand model for a set
of clauses C if and only if for all clauses h1 ; · · · ; hn ← b1 , · · · , bm ∈ C and
for all ground substitutions θ: {b1 θ, · · · , bm θ} ⊆ I → {h1 θ, · · · , hn θ} ∩ I = ∅.
So, a Herbrand interpretation I is a model for a clause c if for all substitutions θ for which body(c)θ is true in I, head(c)θ is also true in I. If an
interpretation is a model for a clause (or set of clauses), the interpretation
satisfies the clause (or set of clauses); otherwise, the interpretation violates
the clause (or set of clauses). Clausal theories that possess a Herbrand model
are called satisifable; those that do not possess one are called unsatisfiable.
When the context is clear, we will talk about interpretations and models
rather than Herbrand interpretations and Herbrand models.
Example 2.11. Reconsider our bibliographic database and the interpretation
I1 . I1 is not a model for the database because there exists a clause f
authorOf(russell, ai a modern approach) ←
such that body(f ) = {} ⊆ I1 but
head(f ) = {authorOf(russell, ai a modern approach))} ∩ I1 = ∅.
At the same time, it is easy to see that I2 is a model for the bibliographic
database.
Example 2.12. For the natural numbers, I3 is not a model, because of the fact
nat(0) ←. Neither is I4 , because of the recursive clause c for which there exists
a substitution θ ={X/succ(0)} for which body(c)θ = {nat(succ(0))}⊆ I4 but
head(c)θ ={nat(succ(succ(0)))}∩I4 = ∅. Finally, I5 is a model for the two
clauses defining nat/1.
Model theory is the basis for reasoning about the declarative semantics
of logical formulae, which relies on the notion of logical entailment. Logical
entailment defines when one formula is a consequence of, or follows from,
another one.
Definition 2.13. Let C be a set of clauses and c be a clause. C logically
entails c, notation C |= c, if and only if all models of C are also models of c.
In this definition, not only the Herbrand interpretations (and models) but all
interpretations and models are considered. The following theorems, however,
allow us to focus our attention on Herbrand interpretations and models to
reason about the satisfiability of a particular formula.

2.3 The Semantics of Clausal Logic — Model Theory

25

Theorem 2.14. (Proposition 3.30 from [Nienhuys-Cheng and de Wolf, 1997])
Let T be a set of clauses. Then T has a model if and only if T has a Herbrand
model.
To generalize the applicability of this result to entailment between two formulae C and c, the theorem below can be used.
Theorem 2.15. (This follows from Proposition 3.31 [Nienhuys-Cheng and
de Wolf, 1997]) Let C be a set of clauses and c be a clause. Then C |= c if
and only if C ∧ ¬c is unsatisfiable, that is, if C ∧ ¬c |= .
The empty clause ←, sometimes written as , is the clause whose head and
body are both empty. This clause is unsatisfiable because its body is always
true and its head is never true, regardless of the interpretation.
Example 2.16. Reconsider our bibliographic database B consisting of all facts
listed in Ex. 2.1. Then
B |= authorOf(lloyd, logic for learning) ←
because
B ∧ ¬ authorOf(lloyd, logic for learning)
is unsatisfiable as authorOf(lloyd, logic for learning) and its negation are contradictory.
Example 2.17. Similarly, let N consist of the two clauses defining nat/1. Then
N |= nat(0) and also N |= nat(succ(0)).
An algorithm to generate a (Herbrand) model of a set of clauses is listed in
Algo. 2.1. It assumes that all clauses are range-restricted, i.e., that all variables
appearing in the head of a clause also appear in its body. Range-restrictedness
implies that all facts are ground.
Algorithm 2.1 Generating a model of a set of clauses
M := ∅
while M is not a model of C do
if there is a denial ← b1 , · · · , bm in C that is violated by M then
backtrack
end if
select h1 ; · · · , hn ← b1 , · · · , bm from C and θ (choice point) such that
{b1 θ, · · · , bm θ} ⊆ M , and
{h1 θ, · · · , hn θ} ∩ M = ∅
add one of the hi θ to M (choice point)
end while

The algorithm starts with the empty model and repeatedly expands it by
non-deterministically adding facts belonging to the head of a violated clause.

26

2 An Introduction to Logic

This process continues until M either becomes a model of the theory, or until
a denial is violated. When a denial is violated, the model cannot be expanded
by adding facts, which explains why the algorithm then backtracks to earlier
choice points. Observe that the algorithm can fail when no model exists, and
also that it can loop infinitely while attempting to construct an infinite model.
There exists also a very elegant and short implementation of this algorithm in
Prolog. It is called Satchmo and was published by Manthey and Bry [1987];
it is described in detail in [Flach, 1994].
Example 2.18. Consider the following clausal theory:
human(X) ← male(X)
human(X) ← female(X)
female(X); male(X) ← human(X)
human(john) ←
One possible trace of Algo. 2.1 generates the following sequence of interpretations:
{}
{human(john)} using the last clause.
{human(john), female(john)} using the third clause.
Another model for this example is {human(john), male(john)}.
Example 2.19. Reconsider the bibliographic database together with the clause
defining the predicate cites/2. Algo. 2.1 would generate the interpretation consisting of all ground facts for the predicates reference/2, cites/2 and authorOf/2
that were listed in the earlier examples.
Example 2.20. Reconsider the definition of the natural numbers using the
predicate nat/1. For these clauses, the algorithm does not terminate, because
it attempts to generate the infinite model I5 defined above.
Exercise 2.21. Does the following clausal theory have a model? If so, generate one.
← student(X), vampire(X)
← student(X), professor(X)
← female(X), male(X)
being(dracula) ←
clever(X); student(X) ← being(X)
female(X); male(X); vampire(X) ← being(X)
student(X); professor(X); vampire(X) ← being(X)
Exercise 2.22. Use the theorem prover Satchmo implemented in Prolog
(cf. [Manthey and Bry, 1987] or [Flach, 1994]) to generate more models of the
theory in the previous example. (This exercise requires that you have access
to a Prolog implementation. Some excellent Prolog implementations, such as
SWI-Prolog and YAP-Prolog, are available from the public domain.)

2.3 The Semantics of Clausal Logic — Model Theory

27

Some theories (sets of clauses) have multiple models. Furthermore, one
model can be a subset of another model, which motivates the notion of a
minimal model. A Herbrand model is minimal if and only if none of its subsets
is a model.
Example 2.23. Reconsider the theory listed in Ex. 2.18. The two models listed
there are minimal. However, the model {human(john), male(john), female(john)}
is not minimal and this interpretation would not be a model if the clause
← female(X), male(X) belonged to the theory.
When restricting one’s attention to definite clauses, which is the subset of
clausal logic most often applied in logic programming, the minimal model is
unique, which explains why it is called the least Herbrand model. The least
Herbrand model of a set of clauses C will be denoted as M (C). The least
Herbrand model is an important concept because it captures the semantics of
the set of clauses. It consists of all ground atoms that are logically entailed
by the definite clause theory. This can be formally specified as:
Property 2.24. Let C be a set of definite clauses and f be a ground fact. Then
C |= f if and only if f ∈ M (C).
The least Herbrand model also captures the intuitive meaning of the theory. This is best illustrated by the bibliographic database, where the least
Herbrand model is that sketched in Ex. 2.19, which indeed contains exactly
those facts which would be generated using the SQL statement, and which
one would intuitively expect.
When applied to (range-restricted) definite clause theories, Algo. 2.1 will
indeed generate the least Herbrand model of the theory. However, in this case
one can also use the simplified version sketched in Algo. 2.2.
Algorithm 2.2 Computing the least Herbrand model of a definite clause
theory
M0 := ∅
M1 := {f |f ← is a fact in C}
i := 1
while Mi = Mi−1 do
Mi+1 := ∅
for all h ← b1 , · · · , bn ∈ C do
for all θ such that {b1 θ, · · · , bn θ} ⊆ Mi do
add hθ to Mi+1
end for
end for
Mi+1 := Mi ∪ Mi+1
i := i + 1
end while

28

2 An Introduction to Logic

The key difference between this algorithm and the previous one is that it
processes all violated clauses in parallel to generate the next model. It can
be optimized by adding {b1 θ, · · · , bn θ} ∩ (Mi − Mi−1 ) = ∅ as an additional
constraint in the inner for loop. This way the same conclusions hθ will not be
regenerated in every iteration, which will remove some redundant computations.
Example 2.25. Suppose the definite clause theory is:
ancestor(X, Y) ← parent(X, Y)
ancestor(X, Y) ← parent(X, Z), ancestor(Z, Y)
parent(rose, luc) ←
parent(leo, rose) ←
The algorithm then computes the following sequence of models:
M0
M1
M2
M3
M4

=∅
= {parent(rose, luc), parent(leo, rose)}
= M1 ∪ {ancestor(rose, luc), ancestor(leo, rose)}
= M2 ∪ {ancestor(leo, luc)}
= M3

Exercise 2.26. Specify the least Herbrand model of the following theory:
plus(0, X, X) ← nat(X)
plus(succ(X), Y, succ(Z)) ← plus(X, Y, Z)
nat(0) ←
nat(succ(X)) ← nat(X)

2.4 Inference with Clausal Logic — Proof Theory
Now that the semantics of clausal logic has been defined, we discuss how
to perform inference in clausal logic. Deductive inference is concerned with
deciding whether one formula F logically entails another one G, that is, deciding whether F |= G. When working with clausal logic, the typical inference
procedure is based on the deductive inference rule known as resolution. The
resolution principle was introduced by Robinson [1965]. It is employed by
the large majority of theorem provers for first-order logic. Even though many
variants of the resolution principle exist, we will restrict our attention to resolution for Horn clauses and SLD-resolution, because this form of resolution
underlies the programming language Prolog on which the vast majority of
logical learning approaches is based.
Let us start by introducing resolution for propositional logic. In propositional logic, all predicates have arity 0, which implies that there are no
constants, variables or structured terms that need to be taken into account.

2.4 Inference with Clausal Logic — Proof Theory

29

Given the clauses
l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm
the propositional resolution operator infers the resolvent
h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm

(2.1)

The rule is sometimes displayed as
l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm
h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm

(2.2)

which indicates that if the clauses above the line are presented, the ones below
the line may be inferred. We sometimes use the notation
l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm
res

h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm

(2.3)

to denote a particular resolution step.
This inference step is graphically illustrated in Fig. 2.1, where the operator
takes the typical V form. Notice that this rule also holds when h = {}, that
is, when working with a denial instead of a definite clause.
l ← b1 , ..., bn

h ← c1 , · · · , ci−1 , l, ci , · · · , cm

h ← c1 , · · · , ci−1 , b1 , ..., bn , ci , · · · , cm
Fig. 2.1. The propositional resolution operator

Example 2.27.
mammal ← rabbit
animal ← mammal
res

animal ← rabbit

30

2 An Introduction to Logic

The resolution operator is sound, which means that whenever c1 ∧ c2 res c
holds, c1 ∧ c2 |= c holds as well.
When more than one resolution step is performed, one talks about resolution derivations or proofs. A resolution proof of a clause c from a theory
T = {c1 , ..., cn }, notation T c, is a sequence of resolution steps c1,1 ∧c1,2 res
c1 , · · · , ck,1 ∧ck,2 res ck where ck = c and ci,j ∈ T ∪{c1 , · · · , ci−1 }. Resolution
proofs can be graphically represented as trees; see Fig. 2.2 for an example.
pos ← foursided, red

pos ← rectangle, red

foursided ← rectangle

rectangle ← square

pos ← square, red
Fig. 2.2. A resolution derivation

Example 2.28. A resolution proof of the clause pos ← square, red (red squares
are positive) is shown in Fig. 2.2. It assumes the following theory T is given:
foursided ← rectangle
rectangle ← square
pos ← foursided, red
A popular technique in mathematics when proving theorems is to assume
that a theorem is false and to prove that this leads to a contradiction. This
technique also applies to logical inference and theorem proving, where it is
known as proving by refutation. The idea of refutation was already formulated
in Theorem 2.15, where it was stated that C |= c if and only if C ∧ ¬c |= .
A proof by refutation is now a resolution derivation that ends in the empty
clause  or ←, which is unsatisfiable.
Resolution is not a complete operator for definite clause logic (even in the
propositional case). Completeness would require that whenever C |= c there
also exists a resolution derivation of c from C.
Example 2.29. Indeed,
mammal ← rabbit |= mammal ← rabbit, brown

2.4 Inference with Clausal Logic — Proof Theory

31

but it is impossible to derive the second clause by resolution from the first
one.
Fortunately, resolution is refutation complete, as indicated in the following
property.
Property 2.30. (Refutation completeness; cf. Theorem 5.18 of [Nienhuys-Cheng
and de Wolf, 1997]) Let C be a set of clauses. Then C is unsatisfiable, that is,
C |= , if and only if there exists a resolution derivation of the empty clause
 starting from C.
Due to the soundness and refutation completeness of resolution (for propositional Horn clauses) we now have an effective procedure for deciding logical
entailment. Deciding whether a set of Horn clauses logically entails a Horn
clause, that is whether C |= h ← b1 , · · · , bn , can be realized as follows:
•
•

negate the clause h ← b1 , · · · , bn , which yields the clauses T = { ← h,
b1 ←, · · · , bn ←}
try to derive the empty clause  from C ∧ T , that is, decide whether
C ∧ T .

Example 2.31. Reconsider the clauses in Ex. 2.28. We now prove by refutation
that T |=pos ← square, red. To this end, we first negate the clause, which yields
the clauses:
square ← and red ← and ← pos
Figure 2.3 shows how to derive  from these clauses and T .
So far, we have introduced resolution for propositional logic, but resolution
becomes more interesting when clauses contain terms. The key difference is
that unification must be used in this case. Unification is needed to make a
literal in one clause match a literal in the other clause. For instance, when
trying to resolve
father(X, Y) ← parent(X, Y), male(X)
with
← father(luc, maarten)
it is necessary to unify the literals father(X, Y) and father(luc, maarten) using
the substitution {X/luc, Y/maarten} to yield the clause
← parent(luc, maarten), male(luc).
Unification was already implicitly used in Algos. 2.1 and 2.2. Formally, a
unifier of two expressions f1 and f2 (terms or atoms) is a substitution such
that f1 θ = f2 θ.

32

2 An Introduction to Logic

pos ← foursided, red

← pos

foursided ← rectangle

← foursided, red

← rectangle, red

rectangle ← square

← square, red

square ←

red ←

← red


Fig. 2.3. A proof by refutation

Example 2.32. For instance, to unify father(luc, X) and father(Y, soetkin) one
can use the substitution {Y/luc, X/soetkin}, and to unify the atoms
plus(succ(succ(0)), succ(X), succ(Y)) and plus(A, B, succ(succ(C))) one can use
θ1 = {A/succ(succ(0)), B/succ(X), Y/succ(C)}, or
θ2 = {A/succ(succ(0)), B/succ(0), X/0, Y/suc