
Data Mining
Código
82033
Unidade Orgânica
Instituto Superior de Estatística e Gestão de Informação
Créditos
6.0
Professor responsável
Fernando José Ferreira Lucas Bação
Horas semanais
45.0
Língua de ensino
Português. No caso de existirem alunos de Erasmus, as aulas serão leccionadas em Inglês
Objectivos
The goal of the course is introducing students to methods for solving complex problems,
involving the use and the uderstanding of large sets of data,
often globally known as Data Mining.
Among the methods that will be studied, particular relevance will be given to Evolutionary
Algorithms (Holland, 1975; Goldberg, 1989) and Neural Networks (Haykin, 1999).
The two main features these methods share are the ability of returning approximate solutions
(in case they are not able to find a perfect one) to a problem and the ability of dealing
with concepts such as imperfection, imprecision, and incomplete information.
Other computational methods are also discussed in this course, like various forms of
Local Search (like Hill Climbing (Aarts and Korst, 1998), Simulated Annealing (Kirkpatrick et al., 1983),
Tabu Search (Glover and Laguna, 1997)) and Particle Swarm Optimization (Clerc, 2006),
a new method for the optimization of continuous values inspired by the collective
behavior of swarms (and belonging to the field of Swarm Intelligence).
All the methods presented in this course are also traditionally classified as belonging to
the wider field of Machine Learning (Mitchell, 1996). Thus an important part of this course
is dedicated to introducing the fundations of Machine Learning.
Pré-requisitos
The course will be held in English language.
No particular propaedeutic knowledge is requested for attending this course,
except a general knowledge of the fundaments of Computer Science and the basics
of Mathematics, in particular Statistics.
Conteúdo
1 Introduction and Motivation
- Complex problems and NP-hard problems.
- Large datasets and complex problems
- What is data mining?
- Definition and foundations of Machine Learning.
- Outline of typical Data Mining applications.
- Optimization Problems and their complexity. Are they hard? Why?
- No Free Lunch Theorem.
- Introduction to Local Search.
- Hill Climbing.
- Fitness Landscapes: their meaning and importance; their structure and topology.
- Simulated Annealing
- general definitions and functioning.
- theory and asymptotic behavior.
- Tabu Search
- general definitions and functioning.
- short term and long term memory.
- Outline of the basic concepts of Darwinian Evolution Theory.
- Genetic operators.
- General functioning of a simple Genetic Algorithm (GA).
- Genotype and phenotype.
- Selection mechanisms.
- Asymptotic behavior of GAs.
- Multiobjective optimization
- VEGA (Shaffer)
- The concepts of Pareto optimal set and Pareto front.
- Experimental evaluation of the performance of a GA.
- A general discussion of the possible applications.
- Introduction and motivation.
- Continuous optimization and real-valued parameters optimization.
- Particle Swarm Optimization (PSO)
- Basic (or "canonical") PSO version.
- Possible extensions and their advantages and drawbacks.
- Representation and main differences with GAs.
- Genetic operators.
- Fitness calculation methods.
- Properties of Closure and Sufficiency.
- The concept of Steady State.
- Automatically Defined Functions (ADF).
- Genetic Programming (GP) Benchmarks (even parity, multiplexer,symbolic
- egression, artificial ant on the Santa Fe trail).
- Open Issues in GP.
- A general discussion of the possible applications.
- The concept of learning.
- Learning a function.
- The concept of generalization.
- Supervised and unsupervised learning.
- The case of classification and clustering.
- Performance of a classifier.
- Data splitting.
- Training, test and validation sets. Definitions and motivations.
- Crossvalidation and its variants.
- Precision and Recall. True positives, true negatives, false positives, false
- egatives. F-measure and K-statistic.
- The concept of feature and its importance.
- The importance of feature selection.
- Outline of the main feature selection methods.
- Principal component based feature selection.
- Correlation based feature selection.
- Introduction and Motivations.
- A very light outline of the physical structure of human brain. Neurons.
- Artificial neuron.
- Perceptron
- Single neuron model.
- Perceptron learning rule.
- Convergence theorem of perceptron.
- The main activation functions.
- Networks of neurons.
- Adaline
- General structure and functioning.
- Delta learning rule. Gradient descent.
- Linearly separable and non-linearly separable problems.
- Hidden layer of neurons.
- Universal Approximation Theorem.
- Backpropagation. Formal justification and implementation.
- Cyclic (or Recursive) Neural Networks.
- Jordan networks.
- Elman networks
- Hopfield networks. The concept of associative memory. Hebb learning rule.
- Non-Supervised Neural Networks
- Competitive learning networks.
- Kohonen networks (Self-Organizing Maps).
Bibliografia
Tettamanzi, M. Tomassini, "Soft Computing", Sprinter
M. Mitchell, "Introduzione agli algoritmi genetici", Apogeo
W. Banzhaf, P. Nordin, R. Keller and F. Francone, "Genetic Programming, an introduction", Morgan Kaufmann
T. Mitchell, Machine Learning, McGraw Hill
Método de ensino
Theoretical classes will be held using the blackboard
and projecting slides. Practical classes will be held
in computer rooms and laboratories, allowing the students
to apply the concepts that have been explained previously.
Método de avaliação
The examination consists in a written test concerning
all the contents of the course.