Faculdade de Ciências e Tecnologia

Data Analytics and Mining

Code

11563

Academic unit

Faculdade de Ciências e Tecnologia

Department

Departamento de Informática

Credits

6.0

Teacher in charge

Joaquim Francisco Ferreira da Silva, Pedro Manuel Corrêa Calvente Barahona

Weekly hours

4

Teaching language

Português

Objectives

Knowledge

  • Understand the paradigms and challenges of Data Analytics and Text Mining
  • Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
  • Understand the advantages and disadvantages of the different methods.

 

Skills

  • Implement and adapt Data Analytics and Text Mining algorithms;
  • Model real data experimentally.
  • Interpret and evaluate experimental results.

 

Competences

  • Evaluate the suitability of each method to case studies
  • Critical evaluation of the results.

Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.

Subject matter

 Introduction

Data Analytics

What is data: Examples of data analytic tasks and various perspectives of them

Visualization as a convenient tool for business analytics

Text Mining

Structured or unstructured data? Why mining texts?

What types of problems can be solved?

  • Module I

Data Understanding

  • 1D Summarization and Visualization of a Single Feature
  • 2D Analysis: Correlation and Visualization of Two Quantitative Features
  • Verification of structure in data

Data Preparation

  • Variable cleaning
  • Feature creating
  • Why normalization matters

Descriptive Modeling I

Principal Component Analysis(PCA): Model and Method

  • Summarization versus Correlation
  • Matrix spectrum and Singular Value Decomposition (SVD)
  • PCA as SVD.  Conventional PCA’s.

PCA: Applications

Descriptive Modeling II

  • K‐means, Anomalous clusters, Intelligent K‐Means
  • Spectral clustering
  • Relational clustering (if time permits)

Interpreting Descriptive Models

  • Conventional Cluster Model Interpretation
  • Assessing Cluster Tendency
  • Least squares principle induced interpretation aids

Data Analytics Case Studies

 

  • Module II Text Mining

Relevant Information Extraction

  • Relevant Expressions: Multi‐words and single‐words
  • Statistical vs symbolic extractors. Algorithms and metrics
  • Language‐independence

Symbolic and Statistical Analysis of texts

  • Tokenization, Stemming and Part‐Of‐Speech Tagging
  • Word and Multi-word distribution in Big Data context.  Zipf Law
  • Metrics for word association and retrieval
  • Document correlation
  • Word Sense Disambiguation

Document Descriptors

  • Language‐independent Mining of Explicit and Implicit Keywords from documents.
  • Semantic Scope of Documents
  • Document Summarization

Document Classification

  • Relevant Expressions as features for document characterization. Feature selection and reduction.
  • Document Similarity
  • Supervised vs unsupervised Document Clustering.
  • Prediction and evaluation

Text Mining Case Studies (some examples)

  • Extraction of Named Entities
  • Email filtering
  • Language detection
  • Efficient Extraction of Multiwords
  • Polarity Detection

Bibliography

  • D. T. Larose, C. D. Larose (2015), Data Mining and Predictive Analytics, 2nd Edition, Wiley.
  • B. Mirkin (2011), Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.
  • Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text mining: Predictive Methods for Analyzing

Evaluation method

The evaluation of this curricular unit is made by two components: theoretical/problems (T) and project (P). Each component contributes with 50% to the final grade.

 To pass, the student must have: a grade of at least 10 points (out of 20 points) in each of the theoretical/problems and project components in each of the two modules. The final grade is defined as the weighted average of the two components of evaluation.

The theoretical part consists of two written individual tests; alternatively, this component can be evaluated by a written exam. 

The project component is evaluated by a set of assignments, and two programming projects accompanied by written reports. Projects will be made in groups of no more than two students and will be subject to individual assessment.

Attendance to at least 2/3 of the lectures either theoretical or practical is required.

Courses