Ideas básicas

2020

¿Qué es esto de la “era del Big Data”?

Época de cambio tecnológico acelerado, con innumerables efectos en todos los ámbitos de nuestra vida, también en el de las empresas e industrias.

Una parte importante de esos avances están relacionados con la abundancia de datos.

Los datos provienen de la multitud de dispositivos “inteligentes” conectados a redes.

Los datos masivos han posibilitado grandes avances en un campo de investigación como es el Machine Learning/Deep Learning

Big Data, datos masivos o macrodatos.

Las tres v’s del Big Data: volumen, velocidad y variedad.

Los datos masivos también han hecho surgir una serie de retos: privacidad, sesgos en los algoritmos.

IA, ML y DL

La idea fundamental en que se basa la Inteligencia Artificial es conseguir que una computadora resuelva un problema complejo como lo haría un humano

3 etapas: inicios, sistemas expertos y ML/DL

En ML/DL no hay que codificar las reglas, sino que se provee a un algoritmo con suficientes datos y este encuentra las reglas por si mismo; es decir, en cierto sentido “aprende”.

Este enfoque ha posibilitado que muchas tareas “humanas” se hayan automatizado.

Un sistema/programa de ML es entrenado más que explicitamente programado.

Deep Learning es un subconjunto del ML. Utiliza técnicas/algoritmos más complejas en términos computacionales y, por lo tanto, necesitan para aprender un mayor volumen de datos, pero su filosofía es la misma: proveer datos para que el ordenador aprenda a resolver y automatizar una tarea concreta.

¿Qué es la Ciencia de Datos?

Un nuevo campo de conocimiento que está irrumpiendo con fuerza en muchos planes de estudio.
El objetivo último de DS es obtener información/conocimiento de los datos que genere valor.
Estadística vs. Data Science (DS). ¿Es lo mismo? Sí y no.

¿Qué es la Ciencia de Datos?

Era del Big Data y DS

“The world’s most valuable resource is no longer oil, but data” (The Economist, 2017).

Necesidad de convertir datos en información que aporte valor a las corporaciones.

“The data scientist is the sexiest job of the century” (Harvard Business Review).

¿Qué es un científico de datos?

¿Qué es un científico de datos?

Un estadístico con pajarita… no, evidentemente no.

Para un data scientist son muy importantes las hackings skills (programación).

¿Tengo que aprender a programar?

SÍ. Bueno, al menos un poco, todo lo que podamos.

Point & click programs vs lenguajes de prográmación: flexibilidad y reproducibilidad

Una de las competencias importantes en DS es saber programar, pero … requiere tiempo.

Una de las competencias importantes en DS es saber programar, pero … requiere tiempo.

Investigación reproducible y software libre

Para que un análisis con datos sea reproducible, no sólo es necesario que los datos utilizados han de ser accesibles, sino que cómo mínimo debería:
- proporcionar los datos originales (obviamente documentar las fuentes)
- efectuar todo el proceso a través de código (scripts)
- documentar el proceso de trabajo (por ejemplo el orden en que se ejecutaron los scripts)
Además debería utilizar software libre
En el curso usaremos R

What is R?

R is a programming language and software environment for statistical computing and graphics
R is distributed under the GNU GPL license; that is, it’s free software
R it’s also free (its price is zero)
R it’s multiplatform: it is available for Windows, Macintosh and GNU / Linux.
The official website of R is called: The R Project for Statistical Computing
R was created by R. Ihaka and R.Gentleman of the University of Auckland in 1993
From 1997 the development of R is carried out by a group of programmers known as “The R-core team” …
… but today, the R environment is the result of the collaboration of a whole community of R users

Capabilities of R

R is extensible through functions and packages
The official repository is CRAN: Comprehensive R Archive Network
In January 2017 CRAN reached the 10.000 packages
R (with its packages) can implement a huge variety of statistical and graphical techniques.

Why R?

Please, don’t look this other one 😄 !!!

Why R?

For me, the best part of R is the R community
In many ways, R is the data language: in data science it’s THE language to beat (with only 1 serious contender: Phyton)
If you don’t believe me, you could read this or this, or this
Among the companies that use R are: Google, Facebook, Twitter, Microsoft, IBM, Uber, Ford, Airbnb, American Express, Barclays Bank, Bank of America … You can find a more complete list here
BUT, not all the people loves R: read this or the classical R-inferno

The debate R vs. Phyton

Phyton is a more general programming language and R is more domain specific. Look some classifications here
There are many opinions about R vs Phyton but …. check this one

Aunque después de lo que dijo Elmo, yo creo que el debate está zanjado:

R-base vs. tidyverse

But before to start your journey into R you should know some things about the recent history of R

In recent years has been a kind of revolution in the R-world: Haldey Wickham & the tidyverse

Hadley Wichahm, chief scientist at RStudio was the father of the Hadleyverse, that has mutated into the tidyverse thanks to a group of developers; for instance:

What it’s the tidyverse?

Well … we can say that the tidyverse is a group of R-packages that works well togheter and that has changed, for the better, the way to program in R.
In Emily Robinson word’s in this fantastic slides, the tidyverse is: " An opinionated collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures "

The philosophy of the tidyverse

You could read the The tidy tools manifesto, but this 3 sentences capture their essence:

Compose simple functions with the pipe (%>%)

“Programs must be written for people to read, and only incidentally for machines to execute” — Hal Abelson

“If I had one thing to tell biologists learning bioinformatics, it would be write code for humans, write data for computers” — Vince Buffalo

“An important aspect of writing data for computers is to make your data TIDY” — Jenny Bryan

Some differences among Base-R & tidyverse

You could find here a post about the differences among the functions of R-base & the tidyverse
Here another post from a new converted to the tidyverse. He says: “Up until last year my R workflow was not dramatically different from when I started using R more than 10 years ago. Thanks to several R package authors my workflow has changed for the better ...”
An example of code à la tidyverse:

data %>% filter(X1 > 400) %>% group_by(X2) %>% summarise(media = mean(X3))

Tidyverse workflow & Reproducible Research

Data wrangling from http://r4ds.had.co.nz/wrangle-intro.html

In my opinion R is one step ahead of the others languages in tools to make reproducible documents: Rmarkdown & friends (blogdown, packagedown, bookdown, flex-dashboards, shiny …)
You could understand that looking at this gallery. To learn how to make this kind of reproducible documents you could go here or here

5 websites you MUST know

Packages:

Official R package repository: CRAN
Un-official repository: Github

To find inspiration (examples of recent analysis in R):

To find help:

Stack overflow

¿Qué es esto de la “era del Big Data”?

IA, ML y DL

¿Qué es la Ciencia de Datos?

¿Qué es la Ciencia de Datos?

Era del Big Data y DS

¿Qué es un científico de datos?

¿Tengo que aprender a programar?

Investigación reproducible y software libre

What is R?

What is R?

Capabilities of R

Why R?

Why R?

The debate R vs. Phyton

R-base vs. tidyverse

But before to start your journey into R you should know some things about the recent history of R

What it’s the tidyverse?

The philosophy of the tidyverse

Some differences among Base-R & tidyverse

Tidyverse workflow & Reproducible Research

5 websites you MUST know

To finish let’s open R & RStudio!!