Ebook: Data visualization tools for users (English)

Page 1

01 02 03

Tools for

data visualization

The data scientist’s toolbox Five data visualization tools

Get the benefit from data with four webinars


The data scientist’s

toolbox

Data Science stands today as a multidisciplinary profession. The following is intended to be a basic guide of some useful resources available for each of the facets performed by these professionals.


01. TOOLBOX

Data Science stands today as a multidisciplinary profession, in which knowledge from various areas overlap in a profile more typical of the Renaissance than from this super-specialized 21st century. Given the scarcity of formal training in this field, data scientists are forced to collect dispersed knowledge and tools to optimally develop their skills.

TOOLS AND LANGUAGES • SQL

• pyODBC

• Sqlite

• mxODBC

• SQlite3

• SQLAlchemy

• RSQlite

• pandas

• Toad

• data.table

• Tora

• XML

• RapidMiner • Jsonlite The following is intended to be a basic guide, obviously not exhaustive, of some useful resources available for each of the facets performed by these professionals.

• Knime • Pentaho • RODBC • RJDBC

• json


01. TOOLBOX

Data management Part of the work of the data scientist it to capture, clean-up and store information in a format suitable for its processing and analysis. The most usual scenario is to access a copy of the data source for a one-time or periodic capture. You will need to know SQL to access the data stored in relational databases. Each database has a console to execute SQL queries, even though most

people prefer to use a graphical environment with information about tables, fields and indexes. Some of the most popular data management tools are Toad, proprietary software for Microsoft’s platform, and Tora, which is open-source and cross-platform. Once the data is extracted we can store it in plain text files which we will upload to our working environment, for machine learning or to be used with a tool such as SQlite.


01. TOOLBOX

SQlite is a lightweight relational database with no external dependencies and which does not require to be installed in a server. Moving a database is as easy as copying a single file. In our case, when processing information we can do it without concurrence or multiple access to the source data, which perfectly suits the characteristics of SQlite. The languages we use for our algorithms have connectivity to SQlite (Python, through SQlite3 and R, trhough RSQlite) so we can choose to import the data before preprocessing or to do part of it in the database itself, which will help us to avoid more than one problem after a certain amount of records. Another alternative to bulk data capture is to use a tool including the full ETL cycle (Extraction, Transformation and Load), i.e. RapidMiner, Knime

or Pentaho. With them, we can graphically define the acquisition and debugging cycles of data using connectors. Once we have guaranteed access to the data source during preprocessing, we can use an ODBC connection (RODBC and RJDBC in R, and pyODBC, mxODBC and SQLAlchemy in Python) and benefit from making connections (JOIN) and groups (GROUP BY) using the database engine and subsequently importing the results. For the external processing, pandas (a Python library) and data.table (a package in R) are our first choice. Data.table allows to circumvent one of R’s weaknesses, memory management, performing vector operations and reference groups without having to duplicate objects temporarily.


01. TOOLBOX

A third scenario would be to access information generated in real time and transmit it in formats like XML or JSON. These are called incremental learning projects, and among them we find recommendation systems, online advertising and high frequency trading. For this we will use tools like XML or jsonlite (R packages), or xml and json (Python modules). With them we will make a streaming capture, make our predictions, send it back in the same format, and update our model once the source system provides us, later on, with the results observed in reality.


01. TOOLBOX

Data analysis Even though the Business Intelligence, Data Warehousing and Machine Learning fields are part of Data Science, the latter is the one which requires a greater number of specific utilities. Hence, our toolbox will need to include R y Python, the programming language most widely used in machine learning.

For Python we highlight the suite scikit-learn, which covers almost all techniques, except perhaps neural networks. For these we have several interesting alternatives, such as Caffe and Pylearn2. The latter is based on Theano, an interesting Python library that allows symbolic definitions and a transparent use of GPU processors.


01. TOOLBOX

If we need to change any R package we will need C++ and some utilities that allow us to re-generate them: Rtools, an environment for creating packages in R under Windows, and devtools, which facilitates all processes related to development.

Some of the most used packages for R: •

Gradient boosting: gbm y xgboost.

Random forests for classification and regression: randomForest and randomForestSRC.

Support vector machines: e1071, LiblineaR and kernlab.

Regularized regression (Ridge, Lasso and ElasticNet): glmnet. Generalized additive models: gam. Clustering: cluster.

There are also some general purpose tools that will make our life easier in R: •

Data.table: Fast reading of text files; creation, modification and deletion of columns by reference; joins by a common key or group, and summary of data.

Foreach: Execution of parallel processes against a previously defined backend with utilities such as doMC or doParallel.

Bigmemory: Manage massive matrices in R and share information across multiple sessions or parallel analyses.

Caret: Compare models, control data partitions (splitting, bootstrapping, subsampling) and tuning parameters (grid search).

Matrix: Manage sparse matrices and transformation of categorical variables to binary (onehote encoding) using the sparse.model.matrix function.


01. TOOLBOX

Distributed environments deserve a special mention. If we have dealt with data from a large institution or company, we will probably have experience working with the so-called Hadoop ecosystem. Hadoop is a distributed file system (HDFS) equipped with algorithms (MapReduce) that allows to perform information processing in parallel.

Among the machine learning tools compatible with Hadoop we find: •

Vowpal Wabbit: Online learning methods based on gradient descent..

Mahout: A suite of algorithms, including among them recommendation systems, clustering, logistic regression, and random forest.

The data scientist should also keep abreast of new trends of generational change of Hadoop to Spark. Spark has several advantages over Hadoop to process information and the execution of

h2o: Perhaps the tool experiencing a higher growth phase, with a large number of parallelizable algorithms. It can be executed from a graphical environment or from R or Python.

algorithms. The main one is speed, as it is 100 times faster because, unlike Hadoop, it uses inmemory management and only writes to disk when necessary.


01. TOOLBOX

Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a nontraumatic way. You can, for example, use HBase as a database, even though Cassandra is emerging as a storage solution thanks to its redundancy and scalability. Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a nontraumatic way. You can, for example, use HBase as a database, even though Cassandra is emerging as a storage solution thanks to its redundancy and scalability.


01. TOOLBOX

Visualization Finally, a brief reference to the presentation of results. The most popular tools for R are clearly lattice y ggplot2, and Matplotlib for Python. But if we need professional presentations embedded in web environments the best choice is certainly D3.js. Among the integrated Business Intelligence environments with a clear approach to presentations we should highlight the well known Tableau, and as alternatives for graphical exploration of data, Birst and Necto.


Five data visualization tools that you should not miss

We present you some of the best data visualization tools that you can use in your business to take full advantage of the large amount of information created every day in the digital world.


02. DATA VISUALIZATION TOOLS

VISUALIZATION TOOLS INDEX • Google Fusion Tables • CartoDB

• Tableau Public • iCharts • Smart Data Report

The digital universe is reaching new thresholds. The amount of data generated by both private users and companies is growing at a rapid pace. Actually, according to a study by IDC and EMC, the world of digital data is doubling its size every two years, and in 2020 it will have generated 44 zettabytes of information, or what is the same: 44 trillion gigabytes of structured and unstructured data. The fact of creating and accessing a website, participating in a blog, increasing our number of followers, post comments, send a tweet or just surfing the internet produces a whole range of data that, if exploited properly, can be of great value for companies.


02. DATA VISUALIZATION TOOLS

The big challenge, however, is to make sense of all that data. That is, to be able to capture, link, analyze and extract its true value, so that the information can be presented in an attractive, clear, concise and understandable manner, facilitating decision making in your business. Exploring and analyzing visually customers’ data can also take you to discover new ways to reach them, create a better segmentation, personalized offers for products or services, and generate innovative ideas, among many other possibilities which can contribute to maintain the engagement between your brand and your users over time. Where to start The first steps in data visualization may be intimidating. Fortunately, the same way data is growing, so do the tools that help us get the most out of it. Here we present the five tools that we consider the best, based on the capabilities they provide and the level of experience required.


02. DATA VISUALIZATION TOOLS

Google Fusion Tables It’s an excellent tool for beginners or for those who don’t know programming. For more advanced users there is an API that allows to produce graphics or maps from information. One of the advantages of this application is the diversity of data representations it offers. It also offers a relatively fast way to create graphics and maps, including GIS functions to analyze data by geographic area. This tool is used frequently by The Guardian to produce detailed maps very quickly.


02. DATA VISUALIZATION TOOLS

CartoDB This is an open source service directed to any user, regardless his technical level, with a friendly interface. It allows to create a variety of interactive maps, choosing from a catalog of options (which includes Google Maps) or adding your own customized maps. The most interesting feature of this tool is that it lets you access Twitter’s data to see how users react to a brand, a particular marketing campaign or event. We can see a good example of this on the map tracking tweets that was created last year with the launch of Beyoncé’s latest album. It shows clearly the places where the release had more impact. This is a great source of visual information for marketing professionals and businesses. It should also be highlighted that it has an active group of developers who provide extensive

documentation and examples. In addition, the open nature of its API allows to create continuously new integrations and to increase the capabilities of the tool with new libraries.


02. DATA VISUALIZATION TOOLS

Tableau Public With Tableau Public you can create easily interactive maps, bar and pie charts, etc. One of its advantages is that, like Google Fusion Tables, you can import tables from Excel to facilitate your work. In a matter of minutes you can generate an interactive graphic, embed it in your website and share it. For example, the news portal Global Post created with it a series of charts about the best countries to do business in Africa. In the recently released 8.2 version we can also find the new OpenStreetMap tool, which allows to produce very detailed maps from local data such as cafes or shops. Tableau Public is a free tool, although it also has a premium version.


02. DATA VISUALIZATION TOOLS

iCharts You can get started in the world of data visualization with the service offered by iCharts, which has a free version (Basic) and two premium options (Platinum and Enterprise). With this tool you can create visualizations in just a few steps, exporting Excel and Google Drive documents or adding data manually. Through this tool it is possible to share your graphics with your collaborators privately, besides

being able to edit and update them with new data through its cloud computing service. You can even share them with your clients through emails, newsletters or social networks. Among the companies using this service we find the prestigious consulting firm IDC, which uses iCharts to provide visual images of relevant data included in its reports.


02. DATA VISUALIZATION TOOLS

Smart Data Report Finally, we also recommend Smart Data Report, which is not a tool as powerful as the previous ones but has the advantage of being an affordable data solution for entrepreneurs and small businesses whose workers don’t have much spare time. Among other services, this website offers free data analysis and the option to receive reports by email, without having to create them yourself. Once the service has your report ready, it generates an HTML code that you can embed in your corporate website or in your articles.


Get the maximum benefit from data with

these four webinars Mapping data, visualizing them in geospatial apps and applying automatic learning. We put our knowledge into practice with the help of these video tutorials.


03. WEBINARS

Mapping data CartoDB explains how to convert location data into knowledge for your business. In this tutorial you can learn how to analyze, visualize and build data apps using the CartoDB tool.


03. WEBINARS

Machine Learning Now summer's round the corner, AndrĂŠs GonzĂĄlez, solutions manager for Big Data and Data Prediction at Clever Task, shows us how to make forecasts from data in a very specific area: the tourist sector.


03. WEBINARS

Geospatial apps And if you want to learn to create apps and geospatial data, you can't miss this tutorial –also by CartoDB– explaining how you can make the most of an API –in this case the one opened by BBVA for the InnovaChallenge competition– to create apps and visualizations.


03. WEBINARS

Good examples of visualization Finally, to finish off this selection, Alberto Cairo, professor of data visualization at the Universidad de Miami, teaches us good practices in data visualization. It's good to learn from our own mistakes and from the successes of others.


share THIS MIGHT INTEREST YOU

Innovation Edge Big Data: to create business value with data

Emerging Tech: Data visualization beyond the noise

Infographic: Big Data, chronology, present and future

Caso study: data visualization with Illustreets y CartoDB

Infographic: the keys of Big Data by DJ Patil


BBVA no BBVA is not resposible for the opinions expressed here in

Sign up To keep up to date with the latest trends

www.bbvaopen4u.com

Interact with us on:


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.