Convocados Cursos sobre herramientas Open Source (presencial y online)

October 3, 2016, 8:23 am

≫ Next: List of Open Source solutions for Smart Cities - Internet of Things projects

≪ Previous: Los peores graficos del mundo

Ya están convocados, (comenzando a mediados de Octubre y concluyendo a finales de año), la más variada oferta de Cursos sobre soluciones Open Source que se realizan en modalidades presencial y online)

Cursos Convocados:

↧

List of Open Source solutions for Smart Cities - Internet of Things projects

October 4, 2016, 1:53 am

≫ Next: Nueva version de Data Cleaner

≪ Previous: Convocados Cursos sobre herramientas Open Source (presencial y online)

Increasingly projects are carried on so-called 'Smart Cities', supported by Big Data, Internet of Things... and the good news is that most of them are made with Open Source technologies. We can share, from TodoBI.com our insights about these technologies

Making a city “smart” involves a set of areas we will outline below: Without IOT (Internet Of Things), there will be no Smart City.

Since automatic collected data is the most efficient way to get huge amounts of information, devices connected to the internet are an essential part of a Smart City.
The way we store and process data from city is generally using Big Data and Real Time Streaming technologies.

The final goal where more innovative and custom analysis can be achieved using Artificial Intelligence and Machine Learning. Finally I would include Apps, as usually this kind of solutions is consumed in mobile devices.

Here we outline the common process of building a Smart City solution:

-Choose data
-Connecting devices
-Design Data Storage Infrastructure
-Real Time Events and Notifications
-Analytics -Visualization (Dashboards)

1) Choosing Data

In a city there are three basic sources of data: citizens, systems, sensors. Use the available information of users, on social networks, informations systems, public statistical information offered by the administration.

A typical example is user with geolocalization enabled in twitter. Information about the systems and services in a city are sometimes available in open data sources. An example could be the water or electricity consumption.

Last but not least, sensors. A city hoping to become “Smart” has to intend to provide automatic information of its environment, and that could be achieved using sensors. Sensors can be anywhere

2) Connecting Devices

Devices (sensors) connects with the real time data streaming and the storage infrastructure using efficient communications protocols, that using light weight packaging and asynchronous communications.

Examples of some communications protocols used:

MQTT (Message Queuing Telemetry Transport) Websocket (bi-directional web communication and connection management)

STOMP (The Simple Text Oriented Messaging Protocol)

XMPP (Extensible Messaging and Presence Protocol)

3) Design Data Storage Infraestructure

The Data Storage Infrastructure for a Smart City solutions has special characteristics, due to the diversity and dynamism of its sources.

Time series DB are frequently used, because of the time evolution of data catched by sensors Some examples of this kind of DB are InfluxDB and Druid.

Another DB commonly used in Smart Cities project are MongoDB (json format advantages), Cassandra (fast insertion advantages), Hadoop (big data frameworks advantages)

Some samples

4) Real Time events and notifications

Usually Smart Cities solutions have needs for real time notifications on events. To accomplish such requirements the system must have a Stream Analytic engine, that can react to events in real time and send notification. This characteristics bring us some technologies related to this; Storm, Spark Streaming, Flink, WebSocket, Socket.IO

IoT Frameworks:

●Node-RED

Node-RED is a tool for wiring together hardware devices, APIs and online services in new and interesting ways.

The light-weight runtime is built on Node.js, taking full advantage of its event-driven, non-blocking model. This makes it ideal to run at the edge of the network on low-cost hardware such as the Raspberry Pi as well as in the cloud.

The flows created in Node-RED are stored using JSON which can be easily imported and exported for sharing with others.

An online flow library allows you to share your best flows with the world

●PubNub

PubNub is a Data Stream Network, that offers infrastructure as a service. With PubNub, we can use the infrastructure provided and connect our devices, designing our architecture and simply get advantages of all this.

PubNub has 5 main tools:

-Publish Subscribe (Allows Real Time Notifications of Events to users)
-Stream Controller (Allows managing channels and groups of channels)
-Presence (Allows notifications when users login or leave the system, or similar behaviour, device availability for example)
-Access Manager (Allows administrators, to grant or deny permitson users of the systems)
-Storage & Playback (Provide storage for messages,and allows messages retrieval at later time)

●IoT-AWS

AWS IoT is a platform that enables you to connect devices to AWS Services and other devices, secure data and interactions, process and act upon device data, and enable applications to interact with devices even when they are offline

5) Analytics and Visualization

You can show real time dashboards, reports, OLAP Analysis using tools like Pentaho. See samples of Analytics

Other Open Source projects for Smart Cities -IoT:

- AllSeen Alliance
- Bug Labsdweet and freeboard
- DeviceHive
- DSA
- Eclipse IoT (Kura)
- Kaa
- Macchina.io
- Predix
- Home Assistant
- Mainspring
- Node-RED
- Open Connectivity Foundation
- openHAB
- OpenIoT
- OpenRemote
- OpenThread
- Physical Web/Eddystone
- PlatformIO
- The Thing System
- ThingSpeak
- Zetta

↧

Nueva version de Data Cleaner

October 6, 2016, 2:32 am

≫ Next: Caso de uso de Apache Kafka en tiempo real, Big Data

≪ Previous: List of Open Source solutions for Smart Cities - Internet of Things projects

The heart of DataCleaner is a strong data profiling engine for discovering and analyzing the quality of your data. Find the patterns, missing values, character sets and other characteristics of your data values.

Profiling is an essential activity of any Data Quality, Master Data Management or Data Governance program. If you don't know what you're up against, you have poor chances of fixing it.

Learn how DataCleaner works with ...

Duplicate detection
Big Data and Hadoop
Pentaho Business Intelligence
CRM systems (such as Salesforce.com)

DataCleaner community edition downloads

↧

Caso de uso de Apache Kafka en tiempo real, Big Data

October 14, 2016, 9:47 am

≫ Next: Detección de fraude en comercios con Neo4J

≪ Previous: Nueva version de Data Cleaner

Este es un buen ejemplo de uso de Apache Kafka en entornos Big Data para consultas y visualización. Ver Cuadro de Mando

En la imagen inferior se muestra el cluster de 3 brokers y 3 producers que emiten datos hacia el cluster kafka.

El componente "Kafka Producer" se conecta al stream de la wikipedia y registra un listener, que es un sujeto del patrónobserver ; cuando se genera una actualización en la wikipedia se recibe a través del "Socket" y este lo notifica al "Listener", que contiene un org.apache.clients.producer.KafkaProducer, el producer registra un callback para notificarle que se ha enviado un mensaje a kafka, la notificación contiene el offset y lapartición de cada mensaje, en este paso se envía cada minuto vía API el tiempo en milisegundos y el offset para ese tiempo.

Esta información se almacena en una Base de Datos PostgreSQL, para luego ser consultada. Cuando el usuario selecciona una fecha a partir de la cual quieren ver los mensajes, el sistema busca en la Base de Datos un offsetregistrado en la fecha solicitada, el cluster kafka mantiene los mensajes en los ficheros locales por 3 días.

Una vez obtenido el offset para la fecha requerida se solicita por medio del "Consumer Holder" un "Thread Safe Kafka Consumer" que realiza las operaciones seek y poll, para indicar el punto y consumir a partir de él respectivamente.

Pordefecto,un org.apache.kafka.clients.consumer.KafkaConsumer no es Thread Safe, por tanto para ser usado en un entorno con accesos simultáneo de usuarios se hizo una implementaciónque permite usar un Consumer por varios hilos, sinchronizando el acceso al objeto.

↧

Detección de fraude en comercios con Neo4J

October 18, 2016, 8:56 am

≫ Next: Las 6 soluciones Open Source que usan las empresas

≪ Previous: Caso de uso de Apache Kafka en tiempo real, Big Data

En este pequeño ejemplo vamos a demostrar las capacidades para la detección del fraude de Neo4J (Base de datos orientada a grafos), que han hecho nuestros compañeros de Stratebi

Nuestro juego de datos incluye:

10 Personas (Nodos): Fernando, Juan, Daniel, Marcos...
13 Comercios (Nodos): Fnac, El Corte Inglés, Primark, Ikea...
64 Transacciones de compra (Relaciones) que identifican compras de una determinada persona en un comercio. Cada una de estas Relaciones tiene los siguientes atributos: cantidad de la compra en €, fecha y estado (legítima o fraudulenta).

// Crear Clientes 10
CREATE (Fernando:Persona {id:'1', nombre:'Fernando', sexo:'masculino', edad:'50'})
CREATE (Juan:Persona {id:'2', nombre:'Juan', sexo:'masculino', edad:'48'})
CREATE (Daniel:Persona {id:'3', nombre:'Daniel', sexo:'masculino', edad:'23'})
CREATE (Marcos:Persona {id:'4', nombre:'Marcos', sexo:'masculino', edad:'30'})
CREATE (Gonzalo:Persona {id:'5', nombre:'Gonzalo', sexo:'masculino', edad:'31'})
CREATE (Marta:Persona {id:'6', nombre:'Marta', sexo:'femenino', edad:'52'})
...

// Crear Comercios
CREATE (Fnac:Comercio {id:'11', nombre:'Fnac', calle:'2626 Wilkinson Court', address:'Madrid 92410'})
CREATE (El_Corte_Ingles:Comercio {id:'12', nombre:'El Corte Ingles', calle:'Nuevos Minist', address:'Madrid 92410'})
CREATE (Primark:Comercio {id:'13', nombre:'Primark', calle:'2092 Larry Street', address:'Madrid 92410'})
CREATE (MacDonalds:Comercio {id:'14', nombre:'MacDonalds', calle:'1870 Caynor Circle', address:'Madrid 92410'})
CREATE (Springfield:Comercio {id:'15', nombre:'Springfield', calle:'1381 Spruce Drive', address:'Madrid 92410'})
CREATE (Burguer_King:Comercio {id:'16', nombre:'Burguer King', calle:'826 Anmoore Road', address:'Madrid 92410'})
CREATE (Ikea:Comercio {id:'17', nombre:'Ikea', calle:'1925 Spring Street', address:'Madrid 92410'})
CREATE (Nike:Comercio {id:'18', nombre:'Nike', calle:'4209 Elsie Drive', address:'Madrid 92410'})
CREATE (Adidas:Comercio {id:'19', nombre:'Adidas', calle:'86 D Street', address:'Madrid 92410'})
CREATE (Sprinter:Comercio {id:'20', nombre:'Sprinter', calle:'945 Kinney Street', address:'Madrid 92410'})
CREATE (Starbucks:Comercio {id:'21', nombre:'Starbucks', calle:'3810 Apple Lane', address:'Madrid 92410'})
...


A continuación se muestra un subconjunto con 25 compras.


// Crear Compras
CREATE (Fernando)-[:HA_COMPRADO_EN {cantidad:'986', fecha:'4/17/2015', estado:'Legitima'}]->(Burguer_King)
CREATE (Fernando)-[:HA_COMPRADO_EN {cantidad:'239', fecha:'5/15/2015', estado:'Legitima'}]->(Starbucks)
CREATE (Fernando)-[:HA_COMPRADO_EN {cantidad:'475', fecha:'3/28/2015', estado:'Legitima'}]->(Nike)
CREATE (Fernando)-[:HA_COMPRADO_EN {cantidad:'654', fecha:'3/20/2015', estado:'Legitima'}]->(Primark)
CREATE (Juan)-[:HA_COMPRADO_EN {cantidad:'196', fecha:'7/24/2015', estado:'Legitima'}]->(Adidas)
CREATE (Juan)-[:HA_COMPRADO_EN {cantidad:'502', fecha:'4/9/2015', estado:'Legitima'}]->(El_Corte_Ingles)
CREATE (Juan)-[:HA_COMPRADO_EN {cantidad:'848', fecha:'5/29/2015', estado:'Legitima'}]->(Primark)
CREATE (Juan)-[:HA_COMPRADO_EN {cantidad:'802', fecha:'3/11/2015', estado:'Legitima'}]->(Fnac)
CREATE (Juan)-[:HA_COMPRADO_EN {cantidad:'203', fecha:'3/27/2015', estado:'Legitima'}]->(Subway)
CREATE (Daniel)-[:HA_COMPRADO_EN {cantidad:'35', fecha:'1/23/2015', estado:'Legitima'}]->(MacDonalds)
.....

Ahora vamos a comenzar a utilizar las capacidades de Cypher el lenguaje de consultas gráficas de Neo4J

1º Mostramos todas las operaciones fraudulentas

MATCH (victima:Persona)-[r:HA_COMPRADO_EN]->(comercio)
WHERE r.estado = "Fraudulenta"RETURN
victima.nombre AS `Nombre Cliente`, 
comercio.nombre AS `Nombre Comercio`, 
r.cantidad AS Cantidad, 
r.fecha AS `Fecha Transaccion`
ORDER BY `Fecha Transaccion` DESC

Resultado: 16 Operaciones fraudulentas

2º Hasta ahora sabemos cuales son los comercios en los que han ocurrido casos de fraude.

Pero existe un timador que estamos buscando, para ayudarnos a encontrarlo nos apoyaremos en la fecha de la transacción.
El ladrón que buscamos ha captado el nº de tárjeta de crédito en una compra legítima. Después de robar los datos de la tarjeta el ladrón ha realizado operaciones fraudulentas.
En la siguiente consulta mostraremos para personas han sido víctimas de fraude, operaciones de compra legítimas y anteriores en el tiempo a las fraudulentas. De esta forma nos aparecerán los comercios en los que se han podido robar el nº de la tarjeta.

MATCH (victima:Persona)-[r:HA_COMPRADO_EN]->(comercio)
WHERE r.estado = "Fraudulenta"

MATCH (victima)-[t:HA_COMPRADO_EN]->(otroscomercios)
WHERE t.estado = "Legitima"AND t.fecha < r.fecha

WITH victima, otroscomercios, t 
ORDER BY t.fecha DESC

RETURN
victima.nombre AS `Nombre Cliente`, 
otroscomercios.nombre AS `Nombre Comercio`, 
t.cantidad AS Cantidad, 
t.fecha AS `Fecha Transaccion`
ORDER BY `Fecha Transaccion` DESC

Resultado: 34 operaciones legítimas y anteriores en el tiempo a las fraudulentas

3º Ahora vamos a calcular el denominador común, agrupamos y ordenamos por el nº de personas que han comprado en cada comercio.

MATCH (victima:Persona)-[r:HA_COMPRADO_EN]->(comercio)
WHERE r.estado = "Fraudulenta"

MATCH (victima)-[t:HA_COMPRADO_EN]->(otroscomercios)
WHERE t.estado = "Legitima"AND t.fecha < r.fecha
WITH victima, otroscomercios, t ORDER BY t.fecha DESC

RETURN
DISTINCT otroscomercios.nombre AS `Comercio Sospechoso`,
count(DISTINCT t) AS Contador,
collect(DISTINCT victima.nombre) AS Victimas
ORDER BY Contador DESC

Resultado: En todas las compras fraudulentas la persona propietaria de la tarjeta había realizado alguna compra en Primark en los días anteriores. Ahora ya sabemos tanto la fecha como el comercio donde fueron robados los datos bancarios.

Visualizamos ahora ordenadas por fecha las compras de las víctimas, de esta forma sabemos la fecha del robo de los datos.

↧

Las 6 soluciones Open Source que usan las empresas

October 20, 2016, 4:45 am

≫ Next: Los mejores recursos Open Source para Alfresco

≪ Previous: Detección de fraude en comercios con Neo4J

Nos podríamos extender en este correo, pero seremos concretos. Lo que queremos reflejar es una realidad que estamos viendo en cada vez más organizaciones. Y es el uso de soluciones Open Source, cada vez de mayor calidad para gestionar el día y día y las necesidades estratégicas de las compañías

Ya no hablamos solo de sistemas operativos o soluciones de backend, sino de potentes soluciones de negocio para todo tipo de usuarios de dento de la compañía. Aquí están:

Portales (y más): Liferay
Gestor Documental (y más): Alfresco
Analytics (y más): Pentaho
ERP (y más): Odoo
CRM (y más): SuiteCRM
Data Management (y más): Talend

↧

Los mejores recursos Open Source para Alfresco

October 25, 2016, 5:01 am

≫ Next: Proximo webinar de presentacion del nuevo Jedox 7

≪ Previous: Las 6 soluciones Open Source que usan las empresas

Para todos los que trabajáis con Alfresco, encontrareis tremendamente útil esta recopilación:

Auditing

Alfresco Audit Analysis and Reporting - A.A.A.R. – Alfresco Audit Analysis and Reporting
Alfresco Audit Dashlet - Dashlet to view Alfresco audit logs

Authentication and Authorization

alfresco-agreement-filter - This extension adds a must read page for every user before starting to use Alfresco.
Share oAuth - Spring Surf extension allowing remote endpoints to be easily set up against OAuth 1.0 and OAuth 2.0 services
Share oAuth SSO - Alfresco Share OAuth SSO Support

Backup and Restore

Alfresco BART - Backup and Recovery Tool - Alfresco BART is a tool written in shell script on top of Duplicity for doing Alfresco backups and restore from a local file system, FTP, SCP or Amazon S3.

Benchmark

Alfresco Benchmark - Alfresco Benchmark framework, utilities and load tests: a scalable load test suite

Content Management Systems

Crafter CMS - A web CMS built on top of Alfresco as the repository

Content Management System Integrations

Drupal Alfresco - Alfresco module provides integration between Drupal and Alfresco Enterprise Content Management System.
AlfrescoDoc for Joomla - A Joomla module to display document from alfresco.
AlfrescoDoc for Wordpress - A WordPress Plugin to display document from alfresco.

Content Stores

Alfresco Cloud Store - Migrated from Google Code
alfresco-s3-adapter - Alfresco AMP Module for S3 Backed Storage
Compressing Content Store for Alfresco - An Alfresco ContentStore implementation, which compresses certain mime types (but not others)
Simple Content Stores - Addon to provide a set of common content store implementations and easy-to-use configuration (no Spring config)

Classification and OCR

Alfresco Google Vision - Google Vision API integration in Alfresco
Alfresco Simple OCR - Simple OCR action for Alfresco
Uploader Plus - An Alfresco uploader that prompts for metadata

Custom Builds

LXCommunity ECM - Open source custom build of Alfresco Community with commercial support

Data List Management

Alfresco Datalists - Datalist Extensions for Alfresco Share
alfresco-datalist-constraints - Use datalists to maintain Alfresco model constraints
AlfrescoDataListDownload - Download as Spreadsheet support for Alfresco DataLists
Alfresco List Manager - Component used to manage custom list of values used in metadata forms.

Desktop Sync

CMISSync - Synchronize content between a CMIS repository and your desktop. Like Dropbox for Enterprise Content Management!

Development

Aikau - Aikau UI Framework
Alfresco SDK - The Alfresco SDK based on Apache Maven, includes support for rapid and standard development, testing, packaging, versioning and release of your Alfresco integration and extension projects
Alfresco Enhanced Script Environment - Provide additional functionality for the server-side JavaScript environments of both the Alfresco Repository and Alfresco Share tier.
Alfresco JavaScript Batch Executer- Alfresco easy bulk processing with JavaScript
Alfresco Javascript Console - Administration Console component for Alfresco Share, that enables the execution of arbitrary JavaScript code against the repository
alfresco-jscript-extensions - Alfresco repository module with helpful javascript root object extensions which are helpful in much scenarios.
Alfresco Maven - Base Maven setup of parent POM, common definitions and plugins for building Alfresco modules without Alfresco SDK (except for a single plugin mojo)
Alfresco @mvc - Enables the usage of Spring @MVC within Alfresco.
alfresco-ng2-components - Alfresco Angular 2 components
Dynamic Extensions for Alfresco - Rapid development of Alfresco repository extensions in Java. Deploy your code in seconds, not minutes. Life is too short for endless server restarts.
Enables Cors support for an Alfresco repository - Enables Cors support for an Alfresco repository
generator-alfresco - A Yeomen generator based on the Alfresco all-in-one Maven archetype with some generators and an opinionated project structure.
Alfresco Share ReactJS - An Alfresco AIO starter kit to start creating Alfresco Share widgets with ReactJS
Alfresco Utility - Project to consolidate abstract utility features that may be reused across functional Alfresco modules

Deployment and Installation

Alfresco Ubuntu Install - Install a production ready Alfresco on Ubuntu 14.04 onwards.
Chef Alfresco - A build automation tool that provides a modular, configurable and extensible way to install an Alfresco architecture
Docker Alfresco - Containerised Alfresco
Puppet Alfresco - Puppet Build Script for Alfresco
Vagrant Alfresco - Project for starting up an Alfresco instance inside a Vagrant VM
Alfresco SPK - Design, run, integrate Alfresco stacks
Share Announcements - Alfresco add-on that allows system announcements to be managed in the Data Dictionary and displayed on the login page.

Digital Signatures

Alfresco eSign Cert - Provides an Alfresco Share action for signing PDF files (PAdES-BES format) and any other file (CAdES-BES format detached) via java applet and more.
CounterSign - A digital signature solution for Alfresco

Documents

Alfresco PDF Toolkit - Migrated project from Google Code
Alfresco PDF Toolkit - Loftux maintained fork - Maintained fork of Alfresco PDF Toolkit

Email

Alfresco Discussions - Send an email to all site members whenever a discussion topic is created/updated. This extension also allows you to reply to the notification via email
Alfresco RFC822/EML tweaks - Alfresco RFC822/EML tweaks
Inbound Invites - send calendar invitations to an Alfresco Share site

Encryption

Alfresco Encryption Module - Extends features of Alfresco system, which allows users to encrypt and decrypt their data on repository.

External App Development

Alfresco JS API - Alfresco API for JavaScript in the browser and Node.js
CMIS JS - A CMIS javascript library for node and browser
Spring Social Alfresco - Spring Social plugin for Alfresco.

External Clients and Applications

Alfrescian CMIS Browser - Simple CMIS Repository Browser using CMIS 1.1
Alfresco HTML5 Client - A simple alfresco client written only in HTML5 and Javascript. Browser Binding based AngularJS and Bootstrap.
Bootfresco - Twitter Bootstrap client for Alfresco

Form Controls and Document Library Components

alfresco-colleagues-picker-form-control - Limits the people picker to show only users members of the same groups the current logged in user is member
alfresco-value-assistance - Configurable value assistance module for Alfresco Share that allows picklists to be managed using datalists.
Alvex Datagrid - Can be used in place of Alfresco default datagrid with additional features
Alvex Masterdata - Extends default Alfresco content model LIST constraints to use dynamic and external lists of values.
Alvex Orgchart - Extends standard Alfresco users and groups functionality by adding complete organizational chart that is more convenient for business users than flat groups.

Integrations

Marklogic Alfresco Integration

Online Editing

Alfresco Etherpad Integration - Alfresco to Etherpad integration
Alfresco Google Docs - Alfresco Google Docs integration
Alfresco LibreOffice Online Editing - A LibreOffice Online Edit Module for Alfresco
Alfresco OnlyOffice Integration - This Share plugin enables users to edit Office documents within ONLYOFFICE from Alfresco Share.
Online edition with Libreoffice in Alfresco Share - Online edition with Libreoffice in Alfresco Share

Mobile Clients

Alfresco iOS App - Alfresco Official iOS app
Alfresco Android App - Alfresco Official Android App
Ionic Alfresco - Alfresco ADF bindings for Ionic 2 and Angular 2

Localisation Tools

alfresco-localisation-tools - Localisation tools for Alfresco

Language Packs

Serbian - Serbian Language pack for Alfresco
Swedish - Swedish Language pack for Alfresco

Management

Alfresco JMX - Add JMX functionality to Alfresco Community Edition
Alfresco Share Import Export - This extension allows you to import and export ACP files from Share UI
Alfresco Bulk Import - Alfresco Bulk Import Tool v2.x - for Alfresco v5.0 and up
Alfresco Bulk Export - Migrated from Google Code
Alfresco ATL Connector - The ETL Connector extension for Alfresco allows to import documents in an Alfresco repository by using compatible ETL Tools.
Alfresco Max Version Policy - Alfresco Max Version Policy limits the number of versions that are created for a versioned node.
Alfresco My Files Quota - Define quota policies on My Files folder for each user
Alfresco Shell Tools - Command line tools to admin Alfresco. Migrated from Google Code
Alfresco Trashcan Cleaner - This Alfresco module periodically purges old content from the Alfresco trashcan.
AuditShare for Alfresco - displays sites and repository usage info.
AuditSurf - AuditSurf is a SURF app displaying repository usage info
FileSynchronizer - Small tool for synchronizing local files with remote server (based on ssh) or Alfresco (based on http)
MassiveDelete - A simple Alfresco massive deletion batch.
OOTBEE Support Tools - "Liberated" variant of the Alfresco Support Tools addon
Share Import/Export Tools - A collection of Python scripts which can be used to import and export sites and users from Alfresco Share.

Records Management

Alfresco Records Management - Offical Alfresco Records Management Community Source Code

Share Add-ons

Alfresco Permission Labels - Displays user permission levels in Document Library Views as a label

Alfresco Default User Avatars - Alfresco module that creates color coded avatars for users without a personal profile picture
Alfresco Share Clipboard - This extensions adds a Clipboard to the Alfresco Share document library that allows collecting documents.
Alfresco Share Site Creators - An Alfresco add-on that limits site creation to those in a specific group.
Alfresco Share Site Logo Customization - This addon will allow you to set a different logo for each Alfresco Site
Alfresco Unzip Action - This extension allows you to add "Unzip" action in Alfresco Share Document Library web tier (available in both Document Library site and repository).
Geo Views add-on for Alfresco Share - Map-based views of geotagged content items in Share, plus support for adding/modifying geotags via a map interface

Share Dashlets

Alfresco Favorite Folders Dashlet - Adds favorite folder dashlet to Alfresco Share
Event Scheduling Dashlet - This extension allows you to plan events directly from a Share dashlet (the dashlet can be added, either on a user or on a site dashboard).
Notice Dashlet - Dashlet to display a user-defined piece of content on a user or a site dashboard

Transformers and Previewers

Alfresco Vector Transformations Module - Adding support for vector file transformations in Alfresco including DWG and SVG
Loftux Media Viewers for Alfresco Share - Loftux maintained fork of Alfresco Media Viewers add-on with additional viewers
MD Preview - Markdown Previews and Editing for Alfresco Share
Media Viewers - Enhanced document previews for a range of different document and media types, plus a dashlet allowing any content item to be displayed on a site dashboard.
STL Previewer - Enables Share previews of STL 3d Model files

Tutorials

Alfresco Developer Series - Source code from Alfresco Developer Series tutorials by Jeff Potts
Alfresco Tutorials - Source for Alfresco Tutorials written by Ole Hejlskov.
Alfresco API Java Examples - Examples showing how to hit the Alfresco Public API using Java.

Visualisations

Alfresco Visualization Tools - Includes dashlets to view and visualize content within Alfresco repositories using D3.js and Simile Project.
ContentCraft - ContentCraft is a Bukkit style plugin for Minecraft that connects, via CMIS, to an Alfresco repository.

Workflow

Activiti - Activiti Workflow
Flowable - Recent fork of Alfresco Activiti by core maintainers

Documentation

Manual Manager for Alfresco - Create documentation and manuals system based on markdown inside your Alfresco

Other

Slack Bot for Alfresco - a simple chatbot for Slack that connects to your Alfresco instance and provides some handy functionality
Alfresco Tooling - Common Alfresco tooling, scripts and test setups.

↧

Proximo webinar de presentacion del nuevo Jedox 7

October 27, 2016, 1:06 am

≫ Next: Estas pensando en mejorar o hacer un update a tu entorno Pentaho?

≪ Previous: Los mejores recursos Open Source para Alfresco

No te pierdas el próximo webinar de Presentación de la mejor herramienta Business Intelligence para Planificación y Presupuestación. Registrate gratuitamente

↧

Estas pensando en mejorar o hacer un update a tu entorno Pentaho?

October 27, 2016, 5:04 am

≫ Next: Twitter Real Time Dashboard

≪ Previous: Proximo webinar de presentacion del nuevo Jedox 7

Pentaho CE lleva más de 10 años siendo implementado en muchas organizaciones.

Afortunadamente, en la mayor parte de los casos, los usuarios le sacan un gran partido, pero conforme han ido saliendo nuevas versiones y se han ido produciendo mejoras por la comunidad, se suele hacer necesario un upgrade para mejorar:

- Rendimiento y cuellos de botella
- Mejorar el front-end y la experiencia de usuario
- Incluir nuevas funcionalidades y mejoras

Podéis echar un vistazo a las mejoras que introducen los especialistas en Pentaho de Stratebi, que incluyen:

- Mejoras en la consola (tags, search, comentarios)
- Herramientas OLAP y Reporting mejoradas
- Nuevas herramientas de generación de Dashboards y Scorecards
- Potentes Cuadros de Mando predefinidos
- Integración con entornos Big Data y Real Time

Ver las mejoras en acción:

Demo_Pentaho - Big Data

↧

Twitter Real Time Dashboard

October 27, 2016, 10:53 am

≫ Next: Aplicaciones de Big Data en Turismo

≪ Previous: Estas pensando en mejorar o hacer un update a tu entorno Pentaho?

Buen ejemplo de aplicación de Real Time con tecnologías Big Data para la ingesta de información de redes sociales, que luego podrá ser procesada, aplicar 'sentiment analysis', cruzar con información en un Data Lake, etc...

Acceder Dashboard

Arquitectura:

El usuario o API envía palabras de filtro mediante una conexión WebSocket; en el servidor se crea una conexión con el cliente (API o usuario) obtenida a través del componente "Stream Holder", cuya función es gestionar la conexiones solicitadas.

El "Stream Holder" solicita una credencial al "Credentials Pool", con la cual se se abre una conexión con el API público de Twitter y envía una consulta especificando los filtros, el resultado son tweets en tiempo real recibidos a través del "Message Receiver".

El "Message Receiver" es un sujeto dentro del patrón observer: cuando la conexión a Twitter recibe un tweet, lo notifica al "Message Receiver" y este, para no bloquear el hilo que lo invoca, usa una Cola de Mensajes para comunicarse con el "Server Socket", es decir, pone los mensajes en la cola y el "Server Socket" los recoge de allí.

Este proceso optimiza el tiempo de bloqueo en O(1), que es la Complejidad Computacional de insertar en una cola.

Esta solución es extensible a un número mucho mayor de nodos, en complemento con un cluster kafka como se muestra en nuestra demo con kafka.

Verlo en funcionamiento:

↧

Aplicaciones de Big Data en Turismo

October 31, 2016, 7:51 am

≫ Next: Analysis Big Data OLAP sobre Hadoop con Apache Kylin

≪ Previous: Twitter Real Time Dashboard

Interesante estudio el que presentan nuestros amigos de Territorio Creativo, donde se hace un buen repaso a las aplicaciones del Big Data en el ámbito del Turismo

Por nuestro lado, os dejamos algunos ejemplos de aplicación en Turismo y demostraciones Big Data, aplicables a diferentes áreas

↧

Analysis Big Data OLAP sobre Hadoop con Apache Kylin

November 2, 2016, 7:47 am

≫ Next: Open Source Business Intelligence tips and tricks in October 16

≪ Previous: Aplicaciones de Big Data en Turismo

Caso de estudio que presentamos, en el que hacemos uso de las herramientas Apache Kylin y STPivot para dar soporte al análisis interactivo OLAP de un almacén de datos (Data Warehouse, DW) que contiene datos con características Big Data (Volumen, Velocidad y Variedad)

Se trata de un gran Volumen de datos académicos, relativos a los últimos 15 años de una universidad de gran tamaño. A partir de esta fuente de datos, se ha diseñado un modelo multidimensional para el análisis del rendimiento académico. En él contamos con unos 100 millones de filas con medidas cómo los créditos relativos a asignaturas aprobadas, suspendidas o matriculadas. Estos hechos se analizan en base a distintas dimensiones o contextos de análisis, como el Sexo o la Calificación y la siempre presente componente temporal, el Año Académico.

Dado que este Volumen de datos es demasiado grande para analizarlo con un rendimiento aceptable con los sistemas OLAP (R-OLAP y M-OLAP) tradicionales, hemos decidido probar la tecnología Apache Kylin, la cual promete tiempos de respuesta en consultas de unos pocos segundos para Volúmenes superiores a los 10 billones de filas.

Las tecnologías del entorno Hadoop fundamentales para Kylin son Apache Hive y Apache HBase.
El almacén de datos (Data Warehouse, DW) se crea en forma de modelo estrella y se mantiene en Apache Hive.

A partir de este modelo y mediante la definición de un modelo de metadatos del cubo OLAP, Apache Kylin, mediante un proceso offline crea un cubo multidimensional (MOLAP) en HBase.

Ver Big Data-OLAP en funcionamiento

A partir de este momento, Kylin permite hacer consultas sobre el mismo a través de su interfaz SQL, también accesible a través de conectores J/ODBC.

Por último, para hacer posible el análisis OLAP mediante consultas MDX y las tablas o vistas multidimensionales correspondientes, hacemos uso de la herramienta STPivot.

STPivot es un visor OLAP desarrollado por StrateBI como parte de la suite STTools.
STPivot usa Mondrian como servidor OLAP y puede desplegarse sobre un servidor BI como Pentaho BA Server, ambos open source. De esta forma, STPivot permite crear y explorar vistas o tablas multidimensionales, cómo las de esta demo, que hacen uso del cubo OLAP creado con Apache Kylin.

Desarrollada por eBay y posteriormente liberada como proyecto Apache open source, Kylin es una herramienta de código libre que da soporte al procesamiento analítico en línea (OLAP) de grandes volúmenes de datos con las características del Big Data (Volumen, Velocidad y Variedad).

Sin embargo, hasta la llegada de Kylin, la tecnología OLAP estaba limitada a las bases de datos relacionales o, en el mejor de los casos, con optimizaciones para el almacenamiento multidimensional, tecnologías con importantes limitaciones para enfrentarse al Big Data.

Apache Kylin, construida sobre la base de distintas tecnologías del entorno Hadoop, proporciona una interfaz SQL que permite la realización de consultas para el análisis multidimensional de un conjunto de datos, logrando unos tiempos de consulta muy bajos (segundos) para hechos de estudio que pueden llegar hasta los 10 billones de filas o más.
Las tecnologías del entorno Hadoop fundamentales para Kylin son Apache Hive y Apache HBase.

El almacén de datos (Data Warehouse, DW) se crea en forma de modelo estrella y se mantiene en Apache Hive. A partir de este modelo y mediante la definición de un modelo de metadatos del cubo OLAP, Apache Kylin, mediante un proceso offline, crea un cubo multidimensional (MOLAP) en HBase. Se trata de una estructura optimizada para su consulta a través de la interfaz SQL proporcionada por Kylin.

De esta forma cuando Kylin recibe una consulta SQL, debe decidir si puede responderla con el cubo MOLAP en HBase (en milisegundos o segundos), o sí por el contrario, no se ha incluido en el cubo MOLAP, y se ha ejecutar una consulta frente al esquema estrella en Apache Hive (minutos), lo cual es poco frecuente.

Por último, gracias al uso de SQL y la disponibilidad de drivers J/ODBC podemos conectar con herramientas de Business Intelligence como Tableau, Apache Zeppelin o incluso motores de consultas MDX como Pentaho Mondrian, permitiendo el análisis multidimensional en sus formas habituales: vistas o tablas multidimensionales, cuadros de mando o informes.

Ver Big Data-Dashboard en funcionamiento

STPivot es un visor OLAP potente a la par que fácil de usar, desarrollado por StrateBI y que forma parte de la suite de aplicaciones Business Intelligence, STTools.

El objetivo de este visor es mejorar la experiencia de usuario haciendo tan sencillo el análisis OLAP como arrastrar y soltar las medidas y contextos del análisis en un lienzo, de forma que la vista OLAP se genera de forma transparente al usuario.

Además, la incorporación de asistentes de consulta, gráficos novedosos además de las propias tablas multidimensionales, un editor de fórmulas avanzado o la exportación para la publicación de las vistas en distintos formatos, son algunas de las características más destacadas de STPivot y que diferencian nuestra herramienta de otros visores OLAP existentes.

En cuanto a su arquitectura, STPivot funciona sobre el motor de ejecución MDX, Mondrian.
Es por ello, qué STPivot puede usarse como aplicación del servidor de Business Intelligence open source Pentaho BA Server (CE), el cual ya incluye Mondrian como parte del mismo.

Gracias a la conectividad JDBC es posible la conexión de Mondrian con Apache Kylin y, de esta forma, el uso de esta fuente de datos OLAP y Big Data con STPivot.

Como fuente datos Big Data de esta demo, disponemos de un gran Volumen de datos académicos ficticios, relativos a los últimos 15 años de una universidad de gran tamaño y por la que han pasado más de un millón de alumnos en este tiempo. A partir de esta fuente de datos, se ha diseñado un modelo multidimensional para el análisis del rendimiento académico

En él contamos con unos 100 millones de filas con medidas cómo la suma de los créditos relativos a asignaturas aprobadas, suspendidas o matriculadas.

Además también nos encontramos con otras medidas derivadas de las anteriores y, por tanto, más complejas como son la Tasa de rendimiento y Tasa de éxito, calculadas a partir de la relación entre Créditos Superados y Créditos Matriculados y de la relación entre Créditos Superados y Créditos Presentados.

No menos importantes son las dimensiones o contextos de análisis en base a los que se analizan las medidas anteriores. Como dimensiones de un solo nivel tenemos el Sexo, la Calificación, el Rango de Edad y la siempre presente componente temporal, el Año Académico. Además, hemos incorporado dos dimensiones complejas, con jerarquías de dos niveles y una mayor cardinalidad, siendo frecuente encontrarnos con dimensiones de esta naturaleza.

Con la dimensión Estudio, podemos analizar los datos agrupados al nivel de Tipo de Estudio (Grado, Máster, Doctorado,...) o profundizar (operación Drill Down sobre la vista OLAP) hasta los distintos Planes de Estudio, esto es, las distintas titulaciones, como "315-Grado en Biología".

↧

Open Source Business Intelligence tips and tricks in October 16

November 4, 2016, 2:02 am

≫ Next: List of Open Source Busines Intelligence tools

≪ Previous: Analysis Big Data OLAP sobre Hadoop con Apache Kylin

Now you can check latest tips on Business Intelligence Open Source, mainly Pentaho, Ctools and Saiku in October. You can see some of this tips implemented in Demo Online.

This month with great stuff:

- https://www.panorama.com/blog/history-business-intelligence/

- http://pedroalves-bi.blogspot.com.es/2016/09/ctools-iot-smart-cities-and-more.html

- https://www.youtube.com/watch?v=WgoPYx21xYU&app=desktop

- http://todobi.blogspot.com.es/2016/09/location-intelligence-bringing-together.html

- https://github.com/mbostock/shapefile/blob/master/README.md

- http://todobi.blogspot.com.es/2016/11/analysis-big-data-olap-sobre-hadoop-con.html

- http://www.lewisgavin.co.uk/CDE-Dashboard/

- http://rpbouman.blogspot.com.es/2016/05/odxl-generic-data-export-layer-for.html

- http://pedroalves-bi.blogspot.com.es/2016/10/pentaho-7.0.html

- https://github.com/kleysonr/NMC-samples

- http://diethardsteiner.github.io/flink/2016/09/18/Flink-Twitter-Stream.html

- http://ubiquis.co.uk/dwh/status-change-fact-table-part-1-the-problem/

- http://todobi.blogspot.com.es/2016/10/list-of-open-source-solutions-for-smart.html

- https://redcloverbi.wordpress.com/2016/10/21/backup-y-restore-en-pentaho-de-forma-facil/

- https://github.com/bhagyas/awesome-alfresco

- http://todobi.blogspot.com.es/2016/10/twitter-real-time-dashboard.html

- http://ubiquis.co.uk/dwh/status-change-fact-table-part-2-the-input-data/

- http://ubiquis.co.uk/dwh/status-change-fact-table-part-4-a-pdi-implementation/

- https://github.com/jazzido/mondrian-rest

↧

List of Open Source Busines Intelligence tools

November 5, 2016, 10:02 am

≫ Next: OLAP for Big Data. It´s possible?

≪ Previous: Open Source Business Intelligence tips and tricks in October 16

Here you can find an updated list of main business intelligence open source tools. If you know any other, don´t hesitate to write us

- Talend, including ETL, Data quality and MDM. Versions OS y Enterprise

- Pentaho, including Kettle, Mondrian, JFreeReport and Weka. Versions OS y Enterprise

- BIRT, for reporting

- Seal Report, for reporting

- LinceBI, including Kettle, Mondrian, STDashboard, STCard and STPivot

- Jasper Reports, including iReport. Versions OS y Enterprise

- Jedox Base, Palo core and Jedox Base. Versions OS y Enterprise

- Saiku, for OLAP Analysis. Versions OS y Enterprise

- SpagoBI, including Talend, Mondrian, JPivot and Palo

- Knime, including Knime connectors

- Kibana, for elasticsearch data

↧

OLAP for Big Data. It´s possible?

November 10, 2016, 8:03 am

≫ Next: Pentaho 7 CE ya listo para descargar

≪ Previous: List of Open Source Busines Intelligence tools

Hadoop is a great platform for storing a lot of data, but running OLAP is usually done on smaller datasets in legacy and traditional proprietary platforms. OLAP workloads are beginning to migrate to the one data lake that is running Hadoop and Spark.

Fortunately, there are a number of Apache projects that are starting to make OLAP possible on Hadoop.

Apache Kylin

For an introduction to this interesting Hadoop project, check out this article. Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase. It can use called through SparkSQL as well making for a very useful project. This project let's you work with PowerBI, Tableau and Excel with more tool support coming soon. You can do MOLAP cubes and support many users with fast queries over billions of rows. Apache Kylin provides JDBC and ODBC drivers.

Check our Post with demo online and detailed information

An interesting talk on Mondrian, MDX and Apache Kylin, points to big things in OLAP. Yet another project using the excellent Apache Calcite.

I would recommend giving this project a try and see if it meets your needs. It is one of the best options out there. It is currently not part of the Big Hadoop Three's supported stacks.

Druid

Druid is another very strong offering in fast SQL OLAP solutions on Hadoop with support growing rapidly. The documentation for this project is excellent and makes it easy for OLAP-oriented DBAs, data architects, data engineers and data focused programmers to get started with this interesting Big Data project. Druid provides sub-second OLAP Queries with column orientation and inverted indexes enabling multi-dimensional filtering and scanning to allow for aggregating and filtering data. Again, not officially part of the Big Hadoop Three's supported stacks. I recommend downloading and installing this project and giving it a test run. Airbnb and Alibaba are users of Druid.

And the secret word for Druid; Apache Calcite. This project seems to be everywhere and you will find it here as well.

Apache Lens

Apache Lens provides a unified analytics interface to Hadoop. It is pretty quick to install, works with Hive, JDBC and OLAP Cubes. There is an Apache Zeppelin interface for Apache Lens which is good. I don't hear a lot about this one, but again it seems interesting.

Other Options To Investigate:

SnappyData (Strong SQL, In-Memory Speed, and GemfireXD history)
Apache HAWQ (Strong SQL support and Greenplum history)
Splice Machine (Now Open Source)
Hive LLAP is moving into OLAP, SQL 2011 support is growing and so is performance.
Apache Phoenix may be able to do basic OLAP with some help from Saiku or STPivot. I really like Phoenix and it has the performance and power to back up a lot of data through queries and concurrency. It is lacking a lot of the OLAP specific queries that some tools and users will most likely need. I am thinking that Apache Calcite and Phoenix will eventually make this a great OLAP tools.

Source: Dzone

↧

Pentaho 7 CE ya listo para descargar

November 12, 2016, 1:07 am

≫ Next: Como empezar a aprender Big Data en 2 horas

≪ Previous: OLAP for Big Data. It´s possible?

Ya tenéis disponible la versión 7 de Pentaho Open Source, tanto de BI Server, como de PDI (Pentaho Data Integration)

A disfrutar!!

Si necesitas apoyo para una migración de versiones anteriores, echa un vistazo a este post

En este blog, puedes seguir lo contado en cada una de las charlas, más que interesantes, que se contaron en el Pentaho Community Meeting de Amberes (PCM16)

Una de las funcionalidades más interesantes presentadas es:

WebSpoon

A web browser based version of Spoon. WebSpoon is basically Spoon that runs in your brower, easy as that.

By accessing a server URL, you can create, preview, save and run transformations and jobs in your browser. WebSpoon works on server side, so all your transformations are stored and run in your browser.

If you want to deploy webspoon for yourself, you can download the .war file from the repository, copy it to the tomcat webserver folder and restart your server. After doing so, webspoon will be accessible through the url of your running server.

Different usecases can be thought of for a browser based spoon:

PDI on the go: run pdi on your smartphone or tablet.
Security: transformations and jobs run on the server so the data remains within the server.
No installation required.
No difference in UI between BI server and DI server.

In order to get developing yourself and contribute to the project, clone the repository, install RAP and eclipse and import the cloned UI folder as an eclipse project.

A disfu

↧

Como empezar a aprender Big Data en 2 horas

November 17, 2016, 6:45 am

≫ Next: Cuadros de Mando y Business Intelligence para Ciudades Inteligentes

≪ Previous: Pentaho 7 CE ya listo para descargar

Big Data es uno de los hitos de estos últimos años. Son muchas las personas que quieren acercarse y conocer, primero lo más básico, para tener unas nociones generales. Pero resulta complicado encontrar una rápida guía, que en un par de horas, sirva para 'defendernos' en esto del Big Data, máxime si no se tienen altos skills técnicos

Por ello, hemos recopilado una serie de infografías, presentaciones, webinar, demos y documentación para que podáis tener una primera visión del Big Data en 2 horas!!

1. Infografías

2. Webinar

Ver en formato Presentación

3. Demos

Ver Demos Online

4. Claves-Presentaciones

5. Libro Verde del Big Data

Mas info? Escríbenos

↧

Cuadros de Mando y Business Intelligence para Ciudades Inteligentes

November 17, 2016, 8:17 am

≫ Next: Tipos de roles en Analytics (Business Intelligence, Big Data)

≪ Previous: Como empezar a aprender Big Data en 2 horas

Cada vez son más las ciudades que están implementando soluciones de Ciudades Inteligentes, Smart Cities... en donde se abarcan una gran cantidad de aspectos, en cuando a tecnologías, dispositivos, analítica de datos, etc...

Lo principal en todos ellos es que son soluciones que deben integrar información e indicadores diversos de todo tipo de fuentes de datos: bases de datos relacionales tradicionales, redes sociales, aplicaciones móviles, sensores... en donde es fundamental que no haya islas o tecnologías cerradas, por lo que el Open Source es fundamental, pues se puede adaptar a todo tipo de soluciones

En base a nuestra experiencia en algunos de estos proyectos de ciudades inteligentes en los que hemos participado, queremos compartir unos cuantas tecnologías, recursos y demos que os pueden ser de ayuda:

1. List of Open Source solutions for Smart Cities - Internet of Things projects

2. List of Open Source Business Intelligence tool for Smart Cities

3. 35 Open Source Tools para Internet of Things (IoT)

Demos:

Tecnologías Big Data

Demos Business Intelligence

Seguimiento del tráfico near real time en el Ayuntamiento de Madrid (Acceso)

Geoposicionamiento de rutas dinámicas (Acceso/Video)

Recomendación de Rutas (grafos) (Acceso/Video)

↧

Tipos de roles en Analytics (Business Intelligence, Big Data)

November 20, 2016, 9:01 am

≫ Next: Business Intelligence for Hadoop Benchmark

≪ Previous: Cuadros de Mando y Business Intelligence para Ciudades Inteligentes

Conforme va creciendo la industria de Analytics, se hace más dificil conocer las descripción de cada uno de los roles y puestos. Es más, generalmente se usan de forma equivocada, mezclando tareas, descripciones de cometidos, etc...

Esto lleva a confusión tanto a los propios especialistas, como a las personas que están formandose y estudiando para realizar estos trabajos. En una industria tan cambiante es frecuente la aparición y especialización de diferentes puestos de trabajos. Aquí, os detallamos cada uno de ellos:

Business Analyst:

Data Analyst:

Data and Analytics Manager:

Data Architect:

Data Engineer:

Data Scientist:

Database Administrator:

Statistician:

Te puede interesar tambien:

Como pasar una entrevista con Pentaho BI Open Source?
Skills en Data Analysts y sus diferencias
Empezar a aprender Big Data en 2 horas?

Visto en Kdnuggets

↧

Business Intelligence for Hadoop Benchmark

November 25, 2016, 10:45 am

≫ Next: Lanzamiento de Jedox 7 y Novedades

≪ Previous: Tipos de roles en Analytics (Business Intelligence, Big Data)

Quite interested this Benchmark you can download from atscale, where you can find insights about Business Intelligence on Hadoop

If you are interested, check also our posts:

- OLAP for Big Data. It´s possible?
- List of Open Source Business Intelligence tools
- Analysis Big Data OLAP sobre Hadoop con Apache Kylin (spanish)
- Caso de uso de Apache Kafka en tiempo real, Big Data (spanish)

About the Benchmark:

Key Findings:

SQL-on-Hadoop engines are well suited for Business Intelligence (BI): All tested engines – Hive, Impala, Presto,and Spark SQL – successfully executed all of the queries in our benchmark suite and are stable enough to support business intelligence workloads.

There is no single “best engine”: We continue to see the different engines shine in different areas. Depending on raw data size, query complexity, and the target number of end-users enterprises will find that each engine has its own ‘sweet spot’.

Version-to-version improvements are significant: The open source community continues to drive significant and rapid improvements across the board. All engines tested showed between 2x to 4x performance gains in the six months between the first and second edition of the benchmarks. This is great news for those enterprises deploying BI workloads to Hadoop.

Small vs. Big Data: Impala and Spark SQL continue to shine for small data queries (queries against the AtScale Adaptive Cache). New in this edition, the latest release of Hive LLAP (Live Long and Process) shows suitable “small data” query response times. Presto also shows promise on small, interactive queries.

Few vs. Many Users: While Impala continues to shine in terms of concurrent query performance, Hive and SparkSQL showed improvements in this category. Presto, new to this edition of the benchmarks, showed the best results in our user concurrency testing.

↧