几天前的Google IO大会上正式宣布Mapreduce的后续产品Cloud Dataflow,这个产品将作为一项竞争产品成为Google云计算平台的一部分,融合了批处理和流计算,基本上是作为ETL和Streaming工具,而后端的分析交给了BigQuery(Dremel);CDF主要是针对AWS推出的DataPipeline、Kinesis等流式数据处理产品。主要包括3个功能:
· for data integration and preparation (e.g. in preparation for interactive SQL in BigQuery)
· to examine a real-time stream of events for significant patterns and activities
· to implement advanced, multi-step processing pipelines to extract deep insight from datasets of any size
Google CDF基于内部的Flume和Millwheel,但是从某种意义上看,CDF并不排斥MR,至少在Flume相关的论文中也并未说明有什么替代方案,而MillWheel则是streaming。Google CDF对Hadoop社区可能不会带来很大的影响,社区目前已经逐步转移到Spark平台上,而Google由于内部系统实际上也是在1-2年前开始启动迁移到CDF的,目前其公司内部应该还有一些MR的系统在运行。这次宣布的是将CDF作为一种服务提供给最终用户(包括Snapchat、Rising Star等移动消息应用)
个人认为,国内外一些媒体说Google放弃MR,或者说用CDF替换MR的说法存在问题,CDF很可能是融合了MillWheel、Flume、MR的一种系统,当然MR本身可能被优化和完善,比如采用内存技术,毕竟GFS也都改名叫做CNS了,为啥MR就不能优化。
有兴趣深入阅读的请继续:
Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service
Posted: Thursday, June 26, 2014
In today’s world, information is being generated at an incredible rate. However, unlocking insights from large datasets can be cumbersome and costly, even for experts.
It doesn’t have to be that way. Yesterday, at Google I/O, you got a sneak peek of Google Cloud Dataflow, the latest step in our effort to make data and analytics accessible to everyone. You can use Cloud Dataflow:
· for data integration and preparation (e.g. in preparation for interactive SQL in BigQuery)
· to examine a real-time stream of events for significant patterns and activities
· to implement advanced, multi-step processing pipelines to extract deep insight from datasets of any size
In these cases and many others, you use Cloud Dataflow’s data-centric model to easily express your data processing pipeline, monitor its execution, and get actionable insights from your data, free from the burden of deploying clusters, tuning configuration parameters, and optimizing resource usage. Just focus on your application, and leave the manag
Reimagining developer productivity and data analytics in the cloud – news from Google IO
Posted: Wednesday, June 25, 2014
Today at Google I/O, we are introducing new services that help developers build and optimize data pipelines, create mobile applications, and debug, trace, and monitor their cloud applications in production.
Introducing Google Cloud Dataflow
A decade ago, Google invented MapReduce to process massive datasets using distributed computing. Since then, more devices and information require more capable analytics pipelines — though they are difficult to create and maintain.
Today at Google I/O, we are demonstrating Google Cloud Dataflow for the first time. Cloud Dataflow is a fully managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes. Cloud Dataflow is a successor to MapReduce, and is based on our internal technologies like Flume andMillWheel.
Cloud Dataflow makes it easy for you to get actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure. You can use Cloud Dataflow for use cases like ETL, batch data processing and streaming analytics, and it will automatically optimize, deploy and manage the code and resources required.
Debug, trace and monitor your application in production
We are also introducing several new Cloud Platform tools that let developers understand, diagnose and improve systems in production.
Google Cloud Monitoring is designed to help you find and fix unusual behavior across your application stack. Based on technology from our recent acquisition of Stackdriver, Cloud Monitoring provides rich metrics, dashboards and alerting for Cloud Platform, as well as more than a dozen popular open source apps, including Apache, Nginx, MongoDB, MySQL, Tomcat, IIS, Redis, Elasticsearch and more. For example, you can use Cloud Monitoring to identify and troubleshoot cases where users are experiencing increased error rates connecting from an App Engine module or slow query times from a Cassandra database with minimal configuration.
We know that it can be difficult to isolate the root cause of performance bottlenecks. Cloud Trace helps you visualize and understand time spent by your application for request processing. In addition, you can compare performance between various releases of your application using latency distributions.
Finally, we’re introducing Cloud Debugger, a new tool to help you debug your applications in production with effectively no performance overhead. Cloud Debugger gives you a full stack trace and snapshots of all local variables for any watchpoint that you set in your code while your application continues to run undisturbed in production. This brings modern debugging to cloud-based applications.
New features for mobile development
With rapid autoscaling, caching and other mobile friendly capabilities, many apps like Snapchat orRising Star have built and run on Cloud Platform. We’re adding new features that make building a mobile app using Cloud Platform even better.
Today, we’re demonstrating a new version of Google Cloud Save, which gives you a simple API for saving, retrieving, and synchronizing user data to the cloud and across devices without needing to code up the backend. Data is stored in Google Cloud Datastore, making the data accessible fromGoogle App Engine or Google Compute Engine using the existing Datastore API. Google Cloud Save is currently in private beta and will be available for general use soon.
We’ve also added tooling to Android Studio, which simplifies the process of adding an App Engine backend to your mobile app. In particular, Android Studio now has three built-in App Engine backend module templates, including Java Servlet, Java Endpoints and an App Engine backend with Google Cloud Messaging. Since this functionality is powered by the open-source App Engine plug-in for Gradle, you can use the same build configuration for both your app and your backend across IDE, CLI and Continuous Integration environments.
We’ll be doing more detailed follow-up posts about these announcements in the coming days, so stay tuned.
-Posted by Greg DeMichillie, Director of Product Management
*Apache, Nginx, MongoDB, MySQL, Tomcat, IIS, Redis, Elasticsearch and Cassandra are trademarks of their respective owners.
Google Launches Cloud Dataflow, A Managed Data Processing Service
Posted Jun 25, 2014 by Frederic Lardinois (@fredericl)
Google expanded its Cloud Platform today with a new managed service called Cloud Dataflow that allows developer to create data pipelines to help them ingest, transform and — most importantly — analyze data.Developers can use the service to work with streaming real-time data and by uploading batches of data to the system.
For now, the service is in private beta and it’s unclear how Google will price Dataflow once it is launched to the public. At its core, Cloud Dataflow is Google’s successor to MapReduce, which has been an experimental App Engine feature for quite a while now.
The company says Dataflow is based on a number of technologies the company has been using internally, including Flume and MillWheel. Google is using Java for the first Cloud Dataflow SDK, but it is also providing a dashboard for monitoring these pipelines right from the developer console.
The focus here, according to Google, is to help its users get “actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure.”
Because this is a private beta, Google isn’t publishing any throughput numbers just yet, but the service will be able to ingest virtually any kind of data in its streaming mode and newline-delimited text files, BigQuery tables and similar data in its batch mode.
With this service, Google closes a major hole in its Cloud Platform lineup. For quite a while now, Amazon has offered its own data pipeline service, and with Kinesis, it launched a service that specializes in real-time data processing at its developer conference last November.
Previously, Google’s focus in this area had mostly been on MapReduce and BigQuery. Google tells BigQuery is complementary to Dataflow. Developers can use Dataflow as a part of the data ingestions into BigQuery, for example, by preparing or filtering the data for BigQuery. Once the data is cleaned, it can be written to BigQuery, where it becomes immediately accessible. At the same time, though, Dataflow can be used to read from BigQuery in case you want to join data from your database with other data sources. And to complete the cycle, you can then write all of this back to BigQuery, too, of course.
In a demo during today’s keynote, Google showed how its engineers, with the help of Twitter, used this service to do sentiment analysis around the World Cup by looking at millions of tweets.
Google has abandoned MapReduce, the system for running data analytics jobs spread across many servers the company developed and later open sourced, in favor of a new cloud analytics system it has built called Cloud Dataflow.
MapReduce has been a highly popular infrastructure and programming model for doing parallelized distributed computing on server clusters. It is the basis of Apache Hadoop, the Big Data infrastructure platform that has enjoyed widespread deployment and become core of many companies’ commercial products.
The technology is unable to handle the amounts of data Google wants to analyze these days, however. Urs Hölzle, senior vice president of technical infrastructure at the Mountain View, California-based giant, said it got too cumbersome once the size of the data reached a few petabytes.
“We don’t really use MapReduce anymore,” Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”
Cloud Dataflow, which Google will also offer as a service for developers using its cloud platform, does not have the scaling restrictions of MapReduce.
“Cloud Dataflow is the result of over a decade of experience in analytics,” Hölzle said. “It will run faster and scale better than pretty much any other system out there.”
It is a fully managed service that is automatically optimized, deployed, managed and scaled. It enables developers to easily create complex pipelines using unified programming for both batch and streaming services, he said.
All these characteristics address what Google thinks does not work in MapReduce: it is hard to ingest data rapidly, it requires a lot of different technology, batch and streaming are unrelated, and deployment and operation of MapReduce clusters is always required.
Hölzle announced other new services on Google’s cloud platform at the show:
§ Cloud Save is an API that enables an application to save an individual user’s data in the cloud or elsewhere and use it without requiring any server-side coding. Users of Google’s Platform-as-a-Service offering App Engine and Infrastructure-as-a-Service offering Compute Engine can build apps using this feature.
§ Cloud Debugging makes it easier to sift through lines of code deployed across many servers in the cloud to identify software bugs.
§ Cloud Tracing provides latency statistics across different groups (latency of database service calls for example) and provides analysis reports.
§ Cloud Monitoring is an intelligent monitoring system that is a result of integration with Stackdriver, a cloud monitoring startup Google bought in May. The feature monitors cloud infrastructure resources, such as disks and virtual machines, as well as service levels for Google’s services as well as more than a dozen non-Google open source packages.
相关信息:
1、http://www.infoworld.com/t/hadoop/why-google-cloud-dataflow-no-hadoop-killer-245212
2、Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service
3、Reimagining developer productivity and data analytics in the cloud – news from Google IO
4、Google Launches Cloud Dataflow, A Managed Data Processing Service
5、Google launches Cloud Dataflow, says MapReduce tired
6、Why Google’s Unveiling of Cloud Dataflow is Great News for Tableau Users
7、Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System
8、http://www.infoq.com/cn/news/2014/06/google-cloud-dataflow