昨天得知Pivotal将HAWQ开源了,虽然和之前听说的GreenPlum开源有所不同,但是估计基本上也差不多了,毕竟HAWQ是GP Over HDFS。大体研究了一下:
- src:就是一个PostgreSQL Server的代码,其中有些代码来自2006年的版本,有些代码却又是2009年之后的代码,看上去是GP内部维护的一个版本,并不是对应到社区的某个版本。 其中主要代码是2006年导出的代码,如果考古或许可以发现一些历史渊源。代码和9.0系列差异很大,和8.0版本倒是更为接近。 (1)backend:新增了acess external代码,多数应该是针对PXF的改动;新增了catalog代码,可以Load Hcatalog代码;索引部分修改btree;支持parquet;新增CDB存储引擎;修改执行器和优化器;支持sunos、qnx移植;新增resource manager;新增扫描cache
(2)其他各个代码,感觉GP的工程师也挺有优化精神的。”This is the greenplum logtape implementation. The original postgres logtape impl is unnecessarily complex and it prevents many perfomanace optmizations. ”
- PXF:Pivotal Extented Framework,支持HAWQ查询外部数据源,包括HIVE、HBASE、HDFS,支持插件式接口。Java语言开发的。里面有几个主要的接口:Fragment、Resolver、Accessor、FilterBuilder,分别解决数据分区、列名解析、数据访问和数据过滤
- contrib:使用的外部Lib,含数据哈希和加密、测试代码、针对float数组的Sparse Vector、定长数据处理
- depends:YARN的C/C++ Lib
- tools:gpnetbench用来测试网络性能
http://hawq.incubator.apache.org/
https://github.com/apache/incubator-hawq
In a class by itself, only Apache HAWQ (incubating) combines exceptional MPP-based analytics performance, robust ANSI SQL compliance, Hadoop ecosystem integration and manageability, and flexible data-store format support. All natively in Hadoop. No connectors required.
Built from a decade’s worth of massively parallel processing (MPP) expertise developed through the creation of the Pivotal Greenplum® enterprise database and open source PostgreSQL, HAWQ enables to you to swiftly and interactively query Hadoop data, natively via HDFS.
HAWQ is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by theIncubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.