介绍
Atlas 是一组可扩展且可扩展的核心基础治理服务,使企业能够高效、高效地满足 Hadoop 内的合规性要求,并允许与整个企业数据生态系统集成。Apache Atlas 为组织提供开放的元数据管理和治理功能,以构建其数据资产目录,对这些资产进行分类和治理,并为数据科学家、分析师和数据治理团队提供围绕这些数据资产的协作功能。
特性介绍
Metadata types & instances
- 预定义了许多Hadoop以及non-Hadoop元数据的types
- 允许给Metadata定义新的types用于管理
- 定义的Types有一些primitive属性和complex 属性、object引用,Types之间可以产生继承关系
- 通过Instance的类型(称为entties)用于记录元数据对象的详情和它们之间的关系
- 提供Rest API用于操作各种各样的Types和instances以方便更容易的集成
Classification
- 可以动态创建分类 - 例如 PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE
- 分类可以包含一些属性例如 在EXPIRES_ON 指定expiry_date 属性
- 所有的Entities可以关联多个Classification,这样可以方便查询和设置权限
- 通过lineage进行分类传播,这样可以在数据处理过程中不影响分类
Lineage
- 直观的用户界面来查看数据在处理过程中的lineage信息
- Rest API用于访问和更改Lineage信息
Search/Discovery
- 直观的用户界面来搜索实体Types、classification、attribute values 或free-text
- Rich REST APIs 用于提供了复杂的criteria项的查询
- SQL Like查询语言用于检索Entities - Domain Specific Language (DSL)
Security & Data Masking
- 细粒度元数据访问安全限定,控制访问Entities例如:添加/更新/删除分类
- 集成Apache Ranger可以授权/数据屏蔽数据访
- 例如谁可以访问数据的PII、SENSITIVE分类
- customer-service 用户仅仅看到 NATIONAL_ID分类的后四位数字
Architecture
整个Atlas主要包含以下组件:Type System、Graph Engine、Ingest / Export
- Type System - Atlas允许用户为metadata object定义一个Model来进行管理。Model是通过定义一些’types’组成。这些types的Instances称为’entities’,这些entities对象代表者实际被管理的metadata objects。Type System允许用户定义个管理types 和 entities。
- Graph Engine - 在内部,Atlas使用一个图模型保存它管理的元数据对象。这种方法提供了极大的灵活性,能够有效地处理元数据对象之间的丰富关系。图形引擎组件负责Atlas类型系统的类型和实体之间的转换,以及底层的图形持久化模型。除了管理图形对象之外,图形引擎还为元数据对象创建适当的索引,以便可以有效地搜索它们。Atlas使用JanusGraph来存储元数据对象。
- Ingest / Export - 摄取组件允许将元数据添加到Atlas中。类似地,导出组件将Atlas检测到的元数据更改公开为事件。使用者可以使用这些更改事件来实时响应元数据更改。
附注:JanusGraph组件https://janusgraph.org/
部署安装
1.安装JDK/Solr/Kafka/HDFS/Hbase/Zookeeper/Maven基础服务,这里看一下环境变量配置
JAVA_HOME=/usr/java/latest
HBASE_HOME=/usr/hbase-2.2.4
HIVE_HOME=/usr/apache-hive-3.1.2-bin
M2_HOME=/usr/apache-maven-3.6.3
HADOOP_HOME=/usr/hadoop-2.9.0
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$M2_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$HIVE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export M2_HOME
export MAVEN_OPTS="-Xms2g -Xmx2g"
export HBASE_HOME
export HBASE_CONF_DIR=$HBASE_HOME/conf
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=/usr/apache-atlas-hive-hook-2.1.0/hook/hive
export HIVE_HOME
这里需要注意,Atals在运行的时候需要知道HBase的配置信息,因此必须在Atals的安装目录上配置HBASE_CONF_DIR和HIVE_CONF_DIR.
2.下载Atals源码文件进行编译安装
①下载https://www.apache.org/dyn/closer.cgi/atlas/2.1.0/apache-atlas-2.1.0-sources.tar.gz
②解压改源码文件进行源码编译
[[email protected] ~]# tar -zxf apache-atlas-2.1.0-sources.tar.gz
[[email protected] ~]# export MAVEN_OPTS="-Xms2g -Xmx2g"
[[email protected] apache-atlas-sources-2.1.0]# mvn clean -DskipTests install
这里的编译需要很长时间,需要用户安装Maven并且配置阿里云Maven镜像加快编译
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
...
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Atlas Server Build Tools 1.0 ................ SUCCESS [ 3.788 s]
[INFO] apache-atlas 2.1.0 ................................. SUCCESS [ 26.519 s]
[INFO] Apache Atlas Test Utility Tools 2.1.0 .............. SUCCESS [ 16.866 s]
[INFO] Apache Atlas Integration 2.1.0 ..................... SUCCESS [ 20.882 s]
[INFO] Apache Atlas Common 2.1.0 .......................... SUCCESS [ 8.228 s]
[INFO] Apache Atlas Client 2.1.0 .......................... SUCCESS [ 0.494 s]
[INFO] atlas-client-common 2.1.0 .......................... SUCCESS [ 5.477 s]
[INFO] atlas-client-v1 2.1.0 .............................. SUCCESS [ 5.570 s]
[INFO] Apache Atlas Server API 2.1.0 ...................... SUCCESS [ 5.407 s]
[INFO] Apache Atlas Notification 2.1.0 .................... SUCCESS [ 10.025 s]
[INFO] atlas-client-v2 2.1.0 .............................. SUCCESS [ 4.534 s]
[INFO] Apache Atlas Graph Database Projects 2.1.0 ......... SUCCESS [ 0.252 s]
[INFO] Apache Atlas Graph Database API 2.1.0 .............. SUCCESS [ 5.109 s]
[INFO] Graph Database Common Code 2.1.0 ................... SUCCESS [ 4.075 s]
[INFO] Apache Atlas JanusGraph-HBase2 Module 2.1.0 ........ SUCCESS [ 7.278 s]
[INFO] Apache Atlas JanusGraph DB Impl 2.1.0 .............. SUCCESS [ 17.070 s]
[INFO] Apache Atlas Graph DB Dependencies 2.1.0 ........... SUCCESS [ 1.338 s]
[INFO] Apache Atlas Authorization 2.1.0 ................... SUCCESS [ 6.442 s]
[INFO] Apache Atlas Repository 2.1.0 ...................... SUCCESS [ 59.939 s]
[INFO] Apache Atlas UI 2.1.0 .............................. SUCCESS [ 48.107 s]
[INFO] Apache Atlas New UI 2.1.0 .......................... SUCCESS [01:13 min]
[INFO] Apache Atlas Web Application 2.1.0 ................. SUCCESS [02:50 min]
[INFO] Apache Atlas Documentation 2.1.0 ................... SUCCESS [ 2.449 s]
[INFO] Apache Atlas FileSystem Model 2.1.0 ................ SUCCESS [ 3.564 s]
[INFO] Apache Atlas Plugin Classloader 2.1.0 .............. SUCCESS [ 6.368 s]
[INFO] Apache Atlas Hive Bridge Shim 2.1.0 ................ SUCCESS [ 7.260 s]
[INFO] Apache Atlas Hive Bridge 2.1.0 ..................... SUCCESS [ 15.244 s]
[INFO] Apache Atlas Falcon Bridge Shim 2.1.0 .............. SUCCESS [ 7.279 s]
[INFO] Apache Atlas Falcon Bridge 2.1.0 ................... SUCCESS [ 6.254 s]
[INFO] Apache Atlas Sqoop Bridge Shim 2.1.0 ............... SUCCESS [ 2.611 s]
[INFO] Apache Atlas Sqoop Bridge 2.1.0 .................... SUCCESS [ 6.670 s]
[INFO] Apache Atlas Storm Bridge Shim 2.1.0 ............... SUCCESS [ 10.405 s]
[INFO] Apache Atlas Storm Bridge 2.1.0 .................... SUCCESS [ 6.310 s]
[INFO] Apache Atlas Hbase Bridge Shim 2.1.0 ............... SUCCESS [ 4.456 s]
[INFO] Apache Atlas Hbase Bridge 2.1.0 .................... SUCCESS [ 14.005 s]
[INFO] Apache HBase - Testing Util 2.1.0 .................. SUCCESS [ 2.127 s]
[INFO] Apache Atlas Kafka Bridge 2.1.0 .................... SUCCESS [ 5.254 s]
[INFO] Apache Atlas classification updater 2.1.0 .......... SUCCESS [ 4.538 s]
[INFO] Apache Atlas Impala Hook API 2.1.0 ................. SUCCESS [ 2.435 s]
[INFO] Apache Atlas Impala Bridge Shim 2.1.0 .............. SUCCESS [ 2.685 s]
[INFO] Apache Atlas Impala Bridge 2.1.0 ................... SUCCESS [ 8.240 s]
[INFO] Apache Atlas Distribution 2.1.0 .................... SUCCESS [ 1.378 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
3.打包编译生成Atlas的运行文件
[[email protected] apache-atlas-sources-2.1.0]# mvn clean -DskipTests package -Pdist
...
[INFO] Apache Atlas Sqoop Bridge Shim 2.1.0 ............... SUCCESS [ 0.121 s]
[INFO] Apache Atlas Sqoop Bridge 2.1.0 .................... SUCCESS [ 2.522 s]
[INFO] Apache Atlas Storm Bridge Shim 2.1.0 ............... SUCCESS [ 0.271 s]
[INFO] Apache Atlas Storm Bridge 2.1.0 .................... SUCCESS [ 2.437 s]
[INFO] Apache Atlas Hbase Bridge Shim 2.1.0 ............... SUCCESS [ 0.705 s]
[INFO] Apache Atlas Hbase Bridge 2.1.0 .................... SUCCESS [ 2.541 s]
[INFO] Apache HBase - Testing Util 2.1.0 .................. SUCCESS [ 1.706 s]
[INFO] Apache Atlas Kafka Bridge 2.1.0 .................... SUCCESS [ 0.966 s]
[INFO] Apache Atlas classification updater 2.1.0 .......... SUCCESS [ 0.439 s]
[INFO] Apache Atlas Impala Hook API 2.1.0 ................. SUCCESS [ 0.060 s]
[INFO] Apache Atlas Impala Bridge Shim 2.1.0 .............. SUCCESS [ 0.072 s]
[INFO] Apache Atlas Impala Bridge 2.1.0 ................... SUCCESS [ 2.186 s]
[INFO] Apache Atlas Distribution 2.1.0 .................... SUCCESS [01:12 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:08 min
[INFO] Finished at: 2021-01-19T22:40:49+08:00
[INFO] ------------------------------------------------------------------------
该脚本执行结束之后会在distrdo/target目录下产生如下文件
[[email protected] apache-atlas-sources-2.1.0]# tree distro/target/ | grep tar.gz
├── apache-atlas-2.1.0-bin.tar.gz
├── apache-atlas-2.1.0-falcon-hook.tar.gz
├── apache-atlas-2.1.0-hbase-hook.tar.gz
├── apache-atlas-2.1.0-hive-hook.tar.gz
├── apache-atlas-2.1.0-impala-hook.tar.gz
├── apache-atlas-2.1.0-kafka-hook.tar.gz
├── apache-atlas-2.1.0-server.tar.gz
├── apache-atlas-2.1.0-sources.tar.gz
├── apache-atlas-2.1.0-sqoop-hook.tar.gz
├── apache-atlas-2.1.0-storm-hook.tar.gz
4.解压安装Atlas服务,配置atlas-application.properties
[[email protected] ~]# tar -zxf apache-atlas-2.1.0-bin.tar.gz -C /usr/
[[email protected] ~]# cd /usr/apache-atlas-2.1.0/
[[email protected] apache-atlas-2.1.0]# tree -L 1
.
├── bin
├── conf
├── data
├── DISCLAIMER.txt
├── hook
├── hook-bin
├── LICENSE
├── logs
├── models
├── NOTICE
├── server
└── tools
9 directories, 3 files
######### Graph Database Configs #########
# Graph Database
atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
#Hbase
#For standalone mode , specify localhost
#for distributed mode, specify zookeeper quorum here
atlas.graph.storage.hostname=CentOS:2181
atlas.graph.storage.hbase.regions-per-server=1
atlas.graph.storage.lock.wait-time=10000
# Delete handler
#
# This allows the default behavior of doing "soft" deletes to be changed.
#
# Allowed Values:
# org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1 - all deletes are "soft" deletes
# org.apache.atlas.repository.store.graph.v1.HardDeleteHandlerV1 - all deletes are "hard" deletes
#
#atlas.DeleteHandlerV1.impl=org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1
atlas.DeleteHandlerV1.impl=org.apache.atlas.repository.store.graph.v1.HardDeleteHandlerV1
# Graph Search Index
atlas.graph.index.search.backend=solr
#Solr
#Solr cloud mode properties
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=CentOS:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=true
#Solr http mode properties
atlas.graph.index.search.solr.http-urls=http://CentOS:8983/solr
# Solr-specific configuration property
atlas.graph.index.search.max-result-set-size=150
######### Notification Configs #########
atlas.notification.embedded=false
atlas.kafka.data=${sys:atlas.home}/data/kafka
atlas.kafka.zookeeper.connect=CentOS:2181
atlas.kafka.bootstrap.servers=CentOS:9092
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
## Server port configuration
atlas.server.http.port=21000
atlas.server.https.port=21443
######### Server Properties #########
atlas.rest.address=http://CentOS:21000
# If enabled and set to true, this will run setup steps when the server starts
atlas.server.run.setup.on.start=false
########## Add http headers ###########
atlas.headers.Access-Control-Allow-Origin=*
atlas.headers.Access-Control-Allow-Methods=GET,OPTIONS,HEAD,PUT,POST
#atlas.headers.<headerName>=<headerValue>
5.启动Atas服务,登录webUI用户名和密码都是admin/admin
[[email protected] apache-atlas-2.1.0]# ./bin/atlas_start.py