天天看点

Atals元数据治理-介绍/编译/安装

介绍

Atlas 是一组可扩展且可扩展的核心基础治理服务,使企业能够高效、高效地满足 Hadoop 内的合规性要求,并允许与整个企业数据生态系统集成。Apache Atlas 为组织提供开放的元数据管理和治理功能,以构建其数据资产目录,对这些资产进行分类和治理,并为数据科学家、分析师和数据治理团队提供围绕这些数据资产的协作功能。

特性介绍

Metadata types & instances

  • 预定义了许多Hadoop以及non-Hadoop元数据的types
  • 允许给Metadata定义新的types用于管理
  • 定义的Types有一些primitive属性和complex 属性、object引用,Types之间可以产生继承关系
  • 通过Instance的类型(称为entties)用于记录元数据对象的详情和它们之间的关系
  • 提供Rest API用于操作各种各样的Types和instances以方便更容易的集成

Classification

  • 可以动态创建分类 - 例如 PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE
  • 分类可以包含一些属性例如 在EXPIRES_ON 指定expiry_date 属性
  • 所有的Entities可以关联多个Classification,这样可以方便查询和设置权限
  • 通过lineage进行分类传播,这样可以在数据处理过程中不影响分类

Lineage

  • 直观的用户界面来查看数据在处理过程中的lineage信息
  • Rest API用于访问和更改Lineage信息

Search/Discovery

  • 直观的用户界面来搜索实体Types、classification、attribute values 或free-text
  • Rich REST APIs 用于提供了复杂的criteria项的查询
  • SQL Like查询语言用于检索Entities - Domain Specific Language (DSL)

Security & Data Masking

  • 细粒度元数据访问安全限定,控制访问Entities例如:添加/更新/删除分类
  • 集成Apache Ranger可以授权/数据屏蔽数据访
  • 例如谁可以访问数据的PII、SENSITIVE分类
  • customer-service 用户仅仅看到 NATIONAL_ID分类的后四位数字

Architecture

整个Atlas主要包含以下组件:Type System、Graph Engine、Ingest / Export

Atals元数据治理-介绍/编译/安装
  • Type System - Atlas允许用户为metadata object定义一个Model来进行管理。Model是通过定义一些’types’组成。这些types的Instances称为’entities’,这些entities对象代表者实际被管理的metadata objects。Type System允许用户定义个管理types 和 entities。
  • Graph Engine - 在内部,Atlas使用一个图模型保存它管理的元数据对象。这种方法提供了极大的灵活性,能够有效地处理元数据对象之间的丰富关系。图形引擎组件负责Atlas类型系统的类型和实体之间的转换,以及底层的图形持久化模型。除了管理图形对象之外,图形引擎还为元数据对象创建适当的索引,以便可以有效地搜索它们。Atlas使用JanusGraph来存储元数据对象。
  • Ingest / Export - 摄取组件允许将元数据添加到Atlas中。类似地,导出组件将Atlas检测到的元数据更改公开为事件。使用者可以使用这些更改事件来实时响应元数据更改。

附注:JanusGraph组件https://janusgraph.org/

部署安装

1.安装JDK/Solr/Kafka/HDFS/Hbase/Zookeeper/Maven基础服务,这里看一下环境变量配置

JAVA_HOME=/usr/java/latest
HBASE_HOME=/usr/hbase-2.2.4
HIVE_HOME=/usr/apache-hive-3.1.2-bin
M2_HOME=/usr/apache-maven-3.6.3
HADOOP_HOME=/usr/hadoop-2.9.0
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$M2_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$HIVE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export M2_HOME
export MAVEN_OPTS="-Xms2g -Xmx2g"
export HBASE_HOME
export HBASE_CONF_DIR=$HBASE_HOME/conf
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=/usr/apache-atlas-hive-hook-2.1.0/hook/hive
export HIVE_HOME

           
这里需要注意,Atals在运行的时候需要知道HBase的配置信息,因此必须在Atals的安装目录上配置HBASE_CONF_DIR和HIVE_CONF_DIR.

2.下载Atals源码文件进行编译安装

①下载https://www.apache.org/dyn/closer.cgi/atlas/2.1.0/apache-atlas-2.1.0-sources.tar.gz

②解压改源码文件进行源码编译

[[email protected] ~]# tar -zxf apache-atlas-2.1.0-sources.tar.gz
[[email protected] ~]# export MAVEN_OPTS="-Xms2g -Xmx2g"
[[email protected] apache-atlas-sources-2.1.0]#  mvn clean -DskipTests install 
           

这里的编译需要很长时间,需要用户安装Maven并且配置阿里云Maven镜像加快编译

<mirror>
  <id>nexus-aliyun</id>
  <mirrorOf>central</mirrorOf>
  <name>Nexus aliyun</name>
  <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
           
...
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Atlas Server Build Tools 1.0 ................ SUCCESS [  3.788 s]
[INFO] apache-atlas 2.1.0 ................................. SUCCESS [ 26.519 s]
[INFO] Apache Atlas Test Utility Tools 2.1.0 .............. SUCCESS [ 16.866 s]
[INFO] Apache Atlas Integration 2.1.0 ..................... SUCCESS [ 20.882 s]
[INFO] Apache Atlas Common 2.1.0 .......................... SUCCESS [  8.228 s]
[INFO] Apache Atlas Client 2.1.0 .......................... SUCCESS [  0.494 s]
[INFO] atlas-client-common 2.1.0 .......................... SUCCESS [  5.477 s]
[INFO] atlas-client-v1 2.1.0 .............................. SUCCESS [  5.570 s]
[INFO] Apache Atlas Server API 2.1.0 ...................... SUCCESS [  5.407 s]
[INFO] Apache Atlas Notification 2.1.0 .................... SUCCESS [ 10.025 s]
[INFO] atlas-client-v2 2.1.0 .............................. SUCCESS [  4.534 s]
[INFO] Apache Atlas Graph Database Projects 2.1.0 ......... SUCCESS [  0.252 s]
[INFO] Apache Atlas Graph Database API 2.1.0 .............. SUCCESS [  5.109 s]
[INFO] Graph Database Common Code 2.1.0 ................... SUCCESS [  4.075 s]
[INFO] Apache Atlas JanusGraph-HBase2 Module 2.1.0 ........ SUCCESS [  7.278 s]
[INFO] Apache Atlas JanusGraph DB Impl 2.1.0 .............. SUCCESS [ 17.070 s]
[INFO] Apache Atlas Graph DB Dependencies 2.1.0 ........... SUCCESS [  1.338 s]
[INFO] Apache Atlas Authorization 2.1.0 ................... SUCCESS [  6.442 s]
[INFO] Apache Atlas Repository 2.1.0 ...................... SUCCESS [ 59.939 s]
[INFO] Apache Atlas UI 2.1.0 .............................. SUCCESS [ 48.107 s]
[INFO] Apache Atlas New UI 2.1.0 .......................... SUCCESS [01:13 min]
[INFO] Apache Atlas Web Application 2.1.0 ................. SUCCESS [02:50 min]
[INFO] Apache Atlas Documentation 2.1.0 ................... SUCCESS [  2.449 s]
[INFO] Apache Atlas FileSystem Model 2.1.0 ................ SUCCESS [  3.564 s]
[INFO] Apache Atlas Plugin Classloader 2.1.0 .............. SUCCESS [  6.368 s]
[INFO] Apache Atlas Hive Bridge Shim 2.1.0 ................ SUCCESS [  7.260 s]
[INFO] Apache Atlas Hive Bridge 2.1.0 ..................... SUCCESS [ 15.244 s]
[INFO] Apache Atlas Falcon Bridge Shim 2.1.0 .............. SUCCESS [  7.279 s]
[INFO] Apache Atlas Falcon Bridge 2.1.0 ................... SUCCESS [  6.254 s]
[INFO] Apache Atlas Sqoop Bridge Shim 2.1.0 ............... SUCCESS [  2.611 s]
[INFO] Apache Atlas Sqoop Bridge 2.1.0 .................... SUCCESS [  6.670 s]
[INFO] Apache Atlas Storm Bridge Shim 2.1.0 ............... SUCCESS [ 10.405 s]
[INFO] Apache Atlas Storm Bridge 2.1.0 .................... SUCCESS [  6.310 s]
[INFO] Apache Atlas Hbase Bridge Shim 2.1.0 ............... SUCCESS [  4.456 s]
[INFO] Apache Atlas Hbase Bridge 2.1.0 .................... SUCCESS [ 14.005 s]
[INFO] Apache HBase - Testing Util 2.1.0 .................. SUCCESS [  2.127 s]
[INFO] Apache Atlas Kafka Bridge 2.1.0 .................... SUCCESS [  5.254 s]
[INFO] Apache Atlas classification updater 2.1.0 .......... SUCCESS [  4.538 s]
[INFO] Apache Atlas Impala Hook API 2.1.0 ................. SUCCESS [  2.435 s]
[INFO] Apache Atlas Impala Bridge Shim 2.1.0 .............. SUCCESS [  2.685 s]
[INFO] Apache Atlas Impala Bridge 2.1.0 ................... SUCCESS [  8.240 s]
[INFO] Apache Atlas Distribution 2.1.0 .................... SUCCESS [  1.378 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

           

3.打包编译生成Atlas的运行文件

[[email protected] apache-atlas-sources-2.1.0]# mvn clean -DskipTests package -Pdist

           
...
[INFO] Apache Atlas Sqoop Bridge Shim 2.1.0 ............... SUCCESS [  0.121 s]
[INFO] Apache Atlas Sqoop Bridge 2.1.0 .................... SUCCESS [  2.522 s]
[INFO] Apache Atlas Storm Bridge Shim 2.1.0 ............... SUCCESS [  0.271 s]
[INFO] Apache Atlas Storm Bridge 2.1.0 .................... SUCCESS [  2.437 s]
[INFO] Apache Atlas Hbase Bridge Shim 2.1.0 ............... SUCCESS [  0.705 s]
[INFO] Apache Atlas Hbase Bridge 2.1.0 .................... SUCCESS [  2.541 s]
[INFO] Apache HBase - Testing Util 2.1.0 .................. SUCCESS [  1.706 s]
[INFO] Apache Atlas Kafka Bridge 2.1.0 .................... SUCCESS [  0.966 s]
[INFO] Apache Atlas classification updater 2.1.0 .......... SUCCESS [  0.439 s]
[INFO] Apache Atlas Impala Hook API 2.1.0 ................. SUCCESS [  0.060 s]
[INFO] Apache Atlas Impala Bridge Shim 2.1.0 .............. SUCCESS [  0.072 s]
[INFO] Apache Atlas Impala Bridge 2.1.0 ................... SUCCESS [  2.186 s]
[INFO] Apache Atlas Distribution 2.1.0 .................... SUCCESS [01:12 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:08 min
[INFO] Finished at: 2021-01-19T22:40:49+08:00
[INFO] ------------------------------------------------------------------------

           

该脚本执行结束之后会在distrdo/target目录下产生如下文件

[[email protected] apache-atlas-sources-2.1.0]# tree distro/target/ | grep tar.gz
├── apache-atlas-2.1.0-bin.tar.gz
├── apache-atlas-2.1.0-falcon-hook.tar.gz
├── apache-atlas-2.1.0-hbase-hook.tar.gz
├── apache-atlas-2.1.0-hive-hook.tar.gz
├── apache-atlas-2.1.0-impala-hook.tar.gz
├── apache-atlas-2.1.0-kafka-hook.tar.gz
├── apache-atlas-2.1.0-server.tar.gz
├── apache-atlas-2.1.0-sources.tar.gz
├── apache-atlas-2.1.0-sqoop-hook.tar.gz
├── apache-atlas-2.1.0-storm-hook.tar.gz

           

4.解压安装Atlas服务,配置atlas-application.properties

[[email protected] ~]# tar -zxf apache-atlas-2.1.0-bin.tar.gz -C /usr/
[[email protected] ~]# cd /usr/apache-atlas-2.1.0/
[[email protected] apache-atlas-2.1.0]# tree -L 1
.
├── bin
├── conf
├── data
├── DISCLAIMER.txt
├── hook
├── hook-bin
├── LICENSE
├── logs
├── models
├── NOTICE
├── server
└── tools

9 directories, 3 files

           
#########  Graph Database Configs  #########

# Graph Database
atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus

#Hbase
#For standalone mode , specify localhost
#for distributed mode, specify zookeeper quorum here
atlas.graph.storage.hostname=CentOS:2181
atlas.graph.storage.hbase.regions-per-server=1
atlas.graph.storage.lock.wait-time=10000

# Delete handler
#
# This allows the default behavior of doing "soft" deletes to be changed.
#
# Allowed Values:
# org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1 - all deletes are "soft" deletes
# org.apache.atlas.repository.store.graph.v1.HardDeleteHandlerV1 - all deletes are "hard" deletes
#
#atlas.DeleteHandlerV1.impl=org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1
atlas.DeleteHandlerV1.impl=org.apache.atlas.repository.store.graph.v1.HardDeleteHandlerV1

# Graph Search Index
atlas.graph.index.search.backend=solr

#Solr
#Solr cloud mode properties
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=CentOS:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=true

#Solr http mode properties
atlas.graph.index.search.solr.http-urls=http://CentOS:8983/solr

# Solr-specific configuration property
atlas.graph.index.search.max-result-set-size=150

#########  Notification Configs  #########
atlas.notification.embedded=false
atlas.kafka.data=${sys:atlas.home}/data/kafka
atlas.kafka.zookeeper.connect=CentOS:2181
atlas.kafka.bootstrap.servers=CentOS:9092
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas

atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000

atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000

## Server port configuration
atlas.server.http.port=21000
atlas.server.https.port=21443

#########  Server Properties  #########
atlas.rest.address=http://CentOS:21000
# If enabled and set to true, this will run setup steps when the server starts
atlas.server.run.setup.on.start=false

########## Add http headers ###########

atlas.headers.Access-Control-Allow-Origin=*
atlas.headers.Access-Control-Allow-Methods=GET,OPTIONS,HEAD,PUT,POST
#atlas.headers.<headerName>=<headerValue>

           

5.启动Atas服务,登录webUI用户名和密码都是admin/admin

[[email protected] apache-atlas-2.1.0]# ./bin/atlas_start.py
           
Atals元数据治理-介绍/编译/安装