带你玩转kubeflow on Kind:使用安装教程

带你玩转kubeflow on Kind:使用安装教程

首页枪战射击C-Ops重装上阵更新时间:2024-05-07

简介: kubeflow 是 google 开源的一个基于 kubernetes 的 ML workflow 平台,其集成了大量的机器学习工具,这里给大家介绍下基于阿里云镜像仓库进行kubeflow安装部署,同时通过 kittab 超参数案例,pipeline workflow 的例子给大家详细介绍kubeflow各组件的玩法,同时在最后提出针对kubeflow 构建 MLOps 平台的一些思考。

引言

kubeflow 是 google 开源的一个基于 kubernetes 的 ML workflow 平台,其集成了大量的机器学习工具,比如用于交互性实验的 jupyterlab 环境,用于超参数调整的 katib,用于 pipeline 工作流控制的 argo workflow等。作为一个“大型工具箱”集合,kubeflow 为机器学习开发者提供了大量可选的工具,同时也为机器学习的工程落地提供了可行性工具。

当然,由于 kubeflow 主要是 google 在主导,虽然其作为一个开源项目,但在很多选型上都和 google 自家产品深度绑定,比如 google 自己的存储工具 gstuil 作为一等公民,镜像仓库地址也大多是 grc.io这样的google自己的镜像仓库地址。

安装部署

关于 kubeflow 的安装部署如果没有比较好的外网访问环境的话,大家可以参考我开源的一个project,专门做国内manifest,镜像仓库都是采用阿里云镜像,在国内网络环境下也能快速轻松安装部署:

git clone https://github.com/shikanon/kubeflow-manifests.git cd kubeflow-manifests python install.py

安装完成后查看等待所有 pod running。

$ kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-6686f66f9b-54s96 1/1 Running 0 2h6m cattle-system cattle-cluster-agent-5f695c79c-x9ql7 1/1 Running 0 3h cert-manager cert-manager-9d5774b59-4xjmk 1/1 Running 0 2h23m cert-manager cert-manager-cainjector-67c8c5c665-nmcp6 1/1 Running 0 2h23m cert-manager cert-manager-webhook-75dc9757bd-z2k5c 1/1 Running 1 2h23m fleet-system fleet-agent-7d959597cb-q8ckq 1/1 Running 0 3h istio-system authservice-0 1/1 Running 0 2h23m istio-system cluster-local-gateway-66bcf8bc5d-j9kvp 1/1 Running 0 2h23m istio-system istio-ingressgateway-85b49c758f-l4hgc 1/1 Running 0 2h22m istio-system istiod-5ff6cdbbcd-2v5kj 1/1 Running 0 2h23m knative-eventing broker-controller-5c84984b97-86zkx 1/1 Running 0 2h23m knative-eventing eventing-controller-54bfbd5446-rx9ll 1/1 Running 0 2h23m knative-eventing eventing-webhook-58f56d9cf4-bnq9q 1/1 Running 0 2h23m knative-eventing imc-controller-769896c7db-kzjv6 1/1 Running 0 2h23m knative-eventing imc-dispatcher-86954fb4cd-9b6gz 1/1 Running 0 2h23m knative-serving activator-75696c8c9-9c5ff 1/1 Running 0 2h23m knative-serving autoscaler-6764f9b5c5-2gwqj 1/1 Running 0 2h23m knative-serving controller-598fd8bfd7-bpn5k 1/1 Running 0 2h23m knative-serving istio-webhook-785bb58cc6-ts9f2 1/1 Running 0 2h23m knative-serving networking-istio-77fbcfcf9b-pg26h 1/1 Running 0 2h23m knative-serving webhook-865f54cf5f-rzpjf 1/1 Running 0 2h23m kube-system coredns-5644d7b6d9-hwwnr 1/1 Running 0 3h1m kube-system coredns-5644d7b6d9-zds92 1/1 Running 0 3h1m kube-system etcd-kubeflow-control-plane 1/1 Running 0 3h kube-system kindnet-8tvm5 1/1 Running 0 3h1m kube-system kindnet-zkmkq 1/1 Running 0 3h1m kube-system kube-apiserver-kubeflow-control-plane 1/1 Running 0 3h kube-system kube-controller-manager-kubeflow-control-plane 1/1 Running 0 3h kube-system kube-proxy-c8zn7 1/1 Running 0 3h1m kube-system kube-proxy-k7b8c 1/1 Running 0 3h1m kube-system kube-scheduler-kubeflow-control-plane 1/1 Running 0 3h kubeflow admission-webhook-deployment-6fb9d65887-pzvgc 1/1 Running 0 2h22m kubeflow cache-deployer-deployment-7558d65bf4-jhgwg 2/2 Running 1 2h6m kubeflow cache-server-c64c68ddf-stz72 2/2 Running 0 22m kubeflow centraldashboard-7b7676d8bd-g2s8j 1/1 Running 0 2h7m kubeflow jupyter-web-app-deployment-66f74586d9-scbsm 1/1 Running 0 2h5m kubeflow katib-controller-77675c88df-mx4rh 1/1 Running 0 2h22m kubeflow katib-db-manager-646695754f-z797r 1/1 Running 0 2h22m kubeflow katib-mysql-5bb5bd9957-gbl5t 1/1 Running 0 2h22m kubeflow katib-ui-55fd4bd6f9-r98r2 1/1 Running 0 2h22m kubeflow kfserving-controller-manager-0 2/2 Running 0 2h22m kubeflow kubeflow-pipelines-profile-controller-5698bf57cf-btpn5 1/1 Running 0 22m kubeflow metacontroller-0 1/1 Running 0 2h7m kubeflow metadata-envoy-deployment-76d65977f7-rmlzc 1/1 Running 0 2h7m kubeflow metadata-grpc-deployment-697d9c6c67-j6dl2 2/2 Running 3 2h7m kubeflow metadata-writer-58cdd57678-8t6gw 2/2 Running 1 2h7m kubeflow minio-6d6784db95-tqs77 2/2 Running 0 2h7m kubeflow ml-pipeline-85fc99f899-plsz2 2/2 Running 1 2h7m kubeflow ml-pipeline-persistenceagent-65cb9594c7-xvn4j 2/2 Running 1 2h7m kubeflow ml-pipeline-scheduledworkflow-7f8d8dfc69-7wfs4 2/2 Running 0 2h7m kubeflow ml-pipeline-ui-5c765cc7bd-4r2j7 2/2 Running 0 2h7m kubeflow ml-pipeline-viewer-crd-5b8df7f458-5b8qg 2/2 Running 1 2h7m kubeflow ml-pipeline-visualizationserver-56c5ff68d5-92bkf 2/2 Running 0 2h7m kubeflow mpi-operator-789f88879-n4xms 1/1 Running 0 2h22m kubeflow mxnet-operator-7fff864957-vq2bg 1/1 Running 0 2h22m kubeflow mysql-56b554ff66-kd7bd 2/2 Running 0 2h7m kubeflow notebook-controller-deployment-74d9584477-qhpp8 1/1 Running 0 2h22m kubeflow profiles-deployment-67b4666796-k7t2h 2/2 Running 0 2h22m kubeflow pytorch-operator-fd86f7694-dxbgf 2/2 Running 0 2h22m kubeflow tensorboard-controller-controller-manager-fd6bcffb4-k9qvx 3/3 Running 1 2h22m kubeflow tensorboards-web-app-deployment-78d7b8b658-dktc6 1/1 Running 0 2h22m kubeflow tf-job-operator-7bc5cf4cc7-gk8tz 1/1 Running 0 2h22m kubeflow volumes-web-app-deployment-68fcfc9775-bz9gq 1/1 Running 0 2h22m kubeflow workflow-controller-5449754fb4-tdg2t 2/2 Running 1 22m kubeflow xgboost-operator-deployment-5c7bfd57cc-9rtq6 2/2 Running 1 2h22m local-path-storage local-path-provisioner-58f6947c7-mv4mg 1/1 Running 0 3h1m访问控制

kubeflow 通过dex 进行鉴权服务,安装好kubeflow,打开本地浏览器,看到 dex 可登录验证框,输出账号密码:

这里的账号密码可以通过 dex 的 configmap 设置:

apiVersion: v1 data: config.yaml: | issuer: http://dex.auth.svc.cluster.local:5556/dex storage: type: kubernetes config: inCluster: true web: http: 0.0.0.0:5556 logger: level: "debug" format: text oauth2: skipApprovalScreen: true enablePasswordDB: true staticPasswords: - email: "admin@example.com" hash: "$2a$10$2b2cU8CPhOTashikanonGrs1HRQJTT5ZHsHSzYiFPm1leZck7Mc8T4W" username: "admin" userID: "08a8684b-db88-4b73-90a9-3cd1661f5466" staticClients: - idEnv: OIDC_CLIENT_ID redirectURIs: ["/login/oidc"] name: 'Dex Login Application' secretEnv: OIDC_CLIENT_SECRET kind: ConfigMap metadata: name: dex namespace: auth

email 就是我们登录的用户名,hash 就是我们的设置的密码,可以通过以下这段python代码来生成:

from passlib.hash import bcrypt import getpass print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))组件功能介绍

可以看到新版的kubeflow多了很多功能。

这里按模块介绍下 Kubeflow 的几个核心组件。

Notebook Servers

notebook 可以说是做机器学习最喜欢用到的工具了,完美地将动态语言的交互性发挥出来,kubeflow 提供了 jupyter notebook 来快速构建云上的实验环境,这里以一个我们自定义的镜像为例:

我们创建了一个test-for-jupyter名字的镜像,配置了一个 tensorflow 这个镜像,点击启动,我们可以看到在kubeflow-user-example-com命名空间下已经创建我们的应用了:

kubectl get po -nkubeflow-user-example-com NAME READY STATUS RESTARTS AGE ml-pipeline-ui-artifact-6d7ffcc4b6-9kxkk 2/2 Running 0 48m ml-pipeline-visualizationserver-84d577b989-5hl46 2/2 Running 0 48m test-for-jupyter-0 0/2 PodInitializing 0 44s

创建完成后点击 connect 就可以进入我们创建的应用界面中了

在 jupyterlab 环境中开发人员可以很方便地进行算法实验,同时由于运行在云上利用 k8s api甚至可以很方便构建k8s资源,比如通过 kfserving 创建一个ML服务。

AutoML

AutoML 是机器学习比较热的领域,主要用来模型自动优化和超参数调整,这里其实是用的 Katib来实现的,一个基于k8s的 AutoML 项目。

Katib 主要提供了 超参数调整(Hyperparameter Tuning),早停法(Early Stopping)和神经网络架构搜索(Neural Architecture Search)

这里以一个随机搜索算法为例:

apiVersion: "kubeflow.org/v1beta1" kind: Experiment metadata: namespace: kubeflow-user-example-com name: random-example spec: objective: type: maximize goal: 0.99 objectiveMetricName: Validation-accuracy additionalMetricNames: - Train-accuracy algorithm: algorithmName: random parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 parameters: - name: lr parameterType: double feasibleSpace: min: "0.01" max: "0.03" - name: num-layers parameterType: int feasibleSpace: min: "2" max: "5" - name: optimizer parameterType: categorical feasibleSpace: list: - sgd - adam - ftrl trialTemplate: primaryContainerName: training-container trialParameters: - name: learningRate description: Learning rate for the training model reference: lr - name: numberLayers description: Number of training model layers reference: num-layers - name: optimizer description: Training model optimizer (sdg, adam or ftrl) reference: optimizer trialSpec: apiVersion: batch/v1 kind: Job spec: template: spec: containers: - name: training-container image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727 command: - "python3" - "/opt/mxnet-mnist/mnist.py" - "--batch-size=64" - "--lr=${trialParameters.learningRate}" - "--num-layers=${trialParameters.numberLayers}" - "--optimizer=${trialParameters.optimizer}" restartPolicy: Never

这里以一个简单的神经网络为例,该程序具有三个参数 lr, num-layers, optimizer,采用的算法是随机搜索,目标是最大化准确率(accuracy)。

可以直接在界面中填上yaml文件,然后提交。

完成后会生成一张各参数和准确率的关系图和训练列表:

Experiments and Pipelines

experiments 为我们提供了一个可以创建实验空间功能, pipeline 定义了算法组合的模板,通过 pipeline 我们可以将算法中各处理模块按特定的拓扑图的方式组合起来。

这里可以看看官方提供的几个 pipeline 例子:

kubeflow pipeline 本质是基于 argo workflow 实现,由于我们的kubeflow是基于kind上构建的,容器运行时用的containerd,而workflow默认的pipeline执行器是docker,因此有些特性不兼容
这里我是把 workflow 的 containerRuntimeExecutor 改成了 k8sapi。但 k8sapi 由于在 workflow 是二级公民,因此有些功能不能用,比如 kubeflow pipeline 在 input/output 的 artifacts 需要用到 docker cp 命令。

由于以上原因 kubeflow 默认给的几个案例并没有用 volumes 是无法在 kind 中运行起来,这里我们基于 argo workflow 语法自己实现一个 pipeline。

基于pipeline构建一个的工作流水

第一步,构建一个 workflow pipeline 文件:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: kubeflow-test- spec: entrypoint: kubeflow-test templates: - name: kubeflow-test dag: tasks: - name: print-text template: print-text dependencies: [repeat-line] - {name: repeat-line, template: repeat-line} - name: repeat-line container: args: [--line, Hello, --count, '15', --output-text, /gotest/outputs/output_text/data] command: - sh - -ec - | program_path=$(mktemp) printf "%s" "$0" > "$program_path" python3 -u "$program_path" "$@" - | def _make_parent_dirs_and_return_path(file_path: str): import os os.makedirs(os.path.dirname(file_path), exist_ok=True) return file_path def repeat_line(line, output_text_path, count = 10): '''Repeat the line specified number of times''' with open(output_text_path, 'w') as writer: for i in range(count): writer.write(line '\n') import argparse _parser = argparse.ArgumentParser(prog='Repeat line', description='Repeat the line specified number of times') _parser.add_argument("--line", dest="line", type=str, required=True, default=argparse.SUPPRESS) _parser.add_argument("--count", dest="count", type=int, required=False, default=argparse.SUPPRESS) _parser.add_argument("--output-text", dest="output_text_path", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS) _parsed_args = vars(_parser.parse_args()) _outputs = repeat_line(**_parsed_args) image: python:3.7 volumeMounts: - name: workdir mountPath: /gotest/outputs/output_text/ volumes: - name: workdir persistentVolumeClaim: claimName: kubeflow-test-pv metadata: annotations: - name: print-text container: args: [--text, /gotest/outputs/output_text/data] command: - sh - -ec - | program_path=$(mktemp) printf "%s" "$0" > "$program_path" python3 -u "$program_path" "$@" - | def print_text(text_path): # The "text" input is untyped so that any data can be printed '''Print text''' with open(text_path, 'r') as reader: for line in reader: print(line, end = '') import argparse _parser = argparse.ArgumentParser(prog='Print text', description='Print text') _parser.add_argument("--text", dest="text_path", type=str, required=True, default=argparse.SUPPRESS) _parsed_args = vars(_parser.parse_args()) _outputs = print_text(**_parsed_args) image: python:3.7 volumeMounts: - name: workdir mountPath: /gotest/outputs/output_text/ volumes: - name: workdir persistentVolumeClaim: claimName: kubeflow-test-pv metadata: annotations:

这里我们定义了两个任务 repeat-line 和 print-text, repeat-line 任务会将生产结果写入 kubeflow-test-pv 的 PVC 中, print-text 会从 PVC 中读取数据输出到 stdout。

这里由于用到 PVC,我们需要先在集群中创建一个kubeflow-test-pv的PVC:

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: kubeflow-test-pv namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 128Mi

第二步,定义好 pipeline 文件后可以创建pipeline:

第三步,启动一个pipeline:

启动 pipeline 除了单次运行模式 one-off,也支持定时器循环模式 Recurring,这块可以根据自己的需求确定。

查看运行结果:

运行完后,可以将实验进行归档(Archived)。

关于 MLOps 的一点思考

我们来看一个简单的 ML 运作流程:

这是一个 google 提供的 level 1 级别的机器学习流水线自动化,整个流水线包括以下几部分:

基于上述功能描述我们其实可以基于 kubeflow 的 pipeline 和 kfserving 功能轻松实现一个简单的 MLOps 流水线发布流程。不过,值得注意的是,DevOps 本身并不仅仅是一种技术,同时是一种工程文化,所以在实践落地中需要团队各方的协同分阶段的落地。

查看全文
大家还看了
也许喜欢
更多游戏

Copyright © 2024 妖气游戏网 www.17u1u.com All Rights Reserved