性能方案设计

调研清单

powerfulseal

schemathesis

Ceph 自动化

参考文档:https://teams.microsoft.com/l/file/2043B749-2D71-4ACF-88C9-12DF50126452?tenantId=19441c6a-f224-41c8-ac36-82464c2d9b13&fileType=pdf&objectUrl=https%3A%2F%2Fapulistech.sharepoint.com%2Fsites%2FQA-Ops%2FShared%20Documents%2FGeneral%2F2019-05-20%20teuthology.pdf&baseUrl=https%3A%2F%2Fapulistech.sharepoint.com%2Fsites%2FQA-Ops&serviceName=teams&threadId=19:30392c27d21c45a7ac07848d310b4347@thread.tacv2&groupId=846ac8d2-94e2-44a9-a6e3-0386803e6f89

测试工具:https://github.com/ceph/teuthology/tree/master/docs

https://github.com/ceph/teuthology/ https://www.youtube.com/watch?v=XwZqo_KQEag

https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw Ceph - YouTubeEnjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.www.youtube.com

https://www.youtube.com/watch?v=gj1OXrKdSrs

系统备份还原

https://clonezilla.org/downloads.php

业界参考书籍

  • AI自动化测试:https://item.jd.com/70985685552.html

  • 京东质量团队转型实践: https://item.jd.com/12442039.html

  • 机器学习测试入门与实践: https://item.jd.com/12958088.html

  • iOS/Android 自动化测试实战:https://search.jd.com/Search?keyword=%E6%B5%8B%E8%AF%95%E4%B9%A6%E7%B1%8D%20%E8%85%BE%E8%AE%AF&enc=utf-8&wq=%E6%B5%8B%E8%AF%95%E4%B9%A6%E7%B1%8D%20%E8%85%BE%E8%AE%AF&pvid=22b5893757e341608b0edcc81a17dd18

k8s 性能

  • kubeflow - kubeserving

  • 平台性能,稳定性,异常

Results of measuring of API performance of Kubernetes kubernetes-storage-performance-comparison-v2-2020 scalability network-plugins-cni etcd community sig-testing kubemark perf-tests kubetest2 test-infra prow.k8s.io Kubernetes Aggregated Failures from

PXE Server 自动安装系统

  • 参考:

  1. Fully Automated Ubuntu 20.04 server install using PXE

  2. Huawei Server Purley Platform BIOS Parameter Reference 21

  3. PXELINUX

  4. Configure a PXE server to load Windows PE

Kubernetes 维护

  1. Kubernetes how to make Deployment to update image

编码规范

命名规范【主谓/动宾表】

  • 脚本命名规范

    主体_操作_属性_实体_环境

    • jmeter 脚本命名示例:

      more_user_submit_more_CPU_job.jmx

  • 代码命名规范

    • 代码文件, 函数命名:小写单词 + “_”

      get_job_name

    • 变量命名: 双驼峰

      TestList

    • 常量,特殊标记: 全大写

      WAITTIME

AI(HPC)高性能计算框架, NPU, GPU

kubic-control for openSUSE Kubic OpenQA kubernetes-perf-tests argo Kubeadm on SLES Jenkins X k8s 获取master token kubeadm国内源

  • Azurre k8s最佳实践 https://docs.microsoft.com/en-us/azure/aks/quotas-skus-regions

  • k8s-kpis-with-kuberhealthy:

https://kubernetes.io/blog/2020/05/29/k8s-kpis-with-kuberhealthy/

https://github.com/Comcast/kuberhealthy/blob/master/docs/EXTERNAL_CHECKS_REGISTRY.md

  • kubeflow/testing:

https://github.com/kubeflow/testing https://github.com/kubernetes/test-infra

  • argo

https://github.com/argoproj/argo/

https://argoproj.github.io/argo-cd/getting_started/#1-install-argo-cd

https://blog.argoproj.io/about

  • jenkins-x https://www.inovex.de/blog/spinnaker-vs-argo-cd-vs-tekton-vs-jenkins-x/

  • kubernetes perf-tests:

https://github.com/kubernetes/perf-tests

k8s Limits

https://kubernetes.io/docs/setup/best-practices/cluster-large/#:~:text=More%20specifically%2C%20Kubernetes%20is%20designed,more%20than%20150000%20total%20pods

No more than 100 pods per node No more than 5000 nodes No more than 150000 total pods No more than 300000 total containers

https://cloud.google.com/kubernetes-engine/docs/best-practices/scalability 默认情况下:

Pod 次要范围默认为 /14(262144 个 IP 地址)。 每个节点都有为其 Pod 分配的 /24 范围(用于其 Pod 的 256 个 IP 地址)。 节点的子网为 /20(4092 个 IP 地址)。

testExample: https://www.jeffgeerling.com/blog/2020/10000-kubernetes-pods-10000-subscribers https://docs.openshift.com/container-platform/3.7/scaling_performance/cluster_limits.html https://learnk8s.io/setting-cpu-memory-limits-requests https://blog.newrelic.com/engineering/kubernetes-request-and-limits/ https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

稳定性测试与功能(自动化)测试的相比的特殊之处

  1. 不同职责的性能测试相关人员需要专业的知识背景

  2. 需要系统的搭建基础环境

  3. 是平台稳定性、性能调优的前提

  4. 一般需要至少1个月时间,不能在短时间内完成

  5. 系统模块层级、通信等越来越复杂,某些模块的“性能很好”或“性能很差”不能解释或解决问题 page: 158

  6. 测试环境和生产环境的差异可能使得测试失败或结果没有参考价值

  7. 压力场景设计要充分考虑多类,多种情况的操作

  8. 要考虑访问、查找等的负载较大页面的比例差异

  9. 要合理估计用户停留时间(思考时间)

  10. 要考虑会话在多个页面之间迁移,会话中的(迁移) 信息累计对内存的占用

  11. 要用不同的ID, 不同的对象,不同的条件触发动态请求,避免“缓存命中”

  12. 测试报告不能简单的通过PASS, FAIL来反馈;面向客户的测试报告有别于内部详细认真的数据报告;更多的,或者不得不考虑客户的理解和信任感,而做取舍或“概述”

测试范围(种类)

性能测试有很多种分类比如:压力测试,负载测试、(上,下)临界测试,狭义性能测试,可靠性测试,耐久测试或稳定性测试。

  • 我们目前关注:稳定性、可靠性测试,压力测试

由于平台有别于WEB系统的特殊性,狭义性能测试和临界值测试在当前的平台架构和实际部署环境不太适用。

  • 既定的平台可靠运行的资源上限

    CPU负载: 75% 内存使用率:
    IO使用率:
    外出网络带宽: 10Gbps 业务网络带宽: 500Mbps OS盘使用率: 75% (超过80%,k8s会将pod驱逐,已驱逐的Pod不能自行恢复) 存储使用率: 90%

注意事项:

  1. 在日志中记录各个处理的会话ID,序列号

工具

Apache Bench (license)

Oracle Application Testing suite

待确认问题

  1. 内存缓冲,缓存机制 是否抢先加载占用内存空间

  2. 用户ID(IP, Head)是怎么区分的,

  3. 虚拟CPU的分配机制,超线程情况

  4. 虚拟内存的分配、过载使用情况,内存共享,(Large Page)大页面,从POD中回收内存页 均一分配,动态分配

  5. IO延时 存储命令等待时间 GAVG 存储队列 SCSI预约竞争 分布式的存储迁移

  6. 任务批处理,异步处理机制

  7. 自动扩容和容量管理

    故障发生时的降容

  8. WAN, LAN网络 WAN 口访问 内部大LAN的,存储,前端,后端,DB通信

  9. Restful 请求

参考:

  • 接口响应: [Results of measuring of API performance of Kubernetes] https://docs.openstack.org/developer/performance-docs/test_results/container_cluster_systems/kubernetes/API_testing/index.html#kubernetes-pod-startup-latency-measurement [Kubernetes网络和集群性能测试]https://jimmysong.io/kubernetes-handbook/practice/network-and-cluster-perfermance-test.html [Kubernetes Performance Measurements and Roadmap]https://kubernetes.io/blog/2015/09/kubernetes-performance-measurements-and/

  • 存储性能基线 Kubernetes Storage Performance Comparison v2 (2020 Updated) 翻译:https://toutiao.io/posts/nmflsd/preview [logs]https://gist.github.com/pupapaik/76c5b7f124dbb69080840f01bf71f924

  • k8s集群可扩展性和性能SLI/SLO Kubernetes scalability and performance SLIs/SLOs 中文:https://www.jianshu.com/p/de768ea3fc19

  • Benchmark results of Kubernetes network plugins (CNI) over 10Gbit/s network 参考: [2020-Benchmark results of Kubernetes network plugins (CNI) over 10Gbit/s network] (https://itnext.io/benchmark-results-of-kubernetes-network-plugins-cni-over-10gbit-s-network-updated-august-2020-6e1b757b9e49) [2018-Benchmark results of Kubernetes network plugins (CNI) over 10Gbit/s network] https://itnext.io/benchmark-results-of-kubernetes-network-plugins-cni-over-10gbit-s-network-36475925a560

  • eted 性能基线与优化 参考:[When to update the heartbeat interval and election timeout settings]https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md etcd 性能测试与调优 (https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/performance.md) limit

  • 性能优化参考点: kubernetes 集群的性能优化 eBay应用程序集群管理器TESS.IO在大规模集群下的性能优化 网易云基于Kubernetes的深度定制化实践 开放下载《阿里巴巴云原生实践 15 讲》揭秘九年云原生规模化落地 使用 K8S 几年后,这些技术专家有话要说 腾讯成本优化黑科技:整机CPU利用率最高提升至90% 华为云在 K8S 大规模场景下的 Service 性能优化实践

  • 测试场景和用例 https://kiddie92.github.io/2019/01/23/kubernetes-%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E6%96%B9%E6%B3%95%E7%AE%80%E4%BB%8B/ https://supereagle.github.io/2017/03/09/kubemark/

  • 自动化测试参考:

  • testim

    • 1,000 runs per month

    • Chrome browser tests only

    • Serial tests runs

    • Runs tests on Testim’s cloud

    • Advanced branching and merging

    • File testing

    • Self-help only (Docs and Resources)

    • Limit: one account per organization

  • Functioniz

  • testhub BlazeMeter

API Key Id: a5ba1e4e2d9afc743a481d9f
API Key Secret: 433cd60a89dda9e400497a8b5b04f0ba5dc988ba1319ea9d0d5c836d8610bbf9cf188ba6
# Usage example: 
$ curl https://a.blazemeter.com/api/v4/user --user 'a5ba1e4e2d9afc743a481d9f:433cd60a89dda9e400497a8b5b04f0ba5dc988ba1319ea9d0d5c836d8610bbf9cf188ba6'

测试报告

ludeknovy/jtl-reporter Generate intuitive Locust.io reports with Jtl Reporter

Locust Monitoring with Grafana in Just 15 Minutes