fixing other format issues

This commit is contained in:
Hasko, Vladimir 2023-05-20 23:10:05 +00:00
parent 2e341f4a19
commit 40423a0364
10 changed files with 42 additions and 56 deletions

View File

@ -51,7 +51,7 @@ of the specific service.
24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present
them on their own customized dashboards which are fullfilling their
requirements.
requirements.
https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m
@ -98,7 +98,7 @@ Service Based Dashboard
=======================
The dashboad provides deeper insight in single service with tailored views,
graphs and tables to address the service major functionalities abd specifics.
graphs and tables to address the service major functionalities abd specifics.
https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1

View File

@ -8,7 +8,7 @@ Metrics are stored in 2 different database types:
- Graphite time series database
- Postgresql relational database
Graphite
========
@ -61,8 +61,8 @@ Counters and timers have following subbranches:
Every section has further following branches:
- environment name (production_regA, production_regB, etc)
- monitoring location (production_regA, awx) - specification of the environment from which the metric is gathered
- environment name (production_regA, production_regB, etc)
- monitoring location (production_regA, awx) - specification of the environment from which the metric is gathered
openstack.api
@ -70,27 +70,35 @@ openstack.api
OpenStack metrics branch is structured as following:
- service (normally service_type from the service catalog, but sometimes differs slightly)
- request method (GET/POST/DELETE/PUT)
- resource (service resource, i.e. server, keypair, volume, etc). Subresources are joined with "_" (i.e. cluster_nodes)
- response code - received response code
- count/upper/lower/mean/etc - timer specific metrics (available only under stats.timers.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
- count/rate - counter specific metrics (available only under stats.counters.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
- attempted - counter for the attempted requests (only for counters)
- failed - counter of failed requests (not received response, connection problems, etc) (only for counters)
- passed - counter of requests receiving any response back (only for counters)
- service (normally service_type from the service catalog, but sometimes differs slightly)
- request method (GET/POST/DELETE/PUT)
- resource (service resource, i.e. server, keypair, volume, etc). Subresources are joined with "_" (i.e. cluster_nodes)
- response code - received response code
- count/upper/lower/mean/etc - timer specific metrics (available only under stats.timers.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
- count/rate - counter specific metrics (available only under stats.counters.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
- attempted - counter for the attempted requests (only for counters)
- failed - counter of failed requests (not received response, connection problems, etc) (only for counters)
- passed - counter of requests receiving any response back (only for counters)
apimon.metric
-------------
- metric name (i.e. create_cce_cluster, delete_volume_eu-de-01, etc) - complex metrics branch
- attempted/failed/failedignored/passed/skipped - counters for the corresponding operation results (this branch element represents status of the corresponding ansible task)
- $az - some metrics would have availability zone for the operation on that level. Since this info is not always available this is a varying path
- curl - subtree for the curl type of metrics
- $name - short name of the host to be checked
- metric name (i.e. create_cce_cluster, delete_volume_eu-de-01, etc) - complex metrics branch
- attempted/failed/failedignored/passed/skipped - counters for the corresponding operation results (this branch element represents status of the corresponding ansible task)
- $az - some metrics would have availability zone for the operation on that level. Since this info is not always available this is a varying path
- curl - subtree for the curl type of metrics
- $name - short name of the host to be checked
- stats.timers.apimon.metric.$environment.$zone.**csm_lb_timings**.{public,private}.{http,https,tcp}.$az.__VALUE__ - timer values for the loadbalancer test
- stats.counters.apimon.metric.$environment.$zone.**csm_lb_timings**.{public,private}.{http,https,tcp}.$az.{attempted,passed,failed} - counter values for the loadbalancer test
- stats.timers.apimon.metric.$environment.$zone.**curl**.$host.{passed,failed}.__VALUE__ - timer values for the curl test
@ -128,4 +136,4 @@ These queries are used mainly on Test Results dashboard and Service specific sta
+-------------------------------+-------------------------------------------------------------------------------------------------------------+
.. image:: training_images/postgresql_query.jpg
.. image:: training_images/postgresql_query.jpg

View File

@ -37,4 +37,4 @@ detected error codes or no responses at all.
.. image:: training_images/epmon_dashboard_details.jpg
EpMon findings are also reported to Alerta and notifications are sent to Zulip
dedicated topic "apimon_endpoint_monitoring".
dedicated topic "apimon_endpoint_monitoring".

View File

@ -1,6 +1,6 @@
============================
How Can I Access Dashboard ?
============================
============================
OTC LDAP authentication is supported on
https://dashboard.tsi-dev.otc-service.com.

View File

@ -8,4 +8,3 @@ Frequently Asked Questions
how_can_i_access_dashboard
how_to_read_the_logs_and_understand_the_issue
what_are_the_annotations

View File

@ -18,6 +18,6 @@ with the respective change directly on the dashboard:
- JIRA Change issue ID
- Impacted Availability Zone
- Affected Environment
- Affected Environment
- Main component
- Summary
- Summary

View File

@ -9,7 +9,7 @@ available to them via the Internet. While internal monitoring checks on the OTC
backplane are necessary, they are not sufficient to detect failures that
manifest in the interface, network connectivity, or the API logic itself. Also
helpful, but not sufficient are simple HTTP requests to the REST endpoints and
checking for 200 status codes.
checking for 200 status codes.
The ApiMon is Open Telekom Cloud product developed by
Ecosystem squad.

View File

@ -8,50 +8,29 @@ Logs
- Each single job log file provides unique URL which can be accessed to see log
details
- These URLs are available on all APIMON levels:
- In Zulip alarm messages
- In Alerta events
- In Grafana Dashboards
- In Zulip alarm messages
- In Alerta events
- In Grafana Dashboards
- Logs are simple plain text files of the whole playbook output::
2020-07-12 05:54:04.661170 | TASK [List Servers]
2020-07-12 05:54:09.050491 | localhost | ok
2020-07-12 05:54:09.067582 | TASK [Create Server in default AZ]
2020-07-12 05:54:46.055650 | localhost | MODULE FAILURE:
2020-07-12 05:54:46.055873 | localhost | Traceback (most recent call last):
2020-07-12 05:54:46.057441 | localhost |
2020-07-12 05:54:46.057499 | localhost | During handling of the above exception, another exception occurred:
2020-07-12 05:54:46.057535 | localhost |
2020-07-12 05:54:46.063992 | localhost | File "/tmp/ansible_os_server_payload_uz1c7_iw/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py", line 500, in _create_server
2020-07-12 05:54:46.065152 | localhost | return self._send_request(
2020-07-12 05:54:46.065186 | localhost | File "/root/.local/lib/python3.8/site-packages/keystoneauth1/session.py", line 1020, in _send_request
2020-07-12 05:54:46.065334 | localhost | raise exceptions.ConnectFailure(msg)
2020-07-12 05:54:46.065378 | localhost | keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://ims.eu-de.otctest.t-systems.com/v2/images: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))
2020-07-12 05:54:46.295035 |
2020-07-12 05:54:46.295241 | TASK [Delete server]
2020-07-12 05:54:48.481374 | localhost | ok
2020-07-12 05:54:48.505761 |
2020-07-12 05:54:48.505906 | TASK [Delete SecurityGroup]
2020-07-12 05:54:50.727174 | localhost | changed
2020-07-12 05:54:50.745541 |

View File

@ -46,6 +46,6 @@ achieve the desired state of service or service resource. For example boot up of
virtual machine from deployment until succesfull login via SSH.
.. code-block::
tags: ["metric=delete_server"]
tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"]
tags: ["metric=delete_server"]
tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"]

View File

@ -48,4 +48,4 @@ Monitoring dashboards
* KPI dashboards
* 24/7 dashboards
* Test results dashboards
* Specific service dashboards
* Specific service dashboards