Help Center Spider
The Open Telekom Cloud Helpcenter Spider is a spider tool visiting all links starting from its landing page on https://docs.otc.t-systems.com/ to find and identify urls that are not correct. It parses all types of hyperlinks and normalizes them into a canonical format. The spider descents into the document tree via […] bredth or width first search. [and does what?] [when is logged which event?]
Getting started
Once you installed the code and its required packages into an virtual environment and checked its configuration file config.json
, the web spider starts invoking the tool without any arguments. Results are listed in [… TBD].
Requirements and Installation
After you cloned this repository you need to prepare an environment to run the tool. You can easily do this with a Python virtual environment:
$ cd _local_folder_/
$ git clone https://gitea.eco.tsi-dev.otc-service.com/infra/hc-spider.git
$ cd hc-spider
$ python -m venv venv/
$ source venv/bin/activate
(venv)$ python -m pip install -r requirements.txt
Configuration
In config.json you can define several items:
- watchdog_file: if you run the tool in the background and want to stop it properly (without sending a signal with
kill
), just send an exit message into the watchdog file:echo exit > watchdog.fifo
. - timer_runtime: maximum runtime limit in seconds.
- log_dir: logging folder.
- logging_interval: frequency of dumping log files.
- workers: number of workers (background processes) you want to run. If you set to 0 it will count from the number of cores (number_of_cores - 1)
- starting_point: base url where to start
Operations
There are two ways to start the spider:
In the foreground
$ source venv/bin/activate
(venv)$ python main.py
In the background
$ source venv/bin/activate
(venv)$ nohup python main.py > log/hc_spider.log 2> log/hc_spider.err <&- &
Stopping the process polietely
To stop the tool when run in the background, send a command to the control fifo with: (venv)$ echo exit > _watchdog_file_