33a81a5f5c
Document authentication in HTTP crawler and rename prepare_request
2021-05-23 11:55:34 +02:00
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
53e031d9f6
Reuse dl/cl for I/O retries in ILIAS crawler
2021-05-23 00:28:27 +02:00
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
4d07de0d71
Adjust forum log message in ilias crawler
2021-05-22 23:20:21 +02:00
953a1bba93
Adjust to new crawl / download names
2021-05-22 23:18:05 +02:00
ae3d80664c
Update local crawler to new crawler structure
2021-05-22 21:46:36 +02:00
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
83d12fcf2d
Add some explains to ilias crawler and use crawler exceptions
2021-05-20 14:58:54 +02:00
e4f9560655
Only retry on aiohttp errors in ILIAS crawler
...
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.
In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
8cfa818f04
Only call should_crawl once
2021-05-19 21:57:55 +02:00
81301f3a76
Rename the ilias crawler to ilias web crawler
2021-05-19 21:41:17 +02:00
2976b4d352
Move ILIAS file templates to own file
2021-05-19 21:37:10 +02:00
9f03702e69
Split up ilias crawler in multiple files
...
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00
0d10752b5a
Configure explain log level via cli and config file
2021-05-19 17:50:10 +02:00
5916626399
Make noqua comment more specific
2021-05-19 17:16:59 +02:00
3851065500
Fix local crawler's download bars
...
Display the pure path instead of the local path.
2021-05-18 23:23:40 +02:00
4b68fa771f
Move logging logic to singleton
...
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
1525aa15a6
Fix link template error and use indeterminate progress bar
2021-05-18 22:40:28 +02:00
db1219d4a9
Create a link file in ILIAS crawler
...
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
b8efcc2ca5
Respect filters in ILIAS crawler
2021-05-17 21:30:26 +02:00
0bae009189
Run formatting tools
2021-05-16 14:32:53 +02:00
8b76ebb3ef
Rename IliasCrawler to KitIliasCrawler
2021-05-16 13:28:06 +02:00
2b6235dc78
Fix pylint warnings (and 2 found bugs) in ILIAS crawler
2021-05-16 13:17:12 +02:00
1c226c31aa
Add some repeat annotations to the ILIAS crawler
2021-05-16 13:01:56 +02:00
9ec0d3e16a
Implement date-demangling in ILIAS crawler
2021-05-16 13:01:56 +02:00
cf6903d109
Retry crawling on I/O failure
2021-05-16 13:01:56 +02:00
c454fabc9d
Add support for exercises in ILIAS crawler
2021-05-15 21:40:17 +02:00
7d323ec62b
Implement video downloads in ilias crawler
2021-05-15 21:32:32 +02:00
c7494e32ce
Start implementing crawling in ILIAS crawler
...
The ilias crawler can now crawl quite a few filetypes, splits off
folders and crawls them concurrently.
2021-05-15 20:42:18 +02:00
1123c8884d
Implement an IliasPage
...
This allows PFERD to semantically understand ILIAS HTML and is the
foundation for the ILIAS crawler. This patch extends the ILIAS crawler
to crawl the personal desktop and print the elements on it.
2021-05-15 18:59:23 +02:00
8c32da7f19
Let authenticators provide username and password separately
2021-05-15 18:27:03 +02:00
d63494908d
Properly invalidate exceptions
...
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
868f486922
Rename local crawler path to target
2021-05-15 17:12:25 +02:00
b2a2b5999b
Implement ILIAS auth and crawl home page
...
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.
It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
b0f731bf84
Make crawlers use transformers
2021-05-15 15:25:05 +02:00
302b8c0c34
Fix errors loading local crawler config
...
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
ed2e19a150
Add reasons for invalid values
2021-05-15 15:25:05 +02:00
1591cb9197
Add options to slow down local crawler
...
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
94d6a01cca
Use file mtime in local crawler
2021-05-13 19:42:40 +02:00
0acdee15a0
Let crawlers obtain authenticators
2021-05-13 18:57:20 +02:00
c3ce6bb31c
Fix crawler cleanup not being awaited
2021-05-11 00:28:45 +02:00
d5f29f01c5
Use global conductor instance
...
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
595ba8b7ab
Remove dummy crawler
2021-05-10 23:47:46 +02:00
f9b2fd60e2
Document local crawler and auth
2021-05-09 01:33:47 +02:00
60cd9873bc
Add local file crawler
2021-05-06 01:02:40 +02:00
273d56c39a
Properly load crawler config
2021-05-05 23:45:10 +02:00
f776186480
Use PurePath instead of Path
...
Path should only be used when we need to access the file system. For all other
purposes (mainly crawling), we use PurePath instead since the paths don't
correspond to paths in the local file system.
2021-04-29 20:20:25 +02:00
d96a361325
Test and fix exclusive output
2021-04-29 15:27:16 +02:00
ac3bfd7388
Make progress bars easier to use
...
The crawler now supports two types of progress bars
2021-04-29 13:53:16 +02:00