Commit Graph

663 Commits

Author SHA1 Message Date
Joscha
b70b62cef5 Make crawler sections start with "crawl:"
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
868f486922 Rename local crawler path to target 2021-05-15 17:12:25 +02:00
I-Al-Istannen
b2a2b5999b Implement ILIAS auth and crawl home page
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.

It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
Joscha
595de88d96 Fix authenticator and crawler names
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
a6fdf05ee9 Allow variable whitespace in arrow rules 2021-05-15 15:25:05 +02:00
Joscha
f897d7c2e1 Add name variants for all arrows 2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84 Make crawlers use transformers 2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34 Fix errors loading local crawler config
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0 Change limiter logic
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
I-Al-Istannen
b0f9e1e8b4 Add vscode directory to gitignore 2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150 Add reasons for invalid values 2021-05-15 15:25:05 +02:00
Joscha
296a169dd3 Make limiter logic more complex
The limiter can now distinguish between crawl and download actions and has a
fancy slot system and delay logic.
2021-05-15 15:25:05 +02:00
Joscha
1591cb9197 Add options to slow down local crawler
These options are meant to make the local crawler behave more like a
network-based crawler for purposes of testing and debugging other parts of the
code base.
2021-05-15 15:25:01 +02:00
Joscha
0c9167512c Fix output dir
I missed these while renaming the resolve function. Shame on me for not running
mypy earlier.
2021-05-14 21:28:38 +02:00
Joscha
a673ab0fae Delete old files
I should've done this earlier
2021-05-14 21:27:44 +02:00
Joscha
6e5fdf4e9e Set user agent to "pferd/<version>" 2021-05-14 21:27:44 +02:00
Joscha
93a5a94dab Single-source version number 2021-05-14 21:27:44 +02:00
Joscha
d565df27b3 Add HttpCrawler 2021-05-13 22:28:14 +02:00
Joscha
961f40f9a1 Document simple authenticator 2021-05-13 19:55:04 +02:00
Joscha
e3ee4e515d Disable highlighting of primitives
This commit prevents rich from highlighting python-looking syntax like numbers,
arrays, 'None' etc.
2021-05-13 19:47:44 +02:00
Joscha
94d6a01cca Use file mtime in local crawler 2021-05-13 19:42:40 +02:00
Joscha
38bb66a776 Update file metadata in more cases
PFERD now not only updates file metadata when a file is successfully added or
changed, but also when a file is downloaded and then detected to be unchanged.

This could occur for example if a remote file's modification time was bumped,
possibly because somebody touched the file without changing it.
2021-05-13 19:40:10 +02:00
Joscha
68781a88ab Fix asynchronous methods being not awaited 2021-05-13 19:39:49 +02:00
Joscha
910462bb72 Log stuff happening to files 2021-05-13 19:37:27 +02:00
Joscha
6bd6adb977 Fix tmp file names 2021-05-13 19:36:46 +02:00
Joscha
0acdee15a0 Let crawlers obtain authenticators 2021-05-13 18:57:20 +02:00
Joscha
c3ce6bb31c Fix crawler cleanup not being awaited 2021-05-11 00:28:45 +02:00
Joscha
0459ed093e Add simple authenticator
... including some required authenticator infrastructure
2021-05-11 00:28:03 +02:00
Joscha
d5f29f01c5 Use global conductor instance
The switch from crawler-local conductors to a single pferd-global conductor was
made to prepare for auth section credential providers.
2021-05-11 00:05:04 +02:00
Joscha
595ba8b7ab Remove dummy crawler 2021-05-10 23:47:46 +02:00
Joscha
cec0a8e1fc Fix mymy errors 2021-05-09 01:45:01 +02:00
Joscha
f9b2fd60e2 Document local crawler and auth 2021-05-09 01:33:47 +02:00
Joscha
60cd9873bc Add local file crawler 2021-05-06 01:02:40 +02:00
Joscha
273d56c39a Properly load crawler config 2021-05-05 23:45:10 +02:00
Joscha
5497dd2827 Add @noncritical and @repeat decorators 2021-05-05 23:36:54 +02:00
Joscha
bbfdadc463 Implement output directory 2021-05-05 18:08:34 +02:00
Joscha
fde811ae5a Document on_conflict option 2021-05-05 12:24:35 +02:00
Joscha
07e831218e Add sync report 2021-05-02 00:56:10 +02:00
Joscha
91c33596da Load crawlers from config file 2021-04-30 16:22:14 +02:00
Joscha
a8dcf941b9 Document possible redownload settings 2021-04-30 15:32:56 +02:00
Joscha
e7a51decb0 Elaborate on transforms and implement changes 2021-04-29 20:24:18 +02:00
Joscha
9ec19be113 Document config file format 2021-04-29 20:24:18 +02:00
Joscha
f776186480 Use PurePath instead of Path
Path should only be used when we need to access the file system. For all other
purposes (mainly crawling), we use PurePath instead since the paths don't
correspond to paths in the local file system.
2021-04-29 20:20:25 +02:00
Joscha
0096d83387 Simplify Limiter implementation 2021-04-29 20:20:25 +02:00
Joscha
20a24dbcbf Add changelog 2021-04-29 20:20:25 +02:00
Joscha
502654d853 Fix mypy errors 2021-04-29 15:47:52 +02:00
Joscha
d2103d7c44 Document crawler 2021-04-29 15:43:20 +02:00
Joscha
d96a361325 Test and fix exclusive output 2021-04-29 15:27:16 +02:00
Joscha
2e85d26b6b Use conductor via context manager 2021-04-29 14:23:28 +02:00
Joscha
6431a3fb3d Fix some mypy errors 2021-04-29 14:23:09 +02:00