Joscha
533f75ea71
Add --debug-transforms flag
2021-05-26 11:37:32 +02:00
Joscha
adb5d4ade3
Print files that are *not* deleted by cleanup
...
These are files that are not present on the remote source any more, but still
present locally. They also show up in the report.
2021-05-26 10:58:19 +02:00
Joscha
a879c6ab6e
Fix function being printed
2021-05-26 10:54:01 +02:00
Joscha
915e42fd07
Fix report not being printed if pferd exits normally
2021-05-26 10:53:54 +02:00
I-Al-Istannen
2d8dcc87ff
Send CSRF token in TFA request
2021-05-25 22:50:40 +02:00
I-Al-Istannen
66f0e398a1
Await result in tfa authenticate path
2021-05-25 19:19:51 +02:00
Joscha
30be4e29fa
Add workaround for RuntimeError after program finishes on Windows
2021-05-25 16:34:22 +00:00
I-Al-Istannen
263780e6a3
Use certifi to ensure CA certificates are bundled in pyinstaller
2021-05-25 18:24:06 +02:00
Joscha
07a75a37c3
Fix FileNotFoundError on Windows
2021-05-25 15:57:03 +00:00
Joscha
f85b75df8c
Switch from exit() to sys.exit()
...
Pyinstaller doesn't recognize exit().
2021-05-25 17:33:38 +02:00
Joscha
519a7ef435
Split --dump-config into two options
...
--dump-config with its optional argument tended to consume the command name, so
it had to be split up.
2021-05-25 17:17:35 +02:00
I-Al-Istannen
a848194601
Rename plaintext link option to "plaintext"
2021-05-25 17:15:13 +02:00
Joscha
aabce764ac
Clean up TODOs
2021-05-25 15:54:01 +02:00
Joscha
5a331663e4
Rename functions for consistency
2021-05-25 15:49:06 +02:00
Joscha
40144f8bd8
Fix rule error messages
2021-05-25 15:47:09 +02:00
Joscha
f68849c65f
Fix rules not being parsed entirely
2021-05-25 15:42:46 +02:00
Joscha
edb52a989e
Print report even if exiting due to Ctrl+C
2021-05-25 15:35:36 +02:00
Joscha
980578d05a
Avoid downloading in some cases
...
Depending on how on_conflict is set, we can determine a few situations where
downloading is never necessary.
2021-05-25 15:20:30 +02:00
I-Al-Istannen
486699cef3
Create anonymous TFA authenticator in ilias crawler
...
This ensures that *some* TFA authenticator is always present when
authenticating, even if none is specified in the config.
The TfaAuthenticator does not depend on any configured values, so it can
be created on-demand.
2021-05-25 15:11:52 +02:00
I-Al-Istannen
0096a0c077
Remove section and config parameter from Authenticator
2021-05-25 15:11:33 +02:00
I-Al-Istannen
d905e95dbb
Allow invalidation of keyring authenticator
2021-05-25 15:02:35 +02:00
Joscha
61430c8739
Overhaul config and CLI option names
2021-05-25 14:23:38 +02:00
Joscha
eb8b915813
Fix path prefix on windows
...
Previously, the path prefix was only set if "windows_paths" was true, regardless
of OS. Now the path prefix is always set on windows and never set on other OSes.
2021-05-25 14:23:38 +02:00
Joscha
22c2259adb
Clean up authenticator exceptions
...
- Renamed to *Error for consistency
- Treating AuthError like CrawlError
2021-05-25 14:23:38 +02:00
Joscha
c15a1aecdf
Rename keyring authenticator file for consistency
2021-05-25 14:20:26 +02:00
I-Al-Istannen
651b087932
Use cl/dl deduplication mechanism for ILIAS crawler
2021-05-25 12:15:38 +02:00
Joscha
bce3dc384d
Deduplicate path names in crawler
...
Also rename files so they follow the restrictions for windows file names if
we're on windows.
2021-05-25 12:11:15 +02:00
I-Al-Istannen
c21ddf225b
Add a CLI option to configure ILIAS links behaviour
2021-05-25 11:58:41 +02:00
I-Al-Istannen
4fefb98d71
Add a wrapper to pretty-print ValueErrors in argparse parsers
2021-05-25 11:57:59 +02:00
I-Al-Istannen
ffda4e43df
Add extension to link files
2021-05-25 11:41:57 +02:00
I-Al-Istannen
69cb2a7734
Add Links option to ilias crawler
...
This allows you to configure what type the link files should have and
whether to create them at all.
2021-05-25 11:41:57 +02:00
I-Al-Istannen
85f89a7ff3
Interpret accordions and expandable headers as virtual folders
...
This allows us to find a file named "Test" in an accordion "Acc" as "Acc/Test".
2021-05-24 18:54:26 +02:00
I-Al-Istannen
9ce20216b5
Do not set a timeout for whole HTTP request
...
Downloads might take longer!
2021-05-24 18:54:26 +02:00
Joscha
86ba47541b
Fix cookie loading and saving
2021-05-24 16:55:11 +02:00
I-Al-Istannen
492ec6a932
Detect and skip ILIAS tests
2021-05-24 16:36:15 +02:00
I-Al-Istannen
342076ee0e
Handle exercise detail containers in ILIAS html parser
2021-05-24 16:22:51 +02:00
I-Al-Istannen
d44f6966c2
Log authentication attempts in HTTP crawler
2021-05-24 16:22:11 +02:00
Joscha
1c1f781be4
Reword some log messages
2021-05-24 13:17:28 +02:00
Joscha
c687d4a51a
Implement cookie sharing
2021-05-24 13:10:44 +02:00
I-Al-Istannen
fca62541ca
De-duplicate element names in ILIAS crawler
...
This prevents any conflicts caused by multiple files with the same name.
Conflicts may still arise due to transforms, but that is out of our
control and a user error.
2021-05-24 00:24:31 +02:00
I-Al-Istannen
3ab3581f84
Add timeout for HTTP connection
2021-05-23 23:41:05 +02:00
I-Al-Istannen
8dd0689420
Add keyring authentication to ILIAS CLI
2021-05-23 23:04:18 +02:00
Joscha
79be6e1dc5
Switch some other options to BooleanOptionalAction
2021-05-23 22:49:09 +02:00
Joscha
edbd92dbbf
Add --status and --report flags
2021-05-23 22:41:59 +02:00
Joscha
27b5a8e490
Rename log.action to log.status
2021-05-23 22:40:33 +02:00
Joscha
1f400d5964
Implement BooleanOptionalAction
2021-05-23 22:26:59 +02:00
Joscha
0ca0680165
Simplify --version
2021-05-23 21:40:48 +02:00
Joscha
ce1dbda5b4
Overhaul colours
...
"Crawled" and "Downloaded" are now printed less bright than "Crawling" and
"Downloading" as they're not as important. Explain topics are printed in yellow
to stand out a bit more from the cyan action messages.
2021-05-23 21:33:04 +02:00
Joscha
9cce78669f
Print report after all crawlers have finished
2021-05-23 21:17:13 +02:00
Joscha
6ca0ecdf05
Load and store reports
2021-05-23 20:46:29 +02:00
I-Al-Istannen
6e9f8fd391
Add a keyring authenticator
2021-05-23 19:44:12 +02:00
Joscha
2fdf24495b
Restructure crawling and auth related modules
2021-05-23 19:16:42 +02:00
Joscha
bbf9f8f130
Add -C as alias for --crawler
2021-05-23 19:06:09 +02:00
I-Al-Istannen
37f8d84a9c
Output total amount of http requests in HTTP Crawler
2021-05-23 19:00:01 +02:00
Joscha
5edd868d5b
Fix always-smart redownloading the wrong files
2021-05-23 18:49:34 +02:00
Joscha
e4e5e83be6
Fix downloader using crawl bar
...
Looks like I made a dumb copy-paste error. Now the download bar shows the proper
progress and speed again.
2021-05-23 18:39:43 +02:00
Joscha
74c7b39dc8
Clean up files in alphabetical order
2021-05-23 18:39:25 +02:00
Joscha
445dffc987
Reword some explanations
2021-05-23 18:35:32 +02:00
I-Al-Istannen
d97d6bf147
Fix handling nested ILIAS folders
2021-05-23 18:29:28 +02:00
I-Al-Istannen
79efdb56f7
Adjust ILIAS html explain messages
2021-05-23 18:24:25 +02:00
Joscha
a9af56a5e9
Improve specifying crawlers via CLI
...
Instead of removing the sections of unselected crawlers from the config file,
crawler selection now happens in the Pferd after loading the crawlers and is
more sophisticated. It also has better error messages.
2021-05-23 18:18:50 +02:00
I-Al-Istannen
59f13bb8d6
Explain ILIAS HTML parsing and add some warnings
2021-05-23 18:14:54 +02:00
I-Al-Istannen
463f8830d7
Add warn_contd
2021-05-23 18:14:54 +02:00
I-Al-Istannen
05ad06fbc1
Only enclose get_page in iorepeat in ILIAS crawler
...
We previously also gathered in there, which could lead to some more
surprises when the method was retried.
2021-05-23 18:14:51 +02:00
Joscha
29d5a40c57
Replace asyncio.gather with custom Crawler function
2021-05-23 17:25:16 +02:00
Joscha
c0cecf8363
Log crawl and download actions more extensively
2021-05-23 16:25:44 +02:00
Joscha
b998339002
Fix cleanup logging of paths
2021-05-23 16:25:44 +02:00
Joscha
245c9c3dcc
Explain output dir decisions and steps
2021-05-23 16:25:44 +02:00
I-Al-Istannen
d8f26a789e
Implement CLI Command for ilias crawler
2021-05-23 13:30:42 +02:00
I-Al-Istannen
e1d18708b3
Rename "no_videos" to videos
2021-05-23 13:30:42 +02:00
Joscha
b44b49476d
Fix noncritical and anoncritical decorators
...
I must've forgot to update the anoncritical decorator when I last changed the
noncritical decorator. Also, every exception should make the crawler not
error_free, not just CrawlErrors.
2021-05-23 13:24:53 +02:00
Joscha
7e0bb06259
Clean up TODOs
2021-05-23 12:47:30 +02:00
I-Al-Istannen
ecdedfa1cf
Add no-videos flag to ILIAS crawler
2021-05-23 12:37:01 +02:00
I-Al-Istannen
3d4b997d4a
Retry crawl_url and work around Python's closure handling
...
Closures capture the scope and not the variables. Therefore, any
type-narrowing performed by mypy on captured variables is lost inside
the closure.
2021-05-23 12:28:15 +02:00
Joscha
e81005ae4b
Fix CLI arguments
2021-05-23 12:24:21 +02:00
I-Al-Istannen
33a81a5f5c
Document authentication in HTTP crawler and rename prepare_request
2021-05-23 11:55:34 +02:00
Joscha
25e2abdb03
Improve transformer explain wording
2021-05-23 11:45:14 +02:00
Joscha
803e5628a2
Clean up logging
...
Paths are now (hopefully) logged consistently across all crawlers
2021-05-23 11:37:19 +02:00
Joscha
c88f20859a
Explain config file dumping
2021-05-23 11:04:50 +02:00
Joscha
ec3767c545
Create crawler base dir at start of crawl
2021-05-23 10:52:02 +02:00
Joscha
729ff0a4c7
Fix simple authenticator output
2021-05-23 10:45:37 +02:00
Joscha
6fe51e258f
Number rules starting at 1
2021-05-23 10:45:37 +02:00
Joscha
44ecb2fbe7
Fix cleanup deleting crawler's base directory
2021-05-23 10:45:37 +02:00
I-Al-Istannen
53e031d9f6
Reuse dl/cl for I/O retries in ILIAS crawler
2021-05-23 00:28:27 +02:00
I-Al-Istannen
8ac85ea0bd
Fix a few typos in HttpCrawler
2021-05-22 23:37:34 +02:00
I-Al-Istannen
adfdc302d7
Save cookies after successful authentication in HTTP crawler
2021-05-22 23:30:32 +02:00
I-Al-Istannen
3053278721
Move HTTP crawler to own file
2021-05-22 23:23:21 +02:00
I-Al-Istannen
4d07de0d71
Adjust forum log message in ilias crawler
2021-05-22 23:20:21 +02:00
I-Al-Istannen
953a1bba93
Adjust to new crawl / download names
2021-05-22 23:18:05 +02:00
Joscha
e724ff7c93
Fix normal arrow
2021-05-22 20:44:59 +00:00
Joscha
62f0f7bfc5
Explain crawling and partially explain downloading
2021-05-22 20:39:57 +00:00
Joscha
9cb2b68f09
Fix arrow parsing error messages
2021-05-22 20:39:29 +00:00
Joscha
1bbc0b705f
Improve transformer error handling
2021-05-22 20:38:56 +00:00
Joscha
662191eca9
Fix crash as soon as first cl or dl token was acquired
2021-05-22 20:25:58 +00:00
Joscha
ae3d80664c
Update local crawler to new crawler structure
2021-05-22 21:46:36 +02:00
Joscha
e21795ee35
Make file cleanup part of default crawler behaviour
2021-05-22 21:45:51 +02:00
Joscha
ec95dda18f
Unify crawling and downloading steps
...
Now, the progress bar, limiter etc. for downloading and crawling are all handled
via the reusable CrawlToken and DownloadToken context managers.
2021-05-22 21:36:53 +02:00
Joscha
098ac45758
Remove deprecated repeat decorators
2021-05-22 21:13:25 +02:00
Joscha
9889ce6b57
Improve PFERD error handling
2021-05-22 21:13:25 +02:00
Joscha
b4d97cd545
Improve output dir and report error handling
2021-05-22 20:54:42 +02:00
Joscha
afac22c562
Handle abort in exclusive output state correctly
...
If the event loop is stopped while something holds the exclusive output, the
"log" singleton is now reset so the main thread can print a few more messages
before exiting.
2021-05-22 18:58:19 +02:00
Joscha
552cd82802
Run async input and password getters in daemon thread
...
Previously, it ran in the event loop's default executor, which would block until
all its workers were done working.
If Ctrl+C was pressed while input or a password were being read, the
asyncio.run() call in the main thread would be interrupted however, not the
input thread. This meant that multiple key presses (either enter or a second
Ctrl+C) were necessary to stop a running PFERD in some circumstances.
This change instead runs the input functions in daemon threads so they exit as
soon as the main thread exits.
2021-05-22 18:37:53 +02:00
Joscha
dfde0e2310
Improve reporting of unexpected exceptions
2021-05-22 18:36:25 +02:00
Joscha
54dd2f8337
Clean up main and improve error handling
2021-05-22 16:47:24 +02:00
Joscha
b5785f260e
Extract CLI argument parsing to separate module
2021-05-22 15:03:45 +02:00
Joscha
98b8ca31fa
Add some todos
2021-05-22 14:45:46 +02:00
I-Al-Istannen
4b104b6252
Try out some HTTP authentication handling
...
This is by no means final yet and will change a bit once the dl and cl
are changed, but it might serve as a first try. It is also wholly
untested.
2021-05-21 12:02:51 +02:00
I-Al-Istannen
83d12fcf2d
Add some explains to ilias crawler and use crawler exceptions
2021-05-20 14:58:54 +02:00
I-Al-Istannen
e4f9560655
Only retry on aiohttp errors in ILIAS crawler
...
This patch removes quite a few retries and now only retries the ilias
element method. Every other HTTP-interacting method (except for the root
requests) is called from there and should be covered.
In the future we also want to retry the root a few times, but that
will be done after the download sink API is adjusted.
2021-05-19 22:01:09 +02:00
I-Al-Istannen
8cfa818f04
Only call should_crawl once
2021-05-19 21:57:55 +02:00
I-Al-Istannen
81301f3a76
Rename the ilias crawler to ilias web crawler
2021-05-19 21:41:17 +02:00
I-Al-Istannen
2976b4d352
Move ILIAS file templates to own file
2021-05-19 21:37:10 +02:00
I-Al-Istannen
9f03702e69
Split up ilias crawler in multiple files
...
The ilias crawler contained a crawler and an HTML parser, now they are
split in two.
2021-05-19 21:34:36 +02:00
Joscha
3300886120
Explain config file loading
2021-05-19 18:11:43 +02:00
Joscha
0d10752b5a
Configure explain log level via cli and config file
2021-05-19 17:50:10 +02:00
Joscha
92886fb8d8
Implement --version flag
2021-05-19 17:33:36 +02:00
Joscha
5916626399
Make noqua comment more specific
2021-05-19 17:16:59 +02:00
Joscha
a7c025fd86
Implement reusable FileSinkToken for OutputDirectory
2021-05-19 17:16:23 +02:00
Joscha
b7a999bc2e
Clean up crawler exceptions and (a)noncritical
2021-05-19 13:25:57 +02:00
Joscha
3851065500
Fix local crawler's download bars
...
Display the pure path instead of the local path.
2021-05-18 23:23:40 +02:00
Joscha
4b68fa771f
Move logging logic to singleton
...
- Renamed module and class because "conductor" didn't make a lot of sense
- Used singleton approach (there's only one stdout after all)
- Redesigned progress bars (now with download speed!)
2021-05-18 22:45:19 +02:00
I-Al-Istannen
1525aa15a6
Fix link template error and use indeterminate progress bar
2021-05-18 22:40:28 +02:00
I-Al-Istannen
db1219d4a9
Create a link file in ILIAS crawler
...
This allows us to crawl links and represent them in the file system.
Users can choose between an ILIAS-imitation (that optionally
auto-redirects) and a plain text variant.
2021-05-17 21:44:54 +02:00
I-Al-Istannen
b8efcc2ca5
Respect filters in ILIAS crawler
2021-05-17 21:30:26 +02:00
Joscha
0bae009189
Run formatting tools
2021-05-16 14:32:53 +02:00
I-Al-Istannen
8b76ebb3ef
Rename IliasCrawler to KitIliasCrawler
2021-05-16 13:28:06 +02:00
I-Al-Istannen
2b6235dc78
Fix pylint warnings (and 2 found bugs) in ILIAS crawler
2021-05-16 13:17:12 +02:00
I-Al-Istannen
1c226c31aa
Add some repeat annotations to the ILIAS crawler
2021-05-16 13:01:56 +02:00
I-Al-Istannen
9ec0d3e16a
Implement date-demangling in ILIAS crawler
2021-05-16 13:01:56 +02:00
I-Al-Istannen
cf6903d109
Retry crawling on I/O failure
2021-05-16 13:01:56 +02:00
Joscha
9fd356d290
Ensure tmp files are deleted
...
This doesn't seem to fix the case where an exception bubbles up to the top of
the event loop. It also doesn't seem to fix the case when a KeyboardInterrupt is
thrown, since that never makes its way into the event loop in the first place.
Both of these cases lead to the event loop stopping, which means that the tmp
file cleanup doesn't get executed even though it's inside a "with" or "finally".
2021-05-15 23:00:40 +02:00
Joscha
989032fe0c
Fix cookies getting deleted
2021-05-15 22:25:48 +02:00
Joscha
05573ccc53
Add fancy CLI options
2021-05-15 22:22:01 +02:00
I-Al-Istannen
c454fabc9d
Add support for exercises in ILIAS crawler
2021-05-15 21:40:17 +02:00
I-Al-Istannen
7d323ec62b
Implement video downloads in ilias crawler
2021-05-15 21:32:32 +02:00
I-Al-Istannen
c7494e32ce
Start implementing crawling in ILIAS crawler
...
The ilias crawler can now crawl quite a few filetypes, splits off
folders and crawls them concurrently.
2021-05-15 20:42:18 +02:00
I-Al-Istannen
1123c8884d
Implement an IliasPage
...
This allows PFERD to semantically understand ILIAS HTML and is the
foundation for the ILIAS crawler. This patch extends the ILIAS crawler
to crawl the personal desktop and print the elements on it.
2021-05-15 18:59:23 +02:00
Joscha
e1104f888d
Add tfa authenticator
2021-05-15 18:27:16 +02:00
Joscha
8c32da7f19
Let authenticators provide username and password separately
2021-05-15 18:27:03 +02:00
Joscha
d63494908d
Properly invalidate exceptions
...
The simple authenticator now properly invalidates its credentials. Also, the
invalidation functions have been given better names and documentation.
2021-05-15 17:37:05 +02:00
Joscha
b70b62cef5
Make crawler sections start with "crawl:"
...
Also, use only the part of the section name after the "crawl:" as the crawler's
output directory. Now, the implementation matches the documentation again
2021-05-15 17:24:37 +02:00
Joscha
868f486922
Rename local crawler path to target
2021-05-15 17:12:25 +02:00
I-Al-Istannen
b2a2b5999b
Implement ILIAS auth and crawl home page
...
This commit introduces the necessary machinery to authenticate with
ILIAS and crawl the home page.
It can't do much yet and just silently fetches the homepage.
2021-05-15 15:25:05 +02:00
Joscha
595de88d96
Fix authenticator and crawler names
...
Now, the "auth:" and "crawl:" parts are considered part of the name. This fixes
crawlers not being able to find their authenticators.
2021-05-15 15:25:05 +02:00
Joscha
a6fdf05ee9
Allow variable whitespace in arrow rules
2021-05-15 15:25:05 +02:00
Joscha
f897d7c2e1
Add name variants for all arrows
2021-05-15 15:25:05 +02:00
Joscha
b0f731bf84
Make crawlers use transformers
2021-05-15 15:25:05 +02:00
Joscha
302b8c0c34
Fix errors loading local crawler config
...
Apparently getint and getfloat may return a None even though this is not
mentioned in their type annotations.
2021-05-15 15:25:05 +02:00
Joscha
acd674f0a0
Change limiter logic
...
Now download tasks are a subset of all tasks.
2021-05-15 15:25:05 +02:00
Joscha
ed2e19a150
Add reasons for invalid values
2021-05-15 15:25:05 +02:00