Commit Graph

80 Commits

Author SHA1 Message Date
46fb782798 Add forum crawling
This downloads all forum posts when needed and saves each thread in its
own html file, named after the thread title.
2022-05-24 23:43:53 +02:00
846c29aee1 Download page descriptions 2022-05-11 21:16:56 +02:00
a5015fe9b1 Correctly parse day-only meeting dates
I failed to recognize the correct format in the previous adjustment, so
this (hopefully) fixes it for good.
Meetings apparently don't always have a time portion.
2022-05-08 23:22:26 +02:00
616b0480f7 Simplify IPD crawler link regex 2022-05-08 18:18:05 +02:00
bcc537468c Fix crawling of expanded meetings
The last meeting on every page is expanded by default.
Its content is then shown inline *and* in the meeting page itself.
We should skip the inline content.
2022-05-05 22:53:37 +02:00
694ffb4d77 Fix meeting date parsing
Apparently the new pattern "<relative time qualifier>: <date>," was
added. This patch adds support for it.
2022-05-05 22:28:30 +02:00
af2cc1169a Mention href for users of link_regex option 2022-05-05 14:36:03 +02:00
bc3fa36637 Fix IPD crawler crashing on weird HTML comments 2022-05-05 14:35:42 +02:00
b8fe25c580 Add .cpp to ipd link regex 2022-05-04 14:19:26 +02:00
b56475450d Use utf-8 for cookies 2022-04-29 23:12:41 +02:00
602044ff1b Fix mypy errors and add missing await 2022-04-27 22:52:50 +02:00
a2831fbea2 Fix shib authentication
Authentication failed previously if the shib session was still valid.
If Shibboleth gets a request and the session is still valid, it directly
responds without a second redirect.
2022-04-27 13:55:24 +02:00
86e2e226dc Notify user when shibboleth presents new entitlements 2022-04-03 11:37:08 +02:00
7872fe5221 Fix tables with more columns than expected 2022-01-18 22:38:48 +01:00
4f022e2d19 Reword changelog 2022-01-15 15:06:02 +01:00
f47e7374d2 Use fixed windows path for video cache 2022-01-15 12:00:30 +01:00
57ec51e95a Fix login after shib url parser change 2022-01-14 20:17:27 +01:00
4ee919625d Add rudimentary support for content pages 2022-01-08 20:47:35 +01:00
d30f25ee97 Detect shib login page as login page
And do not assume we are logged in...
2022-01-08 20:28:45 +01:00
10d9d74528 Bail out when crawling recursive courses 2022-01-08 20:28:30 +01:00
43c5453e10 Correctly crawl files on desktop
The files on the desktop do not include a download link, so we need to
rewrite it.
2022-01-08 20:00:53 +01:00
e32c1f000f Fix mtime for single streams 2022-01-08 18:05:48 +01:00
5f527bc697 Remove Python 3.9 Pattern typehints 2022-01-08 17:14:40 +01:00
ced8b9a2d0 Fix some accordions 2022-01-08 16:58:30 +01:00
6f3cfd4396 Fix personal desktop crawling 2022-01-08 16:58:15 +01:00
462d993fbc Fix local video path cache (hopefully) 2022-01-08 00:27:48 +01:00
a99356f2a2 Fix video stream extraction 2022-01-08 00:27:34 +01:00
eac2e34161 Fix is_logged_in for ILIAS 7 2022-01-07 23:32:31 +01:00
a82a0b19c2 Collect crawler warnings/errors and include them in the report 2021-11-07 21:48:55 +01:00
90cb6e989b Do not download single videos if cache does not exist 2021-11-06 23:21:15 +01:00
6289938d7c Do not stop crawling files when encountering a CrawlWarning 2021-11-06 12:09:51 +01:00
88afe64a92 Refactor IPD crawler a bit 2021-11-02 01:25:01 +00:00
6b2a657573 Fix IPD crawler for different subpages (#42)
This patch reworks the IPD crawler to support subpages which do not use
"/intern" for links and fetches the folder names from table headings.
2021-11-02 01:25:01 +00:00
e42ab83d32 Add support for ILIAS cards 2021-10-30 18:13:44 +02:00
f9a3f9b9f2 Handle multi-stream videos 2021-10-30 18:12:29 +02:00
6673077397 Add kit-ipd crawler 2021-10-21 13:20:21 +02:00
544d45cbc5 Catch non-critical exceptions at crawler top level 2021-07-13 15:42:11 +02:00
ee67f9f472 Sort elements by ILIAS id to ensure deterministic ordering 2021-07-06 17:45:48 +02:00
8ec3f41251 Crawl ilias booking objects as links 2021-07-06 16:15:25 +02:00
89be07d4d3 Use final crawl path in HTML parsing message 2021-07-03 17:05:48 +02:00
91200f3684 Fix nondeterministic name deduplication 2021-07-03 12:09:55 +02:00
6e4d423c81 Crawl all video stages in one crawl bar
This ensures folders are not renamed, as they are crawled twice
2021-06-13 17:18:45 +02:00
70ec64a48b Fix wrong base URL for multi-stage pages 2021-06-13 15:44:47 +02:00
8ab462fb87 Use the exercise label instead of the button name as path 2021-06-04 19:24:23 +02:00
df3ad3d890 Add 'skip' option to crawlers 2021-06-04 18:47:13 +02:00
722970a255 Store cookies in text-based format
Using the stdlib's http.cookie module, cookies are now stored as one
"Set-Cookie" header per line. Previously, the aiohttp.CookieJar's save() and
load() methods were used (which use pickling).
2021-05-31 20:18:20 +00:00
f40820c41f Warn if using concurrent tasks with kit-ilias-web 2021-05-31 20:18:20 +00:00
1fba96abcb Fix exercise date parsing for non-group submissions
ILIAS apparently changes the order of the fields as it sees fit, so we
now try to parse *every* column, starting at from the right, as a date.
The first column that parses successfully is then used.
2021-05-31 18:15:12 +02:00
7b062883f6 Use raw paths for --debug-transforms
Previously, the already-transformed paths were used, which meant that
--debug-transforms was cumbersome to use (as you had to remove all transforms
and crawl once before getting useful results).
2021-05-31 12:33:37 +02:00
64a2960751 Align paths in status messages and progress bars
Also print "Ignored" when paths are ignored due to transforms
2021-05-31 12:32:42 +02:00