Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

guillermoscript · 2023-11-18T14:32:13Z

This pull request includes several changes to improve the functionality of the code:

Refactored the getPageHtml function to handle the case when the specified selector is not found on the page. In this case, the function now falls back to using the body selector to retrieve the page content.
Added a try-catch block to handle the case when the specified selector is not found during the page crawl. If the selector is not found, a warning message is logged and the function falls back to using the body selector.
Added support for downloading URLs from a sitemap.xml file. If the provided URL is a sitemap, all pages listed in the sitemap will be crawled.
Updated comments in the code to indicate that sitemap support has been added.

These changes improve the robustness and flexibility of the code, allowing it to handle cases where the specified selector is not found and enabling the crawling of pages listed in a sitemap.

Fixes #16

not found case, using body as fallback.

…or (body) is not found.

spawn9859 · 2023-11-19T03:58:01Z

src/main.ts

@@ -1,5 +1,5 @@
 // For more information, see https://crawlee.dev/
-import { PlaywrightCrawler } from "crawlee";


Had to revert changes, it was capturing 'www.site.com/img.png" from links on pages and taking up an unnecessarily large amount of space in the output.json. I'd suggest keeping this feature, but adding an option in config.ts to enable or disable it, or enable with a url blacklist option.

okkk, nice gotcha, let me see what i can do about the image issue

following your suggestion, I added a extension blacklist in the config.ts, if there is a match in the route with the blocked extension, then the request its going to be aborted.

vaibhavkumar-sf · 2023-11-19T09:25:07Z

Preparing review...

vaibhavkumar-sf · 2023-11-19T09:29:42Z

Preparing review...

vaibhavkumar-sf · 2023-11-19T09:37:37Z

Preparing review...

vaibhavkumar-sf · 2023-11-19T09:40:02Z

src/main.ts

+      try {
+        await page.waitForSelector(config.selector, {
+          timeout: config.waitForSelectorTimeout ?? 1000,
+        });
+      } catch (e) {
+        // If the selector is not found, let the user know
+        log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`);
+        // using body as a fallback
+        await page.waitForSelector("body", {
+          timeout: config.waitForSelectorTimeout ?? 1000,
+        });


Suggestion: Refactor the code to avoid code duplication when waiting for selectors. You can create a helper function that accepts a selector and timeout as parameters and use it in both cases.

Suggested change

try {

await page.waitForSelector(config.selector, {

timeout: config.waitForSelectorTimeout ?? 1000,

});

} catch (e) {

// If the selector is not found, let the user know

log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`);

// using body as a fallback

await page.waitForSelector("body", {

timeout: config.waitForSelectorTimeout ?? 1000,

});

async function waitForSelectorOrFallback(page: Page, selector: string, fallbackSelector: string, timeout: number) {

try {

await page.waitForSelector(selector, { timeout });

} catch (e) {

log.warning(`Selector "${selector}" not found, Falling back to "${fallbackSelector}"`);

await page.waitForSelector(fallbackSelector, { timeout });

}

}

await waitForSelectorOrFallback(page, config.selector, "body", config.waitForSelectorTimeout ?? 1000);

thanks for the suggestions, i just added those!

vaibhavkumar-sf · 2023-11-19T09:40:03Z

src/main.ts

@@ -57,8 +73,20 @@ if (process.env.NO_CRAWL !== "true") {
    // headless: false,
  });

-  // Add first URL to the queue and start the crawl.
-  await crawler.run([config.url]);
+  const isUrlASitemap = config.url.endsWith("sitemap.xml");


Suggestion: Refactor the code to avoid hardcoding the "sitemap.xml" string. You can create a constant for it and use it in the condition.

Suggested change

const isUrlASitemap = config.url.endsWith("sitemap.xml");

const SITEMAP_SUFFIX = "sitemap.xml";

const isUrlASitemap = config.url.endsWith(SITEMAP_SUFFIX);

vaibhavkumar-sf · 2023-11-19T09:40:03Z

README.md

@@ -58,7 +58,7 @@ See the top of the file for the type definition for what you can configure:

 ```ts
 type Config = {
-  /** URL to start the crawl */
+  /** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */


Suggestion: Fix the typo in the comment. Replace "providedm" with "provided".

Suggested change

/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */

/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

selector from comments suggestions

resource excluded routes

exclusions

steve8708

Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in

core.ts, that way users can download a list of urls from the sitemap.xml, also added an abort if the crawler finds resource that is blocked.

guillermoscript · 2023-11-22T02:34:29Z

Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in

thanks! I just updated the code, basically just adding the sitemap support to this new version and the block resouce list prop, so users can skip images for example, if you want to test those I would recommend you to use

https://builder.io/sitemap.xml (for sitemap test)
https://picsum.photos/ && https://picsum.photos/** for the resourceExclusions, as this is going to skip most of the image that if you try without thise option is jut going to add a blank "html": "" an some times tittle too.

let me know if any other change is required :D

steve8708 · 2023-11-23T00:57:08Z

looks great, just a couple new merge conflicts then we're good to go

guillermoscript · 2023-11-23T02:21:46Z

looks great, just a couple new merge conflicts then we're good to go

conflict resolved 👍

github-actions · 2023-11-26T20:32:11Z

🎉 This PR is included in version 1.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported

guillermoscript added 4 commits November 18, 2023 10:24

Refactor getPageHtml function to handle selector

1fd2a15

not found case, using body as fallback.

Added try catch block, fallback selector is used when original select…

0d547d9

…or (body) is not found.

Add support for downloading URLs from sitemap.xml

adfcc5e

Update comments to let know that sitemap is supported

3f30423

spawn9859 reviewed Nov 19, 2023

View reviewed changes

vaibhavkumar-sf reviewed Nov 19, 2023

View reviewed changes

guillermoscript added 5 commits November 19, 2023 09:56

Refactor waitForSelector function and add fallback

8cbd7cd

selector from comments suggestions

fix typo

fb9abe0

Add preNavigationHooks to abort requests for

941f830

resource excluded routes

Update config.ts with old URL and resource

0c7f7cc

exclusions

Add resourceExclusions option to Config in readme

5624261

steve8708 approved these changes Nov 21, 2023

View reviewed changes

guillermoscript added 3 commits November 21, 2023 21:27

Merge branch 'main' into sitemap-support

7f771aa

Add PlaywrightCrawler and downloadListOfUrls to

0a82e7d

core.ts, that way users can download a list of urls from the sitemap.xml, also added an abort if the crawler finds resource that is blocked.

Add resourceExclusions to Config type

73df650

guillermoscript added 2 commits November 22, 2023 21:59

Merge remote-tracking branch 'upstream/main' into sitemap-support

3935d68

Update configSchema in src/config.ts

5221360

steve8708 merged commit 10a71ed into BuilderIO:main Nov 26, 2023
1 check passed

github-actions bot added the released label Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

		@@ -1,5 +1,5 @@
		// For more information, see https://crawlee.dev/
		import { PlaywrightCrawler } from "crawlee";

	const isUrlASitemap = config.url.endsWith("sitemap.xml");
	const SITEMAP_SUFFIX = "sitemap.xml";
	const isUrlASitemap = config.url.endsWith(SITEMAP_SUFFIX);

	/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */
	/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment