[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26

Merged
merged 14 commits into from
Nov 26, 2023

Conversation

guillermoscript
Copy link
Contributor

This pull request includes several changes to improve the functionality of the code:

  1. Refactored the getPageHtml function to handle the case when the specified selector is not found on the page. In this case, the function now falls back to using the body selector to retrieve the page content.

  2. Added a try-catch block to handle the case when the specified selector is not found during the page crawl. If the selector is not found, a warning message is logged and the function falls back to using the body selector.

  3. Added support for downloading URLs from a sitemap.xml file. If the provided URL is a sitemap, all pages listed in the sitemap will be crawled.

  4. Updated comments in the code to indicate that sitemap support has been added.

These changes improve the robustness and flexibility of the code, allowing it to handle cases where the specified selector is not found and enabling the crawling of pages listed in a sitemap.

Fixes #16

src/main.ts Outdated
@@ -1,5 +1,5 @@
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from "crawlee";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to revert changes, it was capturing 'www.site.com/img.png" from links on pages and taking up an unnecessarily large amount of space in the output.json. I'd suggest keeping this feature, but adding an option in config.ts to enable or disable it, or enable with a url blacklist option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okkk, nice gotcha, let me see what i can do about the image issue

Copy link
Contributor Author
@guillermoscript guillermoscript Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following your suggestion, I added a extension blacklist in the config.ts, if there is a match in the route with the blocked extension, then the request its going to be aborted.

@vaibhavkumar-sf
Copy link

Preparing review...

2 similar comments
@vaibhavkumar-sf
Copy link

Preparing review...

@vaibhavkumar-sf
Copy link

Preparing review...

src/main.ts Outdated
Comment on lines 42 to 52
try {
await page.waitForSelector(config.selector, {
timeout: config.waitForSelectorTimeout ?? 1000,
});
} catch (e) {
// If the selector is not found, let the user know
log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`);
// using body as a fallback
await page.waitForSelector("body", {
timeout: config.waitForSelectorTimeout ?? 1000,
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Refactor the code to avoid code duplication when waiting for selectors. You can create a helper function that accepts a selector and timeout as parameters and use it in both cases.

Suggested change
try {
await page.waitForSelector(config.selector, {
timeout: config.waitForSelectorTimeout ?? 1000,
});
} catch (e) {
// If the selector is not found, let the user know
log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`);
// using body as a fallback
await page.waitForSelector("body", {
timeout: config.waitForSelectorTimeout ?? 1000,
});
async function waitForSelectorOrFallback(page: Page, selector: string, fallbackSelector: string, timeout: number) {
try {
await page.waitForSelector(selector, { timeout });
} catch (e) {
log.warning(`Selector "${selector}" not found, Falling back to "${fallbackSelector}"`);
await page.waitForSelector(fallbackSelector, { timeout });
}
}
await waitForSelectorOrFallback(page, config.selector, "body", config.waitForSelectorTimeout ?? 1000);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestions, i just added those!

src/main.ts Outdated
@@ -57,8 +73,20 @@ if (process.env.NO_CRAWL !== "true") {
// headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run([config.url]);
const isUrlASitemap = config.url.endsWith("sitemap.xml");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Refactor the code to avoid hardcoding the "sitemap.xml" string. You can create a constant for it and use it in the condition.

Suggested change
const isUrlASitemap = config.url.endsWith("sitemap.xml");
const SITEMAP_SUFFIX = "sitemap.xml";
const isUrlASitemap = config.url.endsWith(SITEMAP_SUFFIX);

README.md Outdated
@@ -58,7 +58,7 @@ See the top of the file for the type definition for what you can configure:

```ts
type Config = {
/** URL to start the crawl */
/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Fix the typo in the comment. Replace "providedm" with "provided".

Suggested change
/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

Copy link
Contributor
@steve8708 steve8708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in

core.ts, that way users can download a list of urls from the sitemap.xml, also added an abort if the crawler finds resource that is blocked.
@guillermoscript
Copy link
Contributor Author

Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in

thanks! I just updated the code, basically just adding the sitemap support to this new version and the block resouce list prop, so users can skip images for example, if you want to test those I would recommend you to use

let me know if any other change is required :D

@steve8708
Copy link
Contributor

looks great, just a couple new merge conflicts then we're good to go

@guillermoscript
Copy link
Contributor Author

looks great, just a couple new merge conflicts then we're good to go

conflict resolved 👍

@steve8708 steve8708 merged commit 10a71ed into BuilderIO:main Nov 26, 2023
1 check passed
Copy link

🎉 This PR is included in version 1.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

hirsaeki pushed a commit to hirsaeki/gpt-crawler-y-upstream that referenced this pull request Mar 27, 2024
Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any way to use a sitemap.xml for the crawler?
4 participants