-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported #26
Conversation
not found case, using body as fallback.
…or (body) is not found.
src/main.ts
Outdated
@@ -1,5 +1,5 @@ | |||
// For more information, see https://crawlee.dev/ | |||
import { PlaywrightCrawler } from "crawlee"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to revert changes, it was capturing 'www.site.com/img.png" from links on pages and taking up an unnecessarily large amount of space in the output.json. I'd suggest keeping this feature, but adding an option in config.ts to enable or disable it, or enable with a url blacklist option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okkk, nice gotcha, let me see what i can do about the image issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following your suggestion, I added a extension blacklist in the config.ts, if there is a match in the route with the blocked extension, then the request its going to be aborted.
Preparing review... |
2 similar comments
Preparing review... |
Preparing review... |
src/main.ts
Outdated
try { | ||
await page.waitForSelector(config.selector, { | ||
timeout: config.waitForSelectorTimeout ?? 1000, | ||
}); | ||
} catch (e) { | ||
// If the selector is not found, let the user know | ||
log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`); | ||
// using body as a fallback | ||
await page.waitForSelector("body", { | ||
timeout: config.waitForSelectorTimeout ?? 1000, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Refactor the code to avoid code duplication when waiting for selectors. You can create a helper function that accepts a selector and timeout as parameters and use it in both cases.
try { | |
await page.waitForSelector(config.selector, { | |
timeout: config.waitForSelectorTimeout ?? 1000, | |
}); | |
} catch (e) { | |
// If the selector is not found, let the user know | |
log.warning(`Selector "${config.selector}" not found on ${request.loadedUrl}, Falling back to "body"`); | |
// using body as a fallback | |
await page.waitForSelector("body", { | |
timeout: config.waitForSelectorTimeout ?? 1000, | |
}); | |
async function waitForSelectorOrFallback(page: Page, selector: string, fallbackSelector: string, timeout: number) { | |
try { | |
await page.waitForSelector(selector, { timeout }); | |
} catch (e) { | |
log.warning(`Selector "${selector}" not found, Falling back to "${fallbackSelector}"`); | |
await page.waitForSelector(fallbackSelector, { timeout }); | |
} | |
} | |
await waitForSelectorOrFallback(page, config.selector, "body", config.waitForSelectorTimeout ?? 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestions, i just added those!
src/main.ts
Outdated
@@ -57,8 +73,20 @@ if (process.env.NO_CRAWL !== "true") { | |||
// headless: false, | |||
}); | |||
|
|||
// Add first URL to the queue and start the crawl. | |||
await crawler.run([config.url]); | |||
const isUrlASitemap = config.url.endsWith("sitemap.xml"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Refactor the code to avoid hardcoding the "sitemap.xml" string. You can create a constant for it and use it in the condition.
const isUrlASitemap = config.url.endsWith("sitemap.xml"); | |
const SITEMAP_SUFFIX = "sitemap.xml"; | |
const isUrlASitemap = config.url.endsWith(SITEMAP_SUFFIX); |
README.md
Outdated
@@ -58,7 +58,7 @@ See the top of the file for the type definition for what you can configure: | |||
|
|||
```ts | |||
type Config = { | |||
/** URL to start the crawl */ | |||
/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Fix the typo in the comment. Replace "providedm" with "provided".
/** URL to start the crawl, if sitemap is providedm then it will be used instead and download all pages in the sitemap */ | |
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */ |
selector from comments suggestions
resource excluded routes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool @guillermoscript! We just have a merge conflict and once resolved we can get this in
core.ts, that way users can download a list of urls from the sitemap.xml, also added an abort if the crawler finds resource that is blocked.
thanks! I just updated the code, basically just adding the sitemap support to this new version and the block resouce list prop, so users can skip images for example, if you want to test those I would recommend you to use
let me know if any other change is required :D |
looks great, just a couple new merge conflicts then we're good to go |
conflict resolved 👍 |
🎉 This PR is included in version 1.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Refactor getPageHtml function to handle selector not found case, using body as fallback. Add support for downloading URLs from sitemap.xml. Update comments to let know that sitemap is supported
This pull request includes several changes to improve the functionality of the code:
Refactored the
getPageHtml
function to handle the case when the specified selector is not found on the page. In this case, the function now falls back to using thebody
selector to retrieve the page content.Added a try-catch block to handle the case when the specified selector is not found during the page crawl. If the selector is not found, a warning message is logged and the function falls back to using the
body
selector.Added support for downloading URLs from a sitemap.xml file. If the provided URL is a sitemap, all pages listed in the sitemap will be crawled.
Updated comments in the code to indicate that sitemap support has been added.
These changes improve the robustness and flexibility of the code, allowing it to handle cases where the specified selector is not found and enabling the crawling of pages listed in a sitemap.
Fixes #16