Skip to main content

Web Scraping

learn about the web scraping feature of the knowledge library

Bryce DeCora avatar
Written by Bryce DeCora
Updated yesterday

Overview

Scraping a website is the easiest way to load a bunch of information into an agent for it to reference. However, it's also the easiest way to feed too much information into your agent. Websites typically contain a lot of text, and despite what you may have heard, more doesn't always mean better when it comes to knowledge bases for agents. When possible, use document uploads or text inputs instead of web scraping.

Websites are re-scraped every 24 hours so any changes made to your website will reflect in your Agent within a day.

Initiate Web Scrape - URL

Before putting a URL in, decide what information from your website that you want to be available in your Knowledge Document. I know your gut reaction is going to be "All of it!" but this is likely a bad idea.

Your webpage has a ton of information and a lot of it just isn't useful in a document that is designed for product/service descriptions and Frequently Asked Questions. You probably have funnels on your page and lots of buttons that say "Book Now" or "Schedule a Call." These words can add information to your prompt that can cause some serious issues, like your bookings failing.

You can scrape certain pages by adding their direct URL into this field so you can cherry-pick the information that you want available.

Add your website's Product or FAQ URL including the https:// into this field.

High, Medium, and Low Detail

Select how much "detail" you would like to like to have in the scraped site. Detail in this case does not mean "accuracy of the information." It instead is more like "Scope" by selecting these options, you are effecting the breadth and depth parameters or how many pages you want the scrape to cover. See more below in the Advanced section. High, Medium and Low are just presets for breadth and depth.


Connect to your Source

Just like other documents, make sure after you initiate the web scrape, you attach it to the correct source. The Knowledge Doc must be attached for a source to be able to access the information within the document.



Advanced

Breadth

Breadth is a number that represents how many links you would like the scrape to go into. The first 10 links that your webpage offers will be included in this number. The max breadth that we allow is 10.

Depth

Depth represents how many pages deep you would like the scrape to go. When it opens the pages from Breadth, it will scrape a number of pages deep opening the breadth of links from each page it goes deep.

This gets exponentially huge, literally! Breadth^Depth for each level deep, is the mathematical calculation for the number of pages.

Deduplication

We also "deduplicate" when we are pulling links for breath which means if we have already scraped a unique page, we will skip that link when determining which 10 links we will scrape so we offer unique pages for each layer.

Limits

The smallest scrape you can do is 0 Breadth and 0 Depth meaning just scrape the page you are on. 0^0=1 because anything to the power of 0 equals 1.

The largest scrape that we allow is 10 Breadth and 3 Depth which will scrape the page it is on, then open 10 links from that page and scrape them plus open 10 links from each of those pages, scrape them and then open 10 links from each of THOSE pages and scrape them. 10^3 +10^2+10^1+10^0 = 1,111 web pages.

Document Uploads/Limits

We have a hard limit of 6MB per document uploaded. This is a system limitation and is unlikely to ever get higher. We will also stop the scrape at the point that the information hits 6MB.

Along with this being a system limitation, it will also prevent someone from running up your or their storage costs by trying to upload a video into the Knowledge Library.


Summary

Scraping a website is very easy and it will automatically re-scrape the website each day to make sure the data is up-to-date, but that is really where the benefits end. Using upload or Create Text File to add to your knowledge documents is much more controllable and will have better results than using the web scraper.

Did this answer your question?