How to Check Broken Links with 404 Error in Python

Links is one of the critical SEO factors for a Website. When creating or re-designing some pages, we cannot ignore Website audit especially in terms of finding and tracking broken links regularly. Although there are many online tools, it is still worth learning and implementing such a tool ourselves for fun. To simplify the programming complexity, I decided to use Python. With powerful Python libraries, it is fairly easy to crawl Web pages, parse HTML elements and check server response code.

How to Check URL 404 Error for a Website

To make SEO-friendly, a Website should generate a sitemap.xml file, which helps search engines to easily crawl all valid URLs. So we can implement the crawler with three steps:

  1. Read sitemap.xml to extract all Web page links.
  2. Parse HTML elements of every Web page to collect internal links and outbound links from the attribute href.
  3. Get connected to all links and check the response code.

Installation

Beautiful Soup is a Python library used for parsing data of HTML and XML files. Install Beautiful Soup with following command:

pip install beautifulsoup4

Implementing a Web Page Crawler in Python

Bind keyboard event to interrupt program whenever you want.

def ctrl_c(signum, frame):
    global shutdown_event
    shutdown_event.set()
    raise SystemExit('\nCancelling...')

global shutdown_event
shutdown_event = threading.Event()
signal.signal(signal.SIGINT, ctrl_c)

Read sitemap.xml with Beautiful Soup.

pages = []
try:
    request = build_request("http://kb.dynamsoft.com/sitemap.xml")
    f = urlopen(request, timeout=3)
    xml = f.read()
    soup = BeautifulSoup(xml)
    urlTags = soup.find_all("url")

    print "The number of url tags in sitemap: ", str(len(urlTags))

    for sitemap in urlTags:
        link = sitemap.findNext("loc").text
        pages.append(link)

    f.close()
except HTTPError, URLError:
    print URLError.code

return pages

Parse HTML content to collect all links.

    def queryLinks(self, result):
        links = []
        content = ''.join(result)
        soup = BeautifulSoup(content)
        elements = soup.select('a')

        for element in elements:
            if shutdown_event.isSet():
                return GAME_OVER

            try:
                link = element.get('href')
                if link.startswith('http'):
                    links.append(link)
            except:
                print 'href error!!!'
                continue

        return links

    def readHref(self, url):
        result = []
        try:
            request = build_request(url)
            f = urlopen(request, timeout=3)
            while 1 and not shutdown_event.isSet():
                tmp = f.read(10240)
                if len(tmp) == 0:
                    break
                else:
                    result.append(tmp)

            f.close()
        except HTTPError, URLError:
            print URLError.code

        if shutdown_event.isSet():
            return GAME_OVER

        return self.queryLinks(result)

Send link request and check response code.

    def crawlLinks(self, links, file=None):
        for link in links:
            if shutdown_event.isSet():
                return GAME_OVER

            status_code = 0

            try:
                request = build_request(link)
                f = urlopen(request)
                status_code = f.code
                f.close()
            except HTTPError, URLError:
                status_code = URLError.code

            if status_code == 404:
                if file != None:
                    file.write(link + '\n')

            print str(status_code), ':', link

        return GAME_OVER

Source Code

https://github.com/yushulx/crawl-404