How to Check Broken Links with 404 Error in Python
Links is one of the critical SEO factors for a Website. When creating or re-designing some pages, we cannot ignore Website audit especially in terms of finding and tracking broken links regularly. Although there are many online tools, it is still worth learning and implementing such a tool ourselves for fun. To simplify the programming complexity, I decided to use Python. With powerful Python libraries, it is fairly easy to crawl Web pages, parse HTML elements and check server response code.
How to Check URL 404 Error for a Website
To make SEO-friendly, a Website should generate a sitemap.xml file, which helps search engines to easily crawl all valid URLs. So we can implement the crawler with three steps:
- Read sitemap.xml to extract all Web page links.
- Parse HTML elements of every Web page to collect internal links and outbound links from the attribute href.
- Get connected to all links and check the response code.
Installation
Beautiful Soup is a Python library used for parsing data of HTML and XML files. Install Beautiful Soup with following command:
pip install beautifulsoup4
Implementing a Web Page Crawler in Python
Bind keyboard event to interrupt program whenever you want.
def ctrl_c(signum, frame):
global shutdown_event
shutdown_event.set()
raise SystemExit('\nCancelling...')
global shutdown_event
shutdown_event = threading.Event()
signal.signal(signal.SIGINT, ctrl_c)
Read sitemap.xml with Beautiful Soup.
pages = []
try:
request = build_request("http://kb.dynamsoft.com/sitemap.xml")
f = urlopen(request, timeout=3)
xml = f.read()
soup = BeautifulSoup(xml)
urlTags = soup.find_all("url")
print "The number of url tags in sitemap: ", str(len(urlTags))
for sitemap in urlTags:
link = sitemap.findNext("loc").text
pages.append(link)
f.close()
except HTTPError, URLError:
print URLError.code
return pages
Parse HTML content to collect all links.
def queryLinks(self, result):
links = []
content = ''.join(result)
soup = BeautifulSoup(content)
elements = soup.select('a')
for element in elements:
if shutdown_event.isSet():
return GAME_OVER
try:
link = element.get('href')
if link.startswith('http'):
links.append(link)
except:
print 'href error!!!'
continue
return links
def readHref(self, url):
result = []
try:
request = build_request(url)
f = urlopen(request, timeout=3)
while 1 and not shutdown_event.isSet():
tmp = f.read(10240)
if len(tmp) == 0:
break
else:
result.append(tmp)
f.close()
except HTTPError, URLError:
print URLError.code
if shutdown_event.isSet():
return GAME_OVER
return self.queryLinks(result)
Send link request and check response code.
def crawlLinks(self, links, file=None):
for link in links:
if shutdown_event.isSet():
return GAME_OVER
status_code = 0
try:
request = build_request(link)
f = urlopen(request)
status_code = f.code
f.close()
except HTTPError, URLError:
status_code = URLError.code
if status_code == 404:
if file != None:
file.write(link + '\n')
print str(status_code), ':', link
return GAME_OVER