This python code receives all of the urls from the sitemap of a specific page and then uses the T5 text transformer model in order to summarize the content present on all of the websites from the sitemap.
Meta Descriptions are an important element when constructing a webpage. Though Google does not include the meta description in its ranking algorithm, a meta description can increase your CTR (click-through-rate) which can subsequently increase your ability to rank in the google algorithm. Many websites lack this meta description and therefore often rank lower in the Google algorithm or aren’t placed in it at all. Creating these descriptions is almost always something that has to be done manually and therefore on sites with thousands of pages on them it can be almost impossible, and if not impossible incredibly costly. With the use of an algorithm like this this daunting task can be simplified greatly. Thousands of meta descriptions can be created within the span of a couple hours using a powerful enough computer.
What I used:
In this python algorithm I used many tools for processing the data from the websites and then analyzing it. I used the python libraries:
With usp.tree, transformers, and bs4 I only used specific classes from the libraries which were sitemap_tree_for_homepage, AutoModelWithLMHead and AutoTokenizer, and BeautifulSoup respectively.
T5 is the model I used for processing and summarizing the text. “The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu” (huggingface.co/transformers/model_doc/t5.html) This transformer can be used for many purposes such as answering questions, translating, and, most importantly to this project, summarize. I used this model because after searching through the text transformers and seeing which ones could have been used for summarization, between Pipeline, Bert, and T5, T5 provided the most substantive and yet the briefest summaries of long texts from websites. The case might have been different if I hadn’t used the pretrained models from the previously linked website, but rather if I trained the models themselves. This wasn’t a viable option for this project as I didn’t have a large about of data for the purpose of summarization available to me due to the aforementioned problem (due to the amount of manual labour, websites don’t do it). There is an issue with using this model, which is that it is a work in progress and many features that can be used with other transformers aren’t available yet. Features like the ability to run the code on your GPU instead of the CPU would result in a shorter processing time allowing for larger data to be converted. It can be inferred that future versions of T5 would yield better results as well as better processing time for this implementation of the model.
How I did it
So initially I just used the T5 model on text that I input manually, but the idea of using urls from a sitemap was more appealing and a commercially more useful implementation.The algorithm first accesses the sitemap from the homepage url of a website intended for summarization. Then it loops through this list of urls from the sitemap, populating an array with the urls of the individual sites. Sometimes the library I use for doing this produces duplicate urls so I solve this with a very simple originality check. After initializing the pretrained model and opening the csv file that the data will be written into, the algorithm loops through all of the urls in the array, extracting text from them. With this process an issue arises, websites have a lot of extraneous non-content text in them in the forms of headers, footers, etc. The easiest method of extracting only the content of a website is seeing the difference between a site similar to it. I do this using the difflib library which allows its user to see the difference between the lines of two strings. This works because the extraneous information from websites tends to be in the same place on the website, especially on the same kinds of sites. (a blog is similar to a different blog, and a contacts page is similar to an about page in their structure) This is the reason why I compare the url intended for summarization to the url after it for the difference between them, as the library I used to get the urls tends to group similar types of sites together. This poses a fence-post problem at the end of the array which I solved using the url before the last url intended for summarization for determining the difference between the two. After the unique content based text from the intended article was extracted it is processed with the pretrained T5 transformer model set to summarization. The raw output is written in tokens instead of actual worlds so before further processing the tokens need to be converted into normal text. Another issue with the T5 model is that often it produces duplicate sentences in the output. I solve this issue through a simple function that removes the duplicate sentences from a string. After all of these processes are completed and everything is stored in variables, the summarization of the text from a url is stored along with the actual url from which it originated.
import urllib from bs4 import BeautifulSoup import difflib import csv from usp.tree import sitemap_tree_for_homepage from transformers import AutoModelWithLMHead, AutoTokenizer #T5 model used for this summarization somethimes produces duplicate sentences in the summary, this function removes the duplicate sentences def remove_duplicates(input): inputs = input.split(".") arr =  for sentence in inputs: if sentence != "": sentence = sentence.strip() sentence = sentence.capitalize() sentence = sentence + ". " if sentence not in arr: arr.append(sentence) output = ''.join(arr) return output #populate an array with all of the the urls from the sitemap urls =  tree = sitemap_tree_for_homepage('http://dominikzeman.blogspot.com') #the url of the homepage of the website wanted for summarization of its content for page in tree.all_pages(): if page.url not in urls: urls.append(page.url) #initialize the T5 text processing model model = AutoModelWithLMHead.from_pretrained("t5-base") tokenizer = AutoTokenizer.from_pretrained("t5-base") #open the csv file until the loop is finished with open('url_and_output.csv', 'w', newline='') as file: writer = csv.writer(file) #process all of the urls produced by the sitemap for i in range(len(urls)): #get the url intended for summarization intended_url = urls[i] html = urllib.request.urlopen(intended_url).read() soup = BeautifulSoup(html, features="lxml") for script in soup(["script", "style"]): script.extract() input_article = soup.get_text() lines = (line.strip() for line in input_article.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) input_article = '\n'.join(chunk for chunk in chunks if chunk) #get the reference url used for determining the similarities between the two articles eliminating the headers, footers and other extraneous text from the extracted url if i == (len(urls)-1): #the last article is compared to the article before it as there is no article that comes after it url = urls[i-1] html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, features="lxml") for script in soup(["script", "style"]): script.extract() reference = soup.get_text() lines = (line.strip() for line in reference.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) reference = '\n'.join(chunk for chunk in chunks if chunk) else: #the urls are compared to the url that comes after them becuase sitemaps group similar urls together url = urls[i+1] html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, features="lxml") for script in soup(["script", "style"]): script.extract() reference = soup.get_text() lines = (line.strip() for line in reference.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) reference = '\n'.join(chunk for chunk in chunks if chunk) #splits the two articles into independent lines based on the \n tag in the text input_article = input_article.splitlines() reference = reference.splitlines() #finds the difference between the two articles and puts the simbol for unique lines as the first character of the line ("-" unique to input_article, "+" unique to reference, and " " identical in both) difference = difflib.unified_diff(input_article, reference) final_article =  for line in difference: if line == "-": final_article.append(line[1:] + " ") final_article = ''.join(final_article) #the model that produces the final summarized text "where the magic happens" #T5 inputs = tokenizer.encode("summarize: " + final_article, return_tensors="pt", max_length=512) outputs = model.generate(inputs, max_length=100, min_length=5, length_penalty=2.0, num_beams=4, early_stopping=True) output = tokenizer.decode(outputs.tolist()) #using remove duplicates function to fix any possible model mess ups output = remove_duplicates(output) #write the url and its summarization to a csv file writer.writerow([intended_url, output]) #print how many websites have been summarized out of the total print(str(i + 1) + "/" + str(len(urls)))