For our comprehensive finale, we'll reset our data structures and systematically loop through all available pages to capture the complete dataset. Examining the URL structure reveals a predictable pattern: we start at books.toscrape.com/catalog/page1.html, and subsequent pages follow the sequential format books.toscrape.com/catalog/page2.html, page3.html, and so forth. Our strategy involves iterating through pages 1 to 50 (our discovered pagination maximum), creating a Beautiful Soup object for each page, and methodically extracting titles and prices into our consolidated lists.
Let's implement this scalable solution. We'll use a for loop structure: `for page_num in range(1, pagination_max + 1)` where pagination_max represents our previously determined value of 50. This approach ensures we capture every available page without hard-coding limitations.
Notice the critical `+ 1` addition—this compensates for Python's range function being exclusive at the upper bound. When we specify range(1, 51), we get numbers 1 through 50, exactly what we need. For each iteration, we'll dynamically construct the target URL using Python's f-string formatting, allowing us to inject the current page number into our base URL template.
The URL construction follows this pattern: `f"https://books.toscrape.com/catalog/page{page_num}.html"` where page_num cycles from 1 to 50. This systematic approach ensures we don't miss any pages while maintaining clean, readable code. Once we have our target URL, we execute the familiar request-response cycle: `response = requests.get(url)` followed by `soup = BeautifulSoup(response.content, 'html.parser')`.
This implementation will generate 50 separate HTTP requests—a significant operation that requires patience as each request involves network latency and server processing time. In production environments, you'd want to implement rate limiting and error handling to maintain respectful scraping practices and handle potential connectivity issues.
Now we need to generalize our earlier extraction logic for bulk processing. For titles, we'll use list concatenation: `titles = titles + [title extraction logic]`. While there are multiple approaches—including list.extend() or list comprehensions—this method provides clarity and maintains our existing data structure.
The title extraction follows our established pattern: first, we locate all H3 elements with `h3s = soup.find_all('h3')`, then we extract the title attribute from each nested anchor tag using a list comprehension: `[h3.find('a')['title'] for h3 in h3s]`. This efficiently processes all titles on the current page in a single operation.
Price extraction requires additional processing since we need numerical values rather than raw text. We target paragraph tags with the 'price_color' class: `price_elements = soup.find_all('p', class_='price_color')`. Each price element contains text like "£51.77" that requires cleaning and conversion.
Our price processing pipeline involves three steps: extract the text content using `.get_text()`, remove the pound symbol with string slicing (`[1:]` to skip the first character), and convert to float for numerical operations. The complete operation looks like: `[float(element.get_text()[1:]) for element in price_elements]`. This transforms raw price strings into proper numerical data suitable for analysis and calculations.
When executed, this loop systematically processes all 50 pages, requiring several minutes to complete due to the sequential nature of HTTP requests. The result is comprehensive datasets containing every title and price from the entire catalog.
To verify our success, we'll construct a pandas DataFrame for immediate analysis: `books = pd.DataFrame({'title': titles, 'price': prices})`. This creates a structured dataset with 1,000 rows (representing every book) and two columns (title and price), providing a complete foundation for data analysis, visualization, and further processing. The DataFrame format enables powerful operations like sorting, filtering, statistical analysis, and export to various formats—transforming our web scraping effort into actionable business intelligence.