This project is designed to help us uncover the finest Amazon deals. Our first step involves collecting data from Amazon and conducting exploratory data analysis (EDA) on the acquired information. As a result, we will be able to identify the most advantageous deal.
I am using Playwright in Python to extract data from Amazon. Playwright is a robust web scraping tool that is suitable for modern web pages and single-page applications. It allows for automating interactions with a website, such as clicking buttons, filling out forms, and extracting data.
Using Playwright, I attempt to scrape product details like the product title, price, and ratings.
Step 1: Install Dependencies
Ensure you have Python, Playwright, and Pandas installed.
pip install playwright pandas
Step 2: Web Scraping with Playwright
Create a Python script to scrape product prices from Amazon. Here’s an example of scraping the prices of laptops:
from playwright.sync_api import sync_playwright
import pandas as pd
# Initialize Playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
amazon_url = ‘https://www.amazon.com/s?k=laptop’
# Navigate to the Amazon product page
page.goto(amazon_url)
# Scrape product names and prices
product_names = page.query_selector_all(‘.puisg-col-inner h2 span’)
product_prices = page.query_selector_all(‘.puisg-row a .a-price-whole’)
product_rating = page.query_selector_all(‘span a i span’)
products_data = []
for name, price,rating in zip(product_names, product_prices,product_rating):
products_data.append({
‘Product Name’: name.inner_text().replace(‘,’,”),
‘Price’: price.inner_text(),
‘Rating’: rating.inner_text().split(‘ ‘)[0]
})
# Clean up
browser.close()
This code will scrape the details and then we convert the products_data into
data frame and convert it into CSV.
df = pd.DataFrame(products_data)# Save the DataFrame to a CSV filedf.to_csv(‘product_details.csv’, index=False)
In summary, this code uses Playwright to extract product information (name, price, and rating) from a link, then stores the data in a Pandas DataFrame, and finally saves it to a CSV file for further analysis or use.
Step 3: Data Analysis with Pandas
Let’s assume that you have already collected additional data that includes customer ratings, prices, and product names. I’ll show you how to incorporate this data into your analysis using Python, Pandas, and Matplotlib for visualization.
Install and import Python, Pandas, Seaborn,, and Matplotlib for visualization.
pip install pandas
pip install matplotlib
pip install seaborn
After installing these packages, you can import them into your Python script or Jupyter Notebook:
import matplotlib.pyplot as plt
import seaborn as sns
Read CSV files using Pandas
df.info()
To Replace the comma in Prices
The code takes a DataFrame df
, selects the ‘Price’ column, removes commas from the string representations of numbers in that column, and then converts the cleaned strings to floating-point numbers. This is a common data preprocessing step when you have numerical data in a DataFrame, but it’s represented as strings with special characters like commas that need to be removed to make the data suitable for numerical operations.
To find the price frequency
plt.hist(df[df[‘Price’]<150000][‘Price’], bins=120, color=‘skyblue’)
plt.title(‘Price Distribution’)
plt.xlabel(‘Price’)
plt.ylabel(‘Frequency’)
plt.show()
This code creates a histogram of the ‘Price’ column from the DataFrame df
for prices less than 150,000. It divides the data into 120 bins and sets the plot’s title, x-axis label, and y-axis label. Finally, it displays the plot on the screen using plt.show()
. The histogram provides a visual representation of the distribution of prices in the dataset, allowing you to see how prices are distributed across different price ranges.
Scatter plot of price vs. customer ratings
plt.figure(figsize=(10, 6))
plt.scatter(df[‘Rating’], df[‘Price’], color=‘purple’, alpha=0.5)
plt.title(‘Price vs. Customer Ratings’)
plt.xlabel(‘Rating’)
plt.ylabel(‘Price’)
plt.show()
This code creates a scatter plot using matplotlib in Python. It plots the ‘Rating’ column on the x-axis and the ‘Price’ column on the y-axis from the DataFrame df
. The points are represented as purple dots with reduced opacity (alpha=0.5), and the plot is given a title, x-axis label (‘Rating’), and y-axis label (‘Price’) before being displayed. The scatter plot visually displays the relationship, if any, between customer ratings and product prices, helping to identify patterns or correlations between these two variables.
Create a pair plot to explore correlations between variables
sns.set(style=“ticks”)
sns.pairplot(df, height=3, plot_kws={‘alpha’: 0.7})
plt.suptitle(‘Pairplot of Variables’, y=1.02)
plt.show()
This code uses Seaborn to create a pairplot of variables in the DataFrame ‘df.’ Each combination of variables is represented in a grid of scatter plots, displaying relationships between variables. The ‘alpha’ parameter controls the transparency of the data points in the plots. The ‘sns.set(style=”ticks”)’ line sets the style of the plot, and ‘plt.suptitle’ sets the title for the pairplot. This visualization helps explore relationships and distributions between multiple variables in the dataset.
Calculate and visualize the correlation matrix
- The code calculates and visualizes the correlation matrix between customer ratings and prices. It provides a quantitative measure of the strength and direction of the linear relationship between these two variables. The heatmap with annotations makes it easier to interpret the correlation values.
correlation_matrix = df[[‘Rating’, ‘Price’]].corr()
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap=‘coolwarm’, linewidths=.5)
plt.title(‘Correlation Matrix’)
plt.show()
Create a box plot to examine price variations within different rating ranges
plt.figure(figsize=(10, 6))
sns.boxplot(x=‘Rating’, y=‘Price’, data=df, palette=‘Set2’)
plt.title(‘Price Variations by Customer Ratings’)
plt.xlabel(‘Customer Ratings’)
plt.ylabel(‘Price’)
plt.xticks(rotation=45)
plt.show()
The box plot shows the median, quartiles, and outliers for each rating category. This information can be used to identify any patterns in the pricing behavior of different customers.
The task of exploratory data analysis (EDA) on an e-commerce dataset that includes product prices and customer ratings EDA is an essential step in understanding and summarizing the characteristics of your data. Here’s what we can learn from this code:
1. Visualization of the Relationship Between Price and Customer Ratings:
- The code creates a joint plot that visualizes the relationship between product prices and customer ratings. The “kind=’reg’” parameter indicates a regression plot, allowing us to see if there is a linear relationship between price and ratings. This plot can help identify trends in how price and ratings are related.
2. Pairplot for Exploring Correlations:
- A pair plot is generated to explore correlations between different variables in the dataset. By default, it shows scatter plots for all pairs of numerical variables and histograms for the diagonal. This allows us to visually inspect relationships and dependencies between variables.
3. Price Variations by Customer Ratings:
- A box plot is created to examine how prices vary within different customer rating categories. This visualization helps identify price variations among products with different ratings.
4. Correlation Matrix:
- The code calculates and visualizes the correlation matrix between customer ratings and prices. It provides a quantitative measure of the strength and direction of the linear relationship between these two variables. The heatmap with annotations makes it easier to interpret the correlation values.