Automated Sentiment & Readability Analysis of Web Articles
Project Overview
This project automates the extraction, cleaning, and analysis of textual data from a list of URLs provided in an Excel file. Using Python libraries such as requests, BeautifulSoup, NLTK, and pandas, it delivers structured insights into both the emotional tone and the complexity of web content.
Key Steps
- Data Extraction: Reads URLs from an Excel file and scrapes each page to extract article titles and content.
- Text Preprocessing: Cleans text by removing stopwords and punctuation, then tokenizes for analysis.
- Sentiment Scoring: Calculates positive/negative scores using provided word lists, and computes polarity and subjectivity scores.
- Readability Analysis: Measures metrics like average sentence length, complex word percentage, Fog Index, syllable counts, and more.
- Output: Saves processed articles as text files and compiles all results into a structured CSV file.
Summary
The workflow enables automated sentiment and readability analysis for web articles, providing valuable insights into both the emotional tone and complexity of online content.