can you share a python code for scraping information from URL's for specific data points and and save to a database. can you automate the workflow of the code to every 12 hours and only replace the content in the database that has changed in your scraping crawl cycle.

إنتهى!

قبل ١٠ أشهر

هل يمكنك مشاركة رمز Python لكشط المعلومات من عناوين URL لنقاط بيانات محددة وحفظها في قاعدة بيانات. هل يمكنك أتمتة سير عمل الكود كل 12 ساعة واستبدال المحتوى الموجود في قاعدة البيانات الذي تغير في دورة زحف الكشط فقط.

التكنولوجيا

علوم الحاسوب

الأتمتة

قاعدة البيانات

تجريف الويب

الخطوات

تم الانتهاء من 6 خطوات!١٣٫٢٩ ث

الجواب

إليك مقتطف شفرة Python يوضح كيفية كشط البيانات من عنوان URL ، وحفظها في قاعدة بيانات SQLite ، وأتمتة سير العمل لتشغيله كل 12 ساعة. سيحل الرمز محل البيانات التي تغيرت فقط منذ آخر كشط.

ستحتاج إلى تثبيت المكتبات المطلوبة باستخدام النقطة إذا لم تكن قد قمت بذلك بالفعل:

pip install requests beautifulsoup4 sqlite3 apscheduler

نظرة عامة على الكود

كشط عنوان URL لنقاط بيانات محددة: سيؤدي المثال إلى كشط عنوان URL عام. يمكنك تعديل الدالة scrape_data لاستخراج نقاط البيانات المحددة التي تحتاجها.
الحفظ في قاعدة بيانات SQLite: يستخدم هذا الرمز SQLite لتخزين البيانات المسروقة.
أتمتة مع APScheduler: سيتم استخدام مكتبة APScheduler لجدولة مهمة الكشط كل 12 ساعة.

كود بايثون

import requests
from bs4 import BeautifulSoup
import sqlite3
from apscheduler.schedulers.blocking import BlockingScheduler
import hashlib

# Database setup
def create_table():
    conn = sqlite3.connect('scraping_data.db')
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS scraped_data (
            id INTEGER PRIMARY KEY,
            data_hash TEXT,
            content TEXT
        )
    ''')
    conn.commit()
    conn.close()

# Scraping function
def scrape_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Adjust the following line to scrape the specific data points you need
    content = soup.find('div', class_='specific-class').text.strip()  # Example of scraping a specific div
    data_hash = hashlib.md5(content.encode()).hexdigest()  # Create a hash of the content

    return data_hash, content

# Function to insert or update data in the database
def save_to_database(url):
    data_hash, content = scrape_data(url)
    
    conn = sqlite3.connect('scraping_data.db')
    cursor = conn.cursor()

    # Check if the content has changed
    cursor.execute('SELECT * FROM scraped_data WHERE data_hash = ?', (data_hash,))
    result = cursor.fetchone()

    if result is None:
        # If not present, insert new record
        cursor.execute('INSERT INTO scraped_data (data_hash, content) VALUES (?, ?)', (data_hash, content))
        print("New content saved to database.")
    else:
        print("Content has not changed. No update needed.")

    conn.commit()
    conn.close()

# Function to run the scraper
def run_scraper():
    url = 'http://example.com'  # Replace with the URL you want to scrape
    save_to_database(url)

# Scheduler setup
if __name__ == '__main__':
    create_table()
    scheduler = BlockingScheduler()
    scheduler.add_job(run_scraper, 'interval', hours=12)  # Run every 12 hours
    run_scraper()  # Initial run
    scheduler.start()

شرح

تجريف البيانات: ترسل وظيفة scrape_data طلب HTTP إلى عنوان URL المحدد وتستخدم BeautifulSoup لتحليل HTML واستخراج المحتوى المطلوب.
إدارة قاعدة البيانات: تتحقق وظيفة save_to_database مما إذا كان المحتوى قد تم تخزينه بالفعل باستخدام التجزئة. إذا تم تغييره (على سبيل المثال ، يتم إنشاء تجزئة جديدة) ، فإنه يقوم بتحديث قاعدة البيانات.
الجدولة: تم تكوين APScheduler لتشغيل run_scraper() كل 12 ساعة. يتم استدعاء التشغيل الأولي مباشرة قبل بدء المجدول.

ملاحظات

استبدل http://example.com ومنطق الكشط بعنوان URL الفعلي ومنطق استخراج البيانات وفقا لمتطلباتك.
تأكد من أن بنية HTML لصفحة الويب التي تقوم بكشطها مناسبة للعلامات والفئات التي تستخدمها في الكود.
التعامل مع الأخطاء وحالات الحافة (على سبيل المثال ، فشل الطلب ، والتغييرات في تخطيط صفحة الويب) كما هو مطلوب للمتانة.

الإقتراحات

AAAnonymous

١٥١