今日阅读:

今日软件:

  • Quartz 4
    也就是此博客现在用的框架,一开始想用jekyll框架来搭建,因为我之前有用过这个框架的经验。
    但是后面发现想自己做Graph View之类的功能还是挺折腾的,加上我习惯用Obsidian来写日记,后续整理的话还想活用反向链接功能。直到发现Quartz4这个专门为Obsidian设计的发布工具,也是官方推荐的框架。
  • Pixzip
    图片压缩工具Pxzip。虽然市面上的图片压缩工具很多,但是支持avif的不多见。我新博客中的图片除了部分动态图,其余全部使用avif,加载速度更快。
  • voice-models
    vits之类的音声模型整合站,各种参数标注都很全。

今日代码:

confluence_blog_scraper.py
from bs4 import BeautifulSoup
import requests
import os
import time
from urllib.parse import urljoin
import html2text
import logging
from retrying import retry
 
# 设置日志
logging.basicConfig(level=logging.INFO)
 
# 常量和配置
BASE_URL = "http://example.com"
HEADERS = {'Cookie': 'your_cookie_value'}
SESSION = requests.Session()
SESSION.headers.update(HEADERS)
 
# 重试装饰器设置
@retry(stop_max_attempt_number=3, wait_fixed=2000)
def get_with_retry(url):
    response = SESSION.get(url, timeout=10)
    response.raise_for_status()
    return response
 
def download_image(img_url, save_folder, image_num):
    try:
        os.makedirs(save_folder, exist_ok=True)
        local_path = os.path.join(save_folder, f'image{image_num}.png')
        response = get_with_retry(img_url)
        with open(local_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        return local_path
    except Exception as e:
        logging.error(f"Error downloading image: {e}")
        return None
 
def html_to_markdown(html, title, base_image_folder='downloaded_images'):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        content_div = soup.find(id="main-content")
        if not content_div:
            return "No content found in 'main-content'"
        
        image_num = 1
        title_for_path = title.replace('/', '-').replace('\\', '-')
        image_folder = os.path.join(base_image_folder, title_for_path)
        for img_tag in content_div.find_all('img', class_="confluence-embedded-image"):
            img_src = img_tag.get('data-image-src') or img_tag.get('src')
            if img_src:
                img_src = urljoin(BASE_URL, img_src) if not img_src.startswith('http') else img_src
                local_image_path = download_image(img_src, image_folder, image_num)
                if local_image_path:
                    relative_image_path = os.path.relpath(local_image_path, os.path.dirname(image_folder)).replace('\\', '/')
                    img_markdown = f"![Image {image_num}]({relative_image_path})"
                    img_tag.replace_with(BeautifulSoup(img_markdown, 'html.parser'))
                    image_num += 1
        
        h = html2text.HTML2Text()
        h.ignore_links = False
        return h.handle(str(content_div))
    except Exception as e:
        logging.error(f"Error converting HTML to Markdown: {e}")
        return ""
 
def download_page(page, title):
    try:
        url = f"{BASE_URL}/{page}"
        response = get_with_retry(url)
        html_content = response.text
        markdown = html_to_markdown(html_content, title)
        filename = f"{title}.md"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(markdown)
        logging.info(f"Markdown content saved to {filename}.")
    except Exception as e:
        logging.error(f"Error downloading page: {e}")
 
def get_page_list(month, year, key):
    try:
        url = f"{BASE_URL}/rest/ia/1.0/pagetree/blog/subtree?spaceKey={key}&groupType=2&groupValue={month}%2F{year}"
        response = get_with_retry(url)
        json_content = response.json()
        if not json_content:
            logging.info(f"No blog data for {year}-{month}, skipping.")
            return
        
        for page_info in json_content:
            download_page(page_info['url'], page_info['title'])
    except Exception as e:
        logging.error(f"Error getting page list: {e}")
 
if __name__ == "__main__":
    for year in range(2023, 2025):
        for month in range(1, 13):
            get_page_list(f"{month:02}", year, "your_space_key")

今日见闻:

见闻太多,不想列举。

今日废话:

博客迁移真累,和在跑步机上跑1500米差不多。
跑完1500米再去做博客迁移更累!
买了一台成品NAS,自带4T硬盘的极空间Q2C。自带的内网穿透有点惊喜,还没去公司办理随性私网已经有不错的速度了。
唯一可惜的是我买的早,高配的Z2pro还没出活动就买了,因为是公司内购所以也不支持退换,只能抱着这个没有docker的NAS,慢慢折腾吧。