python scrapy 安装 使用 配置pipeline、item、settings,浅谈scrapy去重机制,Python:Scrapy传入自定义参数运行,yield meta 方法传递参数 ,Python中requests的会话保持session详解

常用

scrapy startproject 项目名
cd 项目名, # 进入项目,此目录是创建爬虫项目时通过模板自动生成。
scrapy genspider 文件名 域名



#
文件名:在爬虫文件名在当前目录下必须保证唯一。
域名: 是用来约束爬虫的范围。
scrapy框架的爬虫程序使用面向对象的方式进行封装

response.xpath(“表达式”)返回的结果为选择器对象列表,如果需要提取元素属性或文本需要使用get()或getall()方法。

get()为序列化并返回单个Unicode字符串中匹配的节点,如果存在多个返回第一个,类型为字符串,匹配不到返回None。
getall()方法序列化并返回unicode字符串的元素列表中匹配的节点,返回类型为列表,匹配不到返回None,

yield meta 方法传递参数 
scrapy 递归爬取如何传递参数,且解决循环yield时总是得到最后一个数据 ,这里就引出scrapy中 request的meta参数,该参数只接受字典形式
meta={'k1':v1,'k2':v2}
用法如下:
def parse(self, response):    items = ScrapytestItem()    items['name'] = 'csdn'    href = href_domains + item.css('......').extract_first()     yield Request(        url=href,        callback=self.parse_details,        meta={'items': items},    )

读取参数def parse_details(self, response): items2 = response.meta['items']
深度拷贝 import copymeta={'items': copy.deepcopy(items)} https://blog.csdn.net/DL_min/article/details/105593318? Python:Scrapy传入自定义参数运行 # 运行爬虫 scrapy crawl spiderName# 传入自定义参数运行$ scrapy crawl spiderName -a parameter1=value1 -a parameter2=value2 读取参数 def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # 在init方法中获取参数 num = kwargs.get('num') print('init num: ', num) # 在实例方法中获取参数 num = getattr(self, 'num', False) print('getattr: ', num)
https://blog.csdn.net/mouday/article/details/112303043
scrapy完整版重写start_requests方法 python requests 强大用法 https://blog.csdn.net/sirobot/article/details/105360486

Python中requests的会话保持session详解

import requests
 
# 会话
session = requests.session()
data ={
   'loginName': xxxxxx,       #改为自己的用户名
   'password': 'xxxxxxxxxx'   #改为自己的登录密码
}
# 登录
url ="https://passport.17k.com/ck/user/login"
 
restult = session.post(url,data=data)
 
# print(restult.text)
# print(restult.cookies)
 
# 再次请求  拿取书架上的数据
url2 = "https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"
result_data =  session.get(url2)
print(result_data.json()['data'])

run.py

# from scrapy import cmdline
# import cmdline
import os
import threading
import time

# line = ('py -m scrapy crawl msrc --nolog'.strip()).split()
# line = ('python -m scrapy crawl getm3u8 --nolog'.strip()).split()
# line = ('python -m scrapy crawl getm3u8'.strip()).split()
# line = ('py -m scrapy crawl getm3u8'.strip()).split()
mongo = 'net start MongoDB'
os.system(mongo)
# line = 'py -m scrapy crawl getm3u8'
# line = 'py -m scrapy crawl getm3u8 --nolog'
try:
    item = int(input('请问要收集多少条,无限制填0,其余数字为条数\n'))
    if len(str(item).strip()) < 1:
        item = 2
except:
    item = 2
# line = 'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT=3 --nolog'
line = f'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT={str(item).strip()} --nolog'
# line = 'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT=3'
# print(line)
# cmdline.execute(line)
print(line)
# os.system(line)
def runline():
    os.system(line)
t = threading.Thread(target=runline, args=())
t.start()
time.sleep(10)
input('\n\n\n\nfinished waiting\n')

spider文件

import re
import scrapy
import urllib
import datetime
import requests
import time

class Getm3u8Spider(scrapy.Spider):
    name = 'getm3u8'
    url = 'https://137aaa.com/index.php/vod/play/id/66625/sid/1/nid/1.html'
    url = (str(input('请输入要分析m3u8资源地址的网站\n'))).strip()
    if url[:4] != 'http':
        url = 'http://' + url.strip()
        url1 = 'https://' + url.strip()
    url_split = url.split('/')
    domain_o = url_split[2]
    domain_o = domain_o.split(":")[0]
    domain_s = domain_o.split('.')
    domain = domain_s[-2] + '.' + domain_s[-1]

    # allowed_domains = ['137aaa.com']
    allowed_domains = [domain, ]
    start_urls = [url, ]

    my_usefull_domains = [domain, 'xjzyplay.com', 'cdn.xjzyplay.com' ]
    my_forbidden_domains = []
    my_items_count = 0
    my_forbidden_domains_count = {}     # 无效域名次数统计

    # start_urls = [
    #     # 'http://httpbin.org/get',
    # ]



    def parse(self, response):
        url_from = response.request.url
        # print('正在分析:', url_from)
        text = response.text
        txt = text.replace('\\', '')
        # print('*\n' * 5)
        # print(text)
        urls_m3u8_list = re.findall(r'''(https?:\\?/\\?/[^!"<>',]*?\.(?:m3u8|mp4|avi|mov|mpeg|mp3))''', txt,
                                    re.S | re.M | re.I)
        # print(urls_m3u8_list)

        # play_src = f'https://play.panjinhe.cn/player/index.php?name={title}&pic=&site={urls_m3u8_list[0]}'
        # print(play_src)
        #  递归爬取
        hrefs = response.xpath('//a/@href').getall()
        hrefs = list(set(hrefs))
        # print(hrefs)
        a = 1
        for u in hrefs:
            if u[:10] != 'javascript':
                if u[:4] != 'http':
                    url_split = url_from.split('/')
                    ht = url_split[0] + '//' + url_split[2]
                    u = ht + u
                u1 = u.split('/')
                if self.domain in u and len(u1) > 4:
                    # print('start to get url: ', u)
                    yield scrapy.Request(url=u, callback=self.parse, dont_filter=False,)
                    a = a + 1
                    if a > 3:
                        # break
                        pass
        time.sleep(0.1)


        title = response.xpath('//title/text()').get()

        # print(title)
        title_u = urllib.parse.quote(title)
        # print(title_u)
        keywords = response.xpath('//meta[@name="keywords"]/@content').get()
        # print(keywords)
        description = response.xpath('//meta[@name="description"]/@content').get()
        # print(description)
        hrefs = response.xpath('//a/@href').getall()
        # print(hrefs)

        save_time = (str(datetime.datetime.now())[:-7]).replace(" ", "_")
        # 去重
        urls_m3u8_list = list(set(urls_m3u8_list))
        for m3u8_src in urls_m3u8_list:
            # 去除二次拼接的链接
            m3u8_src = 'https://dhjjhkfhjk' + m3u8_src
            abcd = re.findall('https://', m3u8_src, re.S | re.I | re.M)
            if len(abcd) > 1:
                m3u8_src = 'https://' + m3u8_src.split('https://')[-1]

            abcde = re.findall('http://', m3u8_src, re.S | re.I | re.M)
            if len(abcde) > 1:
                m3u8_src = 'http://' + m3u8_src.split('http://')[-1]
            m3u8_domains = m3u8_src.split('/')[2]
            try:
                # 提速优化
                if m3u8_domains in self.my_usefull_domains:
                    # 跳过已检测有效的域名
                    m3u8_src_status = 200
                elif m3u8_domains in self.my_forbidden_domains:
                    # 跳过不能链接的域名
                    m3u8_src_status = 404
                else:
                    try:
                        # 测试资源链接有效性
                        hd = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"}
                        m3u8_src_status = requests.get(m3u8_src, headers=hd, timeout=8)
                        m3u8_src_status = m3u8_src_status.status_code
                    except Exception as exp:
                        # 超时不能连不上的域名处理
                        print(exp)
                        m3u8_src_status = 10101

                if m3u8_src_status == 200:
                    str_replace = [' ', '&', '/', '\\', '=', '[', ']', '(', ')', '{', '}', '*', '#', '@', '!', ':', "'", '"', ',', '“',
                                   '‘', ',', ':', '!']
                    try:
                        for i in str_replace:
                            title = title.replace(i, '')
                            title_u = title_u.replace(i, '')
                            keywords = keywords.replace(i, '')
                            description = description.replace(i, '')
                    except:
                        pass

                    play_src = f'https://play.panjinhe.cn/player/index.php?name={title}&pic=&site={m3u8_src}'
                    items = {}
                    items['m3u8_src'] = m3u8_src
                    items['play_src'] = play_src
                    items['url_from'] = url_from
                    items['title'] = title
                    items['title_u'] = title_u
                    items['description'] = description
                    items['keywords'] = keywords
                    items['domain'] = self.domain
                    items['save_time'] = save_time
                    items['m3u8_domains'] = m3u8_domains
                    yield items
                    # print(items)
                    print(self.domain, '获得资源:', self.my_items_count, datetime.datetime.now())
                    print(title)
                    self.my_items_count = self.my_items_count + 1
                    self.my_usefull_domains.append(m3u8_domains)
                    self.my_usefull_domains = list(set(self.my_usefull_domains))  # 列表去重
                elif m3u8_src_status == 404:
                    self.my_forbidden_domains.append(m3u8_domains)
                    self.my_forbidden_domains = list(set(self.my_forbidden_domains))
                else:
                    try:
                        try:
                            self.my_forbidden_domains_count[m3u8_domains] = self.my_forbidden_domains_count[m3u8_domains] + 1
                        except Exception as exp:
                            self.my_forbidden_domains_count[m3u8_domains] = 1

                        print('\n\n\n*\n', m3u8_domains, self.my_forbidden_domains_count[m3u8_domains])

                        if self.my_forbidden_domains_count[m3u8_domains] > 3:
                            self.my_forbidden_domains.append(m3u8_domains)
                            self.my_forbidden_domains = list(set(self.my_forbidden_domains))
                    except Exception as exp:
                        print(exp)
                        pass

            except Exception as exp:
                print(exp)
                pass

items.py

import scrapy


class Getallm3U8Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    m3u8_src = scrapy.Field()
    play_src = scrapy.Field()
    url_from = scrapy.Field()
    title = scrapy.Field()
    title_u = scrapy.Field()
    description = scrapy.Field()
    keywords = scrapy.Field()
    domain = scrapy.Field()
    save_time = scrapy.Field()
    m3u8_domains = scrapy.Field()

settings.py

# Obey robots.txt rules 爬虫协议 默认遵守
ROBOTSTXT_OBEY = False

# 最大并发量
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# 设置随机ua
RANDOM_UA_TYPE = "random"
# MIDDLEWARES 设置
DOWNLOADER_MIDDLEWARES = {
   'getAllM3u8.middlewares.RandomUserAgentMiddlemares': 543,
}

# 配置中间项
ITEM_PIPELINES = {
   'getAllM3u8.pipelines.Getallm3U8Pipeline': 300,
   'getAllM3u8.pipelines.MongoDBPipeline': 300,
   'getAllM3u8.pipelines.BaikePipeline': 300,
}

middlewares.py


# 随机ua方法一
# pip install fake_useragent
# 导入UserAgent
from fake_useragent import UserAgent
# 随机更换user-agent方法
class RandomUserAgentMiddlemares(object):
    ua = UserAgent()
    def process_request(self,spider,request):
        user_agent = self.ua.random
        request.headers["User-Agent"]=user_agent


# 随机ua方法2
class RandomUserAgentMiddlemares1(object):
#随便在网上找的
    agents = [
 "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
 "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
 "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
 "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
 "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
 "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
 "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
 "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
 "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
 "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
 "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
 "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
 "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",]
    def process_request(self,spider,request):
        user_agent=random.choice(self.agents)
        request.headers["User-Agent"]=user_agent

pipeline.py

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import datetime
import time
import os

# 保存到MongoDB非约束性数据库
import json
import pymongo
class MongoDBPipeline(object):
    DB_URL = 'mongodb://localhost:27017/'  # 直接将DB_URI,DB_NAME 写下具体的内容,随后在setting中配置
    DB_NAME = 'm3u8db'
    def __init__(self):
        # 连接数据库
        # self.client = pymongo.MongoClient(host='localhost', port=27017)
        self.client = pymongo.MongoClient(self.DB_URL)
        # 创建库
        self.db = self.client['m3u8db']
        # 创建表
        self.table = self.db['m3u8tb']

    def process_item(self, item, spider):
        # print('___________________++++++', dict(item))
        self.table.insert_one(dict(item))
        # self.table.insert_one() 新版本不支持insert, 需要用单条数据插入insert_one, 或者多条数据插入# self.table.insert_many()
        return item
    def close_spider(self, spider):
        self.client.close()


# 保存txt或Jason文件
import json
class BaikePipeline(object):
    # fname = 'my_m3u8_src1.json'
    # def open_spider(self, spider):
    #     # self.file = open('items.txt', 'w', encoding="utf-8")
    #     # self.file = open(self.fname, 'a+', encoding="utf-8")
    #     pass
    #
    # def close_spider(self, spider):
    #     self.file.close()

    # item在后期使用的时候还要转换回来,他存在的意义只是防止出错导致程序中止
    def process_item(self, item, spider):
        res = dict(item)
        # res = json.dumps(res)
        m3u8_src = res['m3u8_src']
        domain = res['domain']
        # fname = 'my_m3u8_src1.json'
        fname = f'save/m3u8_{domain}.json'
        if os.path.isfile(fname):
            pass
        else:
            try:
                os.makedirs('save')
            except:
                pass

        try:
            try:
                with open(fname, 'r', encoding="utf-8") as f:
                    f_local = f.read()
                    dict_item_local = dict(json.loads(f_local))
                    f.close()
                pass
            except:
                dict_item_local = {}
                pass


            out_text = {}
            out_text[m3u8_src] = res
            # 字典合并
            out_text = dict(dict_item_local, **out_text)
            json_f = out_text
            out_text = json.dumps(out_text, indent=4, ensure_ascii=False)
            # line = res['name']
            # self.file.write(line.encode('utf-8') + '\n')
            # self.file.write(res)
            # self.file.write(out_text)
            with open(fname, 'w', encoding='utf-8') as f:
                f.write(out_text)
                f.close()
            self.save_html(json_f=json_f, domain=domain)
        except:
            pass

    # 生成本地网页文件
    def save_html(self, file_name='', json_f='', domain=''):
        time.sleep(5)
        """
        https://www.cnblogs.com/ivkeji/p/14491959.html
        json_file = {
                "https://new.qqaku.com/20220919/u2tRwIl3/index.m3u8": {
                    "src": "https://new.qqaku.com/20220919/u2tRwIl3/index.m3u8",
                    "name": "罚罪-第36集在线观看-连续剧 - 月亮电影网",
                    "description": "《罚罪》部分取材于真实事件,以一桩恶性案件为切入口,通过青年刑警常征的视角,讲述出两代公安干警为维护一方安宁,扫除犯罪团伙,不畏艰险、前赴后继的英勇故事。在昌武(虚构地)这座小城,在危机重重的战斗第一",
                    "from": "https://www.zqzdzj.com/vod/129111-2-36.html",
                    "times": "2022-09-19"
                },
                "https://56z.cc/default.php?url=https://sod.bunediy.com/20220825/wxOaBtQK/index.m3u8": {
                    "src": "https://56z.cc/default.php?url=https://sod.bunediy.com/20220825/wxOaBtQK/index.m3u8",
                    "name": "罚罪第04集_电视剧完整版_免费在线观看_爱碟影院",
                    "description": "电视剧《罚罪》第04集免费在线观看",
                    "from": "https://aidie.cc/play-90051-8-4/",
                    "times": "2022-09-19"
                },
                "https://new.qqaku.com/20220510/AcX1YZp7/index.m3u8": {
                    "src": "https://new.qqaku.com/20220510/AcX1YZp7/index.m3u8",
                    "name": "《运河风流》第27集在线观看_电视剧_虎鱼影院",
                    "description": "运河风流免费播放全集,运河风流大结局剧情介绍:山东济宁,南临微山,东辖曲阜,大运河穿城而过,得交通之便,浸孔孟之风,自古以来政要商贾云集、文人雅..",
                    "from": "https://www.021huyu.com/bofang/17398-0-26.html",
                    "times": "2022-09-19"
                }
            }
        """
        try:
            # get_m3u8_src.save_html(json_str)
            # file_name = 'save/save_2022-09-20_m3u8_srcs_info.json'
            # file_name = self
            #
            # with open(file_name, 'r', encoding='utf-8') as f:
            #     loacl_f = f.read()
            #     json_file = json.loads(loacl_f)
            #     f.close()
            #     pass
            # json_file = json.dumps(self, ensure_ascii=False)
            json_file = json_f
            if len(json_file) > 0:
                # p 朋友版本
                demo0 = """
                <!DOCTYPE html>
                    <html>
                    <head>
                    <meta charset="utf-8">
                    <meta http-equiv="refresh" content="16">
                    <title>m3u8资源
                """

                demo1 = """
                清爽观影(www.qsbox.cn)</title>
                  <meta name="viewport" content="width=device-width, initial-scale=1">
                  <link href="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/css/bootstrap.min.css" rel="stylesheet">
                  <script src="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/js/bootstrap.bundle.min.js"></script>
                    </head>
                    <body>
                    <div>
                    <center><p><h2><a href="https://www.qsbox.cn" target="_blank">欢迎使用清爽搜索盒子</a></h2></p>
                    <h1>以下是 """

                demo1p = """
                清爽观影</title>
                  <meta name="viewport" content="width=device-width, initial-scale=1">
                  <link href="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/css/bootstrap.min.css" rel="stylesheet">
                  <script src="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/js/bootstrap.bundle.min.js"></script>
                    </head>
                    <body>
                    <div>
                    <center><p><h2>欢迎使用清爽观影媒体资源提取工具,仅供测试,请及时销毁</h2></p>
                    <h1>以下是 """

                demo1a = """
                                的搜索结果</h1>

                        <div class="container-fluid">
                            <div class="row" id="demo">
                                <p id="demo1"></p>
                            </div>
                        </div>
                    </div>
                    <script>

                    var myObj, i, x = "", n=0, m;
                    var myObj = 
                """

                demo2 = """;

                    for (i in myObj) {
                        n = n + 1;
                        m = n%2;
                        if (m == 1){
                            x += "<div class='col-6 col-lg-4 pt-2 my-3 '>";
                        }else{
                            x += "<div class='col-6 col-lg-4 pt-2 my-3 border bg-light text-dark'>";
                        }

                        x +=  " <a href='https://play.panjinhe.cn/player/index.php?name="+ myObj[i].title +"m3u8%E5%9C%A8%E7%BA%BF%E6%92%AD%E6%94%BE%E5%99%A8&pic=&site=" + myObj[i].m3u8_src + "' target='__blank'>";
                        x += myObj[i].title + "</a>&nbsp;</br>";

                        x += "地址:<a id='src"+ n +"' href='" + myObj[i].m3u8_src + "' target='__blank'>";
                        x += "" + myObj[i].m3u8_src + "</a></br>";

                        x += "资源" + n + ".&nbsp; <a style='color:red;' href='https://play.panjinhe.cn/player/index.php?name="+ myObj[i].title +"m3u8%E5%9C%A8%E7%BA%BF%E6%92%AD%E6%94%BE%E5%99%A8&pic=&site=" + myObj[i].m3u8_src + "' target='__blank'>";
                        x += "清爽观影" + "</a>&nbsp;";

                        x += "<a style='color:red;' href='"+ myObj[i].url_from + "' target='__blank'>";
                        x += "访问原站点" + "</a>&nbsp;";	

                        x += "<a style='color:red;' href='http://"+ myObj[i].domain + "' target='__blank'>";
                        x += "来源:" + myObj[i].domain + "</a>&nbsp;";	

                        x += "<span style='display:none;'> <a style='color:red;' class='goDouyin'>";
                        x += "复制资源地址<textarea style='display:none;'>" + myObj[i].src+ "</textarea></a></span></br>";	

                        x += "</div>";
                    }
                    x += "<center>本次获得链接总数:" + n + "个";  
                    document.getElementById("demo").innerHTML = x;
                    </script>

                    <script type="text/javascript">
                      $(document).on("click", ".goDouyin", function() {
                        var Url2=$(this).find('textarea');
                        Url2.select(); // 选择对象用户定义的代码区域
                        document.execCommand("Copy"); //原生copy方法执行浏览器复制命令
                        if( document.execCommand("Copy")==true){
                             layer.msg('复制成功'); //弹窗
                        }
                        });
                      </script>
                    <center>
                    © 2013 -
                """

                demo2p = """;

                    for (i in myObj) {
                        n = n + 1;
                        m = n%2;
                        if (m == 1){
                            x += "<div class='col-xs-12 col-sm-8 col-lg-6 pt-2 my-3 '>";
                        }else{
                            x += "<div class='col-xs-12 col-sm-8 col-lg-6 pt-2 my-3 border bg-light text-dark'>";
                        }

                        x +=  " <a href='https://www.m3u8play.com/?play=" + myObj[i].src + "' target='__blank'>";
                        x += myObj[i].name + "</a>&nbsp;</br>";

                        x += "媒体地址:<a id='src"+ n +"' href='" + myObj[i].src + "' target='__blank'>";
                        x += "" + myObj[i].src + "</a></br>";

                        x += "资源" + n + ".&nbsp; <a style='color:red;' href='https://www.icesun.cn/tools/player-m3u8.php?url=" + myObj[i].src + "' target='__blank'>";
                        x += "清爽观影" + "</a>&nbsp;";

                        x += "<a style='color:red;' href='"+ myObj[i].from + "' target='__blank'>";
                        x += "访问原站点" + "</a>&nbsp;";	

                        x += "<span style='display:none;'> <a style='color:red;' class='goDouyin'>";
                        x += "复制资源地址<textarea style='display:none;'>" + myObj[i].src+ "</textarea></a></span></br>";	

                        x += "</div>";
                    }
                    x += "<p><center>本次获得链接总数:<span style='color: red; font-weight:bold; font-size:1.6em;'>" + n + "个<span><p>";  
                    document.getElementById("demo").innerHTML = x;
                    </script>

                    <script type="text/javascript">
                      $(document).on("click", ".goDouyin", function() {
                        var Url2=$(this).find('textarea');
                        Url2.select(); // 选择对象用户定义的代码区域
                        document.execCommand("Copy"); //原生copy方法执行浏览器复制命令
                        if( document.execCommand("Copy")==true){
                             layer.msg('复制成功'); //弹窗
                        }
                        });
                      </script>
                    <center>
                        </p><h1>m3u8在线播放器</h1><p>

                        <a href='https://linqingping.github.io/M3U8-player' target='_blank'>   https://linqingping.github.io/M3U8-player/#</a></br>

                        <a href='http://tool.liumingye.cn/m3u8/' target='_blank'>     http://tool.liumingye.cn/m3u8/</a></br>

                        <a href='https://www.hlsplayer.net/' target='_blank'>     https://www.hlsplayer.net/</a></br>

                        <a href='https://m3u8-player.com/' target='_blank'>     https://m3u8-player.com/</a></br>

                        <a href='https://m3u8.looks.wang/' target='_blank'>    https://m3u8.looks.wang/</a></br>

                        <a href='http://m3u8player.lantianye3.top/' target='_blank'>     http://m3u8player.lantianye3.top/</a></br>

                        <a href='http://tool.pfan.cn/m3u8/' target='_blank'>     http://tool.pfan.cn/m3u8/</a></br>

                        <a href='https://www.m3u8play.com/' target='_blank'>     https://www.m3u8play.com/</a></br>

                        <a href='http://www.m3u8player.top/' target='_blank'>    http://www.m3u8player.top/</a></br>

                        <a href='https://meetpasser.com/webplayer/' target='_blank'>     https://meetpasser.com/webplayer/</a></br>

                        <a href='https://www.icesun.cn/tools/video-player.php' target='_blank'>    https://www.icesun.cn/tools/video-player.php</a></br>
                        <p>
                        如需播放,可复制媒体文件地址到m3u8在线播放器中粘贴地址播放,谢谢</br>
                        测试文件地址: https://new.qqaku.com/20220914/02PA7bkH/index.m3u8 <p>

                    © 2013 -
                """

                demo3 = """
                  <a href='https://www.qsbox.cn' target='_blank'>  www.qsbox.cn 清爽观影 </a></br>
                    </body>
                    </html>
                """

                demo3p = """
                  清爽观影
                    </body>
                    </html>
                """
                demo = demo0 + f'{str(datetime.datetime.now())[:10]}' + demo1 + f'{str(datetime.datetime.now().strftime("%Y年%m月%d日"))}' + demo1a + str(
                    json_file) + demo2 + str(datetime.datetime.now().strftime("%Y")) + demo3;
                # demo = demo0 + f'{str(datetime.datetime.now())[:10]}' + demo1p + f'{str(datetime.datetime.now().strftime("%Y年%m月%d日"))}' + demo1a +  str(json_file) + demo2p + str(datetime.datetime.now().strftime("%Y")) + demo3p;
                # print(demo)
                html_file_name = f'save/0A_{str(datetime.datetime.now())[:10]}结果.html'

                demo = demo0 + str(domain) + demo1 + "域名《" + str(domain) + "》" + demo1a + str(json_file) + demo2 + str(datetime.datetime.now().strftime("%Y")) + demo3;
                # demo = demo0 + str(kws_link) + demo1p + "关键词《" + str(kws_link) + "》" + demo1a + str(json_file) + demo2p + str(datetime.datetime.now().strftime("%Y")) + demo3p;
                html_file_name = f'save/0A_{domain}结果.html'
                with open(html_file_name, 'w', encoding="utf-8") as f:
                    f.write(demo)
                    f.close()
        except:
            pass


使用 Scrapy 框架对重复的 url 无法获取数据,dont_filter=True

可能的原因是:你使用了 Scrapy 对重复的 url 进行请求。

Scrapy 内置了重复过滤功能,默认情况下该功能处于打开状态。

添加 dont_filter=True 参数,这样 Scrapy 就不会过滤掉重复的请求。

Scrapy 在进入 parse 时,会默认请求一次 start_urls[0],而当你在 parse 中又对 start_urls[0] 进行请求时,Scrapy 底层会默认过滤掉重复的 url,不会对该请求进行提交,这就是为什么 parse2 不被调用的原因。

import scrapy
 
class ExampleSpider(scrapy.Spider):
    name ="test"
    # allowed_domains = ["https://www.baidu.com/"]
 
    start_urls = ["https://www.baidu.com/"]
 
    def parse(self,response):
        yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)
 
    def parse2(self, response):
        print(response.url)

关于Scrapy架构各项说明,如下所示:

ScrapyEngine:引擎。负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。 此组件相当于爬虫的“大脑”,是 整个爬虫的调度中心。 
Schedule:调度器。接收从引擎发过来的requests,并将他们入队。初始爬取url和后续在页面里爬到的待爬取url放入调度器中,等待被爬取。调度器会自动去掉重复的url。
Downloader:下载器。负责获取页面数据,并提供给引擎,而后提供给spider。
Spider:爬虫。用户编些用于分析response并提取item和额外跟进的url。将额外跟进的url提交给ScrapyEngine,加入到Schedule中。将每个spider负责处理一个特定(或 一些)网站。 
ItemPipeline:负责处理被spider提取出来的item。当页面被爬虫解析所需的数据存入Item后,将被发送到Pipeline,并经过设置好次序
DownloaderMiddlewares:下载中间件。是在引擎和下载器之间的特定钩子(specific hook),处理它们之间的请求(request)和响应(response)。提供了一个简单的机制,通过插入自定义代码来扩展Scrapy功能。通过设置DownloaderMiddlewares来实现爬虫自动更换user-agent,IP等。
SpiderMiddlewares:Spider中间件。是在引擎和Spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items或requests)。提供了同样简单机制,通过插入自定义代码来扩展Scrapy功能。

Scrapy数据流:

ScrapyEngine打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个(批)要爬取的url(s);
ScrapyEngine向调度器请求第一个要爬取的url,并加入到Schedule作为请求以备调度;
ScrapyEngine向调度器请求下一个要爬取的url;
Schedule返回下一个要爬取的url给ScrapyEngine,ScrapyEngine通过DownloaderMiddlewares将url转发给Downloader;
页面下载完毕,Downloader生成一个页面的Response,通过DownloaderMiddlewares发送给ScrapyEngine;
ScrapyEngine从Downloader中接收到Response,通过SpiderMiddlewares发送给Spider处理;
Spider处理Response并返回提取到的Item以及新的Request给ScrapyEngine;
ScrapyEngine将Spider返回的Item交给ItemPipeline,将Spider返回的Request交给Schedule进行从第二步开始的重复操作,直到调度器中没有待处理的Request,ScrapyEngine关闭。

您可能还喜欢...

发表回复