python scrapy 安装 使用 配置pipeline、item、settings,浅谈scrapy去重机制,Python:Scrapy传入自定义参数运行,yield meta 方法传递参数 ,Python中requests的会话保持session详解
常用
scrapy startproject 项目名
cd 项目名, # 进入项目,此目录是创建爬虫项目时通过模板自动生成。
scrapy genspider 文件名 域名
#
文件名:在爬虫文件名在当前目录下必须保证唯一。
域名: 是用来约束爬虫的范围。
scrapy框架的爬虫程序使用面向对象的方式进行封装
response.xpath(“表达式”)返回的结果为选择器对象列表,如果需要提取元素属性或文本需要使用get()或getall()方法。
get()为序列化并返回单个Unicode字符串中匹配的节点,如果存在多个返回第一个,类型为字符串,匹配不到返回None。
getall()方法序列化并返回unicode字符串的元素列表中匹配的节点,返回类型为列表,匹配不到返回None,
yield meta 方法传递参数
scrapy 递归爬取如何传递参数,且解决循环yield时总是得到最后一个数据 ,这里就引出scrapy中 request的meta参数,该参数只接受字典形式
meta={'k1':v1,'k2':v2}
用法如下:
def parse(self, response): items = ScrapytestItem() items['name'] = 'csdn' href = href_domains + item.css('......').extract_first() yield Request( url=href, callback=self.parse_details, meta={'items': items}, )
读取参数def parse_details(self, response): items2 = response.meta['items']
深度拷贝
import copymeta={'items': copy.deepcopy(items)}
https://blog.csdn.net/DL_min/article/details/105593318?
Python:Scrapy传入自定义参数运行
# 运行爬虫
scrapy crawl spiderName#
传入自定义参数运行$
scrapy crawl spiderName -a parameter1=value1 -a parameter2=value2
读取参数
def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs)
# 在init方法中获取参数
num = kwargs.get('num')
print('init num: ', num)
# 在实例方法中获取参数
num = getattr(self, 'num', False) print('getattr: ', num)
https://blog.csdn.net/mouday/article/details/112303043
scrapy完整版重写start_requests方法 python requests 强大用法
https://blog.csdn.net/sirobot/article/details/105360486
Python中requests的会话保持session详解
import requests
# 会话
session = requests.session()
data ={
'loginName': xxxxxx, #改为自己的用户名
'password': 'xxxxxxxxxx' #改为自己的登录密码
}
# 登录
url ="https://passport.17k.com/ck/user/login"
restult = session.post(url,data=data)
# print(restult.text)
# print(restult.cookies)
# 再次请求 拿取书架上的数据
url2 = "https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"
result_data = session.get(url2)
print(result_data.json()['data'])
run.py
# from scrapy import cmdline
# import cmdline
import os
import threading
import time
# line = ('py -m scrapy crawl msrc --nolog'.strip()).split()
# line = ('python -m scrapy crawl getm3u8 --nolog'.strip()).split()
# line = ('python -m scrapy crawl getm3u8'.strip()).split()
# line = ('py -m scrapy crawl getm3u8'.strip()).split()
mongo = 'net start MongoDB'
os.system(mongo)
# line = 'py -m scrapy crawl getm3u8'
# line = 'py -m scrapy crawl getm3u8 --nolog'
try:
item = int(input('请问要收集多少条,无限制填0,其余数字为条数\n'))
if len(str(item).strip()) < 1:
item = 2
except:
item = 2
# line = 'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT=3 --nolog'
line = f'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT={str(item).strip()} --nolog'
# line = 'py -m scrapy crawl getm3u8 -s CLOSESPIDER_ITEMCOUNT=3'
# print(line)
# cmdline.execute(line)
print(line)
# os.system(line)
def runline():
os.system(line)
t = threading.Thread(target=runline, args=())
t.start()
time.sleep(10)
input('\n\n\n\nfinished waiting\n')
spider文件
import re
import scrapy
import urllib
import datetime
import requests
import time
class Getm3u8Spider(scrapy.Spider):
name = 'getm3u8'
url = 'https://137aaa.com/index.php/vod/play/id/66625/sid/1/nid/1.html'
url = (str(input('请输入要分析m3u8资源地址的网站\n'))).strip()
if url[:4] != 'http':
url = 'http://' + url.strip()
url1 = 'https://' + url.strip()
url_split = url.split('/')
domain_o = url_split[2]
domain_o = domain_o.split(":")[0]
domain_s = domain_o.split('.')
domain = domain_s[-2] + '.' + domain_s[-1]
# allowed_domains = ['137aaa.com']
allowed_domains = [domain, ]
start_urls = [url, ]
my_usefull_domains = [domain, 'xjzyplay.com', 'cdn.xjzyplay.com' ]
my_forbidden_domains = []
my_items_count = 0
my_forbidden_domains_count = {} # 无效域名次数统计
# start_urls = [
# # 'http://httpbin.org/get',
# ]
def parse(self, response):
url_from = response.request.url
# print('正在分析:', url_from)
text = response.text
txt = text.replace('\\', '')
# print('*\n' * 5)
# print(text)
urls_m3u8_list = re.findall(r'''(https?:\\?/\\?/[^!"<>',]*?\.(?:m3u8|mp4|avi|mov|mpeg|mp3))''', txt,
re.S | re.M | re.I)
# print(urls_m3u8_list)
# play_src = f'https://play.panjinhe.cn/player/index.php?name={title}&pic=&site={urls_m3u8_list[0]}'
# print(play_src)
# 递归爬取
hrefs = response.xpath('//a/@href').getall()
hrefs = list(set(hrefs))
# print(hrefs)
a = 1
for u in hrefs:
if u[:10] != 'javascript':
if u[:4] != 'http':
url_split = url_from.split('/')
ht = url_split[0] + '//' + url_split[2]
u = ht + u
u1 = u.split('/')
if self.domain in u and len(u1) > 4:
# print('start to get url: ', u)
yield scrapy.Request(url=u, callback=self.parse, dont_filter=False,)
a = a + 1
if a > 3:
# break
pass
time.sleep(0.1)
title = response.xpath('//title/text()').get()
# print(title)
title_u = urllib.parse.quote(title)
# print(title_u)
keywords = response.xpath('//meta[@name="keywords"]/@content').get()
# print(keywords)
description = response.xpath('//meta[@name="description"]/@content').get()
# print(description)
hrefs = response.xpath('//a/@href').getall()
# print(hrefs)
save_time = (str(datetime.datetime.now())[:-7]).replace(" ", "_")
# 去重
urls_m3u8_list = list(set(urls_m3u8_list))
for m3u8_src in urls_m3u8_list:
# 去除二次拼接的链接
m3u8_src = 'https://dhjjhkfhjk' + m3u8_src
abcd = re.findall('https://', m3u8_src, re.S | re.I | re.M)
if len(abcd) > 1:
m3u8_src = 'https://' + m3u8_src.split('https://')[-1]
abcde = re.findall('http://', m3u8_src, re.S | re.I | re.M)
if len(abcde) > 1:
m3u8_src = 'http://' + m3u8_src.split('http://')[-1]
m3u8_domains = m3u8_src.split('/')[2]
try:
# 提速优化
if m3u8_domains in self.my_usefull_domains:
# 跳过已检测有效的域名
m3u8_src_status = 200
elif m3u8_domains in self.my_forbidden_domains:
# 跳过不能链接的域名
m3u8_src_status = 404
else:
try:
# 测试资源链接有效性
hd = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"}
m3u8_src_status = requests.get(m3u8_src, headers=hd, timeout=8)
m3u8_src_status = m3u8_src_status.status_code
except Exception as exp:
# 超时不能连不上的域名处理
print(exp)
m3u8_src_status = 10101
if m3u8_src_status == 200:
str_replace = [' ', '&', '/', '\\', '=', '[', ']', '(', ')', '{', '}', '*', '#', '@', '!', ':', "'", '"', ',', '“',
'‘', ',', ':', '!']
try:
for i in str_replace:
title = title.replace(i, '')
title_u = title_u.replace(i, '')
keywords = keywords.replace(i, '')
description = description.replace(i, '')
except:
pass
play_src = f'https://play.panjinhe.cn/player/index.php?name={title}&pic=&site={m3u8_src}'
items = {}
items['m3u8_src'] = m3u8_src
items['play_src'] = play_src
items['url_from'] = url_from
items['title'] = title
items['title_u'] = title_u
items['description'] = description
items['keywords'] = keywords
items['domain'] = self.domain
items['save_time'] = save_time
items['m3u8_domains'] = m3u8_domains
yield items
# print(items)
print(self.domain, '获得资源:', self.my_items_count, datetime.datetime.now())
print(title)
self.my_items_count = self.my_items_count + 1
self.my_usefull_domains.append(m3u8_domains)
self.my_usefull_domains = list(set(self.my_usefull_domains)) # 列表去重
elif m3u8_src_status == 404:
self.my_forbidden_domains.append(m3u8_domains)
self.my_forbidden_domains = list(set(self.my_forbidden_domains))
else:
try:
try:
self.my_forbidden_domains_count[m3u8_domains] = self.my_forbidden_domains_count[m3u8_domains] + 1
except Exception as exp:
self.my_forbidden_domains_count[m3u8_domains] = 1
print('\n\n\n*\n', m3u8_domains, self.my_forbidden_domains_count[m3u8_domains])
if self.my_forbidden_domains_count[m3u8_domains] > 3:
self.my_forbidden_domains.append(m3u8_domains)
self.my_forbidden_domains = list(set(self.my_forbidden_domains))
except Exception as exp:
print(exp)
pass
except Exception as exp:
print(exp)
pass
items.py
import scrapy
class Getallm3U8Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
m3u8_src = scrapy.Field()
play_src = scrapy.Field()
url_from = scrapy.Field()
title = scrapy.Field()
title_u = scrapy.Field()
description = scrapy.Field()
keywords = scrapy.Field()
domain = scrapy.Field()
save_time = scrapy.Field()
m3u8_domains = scrapy.Field()
settings.py
# Obey robots.txt rules 爬虫协议 默认遵守
ROBOTSTXT_OBEY = False
# 最大并发量
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# 设置随机ua
RANDOM_UA_TYPE = "random"
# MIDDLEWARES 设置
DOWNLOADER_MIDDLEWARES = {
'getAllM3u8.middlewares.RandomUserAgentMiddlemares': 543,
}
# 配置中间项
ITEM_PIPELINES = {
'getAllM3u8.pipelines.Getallm3U8Pipeline': 300,
'getAllM3u8.pipelines.MongoDBPipeline': 300,
'getAllM3u8.pipelines.BaikePipeline': 300,
}
middlewares.py
# 随机ua方法一
# pip install fake_useragent
# 导入UserAgent
from fake_useragent import UserAgent
# 随机更换user-agent方法
class RandomUserAgentMiddlemares(object):
ua = UserAgent()
def process_request(self,spider,request):
user_agent = self.ua.random
request.headers["User-Agent"]=user_agent
# 随机ua方法2
class RandomUserAgentMiddlemares1(object):
#随便在网上找的
agents = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",]
def process_request(self,spider,request):
user_agent=random.choice(self.agents)
request.headers["User-Agent"]=user_agent
pipeline.py
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import datetime
import time
import os
# 保存到MongoDB非约束性数据库
import json
import pymongo
class MongoDBPipeline(object):
DB_URL = 'mongodb://localhost:27017/' # 直接将DB_URI,DB_NAME 写下具体的内容,随后在setting中配置
DB_NAME = 'm3u8db'
def __init__(self):
# 连接数据库
# self.client = pymongo.MongoClient(host='localhost', port=27017)
self.client = pymongo.MongoClient(self.DB_URL)
# 创建库
self.db = self.client['m3u8db']
# 创建表
self.table = self.db['m3u8tb']
def process_item(self, item, spider):
# print('___________________++++++', dict(item))
self.table.insert_one(dict(item))
# self.table.insert_one() 新版本不支持insert, 需要用单条数据插入insert_one, 或者多条数据插入# self.table.insert_many()
return item
def close_spider(self, spider):
self.client.close()
# 保存txt或Jason文件
import json
class BaikePipeline(object):
# fname = 'my_m3u8_src1.json'
# def open_spider(self, spider):
# # self.file = open('items.txt', 'w', encoding="utf-8")
# # self.file = open(self.fname, 'a+', encoding="utf-8")
# pass
#
# def close_spider(self, spider):
# self.file.close()
# item在后期使用的时候还要转换回来,他存在的意义只是防止出错导致程序中止
def process_item(self, item, spider):
res = dict(item)
# res = json.dumps(res)
m3u8_src = res['m3u8_src']
domain = res['domain']
# fname = 'my_m3u8_src1.json'
fname = f'save/m3u8_{domain}.json'
if os.path.isfile(fname):
pass
else:
try:
os.makedirs('save')
except:
pass
try:
try:
with open(fname, 'r', encoding="utf-8") as f:
f_local = f.read()
dict_item_local = dict(json.loads(f_local))
f.close()
pass
except:
dict_item_local = {}
pass
out_text = {}
out_text[m3u8_src] = res
# 字典合并
out_text = dict(dict_item_local, **out_text)
json_f = out_text
out_text = json.dumps(out_text, indent=4, ensure_ascii=False)
# line = res['name']
# self.file.write(line.encode('utf-8') + '\n')
# self.file.write(res)
# self.file.write(out_text)
with open(fname, 'w', encoding='utf-8') as f:
f.write(out_text)
f.close()
self.save_html(json_f=json_f, domain=domain)
except:
pass
# 生成本地网页文件
def save_html(self, file_name='', json_f='', domain=''):
time.sleep(5)
"""
https://www.cnblogs.com/ivkeji/p/14491959.html
json_file = {
"https://new.qqaku.com/20220919/u2tRwIl3/index.m3u8": {
"src": "https://new.qqaku.com/20220919/u2tRwIl3/index.m3u8",
"name": "罚罪-第36集在线观看-连续剧 - 月亮电影网",
"description": "《罚罪》部分取材于真实事件,以一桩恶性案件为切入口,通过青年刑警常征的视角,讲述出两代公安干警为维护一方安宁,扫除犯罪团伙,不畏艰险、前赴后继的英勇故事。在昌武(虚构地)这座小城,在危机重重的战斗第一",
"from": "https://www.zqzdzj.com/vod/129111-2-36.html",
"times": "2022-09-19"
},
"https://56z.cc/default.php?url=https://sod.bunediy.com/20220825/wxOaBtQK/index.m3u8": {
"src": "https://56z.cc/default.php?url=https://sod.bunediy.com/20220825/wxOaBtQK/index.m3u8",
"name": "罚罪第04集_电视剧完整版_免费在线观看_爱碟影院",
"description": "电视剧《罚罪》第04集免费在线观看",
"from": "https://aidie.cc/play-90051-8-4/",
"times": "2022-09-19"
},
"https://new.qqaku.com/20220510/AcX1YZp7/index.m3u8": {
"src": "https://new.qqaku.com/20220510/AcX1YZp7/index.m3u8",
"name": "《运河风流》第27集在线观看_电视剧_虎鱼影院",
"description": "运河风流免费播放全集,运河风流大结局剧情介绍:山东济宁,南临微山,东辖曲阜,大运河穿城而过,得交通之便,浸孔孟之风,自古以来政要商贾云集、文人雅..",
"from": "https://www.021huyu.com/bofang/17398-0-26.html",
"times": "2022-09-19"
}
}
"""
try:
# get_m3u8_src.save_html(json_str)
# file_name = 'save/save_2022-09-20_m3u8_srcs_info.json'
# file_name = self
#
# with open(file_name, 'r', encoding='utf-8') as f:
# loacl_f = f.read()
# json_file = json.loads(loacl_f)
# f.close()
# pass
# json_file = json.dumps(self, ensure_ascii=False)
json_file = json_f
if len(json_file) > 0:
# p 朋友版本
demo0 = """
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="refresh" content="16">
<title>m3u8资源
"""
demo1 = """
清爽观影(www.qsbox.cn)</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/css/bootstrap.min.css" rel="stylesheet">
<script src="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/js/bootstrap.bundle.min.js"></script>
</head>
<body>
<div>
<center><p><h2><a href="https://www.qsbox.cn" target="_blank">欢迎使用清爽搜索盒子</a></h2></p>
<h1>以下是 """
demo1p = """
清爽观影</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/css/bootstrap.min.css" rel="stylesheet">
<script src="https://cdn.staticfile.org/twitter-bootstrap/5.1.1/js/bootstrap.bundle.min.js"></script>
</head>
<body>
<div>
<center><p><h2>欢迎使用清爽观影媒体资源提取工具,仅供测试,请及时销毁</h2></p>
<h1>以下是 """
demo1a = """
的搜索结果</h1>
<div class="container-fluid">
<div class="row" id="demo">
<p id="demo1"></p>
</div>
</div>
</div>
<script>
var myObj, i, x = "", n=0, m;
var myObj =
"""
demo2 = """;
for (i in myObj) {
n = n + 1;
m = n%2;
if (m == 1){
x += "<div class='col-6 col-lg-4 pt-2 my-3 '>";
}else{
x += "<div class='col-6 col-lg-4 pt-2 my-3 border bg-light text-dark'>";
}
x += " <a href='https://play.panjinhe.cn/player/index.php?name="+ myObj[i].title +"m3u8%E5%9C%A8%E7%BA%BF%E6%92%AD%E6%94%BE%E5%99%A8&pic=&site=" + myObj[i].m3u8_src + "' target='__blank'>";
x += myObj[i].title + "</a> </br>";
x += "地址:<a id='src"+ n +"' href='" + myObj[i].m3u8_src + "' target='__blank'>";
x += "" + myObj[i].m3u8_src + "</a></br>";
x += "资源" + n + ". <a style='color:red;' href='https://play.panjinhe.cn/player/index.php?name="+ myObj[i].title +"m3u8%E5%9C%A8%E7%BA%BF%E6%92%AD%E6%94%BE%E5%99%A8&pic=&site=" + myObj[i].m3u8_src + "' target='__blank'>";
x += "清爽观影" + "</a> ";
x += "<a style='color:red;' href='"+ myObj[i].url_from + "' target='__blank'>";
x += "访问原站点" + "</a> ";
x += "<a style='color:red;' href='http://"+ myObj[i].domain + "' target='__blank'>";
x += "来源:" + myObj[i].domain + "</a> ";
x += "<span style='display:none;'> <a style='color:red;' class='goDouyin'>";
x += "复制资源地址<textarea style='display:none;'>" + myObj[i].src+ "</textarea></a></span></br>";
x += "</div>";
}
x += "<center>本次获得链接总数:" + n + "个";
document.getElementById("demo").innerHTML = x;
</script>
<script type="text/javascript">
$(document).on("click", ".goDouyin", function() {
var Url2=$(this).find('textarea');
Url2.select(); // 选择对象用户定义的代码区域
document.execCommand("Copy"); //原生copy方法执行浏览器复制命令
if( document.execCommand("Copy")==true){
layer.msg('复制成功'); //弹窗
}
});
</script>
<center>
© 2013 -
"""
demo2p = """;
for (i in myObj) {
n = n + 1;
m = n%2;
if (m == 1){
x += "<div class='col-xs-12 col-sm-8 col-lg-6 pt-2 my-3 '>";
}else{
x += "<div class='col-xs-12 col-sm-8 col-lg-6 pt-2 my-3 border bg-light text-dark'>";
}
x += " <a href='https://www.m3u8play.com/?play=" + myObj[i].src + "' target='__blank'>";
x += myObj[i].name + "</a> </br>";
x += "媒体地址:<a id='src"+ n +"' href='" + myObj[i].src + "' target='__blank'>";
x += "" + myObj[i].src + "</a></br>";
x += "资源" + n + ". <a style='color:red;' href='https://www.icesun.cn/tools/player-m3u8.php?url=" + myObj[i].src + "' target='__blank'>";
x += "清爽观影" + "</a> ";
x += "<a style='color:red;' href='"+ myObj[i].from + "' target='__blank'>";
x += "访问原站点" + "</a> ";
x += "<span style='display:none;'> <a style='color:red;' class='goDouyin'>";
x += "复制资源地址<textarea style='display:none;'>" + myObj[i].src+ "</textarea></a></span></br>";
x += "</div>";
}
x += "<p><center>本次获得链接总数:<span style='color: red; font-weight:bold; font-size:1.6em;'>" + n + "个<span><p>";
document.getElementById("demo").innerHTML = x;
</script>
<script type="text/javascript">
$(document).on("click", ".goDouyin", function() {
var Url2=$(this).find('textarea');
Url2.select(); // 选择对象用户定义的代码区域
document.execCommand("Copy"); //原生copy方法执行浏览器复制命令
if( document.execCommand("Copy")==true){
layer.msg('复制成功'); //弹窗
}
});
</script>
<center>
</p><h1>m3u8在线播放器</h1><p>
<a href='https://linqingping.github.io/M3U8-player' target='_blank'> https://linqingping.github.io/M3U8-player/#</a></br>
<a href='http://tool.liumingye.cn/m3u8/' target='_blank'> http://tool.liumingye.cn/m3u8/</a></br>
<a href='https://www.hlsplayer.net/' target='_blank'> https://www.hlsplayer.net/</a></br>
<a href='https://m3u8-player.com/' target='_blank'> https://m3u8-player.com/</a></br>
<a href='https://m3u8.looks.wang/' target='_blank'> https://m3u8.looks.wang/</a></br>
<a href='http://m3u8player.lantianye3.top/' target='_blank'> http://m3u8player.lantianye3.top/</a></br>
<a href='http://tool.pfan.cn/m3u8/' target='_blank'> http://tool.pfan.cn/m3u8/</a></br>
<a href='https://www.m3u8play.com/' target='_blank'> https://www.m3u8play.com/</a></br>
<a href='http://www.m3u8player.top/' target='_blank'> http://www.m3u8player.top/</a></br>
<a href='https://meetpasser.com/webplayer/' target='_blank'> https://meetpasser.com/webplayer/</a></br>
<a href='https://www.icesun.cn/tools/video-player.php' target='_blank'> https://www.icesun.cn/tools/video-player.php</a></br>
<p>
如需播放,可复制媒体文件地址到m3u8在线播放器中粘贴地址播放,谢谢</br>
测试文件地址: https://new.qqaku.com/20220914/02PA7bkH/index.m3u8 <p>
© 2013 -
"""
demo3 = """
<a href='https://www.qsbox.cn' target='_blank'> www.qsbox.cn 清爽观影 </a></br>
</body>
</html>
"""
demo3p = """
清爽观影
</body>
</html>
"""
demo = demo0 + f'{str(datetime.datetime.now())[:10]}' + demo1 + f'{str(datetime.datetime.now().strftime("%Y年%m月%d日"))}' + demo1a + str(
json_file) + demo2 + str(datetime.datetime.now().strftime("%Y")) + demo3;
# demo = demo0 + f'{str(datetime.datetime.now())[:10]}' + demo1p + f'{str(datetime.datetime.now().strftime("%Y年%m月%d日"))}' + demo1a + str(json_file) + demo2p + str(datetime.datetime.now().strftime("%Y")) + demo3p;
# print(demo)
html_file_name = f'save/0A_{str(datetime.datetime.now())[:10]}结果.html'
demo = demo0 + str(domain) + demo1 + "域名《" + str(domain) + "》" + demo1a + str(json_file) + demo2 + str(datetime.datetime.now().strftime("%Y")) + demo3;
# demo = demo0 + str(kws_link) + demo1p + "关键词《" + str(kws_link) + "》" + demo1a + str(json_file) + demo2p + str(datetime.datetime.now().strftime("%Y")) + demo3p;
html_file_name = f'save/0A_{domain}结果.html'
with open(html_file_name, 'w', encoding="utf-8") as f:
f.write(demo)
f.close()
except:
pass
使用 Scrapy 框架对重复的 url 无法获取数据,dont_filter=True
可能的原因是:你使用了 Scrapy 对重复的 url 进行请求。
Scrapy 内置了重复过滤功能,默认情况下该功能处于打开状态。
添加 dont_filter=True 参数,这样 Scrapy 就不会过滤掉重复的请求。
Scrapy 在进入 parse 时,会默认请求一次 start_urls[0],而当你在 parse 中又对 start_urls[0] 进行请求时,Scrapy 底层会默认过滤掉重复的 url,不会对该请求进行提交,这就是为什么 parse2 不被调用的原因。
import scrapy
class ExampleSpider(scrapy.Spider):
name ="test"
# allowed_domains = ["https://www.baidu.com/"]
start_urls = ["https://www.baidu.com/"]
def parse(self,response):
yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)
def parse2(self, response):
print(response.url)
关于Scrapy架构各项说明,如下所示:
ScrapyEngine:引擎。负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。 此组件相当于爬虫的“大脑”,是 整个爬虫的调度中心。
Schedule:调度器。接收从引擎发过来的requests,并将他们入队。初始爬取url和后续在页面里爬到的待爬取url放入调度器中,等待被爬取。调度器会自动去掉重复的url。
Downloader:下载器。负责获取页面数据,并提供给引擎,而后提供给spider。
Spider:爬虫。用户编些用于分析response并提取item和额外跟进的url。将额外跟进的url提交给ScrapyEngine,加入到Schedule中。将每个spider负责处理一个特定(或 一些)网站。
ItemPipeline:负责处理被spider提取出来的item。当页面被爬虫解析所需的数据存入Item后,将被发送到Pipeline,并经过设置好次序
DownloaderMiddlewares:下载中间件。是在引擎和下载器之间的特定钩子(specific hook),处理它们之间的请求(request)和响应(response)。提供了一个简单的机制,通过插入自定义代码来扩展Scrapy功能。通过设置DownloaderMiddlewares来实现爬虫自动更换user-agent,IP等。
SpiderMiddlewares:Spider中间件。是在引擎和Spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items或requests)。提供了同样简单机制,通过插入自定义代码来扩展Scrapy功能。
Scrapy数据流:
ScrapyEngine打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个(批)要爬取的url(s);
ScrapyEngine向调度器请求第一个要爬取的url,并加入到Schedule作为请求以备调度;
ScrapyEngine向调度器请求下一个要爬取的url;
Schedule返回下一个要爬取的url给ScrapyEngine,ScrapyEngine通过DownloaderMiddlewares将url转发给Downloader;
页面下载完毕,Downloader生成一个页面的Response,通过DownloaderMiddlewares发送给ScrapyEngine;
ScrapyEngine从Downloader中接收到Response,通过SpiderMiddlewares发送给Spider处理;
Spider处理Response并返回提取到的Item以及新的Request给ScrapyEngine;
ScrapyEngine将Spider返回的Item交给ItemPipeline,将Spider返回的Request交给Schedule进行从第二步开始的重复操作,直到调度器中没有待处理的Request,ScrapyEngine关闭。