今日目标:爬取CVPR2018论文,进行分析总结出提到最多的关键字,生成wordCloud词云图展示,并且设置点击后出现对应的论文以及链接
对任务进行分解:
①爬取CVPR2018的标题,简介,关键字,论文链接
②将爬取的信息生成wordCloud词云图展示
③设置点击事件,展示对应关键字的论文以及链接
一、爬虫实现
由于文章中并没有找到关键字,于是将标题进行拆分成关键字,用逗号隔开
import requests
from bs4 import BeautifulSoup
import demjson
import pymysql
import os
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#创建头部信息
url='http://openaccess.thecvf.com/CVPR2018.py'
r=requests.get(url,headers=headers)
content=r.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
dts=soup.find_all('dt',class_='ptitle')
hts='http://openaccess.thecvf.com/'
#数据爬取
alllist=[]
for i in range(len(dts)):
print('这是第'+str(i)+'个')
title=dts[i].a.text.strip()
href=hts+dts[i].a['href']
r = requests.get(href, headers=headers)
content = r.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
#print(title,href)
divabstract=soup.find(name='div',attrs={"id":"abstract"})
abstract=divabstract.text.strip()
#print('第'+str(i)+'个:',abstract)
alllink=soup.select('a')
link=hts+alllink[4]['href'][6:]
keyword=str(title).split(' ')
keywords=''
for k in range(len(keyword)):
if(k==0):
keywords+=keyword[k]
else:
keywords+=','+keyword[k]
value=(title,abstract,link,keywords)
alllist.append(value)
print(alllist)
tuplist=tuple(alllist)
#数据保存
db = pymysql.connect("localhost", "root", "fengge666", "yiqing", charset='utf8')
cursor = db.cursor()
sql_cvpr = "INSERT INTO cvpr values (%s,%s,%s,%s)"
try:
cursor.executemany(sql_cvpr,tuplist)
db.commit()
except:
print('执行失败,进入回调3')
db.rollback()
二、将数据进行wordCloud展示
首先找到对应的包,来展示词云图
然后通过异步加载,将后台的json数据进行展示。
由于第一步我们获得的数据并没有对其进行分析,因此我们在dao层会对其进行数据分析,找出所有的关键字的次数并对其进行降序排序(用Map存储是最好的方式)
public Mapgetallmax()
{
String sql="select * from cvpr";
Mapmap=new HashMap();
Mapsorted=new HashMap();
Connection con=null;
Statement state=null;
ResultSet rs=null;
con=DBUtil.getConn();try{
state=con.createStatement();
rs=state.executeQuery(sql);while(rs.next())
{
String keywords=rs.getString("keywords");
String[] split= keywords.split(",");for(int i=0;i
{if(map.get(split[i])==null)
{
map.put(split[i],0);
}else{
map.replace(split[i], map.get(split[i])+1);
}
}
}
}catch(SQLException e) {//TODO Auto-generated catch block
e.printStackTrace();
}
DBUtil.close(rs, state, con);
sorted=map
.entrySet()
.stream()
.sorted(Collections.reverseOrder(comparingByValue()))
.collect(
toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2)->e2,
linkedHashMap::new));returnsorted;
}
到servlet层后,我们还需对数据进行一定的筛选(介词,a,等词语应该去除掉,要不然会干扰我们分析关键字),取前30名关键字,在前台进行展示
aracterEncoding("utf-8");
MapsortMap=dao.getallmax();
JSonArray json=newJSonArray();int k=0;for (Map.Entryentry : sortMap.entrySet())
{
JSonObject ob=newJSonObject();
ob.put("name", entry.getKey());
ob.put("value", entry.getValue());if(!(entry.getKey().equals("for")||entry.getKey().equals("and")||entry.getKey().equals("With")||entry.getKey().equals("of")||entry.getKey().equals("in")||entry.getKey().equals("From")||entry.getKey().equals("A")||entry.getKey().equals("to")||entry.getKey().equals("a")||entry.getKey().equals("the")||entry.getKey().equals("by")))
{
json.add(ob);
k++;
}if(k==30)break;
}
System.out.println(json.toString());
response.getWriter().write(json.toString());
三、设置点击事件,展示对应关键字的论文以及链接
//设置点击效果 var ecConfig = echarts.config; myChart.on('click', eConsole);
用函数来实现点击事件的内容:通过点击的关键字,后台进行模糊查询,找到对应的论文题目以及链接,返回到前端页面
function eConsole(param) {if (typeof param.seriesIndex == 'undefined') {return;
}if (param.type == 'click') {var word=param.name;var htmltext="
$.post('findkeytitle',
{'word':word},
function(result)
{
json=JSON.parse(result);for(i=0;i
{
htmltext+="
"+json[i].Title+"";}
htmltext+="
"$("#show").html(htmltext);}
)
}
}
成果展示:
前台页面代码:
background-color: black;
}
#main {
width:70%;
height:100%;
margin:0;float:right;
background: black;
}
#show{
overflow-x: auto;
overflow-y: auto;
width:30%;
height:100%;float:left;
margin-top:100dp;
padding-top:100dp;
background: pink;
}
echartsCloud();
});//点击事件
function eConsole(param) {if (typeof param.seriesIndex == 'undefined') {return;
}if (param.type == 'click') {var word=param.name;var htmltext="
$.post('findkeytitle',
{'word':word},
function(result)
{
json=JSON.parse(result);for(i=0;i
{
htmltext+="
"+json[i].Title+"";}
htmltext+="
"$("#show").html(htmltext);}
)
}
}
function echartsCloud(){
$.ajax({
url:"getmax",
type:"POST",
dataType:"JSON",async:true,
success:function(data)
{var mydata = new Array(0);for(var i=0;i
{var d ={
};
d["name"] = data[i].name;//.substring(0, 2);
d["value"] =data[i].value;
mydata.push(d);
}var myChart = echarts.init(document.getElementById('main'));//设置点击效果
var ecConfig =echarts.config;
myChart.on('click', eConsole);
myChart.setOption({
title: {
text:''},
tooltip: {},
series: [{
type :'wordCloud', //类型为字符云
shape:'smooth', //平滑
gridSize : 8, //网格尺寸
size : ['50%','50%'],//sizeRange : [ 50, 100 ],
rotationRange : [-45, 0, 45, 90], //旋转范围
textStyle : {
normal : {
fontFamily:'微软雅黑',
color: function() {return 'rgb(' +Math.round(Math.random()* 255) +
',' + Math.round(Math.random() * 255) +
',' + Math.round(Math.random() * 255) + ')'}
},
emphasis : {
shadowBlur :5, //阴影距离
shadowColor : '#333' //阴影颜色
}
},
left:'center',
top:'center',
right:null,
bottom:null,
width:'100%',
height:'100%',
data:mydata
}]
});
}
});
}