태지쌤

로봇 & 코딩교육 No.1 크리에이터

파이썬

크롤링과 스크래핑

태지쌤 2022. 8. 30. 22:25
반응형

[크롤링] : 인터넷에서 공개된 데이터를 수집하는 일, 프로그램을 이용해 자동으로

* 주의사항

1) 저작권을 지킬 것

2) 과도한 액세스로 업무를 방해하지 말 것

3) 크롤링 금지 사이트는 크롤링하지 말 것

 

[스크래핑] : 수집한 데이터를 해석해 필요한 데이터를 구하는 일

-> 파이썬 라이브러리(Beautiful Soup4)

< HTML 파일을 읽어 들이는 프로그램 >

import requests

url = "https://blog.naver.com/scienleader/221509685658"

response = requests.get(url)
response.encoding = response.apparent_encoding
print(response.text)

결과

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="ko">
<head>
<meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta name="robots" content="noindex,follow"/>
<meta name="referrer" content="always"/>
<meta http-equiv="content-type" content="text/html;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico?3" />
<link rel="alternate" type="application/rss+xml" href="https://rss.blog.naver.com/scienleader.xml" title="RSS feed for scienleader Blog"/>
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="https://blog.naver.com/NBlogWlwLayout.naver?blogId=scienleader" />




<title>로봇과학 & 코딩교육 No.1 크리에이터 = 태지쌤 : 네이버 블로그</title>
</head>
<script type="text/javascript" src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset-347491577_https.js" charset="UTF-8"></script>

<script type="text/javascript" charset="UTF-8">
var photoContent="";
var postContent="";

var videoId 	  = "";
var thumbnail 	  = "";
var inKey 		  = "";
var movieFileSize = "";
var playTime 	  = "";
var screenSize 	  = "";

var blogId = 'scienleader';
var blogURL = 'https://blog.naver.com';
var eventCnt = '';

var g_ShareObject = {};
g_ShareObject.referer = "";


jsMVC.setController("framesetTitleController", FramesetTitleController);
jsMVC.setController("framesetUrlController", FramesetUrlController);
jsMVC.setController("framesetMusicController", FramesetMusicController);
var oFramesetTitleController = jsMVC.getController("framesetTitleController");
var oFramesetUrlController = jsMVC.getController("framesetUrlController");
var oFramesetMusicController = jsMVC.getController("framesetMusicController");
var sTitle = document.title;

var topFrameAlert = function(message){
	alert(message);
};

var topFrameConfirm = function(message){
	if(confirm(message)){
		return true;
	} else {
		return false;
	}
};
</script>
<style type="text/css">
    html{width:100%;height:100%;}
    body{width:100%;height:100%;margin:0;padding:0;font-size:0;}
    #mainFrame{width:100%;height:100%;margin:0;padding:0;border:0;}
    #hiddenFrame{width:0;height:0;margin:0;padding:0;border:0;}
</style>
<body>
    <iframe id="mainFrame" name="mainFrame" allowfullscreen="true" src="/PostView.naver?blogId=scienleader&logNo=221509685658&redirect=Dlog&widgetTypeCall=true&directAccess=false" scrolling="auto"  onload="oFramesetTitleController.start(self.frames['mainFrame'], self, sTitle);oFramesetTitleController.onLoadFrame();oFramesetUrlController.start(self.frames['mainFrame']);oFramesetUrlController.onLoadFrame()" allowfullscreen></iframe>
</body>
</html>
반응형