目的
目的は3つです。
- クローリング/スクレイピングは、目的に合わせて複数環境を用意する必要が出てくるため、簡単に何度でも、環境構築できるようにしたい
- javascriptで動的に生成されるサイトでもデータの取得ができるようにしたい
- 蓄積したデータの利便性を高めるため、データを簡単に呼び出して使えるようにしたい
- マルチサイトにしたい(おまけ)
検討
環境構築を簡単にするために、docker-composeを用います。また再利用性を向上させるため、環境変数となる設定個所を明確にします。
クローリング/スクレイピングは、既に定型化された手法(フレームワーク)であるscrapyを使います。またjavascriptによる動的なサイトに対応するため、splash(ヘッドレスブラウザ)
スクレイピングで蓄積したデータの呼び出しの利便性を上げるために、REST APIで呼び出せるようにします。またscrapyとの親和性を考え、pythonフレームワークであるDjangoを利用します
また、もともとVPS上に環境を複数面作成する要件があったため、リバースプロキシを設置しています。
構成
以下にdocker-compose/dockerfileの構成を記載ます。
docker-compose.xmlは2つあります。1つはリバースプロキシ、もう1つはscrapy/Django-REST-frameworkのdocker-compose.xmlです。
リバースプロキシの構成
│ docker-scrapy_local.conf
D:.
│ docker-compose.yml
└─proxy
│ Dockerfile
├─etc
│ └─nginx
│ │ nginx.conf
│ └─conf
│ docker-scrapy_local.conf
│ docker-scrapy.conf
├─html
│ index.html
└─static
D:.
│ docker-compose.yml
└─proxy
│ Dockerfile
├─etc
│ └─nginx
│ │ nginx.conf
│ └─conf
│ docker-scrapy_local.conf
│ docker-scrapy.conf
├─html
│ index.html
└─static
ポイントは、etc/nginx/conf内にサイトごとの設定ファイルを用意している点です。もしサイトが増減しても、リバースプロキシをとしては、このconfファイルを作成するだけで対応が可能になります。
※もし独自ドメインを取得し、SSLを行う場合は、let’s encrypt の実施が必要になりますが、それは別の話として割愛します。もし気になる方は以下を参考にして下しさい。
scrapy / Django-REST-framework
細かいファイルは割愛し、docker周りの設定ファイルのみを以下に示します。
D:.
│ docker-compose.yml
├─nginx
│ │ Dockerfile
├─restapi
│ │ Dockerfile
│ │ requirements.txt
└─scrapy
│ Dockerfile
│ pip_requirements.txt
D:.
│ docker-compose.yml
├─nginx
│ │ Dockerfile
├─restapi
│ │ Dockerfile
│ │ requirements.txt
└─scrapy
│ Dockerfile
│ pip_requirements.txt
nginxは、リバースプロキシとDjangoを結びつけるために用意しています。
データのモデルはDjango-REST-frameworkのマイクレーションで作成することで、テーブル管理の手間も少なくできます。データの永続化は、 docker-composeで作成しているPostgreSQLに蓄積します。
Djangoで作成したテーブルに、scrapyでクローリング/スクレイピングしたデータを蓄積します。これによりrestapiでのデータ呼び出しが可能になります。
ファイルも含めた構成を以下に示します。
│ │ │ │ adminResources.py
│ │ │ │ │ 0001_initial.py
│ │ google_news_freelance.py
D:.
│ docker-compose.yml
│
├─nginx
│ │ default.conf
│ │ Dockerfile
│ │ uwsgi_params
│ ├─conf
│ ├─html
│ │ index.html
│ └─log
│ access.log
│ error.log
│ uwsgi.log
│
├─restapi
│ │ debug.log
│ │ Dockerfile
│ │ requirements.txt
│ │ uwsgi.log
│ │
│ ├─app
│ │ │ debug.log
│ │ │ manage.py
│ │ │
│ │ ├─app
│ │ │ │ app.ini
│ │ │ │ settings.py
│ │ │ │ urls.py
│ │ │ │ wsgi.py
│ │ │ │ __init__.py
│ │ │ │
│ │ │ └─__pycache__
│ │ │
│ │ ├─scrapydb
│ │ │ │ admin.py
│ │ │ │ adminResources.py
│ │ │ │ apps.py
│ │ │ │ models.py
│ │ │ │ serializer.py
│ │ │ │ tests.py
│ │ │ │ urls.py
│ │ │ │ views.py
│ │ │ │ __init__.py
│ │ │ │
│ │ │ ├─migrations
│ │ │ │ │ 0001_initial.py
│ │ │ │ │ __init__.py
│ │ │ │ │
│ │ │ │ └─__pycache__
│ │ │ │
│ │ │ └─__pycache__
│ │ │
│ │ └─static
│ │ ├─admin
│ │ │ ├─css
│ │ │ ├─fonts
│ │ │ ├─img
│ │ │ └─js
│ │ ├─import_export
│ │ └─rest_framework
│ │ ├─css
│ │ ├─docs
│ │ │ ├─css
│ │ │ ├─img
│ │ │ └─js
│ │ ├─fonts
│ │ ├─img
│ │ └─js
│ │
└─scrapy
│ Dockerfile
│ pip_requirements.txt
│
└─app
│ main.sh
│ scrapy.cfg
│
└─scraping
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─results
├─spiders
│ │ google_news_freelance.py
│ │ __init__.py
│ └─__pycache__
└─__pycache__
D:.
│ docker-compose.yml
│
├─nginx
│ │ default.conf
│ │ Dockerfile
│ │ uwsgi_params
│ ├─conf
│ ├─html
│ │ index.html
│ └─log
│ access.log
│ error.log
│ uwsgi.log
│
├─restapi
│ │ debug.log
│ │ Dockerfile
│ │ requirements.txt
│ │ uwsgi.log
│ │
│ ├─app
│ │ │ debug.log
│ │ │ manage.py
│ │ │
│ │ ├─app
│ │ │ │ app.ini
│ │ │ │ settings.py
│ │ │ │ urls.py
│ │ │ │ wsgi.py
│ │ │ │ __init__.py
│ │ │ │
│ │ │ └─__pycache__
│ │ │
│ │ ├─scrapydb
│ │ │ │ admin.py
│ │ │ │ adminResources.py
│ │ │ │ apps.py
│ │ │ │ models.py
│ │ │ │ serializer.py
│ │ │ │ tests.py
│ │ │ │ urls.py
│ │ │ │ views.py
│ │ │ │ __init__.py
│ │ │ │
│ │ │ ├─migrations
│ │ │ │ │ 0001_initial.py
│ │ │ │ │ __init__.py
│ │ │ │ │
│ │ │ │ └─__pycache__
│ │ │ │
│ │ │ └─__pycache__
│ │ │
│ │ └─static
│ │ ├─admin
│ │ │ ├─css
│ │ │ ├─fonts
│ │ │ ├─img
│ │ │ └─js
│ │ ├─import_export
│ │ └─rest_framework
│ │ ├─css
│ │ ├─docs
│ │ │ ├─css
│ │ │ ├─img
│ │ │ └─js
│ │ ├─fonts
│ │ ├─img
│ │ └─js
│ │
└─scrapy
│ Dockerfile
│ pip_requirements.txt
│
└─app
│ main.sh
│ scrapy.cfg
│
└─scraping
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─results
├─spiders
│ │ google_news_freelance.py
│ │ __init__.py
│ └─__pycache__
└─__pycache__
構成個所ごとの設定
リバースプロキシの設定個所
リバースプロキシ/nginx/docker-compose.xml
- '/etc/letsencrypt:/etc/letsencrypt'
- /var/run/docker.sock:/tmp/docker.sock:ro
- 'static_file_share:/static'
version: '3'
services:
proxy:
build: ./proxy
tty: true
container_name: dc_proxy
ports:
- "80:80"
- "443:443"
volumes:
- '/etc/letsencrypt:/etc/letsencrypt'
- /var/run/docker.sock:/tmp/docker.sock:ro
- 'static_file_share:/static'
restart: always
networks:
default:
external:
name: dc_proxy_nw
# volumes を定義する
volumes:
static_file_share:
external: true
version: '3'
services:
proxy:
build: ./proxy
tty: true
container_name: dc_proxy
ports:
- "80:80"
- "443:443"
volumes:
- '/etc/letsencrypt:/etc/letsencrypt'
- /var/run/docker.sock:/tmp/docker.sock:ro
- 'static_file_share:/static'
restart: always
networks:
default:
external:
name: dc_proxy_nw
# volumes を定義する
volumes:
static_file_share:
external: true
ポイントは、networksに外部共有可能なネットワーク「dc_proxy_nw」を作成している点です。このネットワークを通じて、バックエンドのDockerコンテナと通信をします。
リバースプロキシ/nginx/Dockerfile
COPY ./etc/nginx/nginx.conf /etc/nginx/nginx.conf
#COPY ./etc/nginx/conf/fx.conf /etc/nginx/conf.d/fx.conf
#COPY ./etc/nginx/conf/dc.conf /etc/nginx/conf.d/dc.conf
#COPY ./etc/nginx/conf/lm.conf /etc/nginx/conf.d/lm.conf
#COPY ./etc/nginx/conf/mt.conf /etc/nginx/conf.d/mt.conf
#COPY ./etc/nginx/conf/docker-scrapy.conf /etc/nginx/conf.d/docker-scrapy.conf
#COPY ./etc/nginx/conf/fx_local.conf /etc/nginx/conf.d/fx_local.conf
#COPY ./etc/nginx/conf/dc_local.conf /etc/nginx/conf.d/dc_local.conf
#COPY ./etc/nginx/conf/lm_local.conf /etc/nginx/conf.d/lm_local.conf
#COPY ./etc/nginx/conf/mt_local.conf /etc/nginx/conf.d/mt_local.conf
#COPY ./etc/nginx/conf/mtb_local.conf /etc/nginx/conf.d/mtb_local.conf
COPY ./etc/nginx/conf/docker-scrapy_local.conf /etc/nginx/conf.d/docker-scrapy_local.conf
COPY html/index.html /etc/nginx/html/index.html
RUN apt-get update && apt-get install -y \
rm -rf /var/lib/apt/lists/*
FROM nginx
COPY ./etc/nginx/nginx.conf /etc/nginx/nginx.conf
#COPY ./etc/nginx/conf/fx.conf /etc/nginx/conf.d/fx.conf
#COPY ./etc/nginx/conf/dc.conf /etc/nginx/conf.d/dc.conf
#COPY ./etc/nginx/conf/lm.conf /etc/nginx/conf.d/lm.conf
#COPY ./etc/nginx/conf/mt.conf /etc/nginx/conf.d/mt.conf
#COPY ./etc/nginx/conf/docker-scrapy.conf /etc/nginx/conf.d/docker-scrapy.conf
#COPY ./etc/nginx/conf/fx_local.conf /etc/nginx/conf.d/fx_local.conf
#COPY ./etc/nginx/conf/dc_local.conf /etc/nginx/conf.d/dc_local.conf
#COPY ./etc/nginx/conf/lm_local.conf /etc/nginx/conf.d/lm_local.conf
#COPY ./etc/nginx/conf/mt_local.conf /etc/nginx/conf.d/mt_local.conf
#COPY ./etc/nginx/conf/mtb_local.conf /etc/nginx/conf.d/mtb_local.conf
COPY ./etc/nginx/conf/docker-scrapy_local.conf /etc/nginx/conf.d/docker-scrapy_local.conf
COPY html/index.html /etc/nginx/html/index.html
#COPY static/ /static/
RUN apt-get update && apt-get install -y \
wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
FROM nginx
COPY ./etc/nginx/nginx.conf /etc/nginx/nginx.conf
#COPY ./etc/nginx/conf/fx.conf /etc/nginx/conf.d/fx.conf
#COPY ./etc/nginx/conf/dc.conf /etc/nginx/conf.d/dc.conf
#COPY ./etc/nginx/conf/lm.conf /etc/nginx/conf.d/lm.conf
#COPY ./etc/nginx/conf/mt.conf /etc/nginx/conf.d/mt.conf
#COPY ./etc/nginx/conf/docker-scrapy.conf /etc/nginx/conf.d/docker-scrapy.conf
#COPY ./etc/nginx/conf/fx_local.conf /etc/nginx/conf.d/fx_local.conf
#COPY ./etc/nginx/conf/dc_local.conf /etc/nginx/conf.d/dc_local.conf
#COPY ./etc/nginx/conf/lm_local.conf /etc/nginx/conf.d/lm_local.conf
#COPY ./etc/nginx/conf/mt_local.conf /etc/nginx/conf.d/mt_local.conf
#COPY ./etc/nginx/conf/mtb_local.conf /etc/nginx/conf.d/mtb_local.conf
COPY ./etc/nginx/conf/docker-scrapy_local.conf /etc/nginx/conf.d/docker-scrapy_local.conf
COPY html/index.html /etc/nginx/html/index.html
#COPY static/ /static/
RUN apt-get update && apt-get install -y \
wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
本番と開発でドメインが異なるため、confファイルを分けています。そのため「copy」をコメントアウトしています。(もっといいやり方は絶対ありますが、調査しないままになってます。)
リバースプロキシ/nginx/nginx.conf
error_log /var/log/nginx/error.log;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
include /etc/nginx/conf.d/*.conf;
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
# for more information.
include /etc/nginx/conf.d/*.conf;
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
# for more information.
include /etc/nginx/conf.d/*.conf;
リバースプロキシ/nginx/docker-scrapy.conf
server_name ds-nginx.localhost;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
location /static/docker-scrapy/ {
alias /static/docker-scrapy/static/;
alias /static/docker-scrapy/static/;
alias /static/docker-scrapy/html/;
proxy_pass http://ds-nginx:8000;
proxy_pass http://ds-nginx:8000;
location ^~ /.well-known/acme-challenge/ {
server{
listen 80;
server_name ds-nginx.localhost;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# staticfiles
location /static/docker-scrapy/ {
alias /static/docker-scrapy/static/;
}
location /static/ {
alias /static/docker-scrapy/static/;
}
location /app {
alias /static/docker-scrapy/html/;
index index.html;
}
location /api/ {
proxy_pass http://ds-nginx:8000;
}
location /api/admin/ {
proxy_pass http://ds-nginx:8000;
}
location ^~ /.well-known/acme-challenge/ {
allow all;
root /var/www/html;
try_files $uri =404;
}
}
server{
listen 80;
server_name ds-nginx.localhost;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# staticfiles
location /static/docker-scrapy/ {
alias /static/docker-scrapy/static/;
}
location /static/ {
alias /static/docker-scrapy/static/;
}
location /app {
alias /static/docker-scrapy/html/;
index index.html;
}
location /api/ {
proxy_pass http://ds-nginx:8000;
}
location /api/admin/ {
proxy_pass http://ds-nginx:8000;
}
location ^~ /.well-known/acme-challenge/ {
allow all;
root /var/www/html;
try_files $uri =404;
}
}
- server_name : ds-nginx.localhost
- ブラウザで「http://ds-nginx.localhost」を指定すると、アプリケーションを開けます
- location /api/ : http://ds-nginx:8000
- Django-REST-FrameworkのAPI呼び出しのためのパスです
- location /api/admin/ : http://ds-nginx:8000
- Django-REST-Frameworkの管理画面のパスです
scrapy / Django-REST-framework
各構成の設定を以下に示します。
docker-compose.xml
image: scrapinghub/splash
container_name: ds-splash
VIRTUAL_HOST: ds-splash.localhost
container_name: ds-scrapy
VIRTUAL_HOST: ds-nginx.localhost
- '/etc/letsencrypt:/etc/letsencrypt'
- '/var/run/uwsgi:/var/run/uwsgi'
container_name: ds-restapi
command: uwsgi --ini ./app/app/app.ini #Djangoでアプリを作成後に追加
- '/var/run/uwsgi:/var/run/uwsgi'
- "dbdata:/var/lib/postgresql/data"
POSTGRES_PASSWORD: ********
version: '3'
services:
splash:
restart: always
image: scrapinghub/splash
container_name: ds-splash
ports:
- "5023:5023"
- "8050:8050"
- "8051:8051"
environment:
VIRTUAL_HOST: ds-splash.localhost
scrapy:
build: ./scrapy
container_name: ds-scrapy
depends_on:
- splash
- dsdb
tty: true
nginx:
build: ./nginx
tty: true
container_name: ds-nginx
environment:
VIRTUAL_HOST: ds-nginx.localhost
volumes:
- '/etc/letsencrypt:/etc/letsencrypt'
- '/var/run/uwsgi:/var/run/uwsgi'
- '/static:/static'
depends_on:
- restqpi
restqpi:
build: ./restapi
container_name: ds-restapi
command: uwsgi --ini ./app/app/app.ini #Djangoでアプリを作成後に追加
volumes:
- ./restapi:/code
- '/var/run/uwsgi:/var/run/uwsgi'
expose:
- "8001"
depends_on:
- dsdb
dsdb:
image: postgres
container_name: myscdb
ports:
- "5432:5432"
volumes:
- "dbdata:/var/lib/postgresql/data"
environment:
TZ: "Asia/Tokyo"
POSTGRES_USER: ********
POSTGRES_PASSWORD: ********
POSTGRES_DB: myscdb
networks:
default:
external:
name: dc_proxy_nw
volumes:
dbdata:
version: '3'
services:
splash:
restart: always
image: scrapinghub/splash
container_name: ds-splash
ports:
- "5023:5023"
- "8050:8050"
- "8051:8051"
environment:
VIRTUAL_HOST: ds-splash.localhost
scrapy:
build: ./scrapy
container_name: ds-scrapy
depends_on:
- splash
- dsdb
tty: true
nginx:
build: ./nginx
tty: true
container_name: ds-nginx
environment:
VIRTUAL_HOST: ds-nginx.localhost
volumes:
- '/etc/letsencrypt:/etc/letsencrypt'
- '/var/run/uwsgi:/var/run/uwsgi'
- '/static:/static'
depends_on:
- restqpi
restqpi:
build: ./restapi
container_name: ds-restapi
command: uwsgi --ini ./app/app/app.ini #Djangoでアプリを作成後に追加
volumes:
- ./restapi:/code
- '/var/run/uwsgi:/var/run/uwsgi'
expose:
- "8001"
depends_on:
- dsdb
dsdb:
image: postgres
container_name: myscdb
ports:
- "5432:5432"
volumes:
- "dbdata:/var/lib/postgresql/data"
environment:
TZ: "Asia/Tokyo"
POSTGRES_USER: ********
POSTGRES_PASSWORD: ********
POSTGRES_DB: myscdb
networks:
default:
external:
name: dc_proxy_nw
volumes:
dbdata:
- splash
- container_name: ds-splash
- VIRTUAL_HOST: ds-splash.localhost
- scrapy
- container_name: ds-scrapy
- nginx
- VIRTUAL_HOST: ds-nginx.localhost
- restqpi
- container_name: ds-restapi
- command: uwsgi –ini ./app/app/app.ini #Djangoでアプリを作成後に追加
- depends_on : dsdb
- dsdb: ★データベース名は同一docker環境ではユニークにした方が良い
- container_name: myscdb
- POSTGRES_DB: myscdb
nginx
Dockerfile
COPY default.conf /etc/nginx/conf.d/default.conf
COPY uwsgi_params /etc/nginx/uwsgi_params
COPY html/index.html /etc/nginx/html/index.html
RUN chmod o+w /etc/nginx;
RUN chmod o+x /etc/nginx/html;
RUN chmod o+r /etc/nginx/html/index.html;
RUN chmod o+r /etc/nginx/uwsgi_params;
RUN apt-get update && apt-get install -y \
rm -rf /var/lib/apt/lists/*
FROM nginx
COPY default.conf /etc/nginx/conf.d/default.conf
COPY uwsgi_params /etc/nginx/uwsgi_params
COPY html/index.html /etc/nginx/html/index.html
RUN chmod o+w /etc/nginx;
RUN chmod o+x /etc/nginx/html;
RUN chmod o+r /etc/nginx/html/index.html;
RUN chmod o+r /etc/nginx/uwsgi_params;
RUN apt-get update && apt-get install -y \
wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
FROM nginx
COPY default.conf /etc/nginx/conf.d/default.conf
COPY uwsgi_params /etc/nginx/uwsgi_params
COPY html/index.html /etc/nginx/html/index.html
RUN chmod o+w /etc/nginx;
RUN chmod o+x /etc/nginx/html;
RUN chmod o+r /etc/nginx/html/index.html;
RUN chmod o+r /etc/nginx/uwsgi_params;
RUN apt-get update && apt-get install -y \
wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
default.conf
server unix:/var/run/uwsgi/docker-scrapy.sock;
server_name ds-nginx.localhost;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
alias /static/; # STATIC_ROOTへのパスをここに書く
include /etc/nginx/uwsgi_params;
include /etc/nginx/uwsgi_params;
upstream django {
server unix:/var/run/uwsgi/docker-scrapy.sock;
}
server{
listen 8000;
server_name ds-nginx.localhost;
#server_name 172.21.0.6;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# staticfiles
location /static/ {
alias /static/; # STATIC_ROOTへのパスをここに書く
}
location /api/ {
include /etc/nginx/uwsgi_params;
uwsgi_pass django;
}
location /api/admin/ {
include /etc/nginx/uwsgi_params;
uwsgi_pass django;
}
}
upstream django {
server unix:/var/run/uwsgi/docker-scrapy.sock;
}
server{
listen 8000;
server_name ds-nginx.localhost;
#server_name 172.21.0.6;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# staticfiles
location /static/ {
alias /static/; # STATIC_ROOTへのパスをここに書く
}
location /api/ {
include /etc/nginx/uwsgi_params;
uwsgi_pass django;
}
location /api/admin/ {
include /etc/nginx/uwsgi_params;
uwsgi_pass django;
}
}
postgresql
docker-compose.xmlに記載の通り
scrapy/splash
Dockerfile
RUN apk add --update --no-cache \
ADD pip_requirements.txt /tmp/pip_requirements.txt
apk add --no-cache postgresql-libs && \
apk add --no-cache --virtual .build-deps gcc musl-dev postgresql-dev && \
python3 -m pip install -r /tmp/pip_requirements.txt --no-cache-dir && \
apk --purge del .build-deps
FROM python:3.6-alpine
RUN apk add --update --no-cache \
build-base \
python-dev \
zlib-dev \
libxml2-dev \
libxslt-dev \
openssl-dev \
libffi-dev
ADD pip_requirements.txt /tmp/pip_requirements.txt
RUN \
apk add --no-cache postgresql-libs && \
apk add --no-cache --virtual .build-deps gcc musl-dev postgresql-dev && \
python3 -m pip install -r /tmp/pip_requirements.txt --no-cache-dir && \
apk --purge del .build-deps
ADD ./app /usr/src/app
WORKDIR /usr/src/app
FROM python:3.6-alpine
RUN apk add --update --no-cache \
build-base \
python-dev \
zlib-dev \
libxml2-dev \
libxslt-dev \
openssl-dev \
libffi-dev
ADD pip_requirements.txt /tmp/pip_requirements.txt
RUN \
apk add --no-cache postgresql-libs && \
apk add --no-cache --virtual .build-deps gcc musl-dev postgresql-dev && \
python3 -m pip install -r /tmp/pip_requirements.txt --no-cache-dir && \
apk --purge del .build-deps
ADD ./app /usr/src/app
WORKDIR /usr/src/app
pip_requirements.txt
# =============================================================================
# =============================================================================
# Production
# =============================================================================
scrapy==1.4.0
scrapy-splash
psycopg2-binary
psycopg2
# Development
# =============================================================================
flake8==3.3.0
flake8-mypy==17.3.3
mypy==0.511
# Production
# =============================================================================
scrapy==1.4.0
scrapy-splash
psycopg2-binary
psycopg2
# Development
# =============================================================================
flake8==3.3.0
flake8-mypy==17.3.3
mypy==0.511
scrapy.cfg
# Automatically created by: scrapy startproject
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
default = scraping.settings
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = scraping.settings
[deploy]
project = scraping
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = scraping.settings
[deploy]
project = scraping
setting.py
SPIDER_MODULES = ['scraping.spiders']
NEWSPIDER_MODULE = 'scraping.spiders'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'
SPLASH_URL = 'http://ds-splash:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Configure item pipelines
'scraping.pipelines.PostgresPipeline': 200
POSTGRESQL_URL = 'postgresql://********:********@dsdb:5432/postgres'
SPIDER_MODULES = ['scraping.spiders']
NEWSPIDER_MODULE = 'scraping.spiders'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'
DOWNLOAD_DELAY = 0.25
SPLASH_URL = 'http://ds-splash:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Configure item pipelines
ITEM_PIPELINES = {
'scraping.pipelines.PostgresPipeline': 200
}
# DB
POSTGRESQL_URL = 'postgresql://********:********@dsdb:5432/postgres'
SPIDER_MODULES = ['scraping.spiders']
NEWSPIDER_MODULE = 'scraping.spiders'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'
DOWNLOAD_DELAY = 0.25
SPLASH_URL = 'http://ds-splash:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Configure item pipelines
ITEM_PIPELINES = {
'scraping.pipelines.PostgresPipeline': 200
}
# DB
POSTGRESQL_URL = 'postgresql://********:********@dsdb:5432/postgres'
- SPLASH_URL = ‘http://ds-splash:8050’
- POSTGRESQL_URL = ‘postgresql://:@dsdb:5432/postgres’
Django-REST-Framwork
Dockerfile
# Docker コンテナ内で使える環境変数を指定
# イメージビルド時に mkdir /code する
# requirements.txt をイメージ内の /code/ ディレクトリにコピーする
ADD requirements.txt /code/
# イメージビルド時に requirements.txt から pip install する
RUN pip install -r requirements.txt
# カレントディレクトリ配下を /code 配下にコピーする
# ベースイメージ
FROM python:3
# Docker コンテナ内で使える環境変数を指定
ENV PYTHONUNBUFFERED 1
# イメージビルド時に mkdir /code する
RUN mkdir /code
# この後の指令は /code で実行する
WORKDIR /code
# requirements.txt をイメージ内の /code/ ディレクトリにコピーする
ADD requirements.txt /code/
# イメージビルド時に requirements.txt から pip install する
RUN pip install -r requirements.txt
# カレントディレクトリ配下を /code 配下にコピーする
ADD . /code/
# ベースイメージ
FROM python:3
# Docker コンテナ内で使える環境変数を指定
ENV PYTHONUNBUFFERED 1
# イメージビルド時に mkdir /code する
RUN mkdir /code
# この後の指令は /code で実行する
WORKDIR /code
# requirements.txt をイメージ内の /code/ ディレクトリにコピーする
ADD requirements.txt /code/
# イメージビルド時に requirements.txt から pip install する
RUN pip install -r requirements.txt
# カレントディレクトリ配下を /code 配下にコピーする
ADD . /code/
requirements.txt
# =============================================================================
# =============================================================================
# Production
# =============================================================================
uwsgi
django>=2.0
djangorestframework
django-filter
psycopg2
psycopg2-binary
django-cors-headers
django-import-export
slackweb
djangorestframework-jwt
django-rest-auth
django-allauth
# =============================================================================
# Production
# =============================================================================
uwsgi
django>=2.0
djangorestframework
django-filter
psycopg2
psycopg2-binary
django-cors-headers
django-import-export
slackweb
djangorestframework-jwt
django-rest-auth
django-allauth
# =============================================================================
以下は、Djangoのプロジェクト/アプリを作成後に設定
app.ini
socket = /var/run/uwsgi/docker-scrapy.sock
wsgi-file = app/app/wsgi.py
[uwsgi]
#----------
socket = /var/run/uwsgi/docker-scrapy.sock
chmod-socket = 666
module = app.app.wsgi
wsgi-file = app/app/wsgi.py
logto = uwsgi.log
py-autoreload = 1
master = true
processes = 4
threads = 2
[uwsgi]
#----------
socket = /var/run/uwsgi/docker-scrapy.sock
chmod-socket = 666
module = app.app.wsgi
wsgi-file = app/app/wsgi.py
logto = uwsgi.log
py-autoreload = 1
master = true
processes = 4
threads = 2
- socket = /var/run/uwsgi/docker-scrapy.sock
- module = app.app.wsgi
- wsgi-file = app/app/wsgi.py
settings.py
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# SECURITY WARNING: don't run with debug turned on in production!
ROOT_URLCONF = 'app.app.urls'
WSGI_APPLICATION = 'app.app.wsgi.application'
# https://docs.djangoproject.com/en/2.2/ref/settings/#databases
'ENGINE': 'django.db.backends.postgresql',
# https://docs.djangoproject.com/en/2.2/ref/settings/#auth-password-validators
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.2/howto/static-files/
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# SECURITY WARNING: don't run with debug turned on in production!
# Application definition
INSTALLED_APPS = [
...,
'app.scrapydb',
]
ROOT_URLCONF = 'app.app.urls'
WSGI_APPLICATION = 'app.app.wsgi.application'
# Database
# https://docs.djangoproject.com/en/2.2/ref/settings/#databases
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': '********',
'USER': '********',
'PASSWORD' : '********',
'HOST': 'dsdb',
'PORT': 5432,
}
}
# Password validation
# https://docs.djangoproject.com/en/2.2/ref/settings/#auth-password-validators
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.2/howto/static-files/
STATIC_URL = '/static/'
STATIC_ROOT = '/static/'
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# SECURITY WARNING: don't run with debug turned on in production!
# Application definition
INSTALLED_APPS = [
...,
'app.scrapydb',
]
ROOT_URLCONF = 'app.app.urls'
WSGI_APPLICATION = 'app.app.wsgi.application'
# Database
# https://docs.djangoproject.com/en/2.2/ref/settings/#databases
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': '********',
'USER': '********',
'PASSWORD' : '********',
'HOST': 'dsdb',
'PORT': 5432,
}
}
# Password validation
# https://docs.djangoproject.com/en/2.2/ref/settings/#auth-password-validators
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.2/howto/static-files/
STATIC_URL = '/static/'
STATIC_ROOT = '/static/'
- INSTALLED_APPS
- ROOT_URLCONF = ‘app.app.urls’
- WSGI_APPLICATION = ‘app.app.wsgi.application’
- ‘HOST’: ‘dsdb’,
urls.py
from django.contrib import admin
from django.urls import path
from django.conf.urls import url, include
from app.scrapydb.urls import router as scrapydb_router
url('^api/admin/', admin.site.urls),
url('^api/', include(scrapydb_router.urls)),
from django.contrib import admin
from django.urls import path
from django.conf.urls import url, include
from app.scrapydb.urls import router as scrapydb_router
urlpatterns = [
url('^api/admin/', admin.site.urls),
url('^api/', include(scrapydb_router.urls)),
]
from django.contrib import admin
from django.urls import path
from django.conf.urls import url, include
from app.scrapydb.urls import router as scrapydb_router
urlpatterns = [
url('^api/admin/', admin.site.urls),
url('^api/', include(scrapydb_router.urls)),
]
wsgi.py
from django.core.wsgi import get_wsgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'app.app.settings')
application = get_wsgi_application()
import os
from django.core.wsgi import get_wsgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'app.app.settings')
application = get_wsgi_application()
import os
from django.core.wsgi import get_wsgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'app.app.settings')
application = get_wsgi_application()
- os.environ.setdefault(‘DJANGO_SETTINGS_MODULE’, ‘app.app.settings’)