Build Groph Step by Step

Groph is my new personal project to study big data analysis.

Data set

tencent qq group leak data around 2012.

Tech Stack:

  • Sql server 2008 r2, to restore the original data set.
  • Neo4j, graph database to store data and relations
  • python3, lib:
  • pipenv: setup virtualenv
  • pymssql: python lib to connect to sql server

Day 0

  • install sql server
  • WIP import data into sql server
  • import query: sp_attach_single_file_db @dbname='GroupData5_Data' . @physname='[path to your data set folder]\GroupData5_Data.MDF'

Day 1, 2017-3-30

  • init git repo, github
  • setup python virtualenv use pipenv
  • connect sql server from python script
  • need to config sql server to enable tcp/ip connection first, doc. run C:\Windows\SysWOW64\SQLServerManager10.msc to open the Configuration Manager if you can't find it.
  • in python script, conn = pymssql.connect(server='SX-DEV', database="GroupData1_Data")

Day 2, 2017-4-5

  • init django into project
  • setup local neo4j use docker
  • install python3.6
  • use py3.6 for pyenv, on windows: pipenv install --python=E:\python36\python.exe
  • well, pymssql not support python3.6 yet, will still need to use py3.5
  • create a django command to port data.

Day 3, 2017-04-07

  • setup dotenv

Day 4, 2017-04-08

  • setup neo4j connector, py2neo
  • add methods to add group nodes

Day 5, 2017-04-12, I got engaged today!

  • optimize port command to handle exception, node creating should be resumed at where it stopped.

Day 6, 2019-04-06, Yes, two years later I'm back on this project again

  • discarded old django code base
  • create new repo use nestjs
  • migrate the data base importer code
  • install sql server 2008 again!

Day 7, 2019-04-07

  • create neo4j vm on google cloud
  • refactored the python importer code. tried to run the importer, but it's way too slow, about 7 nodes/second

Day 8, 2019-04-08

  • Rewrite the importer in nodejs, so I can easily import for all databases at same time. It was super fast at beginning, but slow down after few minutes. The reason turns out is the CPU usage is above 100% on neo4j server. So I tried to increase the server resources but didn't help.

Day 9, 2019-04-09

  • Programmatically run import seems won't work. I gotta find another way. Then I discovered the neo4j-admin import tool which might help.
  • However that import tool only works with CSV files, which mean I have to export the original data sit inside SQL server to CSV files. After some research, I finally made the export works programatically in bash with the bcp command from SQL server. bcp ${table} out ${file}.csv -S AXE-PC -w -T -d ${db} -o ${file}.log
  • Wrote some bash script to run neo4j-admin import with those generated csv

Day 10, 2019-04-10

  • Create a new disk in GCP and attached it to the VM, then upload all csv to that disk.
  • Run import script with all the uploaded CSV.. it's till running while I'm writting, dont know how long it will take.