Pages

Wednesday, November 23, 2011

[Transcript]Why Can't Developers Estimate Time?

This article is a transcript from the following url:
http://blog.patchspace.co.uk/why-cant-developers-estimate-time

A few interesting points came up on a mailing list thread I was involved in. Here are a few of them. The original comments are presented as sub-headers / quoted blocks, with my response below. This isn't a thorough look at the issues involved, but what I thought were relevant responses. Note: I've done some editing to improve the flow and to clarify a few things.
Why can't developers estimate time?
We can't estimate the time for any individual task in software development because the nature of the work is creating new knowledge.
The goal of software development is to automate processes. Once a process is automated, it can be run repeatedly, and in most cases, in a predictable time. Source code is like a manufacturing blueprint, the computer is like a manufacturing plant, the inputs (data) are like raw materials, and the outputs (data) are like finished goods. To use another analogy, the reason Starbucks makes drinks so quickly and repeatably is because they invested a lot of time in the design of the process, which was (and is, ongoing) a complex and expensive task. Individual Starbucks franchises don't have to re-discover this process, they just buy the blueprint. I'll leave it as an exercise to the reader to infer my opinion of the Costa coffee-making process.
It's not actually always a problem that development time is unpredictable, because the flipside is that so is the value returned. A successful piece of software can make or save vastly more than its cost. Tom DeMarco argues for focussing on the high value projects for exactly this reason. Note that this does require a value-generation mindset, rather than the currently-prevalent cost-control mindset. This is a non-trivial problem.
By far the best explanation I've read of variability and how to exploit it for value is Don Reinertsen's Principles of Product Development Flow, which is pretty much the adopted "PatchSpace Bible" for day-to-day process management. And when I say "by far the best", I mean by an order of magnitude above pretty much everything else I've read, apart from the Theory of Constraints literature.
Here is the data from my last development project. (Histogram generated in R with 5-hour buckets: the horizontal axis shows the duration in hours for the user stories - 0-5 hours, 5-10 hours, etc; the vertical axis is the number of stories that took that duration). We worked in 90 minute intervals and journaled the work on Wave, so we knew task durations to a pretty fine resolution. (We did this for both client communication and billing purposes.) The result: our development times were about as predictable as radioactive decay, but they were very consistently radioactive. Correlation with estimates was so poor I refused to estimate individual tasks, as it would have been wilfully misleading, but we had enough data to make sensible aggregates.
Stories_dev_time_histogram
Rule of thumb: take the estimates of a developer, double it and add a bit
The double-and-add-a-bit rule is interesting. When managers do this, how often are tasks completed early? We generally pay much more attention to overruns than underruns. If a team is not completing half of its tasks early, it is padding the estimates, and that means trading development cycle time for project schedule. Cycle time is usually much more valuable than predictability, as it means getting to market sooner. Again, see Reinertsen's work, the numbers can come out an order of magnitude apart.
Also, this is the basis for Critical Chain project management, which halves the "safe" estimates to condense the timescale, and puts the remaining time (padding on individual tasks) at the end, as a "project buffer". This means that Parkinson's Law doesn't cause individual tasks to expand unduly. I'm unconvinced that Critical Chain is an appropriate method for software though, as the actual content of development work can change significantly, as feedback and learning improves the plan.
People in general just make shit up
It's not just developers that are bad with estimates either. Everyone at some point is just winging it because it's something they've never done before and won't be able to successfully make a judgement until they have.
As a community we need to get away from this. If we don't know, we don't know, and we need to say it. Clients who see regular progress on tasks they were made aware were risky (and chose to invest in) have much more trust in their team than clients whose teams make shit up. It's true! Srsly. Don't just take my word for it, though - read David Anderson's Kanban.
Estimating is a very important skill and should be taught more in junior dev roles
I propose an alternative: what we need to do is teach to junior devs the meaning of done. If estimation problems are bad enough, finding out at some indeterminate point in the future that something went out unfinished (possibly in a rush to meet a commitment … I mean - estimate!) blows not only that estimate out of the water, but the schedule of all the current work in process too. This is very common, and can cause a significant loss of a development team's capacity.

[Note] Spectral Clustering (SVD)

這篇是在找normalization cut 時候找到的,算是轉錄再轉錄
來源網址是
http://blog.xuite.net/coke750101/coketech/19917487

Spectral Clustering
http://140.123.102.86:180/phpBB3/viewtopic.php?f=10&t=154
分頁: 1 / 1

作者:  Sam.Tzeng. [ 週一 3月 10, 2008 3:09 pm ]
文章標題 :  Spectral Clustering

Spectral Clustering

轉自bbs.sjtu.edu.cn


什麼叫Spectral Algorithm?
廣義上來說,任何在演算法中用到SVD/特徵值分解的,都叫Spectral Algorithm。從很老很老的PCA/LDA,到比較近的Spectral Embedding/Clustering,都屬於這類。


為什麼要用SVD/特徵值分解?
其實並不是為用而用,而是不得不用。目前在研究領域碰到的很多基礎問題都是NP-hard的,找一個比較好的近似演算法要費很大的精力;就算找到多項式的近似方法,也會出現實際使用上仍然太慢/解陷入局部極小等問題。

比如說用K-means聚類,建模本身已經夠簡單了,但它是NP-hard的,用傳統的EM迭代作近似解會陷入局部極小。

反 之,SVD理論上只有唯一解,演算法速度相對又快,並且有大量理論結果及周邊性質支持,可以算是一個很理想地能將NP-hard問題“靠”上去的模型;它 的另一個好處是,作為帶約束二次規劃的一種特殊情況,它對運算式為二次的目標函數的“相容性”比較好,“靠”所要求的數學技巧不高,任何人,任何方向都能 拿來試試。


Spectral Algorithm的幾個方向:
傳統的如PCA/LDA用來做線性降維,2000年左右的一些Spectral Embedding及Spectral Clustering,還有周邊的一些,如Low-rank approximation等等。


為什麼先做降維再做K-means,效果會更好呢?
另外,有趣的是K-means可以用PCA來做近似解。K-means是說找到K個點,使得所有點到這K個點的距離平方和最小;
而SVD是說找到一個子空間,使得所有點到這個子空間的距離平方和最小。於是這兩者就建立了聯繫,K-means便relax到SVD上去了。



Spectral Clustering/Embedding:

Spectral Clustering可算是Spectral Algorithm的重頭戲。
所謂Clustering,就是說聚類,把一堆東西(合理地)分成兩份或者K份。從數學上來說,聚類的問題就相當於Graph Partition的問題,即給定一個圖G = (V, E),如何把它的頂點集劃分為不相交的子集,使得這種劃分最好。其難點主要有兩個:

1. 這個“合理”其實相當難達到,隨便設一個目標函數可能達不到希望的結果。大家可以看了看[1] Ravi Kannan and Adrian Vetta, On clusterings: good, bad and spectral,這裏詳細地討論了一下準則的選擇問題。
2. 即使我們定義了一個相當好的聚類準則,如何優化它又是一個問題。

對於1,在Spectral Clustering這一塊,各家有各家的想法。主要有以下幾種:
a) 大名鼎鼎的Normalized Cut[2],還有一些變種如Ratio Cut/Minmax cut.
b) 和代數圖論緊密相聯的Minimum conductance[1].
c) 沒有準則,但有證明的演算法[3]
d) 不基於圖,而是reformulate原來的聚類方法,使之變成SVD能解的問題[4]。
2則完全被1的選取所決定。



Normalized Cut:
在 圖上,定義什麼樣的聚類最好,最簡單的方法是圈定K個不相交頂點集之後,希望頂點集之間的邊,其權值的和最小。(邊上的權值代表的是兩頭的頂點鄰近的程 度,或者說相似度)這就是所謂MinCut(最小割)問題。二類分類的最小割不是NP-hard的,但是這不能讓人感到開心,因為MinCut這個準則對 於聚類不好。

具體來說,Mincut完全可能將離大部隊過遠的單個頂點與其他頂點分開, 形成兩類。
事實上,我們不僅僅要讓割邊的權和最小,而且要讓這K個頂點集都差不多大,這樣才符合聚類給人的直觀感覺。

於是在MinCut的基礎上,出現了Normalized Cut. 思路很簡單,將Cut normalize一下,除以表現頂點集大小的某種量度(如 vol A = 所有A中頂點集的度之和)。
也就是Normalize Cut(A, B) = Cut(A, B) / volA + cut(A, B) / volB
然而這樣一改,NP-hard就來了。這幾乎是所有組合優化問題的惡夢。

怎麼辦呢?把組合優化問題連續化,即所謂減少約束,進行適當的relax。那麼為什麼會和SVD扯上的呢?

很簡單,聚類是東西分成不相交集,也就是有正交的含義在裏面;只是分東西必須是0-1式的,這種離散化,就是np-hard的原因。我們把正交約束保留,但把離散變成連續的,聚類就變成了尋找(列)正交陣的優化問題,那正是SVD的火力所在!

就這樣,通過這種巧妙的relax,NP-hard問題有了近似解。且不說這近似解的質量如何,這種方法是相當令人振奮的。(關於Normalized Cut近似解的質量, 似乎沒有什麼文章能夠給出嚴格的證明,只是實際效果不錯就是了。)

值得一提的是,Normalized Cut還和圖上的Markov chain有緊密的關係[5]。Normalized Cut這個量度,換成Markov chain的語言就是在圖上隨機遊走,子集間相互“串門”的概率大小。相當有趣。



Minimum conductance:

Minimum conductance是另一種Graph Partition的準則,和Normalized Cut看起來相當像,然而分析方法卻完全不同。Minimum conductance是圖上的一種不變數,在代數圖論中,它被稱為Cheeger constant, 是指遍曆所有的子集分劃(A, B),邊割除上min(vol A, vol B)的最小值。 vol是圖上頂點集大小的量度(這個量度和Normalized Cut所用的量度是一樣的)。

這個按理說比Normalized Cut所用的量度更難處理,因為有min函數。然而,有趣的是在代數圖論中有這樣的結論,圖上的Laplace矩陣的第二小特徵值(第一小是0)是Cheeger constant的一個估計:

2 * Cheeger constant <= \lambda <= (Cheeger constant)^2 / 2

OK,SVD再次登場,找到Laplace矩陣的第二小特徵值,然後用相應特徵向量做圖的割,雖然差了點,但不會比Minimum conductance的割差到哪里去。並且差多少,是有理論界的。



沒有準則,但有證明的演算法
有時候寫準則是很麻煩的,反過來說,不用準則直接給出演算法及證明,難度更大然而更靈活。[1]給出了一個遞迴演算法,不斷地把圖分成兩塊,直到某個終止條件被滿足,這樣能做K類分類。
[1] 還給出了證明,保證這樣聚類出來的質量。相比之下,Normalized Cut對於對多類分類,無法給出這樣的界,儘管它也是不斷地二類分下去。另外,[3]給出了一個能直接做多類分類的方法:先Spectral再K- mean. 它沒有目標函數准則,卻能用矩陣擾動理論,證明至少在稍許偏離理想情況下演算法是work的。

這種做法,給一直遵循從目標函數到演算法這一步驟的我以很大的啟發。原來自己走的路一直是很窄的啊。



Reformulate原來的聚類方法:
或許Spectral Clustering給我們留下最重要的,並不是那麼多演算法,而是將SVD融入優化問題的思路,即所謂的spectral relaxation。
[4]正是使用了這條思路,將K-means重新建模,以離散化連續,relax成可以用SVD解的問題,並解之。[4]就數學技巧而言只能說一般般,但是用上了,就會出來很有趣的東西。

其他一些聚類方法:
此 外,還有其他的一些用到Spectral Algorithm的聚類方法。如[7]裏面,Spectral Algorithm用來將點集分成樹狀,然後在樹上以其他準則(如K-means) 將樹葉合併回去,形成最終的聚類結果。在樹上很多本來np-hard的問題就變成可以用動態規劃解了。



Spectral Embedding,一些非線性降維的方法:
除了Spectral clustering, Spectral Embedding即用spectral algorithm來進行非線性降維,也是譜演算法的應用之一。
這個和流形學習有一定聯繫,Salu大哥對此方面極為精通,呵呵。
Laplacian Eigenmap[9]是其中一個很有名的方法。它也是使用圖的Laplace矩陣,優化一個二次函數,解SVD得到的。只是這裏不像聚類,沒有relax的過程,因此得到的就是精確解。


一些周邊:
1. Low-rank approximation
目前來說,解SVD已經算相當快了,但是仍然滿足不了某些應用的胃口。因此解SVD的方法也有了一些發展。用隨機演算法是一個趨勢。

2. Semi-definite Programming(SDP)
作為Spectral algorithm的競爭對手,把一些聚類問題relax到SDP在理論上會有更好的結果,不過實驗上看不出差別[11]。而且要命的是......實在太複雜了。


一些有用的資料和website:

http://crd.lbl.gov/~cding/Spectral/
這 個不用說了,查spectral clustering第一個就是它。不過頁面上附的Tutorial實在不行,單看是幾乎不懂的。還是老老實實看Selected References裏的paper比較好。http://www.cc.gatech.edu/~vempala/papers /spectral.html



[1] Ravi Kannan and Adrian Vetta, On clusterings: good, bad and spectral. Jour
nal of the ACM (JACM) 51(3), 497--515, 2004.
[2] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE. Trans.
on Pattern Analysis and Machine Intelligence, 22:888--905, 2000.
[3] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and a
n algorithm. Proc. Neural Info. Processing Systems (NIPS 2001), 2001.
[4] H. Zha, C. Ding, M. Gu, X. He, and H.D. Simon. Spectral relaxation for K-m
eans clustering. Advances in Neural Information Processing Systems 14 (NIPS 20
01). pp. 1057-1064, Vancouver, Canada. Dec. 2001.
[5] M. Meila and J. Shi. A random walks view of spectral segmentation. Int'l W
orkshop on AI & Stat (AI-STAT 2001).
[6] F.R.K. Chung. Spectral Graph Theory. Amer. Math. Society Press, 1997.
[7] A Divide-and-Merge Methodology for Clustering (D. Cheng, R. Kannan and G.
Wang) Proc. of the ACM Symposium on Principles of Database Systems, 2005.
[8] H. Zha, X. He, C. Ding, M. Gu & H. Simon. Bipartite Graph Partitioning and
Data Clustering, Proc. of ACM 10th Int'l Conf. Information and Knowledge Mana
gement (CIKM 2001), pp.25-31, 2001, Atlanta.
[9] M. Belkin and P. Niyogi. Laplacian Eigenmaps and Spectral Techniques for E
mbedding and Clustering, Advances in Neural Information Processing Systems 14
(NIPS 2001), pp: 585-591, MIT Press, Cambridge, 2002.
[10] E.P. Xing and M.I. Jordan. On semidefinite relaxation for normalized k-cu
t and connections to spectral clustering. Tech Report CSD-03-1265, UC Berkeley
, 2003.

Thursday, November 17, 2011

[Windows] Putty顯示中文

嗯...putty是拿來連SSH的工具,通常是在Windows用的

如果你的server端有中文字的話,在putty會有亂碼,而且會有其他問題

解決方式是去調putty的字型設定(記得順手編碼成UTF8喔)

在Change Settings",找到 "Window" 中 "Appearance" 選項,

右邊的 "Font settings" ,按下 "Change",選你要的中文字型,如:新細明體

下面的字集選中文BIG5

嗯...連線看看吧

[SUSE] VNC server

VNC 就是跨平台的遠端桌面,可以在Windows跟非Windows的系統中連遠端桌面,通常是在需要用圖形介面連Linux伺服器時用的,平常都是用SSH比較多

---------------------------------以下是複製貼上的-----------------------------------------------------------------------

步驟1.啟動 VNC Service

輸入指令 vncserver 來啟動 VNC Service 產生 VNC 設定檔,執行 vncserver 指令後會請您輸入屆時登入 VNC 的密碼,若未指令 VNC Service 時的 Port 則預設為 Port 5901

 #vncserver                                                   //輸入指令啟動 VNC Service
 You will require a password to access your desktops.
 Password:                                                    //設定 VNC Password
 Verify:                                                      //確認 VNC Password
 Would you like to enter a view-only password (y/n)? n        //是否要設定 View Only 的 VNC Password
 New 'X' desktop is weithenn:1                                //VNC Listen Port 為 5901
 Creating default startup script /root/.vnc/xstartup          //產生的 VNC 設定檔路徑
 Starting applications specified in /root/.vnc/xstartup
 Log file is /root/.vnc/weithenn:1.log

步驟2.修改 vncserver 設定檔

修改 VNC 設定檔內容使屆時登入 OpenSuse 時桌面環境為 Gnome.

 #vi ~/.vnc/xstartup                                          //設定 VNC 設定檔
 #!/bin/sh
 xrdb $HOME/.Xresources
 xsetroot -solid grey
 xterm -geometry 80x24+10+10 -ls -title "$VNCDESKTOP Desktop" &
 #twm &                                                       //註解此行 (預設值)
 gnome-session &                                              //新增此行,採用 Gnome 桌面環境

修改完畢後將 VNC Service 停止以便等一下重新載入 VNC 設定檔內容

 #vncserver -kill :1                                          //刪除剛才 Listen Port 5901 的 VNC Service
 Killing Xvnc process ID 15878

步驟3.啟動 VNC Service

下列指令為將 VNC Service 啟動且 Listen Port 6000 (5900 + :100 => 6000)

 #vncserver :100
 New 'X' desktop is weithenn:100
 Starting applications specified in /root/.vnc/xstartup
 Log file is /root/.vnc/weithenn:100.log

啟動成功後確認系統是否有 Listen Port 6000

 #netstat -tnl |grep '6000'
 tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN

確認系統是否有 VNC Service 的 Process

 #ps aux |grep vnc
 root     15952  0.2  0.3  86264 24564 pts/2    S    09:32   0:00 Xvnc :100 -desktop X -httpd /usr/share/vnc/classes -auth /root/.Xauthority...








*************************************
1.這種的是只能開啟自己的遠端登入,其他的帳號要他登入自己開啟

2. 如果出現'cannot acquire name on session bus'的錯誤訊息的話,其實是因為你這個帳號已經有遠端登入gnome(一種圖形介面的模式)了,同一個帳號不能同時用兩個gnome
如果當初開啟vncserver是在圖形介面下,那當然就已經占用了gnome了

所以解決方法就是: 用SSH(文字介面)下開啟vncserver,再用vnc軟體遠端連上去

3.  中文版yast的  '遠端管理(VNC)'  是xorg-x11-vnc,在防火牆允許規則中翻作 'VNC伺服器',只開 5901,另一個是只有 'VNC'三個字,port是 5800-5900 全開

原文出處
http://www.weithenn.org/cgi-bin/wiki.pl?OpenSuse-VNC_Server%E9%81%A0%E7%AB%AF%E9%80%A3%E7%B7%9A%E4%BC%BA%E6%9C%8D%E5%99%A8

其餘參考文章
http://opensuse.swerdna.org/susefirewall.html

(大陸網頁)
http://www.sparelife.net/?p=82

Wednesday, November 16, 2011

[Linux] 新增使用者、群組進sudoers

關於sudoers: 就是可以做 sudo 的傢伙們,以下先簡介一下sudo功用


Linux系統下,有許多機密的操作是需要root權限的,有如Win下的Administrator

所以呢,要取得root權限有兩個方法,

第一個是取得root的密碼,直接登成root,阿root密碼給太多人知道總是不好,

這時就要用sudo了!

這種做法是認定某些人是有此權限的,這些人只要在指令前冠上 sudo ,

接著系統會問你"自己的密碼",以確定你是本人,密碼正確後就會執行指令了


例如:想立即關機

法一:登入root
spencer:~ # su
 [輸入root密碼]
spencer:~ # shutdown -h now


法二: 利用sudo
spencer:~ # sudo shutdown -h now
[輸入自己的密碼]

-----------------------------------------------------------------------------------------------------
然後,用小指甲也知道,如果隨便人都可以用sudo的話,那root如同虛設了

sudoer的設定檔在/etc/sudoers,不過不建議直接暴力法改,這裡介紹指令法

假設我要加入Spencer這個使用者,wheel這個群組進sudoers

(以下皆須以root權限執行)

spencer:~ # visudo
  #他會打開一個暫存檔讓你編輯,確定格式正確以後才會寫進真的sudoers檔去





把這兩行註解掉,不然他問你密碼還是問root密碼,白搭!
#Defaults targetpw   # ask for the password of the target user i.e. root
#ALL    ALL=(ALL) ALL   # WARNING! Only use this together with 'Defaults targetpw'!

加入以下幾行:

 Spencer  ALL=(ALL) ALL
 %wheel ALL=(ALL) ALL


阿如果想偷懶到連自己密碼都不打的話可以這樣
 Spencer  ALL=(ALL) NOPASSWD:  ALL
 %wheel ALL=(ALL) NOPASSWD:  ALL


如果想限制只有某些指令有權限的話可以這樣(/bin/mount及/bin/umount為例)
Spencer ALL=(ALL): /bin/mount,/bin/umount
(在冒號之後有空白喔!
存檔,重新登入


----------------------------------------------------
然後順便寫一下新增群組跟把使用者加入群組,在此只寫指令法

新增群組(group):新增wheel 群組

spencer:~ # groupadd wheel

把使用者Spencer加入wheel群組
spencer:~ # usermod -G wheel Spencer

如果新增Spencer時就要加入wheel群組,也開個個人資料夾
spencer:~ # useradd -G wheel -m Spencer